0% found this document useful (0 votes)
136 views

Panel Data Analysis For Economics and The Melbourne Institute

This document provides an overview of a 3-day course on panel data analysis for economics and social sciences held in December 2013. The course will cover the basics of panel data, commonly used panel data sets, techniques for handling and describing panel data, basic estimation methods, and static and dynamic binary response models. It includes an agenda with lectures, teaching assistants, and exercises using Stata software and the Household, Income and Labour Dynamics in Australia data set. The goal is for participants to understand panel data concepts and apply them in empirical analysis.

Uploaded by

Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views

Panel Data Analysis For Economics and The Melbourne Institute

This document provides an overview of a 3-day course on panel data analysis for economics and social sciences held in December 2013. The course will cover the basics of panel data, commonly used panel data sets, techniques for handling and describing panel data, basic estimation methods, and static and dynamic binary response models. It includes an agenda with lectures, teaching assistants, and exercises using Stata software and the Household, Income and Labour Dynamics in Australia data set. The goal is for participants to understand panel data concepts and apply them in empirical analysis.

Uploaded by

Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Panel Data Analysis

for
Economics and the Social Sciences
Melbourne Institute
9-11 December 2013

Steve Pudney
ISER
University of Essex

Organisation

Monday Tuesday Wednesday


9.00-10.30 Lecture (SP) 9.00-10.30 Lecture (SP) 9.00-10.30 Lecture (SP)
10.30-11.00 Tea 10.30-11.00 Tea 10.30-11.00 Tea
11.00-12.45 Lecture (SP) 11.00-12.30 Lecture (SP) 11.00-12.45 Lecture (SP)
12.45-1.45 Lunch 12.30-1.30 Lunch 12.45-1.45 Lunch
1.45-3.00 Lecture (NW) 1.30-3.00 Stata: ex 2/4 1.45-3.30 Stata: ex 4/6
3.00-3.30 Tea 3.00-3.30 Tea 3.30-4.00 Tea
3.30-5.00 Stata: ex 1/2 3.30-5.00 Stata: ex 2/4 4.00-5.00 Lecture (NW)

Lectures: Theatre 3, Level 2, FBE Building, 111 Barry St


Catering: Foyer, Level 5, FBE Building, 111 Barry St
Lab: Room G08, 233 Bouverie St

Melbourne: 05/12/2013 (2)

1
Aims of the course
• Introduce the distinctive features of panel data.
• Review some panel data sets commonly used in social sciences.
• Present the advantages (and limitations) of panel data, and
consider what sort of questions panel data can(not) address.
• Show how to handle and describe panel data.
• Introduce the basic estimation techniques for panel data (linear
and non-linear).
• Discuss how to choose (and test for) the right technique for the
question being addressed.
• Discuss interpretation of results
• Introduce dynamic modeling of panel data

Melbourne: 05/12/2013 (3)

Structure of course
• 3 days:
 3 hours methods lectures
 2 hour lab sessions
 1 hour HILDA lecture
• Lab sessions will illustrate concepts using Stata
software (“industry standard” in survey-based applied
work)
• Main data will be from Household, Income and Labour
Dynamics in Australia (HILDA)
• Focus is on understanding the concepts and applying
them.
• Full lecture slides on the web
• Technical detail kept to a minimum but available in
appendices
Melbourne: 05/12/2013 (4)

2
Day 1: Basics
• What are panel data?
• Why use panel data?
• Handling panel data in Stata – some basic commands.
• Patterns of observations in panel data (non-response and
attrition)
• Within and between variation, transitions & cohort
analysis
• Inference using panel data: some identification issues
 unobservables.
 age, time and cohort effects
• Regression analysis: within and between group
regression

Melbourne: 05/12/2013 (5)

Day 2:
Panel regression analysis and its generalisations

• Random effects regression


• Testing the FE and RE assumptions
• Endogeneity
 The source of endogeneity
 The between- and within-group IV estimator
 Correlated individual effects

Melbourne: 05/12/2013 (6)

3
Day 3: Static & dynamic binary response models

Basics
• Types of discrete variables
• Why not linear regression?
Discrete response models
• Latent linear regression
• Conditional (fixed-effects) logit
• Static random effects logit and probit
Dynamic binary response

Melbourne: 05/12/2013 (7)

Stata exercises
Ex 0: Assembles a working panel data file from the raw HILDA data
files
Ex 1: Examines the data structure and calculates simple summary
measures
Ex 2: Estimates basic between- & within-group panel data regression
models
Ex 3: Uses random effects regression and tests the underlying zero
correlation assumption
Ex 4: Explores endogeneity and instrumental variable estimation
Ex 5: Compares the linear probability model, random effects probit &
logit and conditional logit

Melbourne: 05/12/2013 (8)

4
Some reading
• Econometrics texts
 Arellano, M. (2003). Panel Data Econometrics, Oxford University Press,
 Baltagi, B.H. (2013). Econometric Analysis of Panel Data (5th ed.), Wiley.
 Hsiao, C. (2003). Analysis of Panel Data (2nd ed.), Cambridge University Press,
 Wooldridge, J.M. (2002), Econometric Analysis of Cross Section and Panel Data, MIT
Press (chapters 10 and 11 and sections 14.4, 15.8 and 16.8).

• Sociology
 Halaby, C. N. (2004). Panel data models in sociological research: Theory into
practice, American Review of Sociology 30, 507-544

• Stata-specific
 Cameron, A. C. & Trivedi, P. K. (2009) Microeconometrics Using Stata, Stata Press
 Rabe-Hesketh, S. & Skrondal, A. (2008) Multilevel and Longitudinal Modeling Using
Stata (2nd ed), S. & A. , Stata Press
 StataCorp (2013), Stata Statistical Software: Release 11 : XT Reference Manual, Stata
Corporation.

Melbourne: 05/12/2013 (9)

Day 1: Basics

Melbourne: 05/12/2013 (10)

5
What are Panel Data?
• Panel data involve regularly repeated observations on the same
individuals.
• Repeat observations may be different time periods or units
within clusters (e.g. workers within firms; siblings within
twin pairs)
• Individuals may be people, households, firms, areas, etc.
• In most analysis using household panels, the individual is the
person and the repeated observations are the different time
periods (waves).
• Sometimes, e.g. to isolate household (or family) effects, the
individual is the household (or family) and the repeated
observations are different persons within the household
• Multi-level analysis may involve more than 2 dimensions of
the sample, e.g. time periods within persons within
households (but note problem of defining the household
over time)
Melbourne: 05/12/2013 (11)

Some types of longitudinal data


• Cohort surveys
 Birth cohorts, e.g. LSAC (Australia) NCDS, BCS70, MCS (UK)
 Age group cohorts, e.g. youth cohorts LSAY (Australia), NLSY,
MtF, Addhealth (US); older people (HRS, SHARE, ELSA)
 Many programme evaluation studies and social experiments
• Panel surveys
 Rotating household panels: (Labour Force Surveys, US SIPP)
 Perpetual household panels: an indefinitely long horizon of
regular repeated measurements (HILDA, BHPS, PSID, SOEP…)
 Company panels: firms observed over time, linked to annual
accounts information
• Non-temporal survey panels
 Example: UK Workplace Employment Relations Survey (WERS)
 cross-section of workplaces, 25 workers sampled within each
• Non-survey panels (aggregate panels)
 Countries, regions, industries, etc. observed over time

Melbourne: 05/12/2013 (12)

6
Long-term household panels
• Individuals in their household context
• Perpetual panel survey, often with retrospective elements
(periods before 1st wave & between waves)
• Designed to maintain representativeness of the sampled
population over time. May use refreshment samples to deal
with immigration, panel fatigue/conditioning
• Examples worldwide, include
• Australian HILDA, US PSID, Dutch HP, Swedish LoLS, German
SOEP, BHPS, Canadian SLID, NZ SoFIE, pan-European ECHP,
SHARE, and several in developing countries (e.g. South Africa,
Indonesia, Ethiopia, VietNam)
• Cross-National Equivalent File harmonises 8 hh panels:
http://www.human.cornell.edu/pam/research/centers-programs/german-panel/cnef.cfm
• Useful resources for locating data are the HILDA website
(http://www.melbourneinstitute.com/hilda/links.html) and KeepingTrack
(http://www.iser.essex.ac.uk/ulsc/longitudinal-resources/keeping-track)
• Big differences in content, following rules, who is interviewed,
interview method, etc.

Melbourne: 05/12/2013 (13)

Why use panel data?


• Can isolate effects of unobserved differences between individuals
• Causal inference may be strengthened by temporal ordering
• Repeated current observation more reliable than recall of histories
in one-shot cross-section surveys
• Some phenomena are inherently longitudinal (e.g. poverty
persistence; unstable employment)
• Can study dynamics – may be important even if we’re only
interested in the long run
Example: yit =  xit +  xit-1 + uit
 Long-run impact =  + 
Regression of yit on xit 
coefficient byx =  +  cov(xit , xit-1 )/var(xit)   + 
So a static model doesn’t necessarily give good estimates of the
long-run relationship

Melbourne: 05/12/2013 (14)

7
Limitations
BUT don’t expect too much…
• Variation between people usually far exceeds
variation over time for an individual
 a panel with T waves doesn’t give T times the
information of a cross-section
• Variation over time may not exist for some important
variables
• Variation over time may be inflated by measurement
error
• Panel data imposes a fixed timing structure;
continuous-time survival analysis may be more
informative
• We still need very strong assumptions to draw clear
inferences from panels: sequencing in time does not
necessarily reflect causation
Melbourne: 05/12/2013 (15)

Some terminology
A balanced panel has the same number of time observations (T)
for each of the n individuals
An unbalanced panel has different numbers of time observations
(Ti) on each individual
A compact panel covers only consecutive time periods for each
individual – there are no “gaps”
Attrition is the process of drop-out of individuals from the panel,
leading to an unbalanced and possibly non-compact panel
A short panel has a large number of individuals but few time
observations on each, (e.g. HILDA has 7,400 households and 12
waves)
A long panel has a long run of time observations on each
individual, permitting separate time-series analysis for each

We consider only short panels

Melbourne: 05/12/2013 (16)

8
Handling panel data in Stata
• For our purposes, the unit of analysis or case is the
individual person
• A record for an individual case contains information
on the person’s state at different dates
• Data can be organised in two ways:
 Wide form – data is sometimes supplied in this format
 Long form – usually most convenient & needed for most
panel data commands in Stata
 Use Stata reshape command to convert between them.
• Three important operations:
 Matching/merging
 Aggregating
 Appending

Melbourne: 05/12/2013 (17)

Wide format
•One row per case
•Observations on a variable for different time periods (or dates)
held in different columns
•Variable name identifies time (via prefix)

xwaveid awage bwage cwage


(Wage at w1) (Wage at w2) (Wage at w3)
10001 7.2 7.5 7.7
10002 6.3 missing 6.3
10003 5.4 5.4 missing

Melbourne: 05/12/2013 (18)

9
Long format
• multiple rows per case
• observations on a variable for different time periods held in
different rows for each individual
• The dataset’s row identifier identifies time (e.g. wave)

xwaveid wave wage


10001 1 7.2
10001 2 7.5
10001 3 7.7
10002 1 6.3
10002 3 6.3
10003 1 5.4
10003 2 5.4
… …
Melbourne: 05/12/2013 (19)

Matching (or merging)


• Joining two (or more) files at the same level of observation (e.g.
person files) where both (all) files contain the same identifier
variable used as key
• 1:1 - one case in master file corresponds to one case in “using
file” (i.e. file being matched in)
• 1 : many – one case in using file may be ‘distributed’ to many
cases in master file
• E.g. info about a household attached to each one of the household’s
members
• Either way, not all cases in master file may receive match; not
all cases in using file may provide a match
• Stata’s merge command. Note:
• importance of checking: use tabulate _merge (see examples later)
• If there are multiple possible matches in the “using” file for a case in the
master file, Stata chooses one randomly, without warning !!!

Melbourne: 05/12/2013 (20)

10
Aggregation

• Deriving group-level information from all the


members of that group. Examples:
• calculate household income from the incomes of its members
• calculate number of children a woman has during her first
marriage
• Group-level information can be used in two ways:
• saved in a new file with the group (e.g. household or spell) as
the case (collapse)
• attributed to each of the group members within the existing
file in which member is case (egen; by(sort): …)

Melbourne: 05/12/2013 (21)

Appending
• Combining files with no index-based matching
• E.g. combining file A with n1 rows and file B with n2 rows
to produce a new file C with n1+n2 rows.
• Stata command: append
• Used to assemble a sequence of annual cross-section
data files into a single long-format panel data file
• Rows in new combined files are specific to a person-wave
combination
• Each variable must have the same name in each of
the annual cross-section files

Melbourne: 05/12/2013 (22)

11
Ordering the data
• We now have a dataset in long format
• It’s a good idea to order the data for easier viewing.
“Eyeballing” the data is important!
• We also have to tell Stata which variable identifies
the individual (Stata calls this the panel variable).
• We may also have to tell Stata which variable
identifies the repeated observation (Stata calls this the
time variable).
 For some types of panel analysis we don’t need to know the
ordering of the repeated observations

Melbourne: 05/12/2013 (23)

sort pid wave

PID wave wage PID wave wage


10001 2 7.5 10001 1 7.2
10002 3 6.3 10001 2 7.5
10002 1 6.3 10001 3 7.7
10001 1 7.2 10002 1 6.3
10001 3 7.7 10002 3 6.3
10003 1 5.4 10003 1 5.4
10003 2 5.4 10003 2 5.4
… … … …

Note: this panel is neither balanced nor compact

Melbourne: 05/12/2013 (24)

12
Panel and time variables

• Use tsset or xtset to tell Stata which are panel


and time variables:
. xtset pid wave
panel variable: pid, 10002251 to 1.347e+08
time variable: wave, 1 to 13, but with gaps

• Note that tsset & xtset automatically sort the


data accordingly.

Melbourne: 05/12/2013 (25)

Describing panel data


• Ways of describing/summarising panel data:
 Basic patterns of available cases
 Between- and within-group components of variation
 Transition tables
• Some basic notation:
yit is the “dependent variable” to be analysed
 i indexes the individual (pid), i = 1, 2, …, n
 t indexes the repeated observation / time period (wave),
t = 1, 2, …, Ti
• yit may be:
 continuous (e.g. wages);
 mixed discrete/continuous (e.g. hours of work);
 binary (e.g. employed/not employed);
 ordered discrete (e.g. Likert scale for happiness, attitudes, etc.);
 unordered discrete (e.g. occupation)

Melbourne: 05/12/2013 (26)

13
Describe patterns of panel data: xtdes
. xtdes
pid: 10002251, 10004491, ..., 1.347e+08 n = 16082
wave: 1, 2, ..., 13 T = 13
Delta(wave) = 1; (13-1)+1 = 13
(pid*wave uniquely identifies each observation)
Distribution of T_i: min 5% 25% 50% 75% 95% max
1 1 2 7 13 13 13
Freq. Percent Cum. | Pattern
---------------------------+---------------
4648 28.90 28.90 | 1111111111111
997 6.20 35.10 | 1............
646 4.02 39.12 | 11...........
376 2.34 41.46 | ............1
342 2.13 43.58 | 111..........
327 2.03 45.62 | 1111.........
261 1.62 47.24 | ...........11
254 1.58 48.82 | .1...........
251 1.56 50.38 | ..........111
7980 49.62 100.00 | (other patterns)
---------------------------+---------------
16082 100.00 | XXXXXXXXXXXXX

Melbourne: 05/12/2013 (27)

Between- and within-group variation


Define the individual-specific or group mean for any variable, e.g.
yit as: Ti
1
yi 
Ti
y
t 1
it

yit can be decomposed into 2 components:

yit  y   yit  yi    yi  y 
 within  between
n Ti
where y   yit nT and T is average no. of periods per case
i 1 t 1

Corresponding decomposition of sum of squares:


n Ti n Ti n Ti

  yit  y     yit  yi     yi  y 
2 2 2

i 1 t 1 i 1 t 1 i 1 t 1

or: Tyy = Wyy + Byy

Melbourne: 05/12/2013 (28)

14
Between- and within-group variation
• Between and within variation are the basis of linear
panel regression. Important concept to understand.
• Simple BHPS example: balanced panel (n=1119, T =
13) of workers who have reported their wages.
• From summarize, we have grand mean wage ( y ) =
£9.84 per hour, and (overall) variance of wages =
32.63. Recall the standard formula for variance:
n T

  y  y
2
it
Tyy
s2  i 1 t 1

nT  1 nT  1

Melbourne: 05/12/2013 (29)

Between- and within-group variation (3)


• So Tyy is the variance multiplied by its degrees of freedom nT  1
= 1119*13 – 1 = 14546 (or can calculate Tyy ‘by hand’ in Stata –
see example in computer lab).
• We get Tyy = 32.627956 * 14546 = 474606.3
• Can calculate Byy and Wyy manually in Stata (see HILDA
example in computer lab). We get:
 Byy = 358920.7
 Wyy = 115685.6
 Check that Byy + Wyy = Tyy !!
• Proportion of between variation is Byy / Tyy = 76%. Most
variation is between people not within people!
• Measurement error may make this an underestimate!

Melbourne: 05/12/2013 (30)

15
Within and between deviations in the data
Grand Ind. Within Between Total
pid wave Wage mean Mean dev dev dev
10028005 1 9.302 9.841 10.948 -1.646 1.107 -0.539
10028005 2 10.444 9.841 10.948 -0.504 1.107 0.603
10028005 3 13.883 9.841 10.948 2.935 1.107 4.042
10028005 4 4.573 9.841 10.948 -6.375 1.107 -5.268
10028005 5 13.769 9.841 10.948 2.820 1.107 3.928
.. .. .. .. .. .. .. ..
10028005 13 12.914 9.841 10.948 1.966 1.107 3.073
10060111 1 13.046 9.841 12.953 0.094 3.112 3.205
10060111 2 12.923 9.841 12.953 -0.030 3.112 3.081
10060111 3 13.453 9.841 12.953 0.500 3.112 3.612
10060111 4 13.505 9.841 12.953 0.553 3.112 3.664
10060111 5 12.418 9.841 12.953 -0.535 3.112 2.577

Melbourne: 05/12/2013 (31)

Between- and within-group variation: xtsum


• Stata contains a ‘canned’ routine, xtsum, that summarises
within and between variation.
• Doesn’t give an exact decomposition:
 Converts sums of squares to variance using different ‘degrees of
freedom’ so they are not comparable
 Reports square root (i.e. standard deviation) of these variances
 Documentation is not very clear!

. xtsum wage
Variable | Mean Std. Dev. Min Max | Obs
--------------+----------------------------------------+----------
wage overall | 9.841044 5.712089 .3813552 121.7474 | N = 14547
between | 4.969431 3.322259 46.54612 | n = 1119
within | 2.820121 -18.37394 108.5192 | T = 13

Melbourne: 05/12/2013 (32)

16
Transitions
• Want to compare state in this wave with state in last wave.
Example: part-time work status (binary variable PT)
• If we have xtset the data, can easily create lagged values of
variable: generate lpt = l.pt
• Then tabulate current against lagged value: tabulate lpt pt
. tabulate lpt pt, row
| Part-time (<=30 hours
Lagged PT | total)
work | 0 1 | Total
-----------+----------------------+----------
0 | 10,619 310 | 10,929
| 97.16 2.84 | 100.00
-----------+----------------------+----------
1 | 333 2,166 | 2,499
| 13.33 86.67 | 100.00
-----------+----------------------+----------
Total | 10,952 2,476 | 13,428
| 81.56 18.44 | 100.00

• Same result with command: xttrans pt, freq


Melbourne: 05/12/2013 (33)

Transitions and measurement error


Analysis of transitions can give good indications of data (un)reliability
Example: UK Offending Crime & Justice Survey: 2 waves, 2003 & 2004

. tab d1evec if wave==1

have you ever taken |


cannabis | Freq. Percent Cum.
---------------------+-----------------------------------
yes | 855 25.45 25.45
no | 2,477 73.72 99.17
don't know | 13 0.39 99.55
don't want to answer | 15 0.45 100.00
---------------------+-----------------------------------
Total | 3,360 100.00

Melbourne: 05/12/2013 (34)

17
Transition matrix
. xttrans d1evec, freq
have you |
ever taken | have you ever taken cannabis
cannabis | Yes No DK DWTA | Total
-----------+--------------------------------------------+----------
Yes | 728 111 0 1 | 840
| 86.67 13.21 0.00 0.12 | 100.00
-----------+--------------------------------------------+----------
No | 251 2,189 6 7 | 2,453
| 10.23 89.24 0.24 0.29 | 100.00
-----------+--------------------------------------------+----------
DK | 2 9 1 1 | 13
| 15.38 69.23 7.69 7.69 | 100.00
-----------+--------------------------------------------+----------
DWTA | 9 5 0 1 | 15
| 60.00 33.33 0.00 6.67 | 100.00
-----------+--------------------------------------------+----------
Total | 990 2,314 7 10 | 3,321
| 29.81 69.68 0.21 0.30 | 100.00

Melbourne: 05/12/2013 (35)

Transition matrix
. xttrans d1evec, freq
have you |
ever taken | have you ever taken cannabis
cannabis | Yes No DK DWTA | Total
-----------+--------------------------------------------+----------
Yes | 728 111 0 1 | 840
| 86.67 13.21 0.00 0.12 | 100.00
-----------+--------------------------------------------+----------
No | 251 2,189 6 7 | 2,453
| 10.23 89.24 0.24 0.29 | 100.00
-----------+--------------------------------------------+----------
DK | 2 9 1 1 | 13
| 15.38 69.23 7.69 7.69 | 100.00
-----------+--------------------------------------------+----------
DWTA | 9 5 0 1 | 15
| 60.00 33.33 0.00 6.67 | 100.00
-----------+--------------------------------------------+----------
Total | 990 2,314 7 10 | 3,321
| 29.81 69.68 0.21 0.30 | 100.00

• 13% of people who’d used cannabis before 2003 say they’ve never used before 2004!!

Melbourne: 05/12/2013 (36)

18
Age and cohort: earnings profiles
How have different generations fared in the labour market?
Cohort-specific age-real earnings profiles for employees, HILDA

1000
800600
earnings
400 200
0

20 30 40 50 60 70
age

1941-45 1946-50 1951-55


1956-60 1961-65 1966-70
1971-75 1976-80 1981-85

Melbourne: 05/12/2013 (37)

Age and cohort:


Earnings deflated by average earnings index
1000
800600
earnings
400 200
0

20 30 40 50 60 70
age

1941-45 1946-50 1951-55


1956-60 1961-65 1966-70
1971-75 1976-80 1981-85

Melbourne: 05/12/2013 (38)

19
How did we do it?
gen cohort=year-hgage //derive year of birth from age
recode cohort (-999/1940=.) (1941/1945=1) (1946/1950=2) (1951/1955=3) ///
(1956/1960=4) (1961/1965=5) (1966/1970=6) (1971/1975=7) ///
(1976/1980=8) (1981/1985=9) (1986/9999=.)

* use the collapse command to replace dataset by one containing real earnings
averages for age-cohort groups
collapse rwsce, by(cohort hgage)
keep if hgage>=16&hgage<=65

* create cohort-specific earnings variables


forvalues c=1/9 {
gen reale`c'=rwsce if cohort==`c'
}
label variable e1 "1941-45"
label variable e2 "1946-50"
label variable e3 "1951-55"
label variable e4 "1956-60"
label variable e5 "1961-65"
label variable e6 "1966-70"
label variable e7 "1971-75"
label variable e8 "1976-80"
label variable e9 "1981-85"

* plot the age-earnings profiles for real earnings


graph twoway scatter reale1-reale9 hgage, msize(small..) connect(l..)
ytitle("earnings") ///
yscale(titlegap(1)) xtitle("age") xscale(range(16 65) titlegap(1))
legend(rows(3))

Melbourne: 05/12/2013 (39)

Notation
We work with observed variables yit , zi and xit :
yit = dependent variable to be analysed
zi = time-invariant explanatory covariates
(e.g. year of birth, sex)
xit = time-varying explanatory covariates
(e.g. job tenure, marital status)
where i denotes individuals, t denotes time periods.
[Technically, zi and xit are row vectors, containing collections of variables]

05/12/2013 (40)

20
Modelling approaches
Ways of thinking about panel data:
• A collection of cross-sections, one for each time period:
 Between-group regression
 The Structural Equations (SEM) approach – 1 equation for each time
period (e.g. Bollen, 1989, Structural Equations with Latent Variables)
• A collection of time-series, one for each individual. Examples:
 Within-group regression
 Dynamic models with individual heterogeneity
 Latent growth curve analysis (e.g. Acock & Li
http://oregonstate.edu/dept/hdfs/papers/lgcgeneral.pdf#search=%22latent%20growth%20cu
rve%20analysis%20oregon%22)
 Trajectory analysis (e.g. Nagin & Tremblay, Child Development 1999)
• Comprehensive models try to capture both inter-individual and inter-
period variation

Melbourne: 05/12/2013 (41)

Modelling approaches
Ways of thinking about panel data:
• A collection of cross-sections, one for each time period:
 Between-group regression
 The Structural Equations (SEM) approach – 1 equation for each time
period (e.g. Bollen, 1989, Structural Equations with Latent Variables)
• A collection of time-series, one for each individual. Examples:
 Within-group regression
 Dynamic models with individual heterogeneity
 Latent growth curve analysis (e.g. Acock & Li
http://oregonstate.edu/dept/hdfs/papers/lgcgeneral.pdf#search=%22latent%20growth%20cu
rve%20analysis%20oregon%22)
 Trajectory analysis (e.g. Nagin & Tremblay, Child Development 1999)
• Comprehensive models try to capture both inter-individual and inter-
period variation

05/12/2013 (42)

21
Why use panel data?
The disadvantages of cross-section data
Example: cross-section earnings regression (single time period, t
subscript suppressed)
yi = zi  + xi  + i
where:
yi = log wage;
zi = observable time-invariant factors (education, etc.);
xi = observable time-varying factors (e.g. job tenure);
i = random error (e.g. “luck”)

Possible misspecifications, causing bias:


• Omitted dynamics (lagged variables not observed)
• Reverse causation (e.g. pay and tenure jointly determined)
• Omitted unobservables (e.g. “ability”)

05/12/2013 (43)

When to use regression methods


Regression models are suitable for the analysis of dependent
variables yit which can vary continuously, so:
 Income, birthweight, etc.  regression appropriate
 Age at retirement, interpolated grouped income, etc. 
regression may work OK
 Age of school leaving, no. of visits to doctor last week, etc.
 regression a bit risky
 Binary variables (married/non-married, employed/non-
employed, etc.  regression is unreliable
Regression models also have technical problems when:
 The sample is censored or truncated (e.g. if yit = hours of
work and non-workers are recorded as zero or excluded)
 When there is no natural scale (e.g. Likert scales)

05/12/2013 (44)

22
Related methods: Latent growth curves
Latent growth curve analysis is widely used in sociology,
psychology, criminology, etc. but not economics

Example: simple quadratic latent growth curve:


yit = ui + i t + i t2 + it
where the intercept and slope coefficients (ui , i , i) vary
randomly across individuals
Advantage:
 Doesn’t assume all individuals have the same coefficients
(panel data regression assumes no variation in i , i )
Disadvantage:
 Purely descriptive: no theory of development
 Crude dynamics (nothing changes the trend for an individual
once it’s underway)
05/12/2013 (45)

Related methods : SEMs


Structural equation modelling (SEM) is widely used in many disciplines,
but with differences in terminology.
In panel data applications, each year is described by a different equation:
Period 1: yi1 = zi 1 + xi1 1 + ui + i1
.
.

Period T: yiT = zi T + xiT T + ui + iT


Advantage:
 general structure (e.g. panel regression is special case where the t and t
are the same in all periods)
Disadvatage:
 No theory of how the parameters vary over time
 Can’t predict outcomes in new periods
 Difficult to use in long or very unbalanced panels

05/12/2013 (46)

23
Related methods: multi-level models
Multi-level modelling is widely used throughout social
statistics. It generalises ordinary panel data applications to
multiple dimensions
Example: time periods (t) within individuals (i) within
households (h):
yhit = xhit  + uhi + wh + iT
 wh is the household effect, common to all individuals at all periods
within household h
 uhi is the individual effect, common to all time periods for the ith
individual in household h
 Some or all of the -coefficients may also be allowed to vary

Specialist software is available for latent growth curve, SEM and Multi-
level analysis (MLwin, Mplus, LISREL, etc). See also xtmixed and
GLLAMM in Stata

05/12/2013 (47)

Two basic identification problems

(1) Unobservable variables


• Can we identify the impact of unobservables?
• Can we distinguish the impact of unobservables from the impact
of time-invariant observables?

(2) Age, cohort and time effects – can they be distinguished?


• Behaviour may change with age
• Current behaviour may be affected by experience in “formative
years”  cohort or year-of-birth effect
• Time may affect behaviour through changing social environment

Melbourne: 05/12/2013 (48)

24
Identification of unobservables
Example: wage models based on human capital theory:
yit = zi  + xit  + ui + it
where i = 1…n, t = 1 … Ti :
yit = log wage
zi = observable time-invariant factors (e.g. education)
xit = observable time-varying factors (e.g. job tenure)
ui = unobservable “ability” (assumed not to change over time)
it = “luck”

Pooled data regression of y on z and x  omitted variable bias:


Ability (u) is likely to be positively related to education (z)
 upward bias in estimate of returns to education

But can we identify the effect of ui if we can’t observe it?

Melbourne: 05/12/2013 (49)

Identification of unobservables
The identification of the effect of ui rests on assumptions about the
correlation structure of the compound residual vit :
vit = ui + it
If individuals have been sampled at random, there is no correlation
across different individuals:
cov(ui , uj ) = 0
cov( [i1 … iT], [j1 … jT]) = 0
for any two (different) sampled individuals i and j

But there may be some correlation over time for any individual:
cov(vis , vit )  0 for two different periods s  t,
since:
cov(vis , vit ) = cov(ui + is , ui + it) = var(ui) + cov(is , it)
If we assume cov(is , it) = 0 then ui is the only source of correlation over
time, so its variance can be inferred (identified) from the serial
correlation of the residuals.

Melbourne: 05/12/2013 (50)

25
Identification with time-invariant covariates:
can we distinguish zi and ui?
Consider again the panel regression model:
yit = zi  + xit  + ui + it (1)
Let zi  be any arbitrary combination of the z-variables (choose any value
for  you like). Add it to the right-hand side and subtract it again:
yit = zi  + zi  + xit  + ui - zi  + it
Now re-write this as:
yit = zi * + xit  + ui* + it (2)
Where * represents ( + ) and ui* represents (ui - zi ).

But (1) and (2) have exactly the same form, so we can’t tell whether we’re
estimating  or a completely arbitrary value * = ( + ).
So the separate effects of zi  and ui can’t be distinguished empirically
without further assumptions

Melbourne: 05/12/2013 (51)

Summary
In models like:
yit = zi  + xit  + ui + it

• We can only identify the effect of unobservable ability ui if


we can assume that it is serially-independent, or some
other restricted autocorrelation structure. See also
Calzolari &Magazzini (2009) on the difficulty of
identifying the serial correlation in it alongside var(ui)
[http://dse.univr.it/RePEc/ver/Wpaper/WP53.pdf]
• We cannot distinguish the separate effects of zi and ui
without making further assumptions (e.g. no correlation
between zi and ui).

Melbourne: 05/12/2013 (52)

26
Another problem: age, cohort & time effects
Identity relating age (Ait), period (t) and birth cohort (Bi):
Ait  t –Bi
They cannot be distinguished in principle. It would require an
ability to move a cohort forward or back in time (!) to measure
the effect of time holding age and cohort constant.
Glenn (Am. Sociol. Rev. 1976) “Cohort analysts’ futile quest –
statistical attempts to separate age, period & cohort effects”
• In a cross-section, t doesn’t vary, so time effects can’t be
estimated and age or cohort are collinear – only their joint
effect can be estimated.
• In a panel, t varies but Ait , t and Bi are collinear - only two
of the three effects can be estimated.
• So we can use (t, Bi) , (Ait , Bi) or (Ait , t) as covariates, but not
all three.

Melbourne: 05/12/2013 (53)

Age, cohort and time effects


• A possible solution is to think more deeply about the effects
of time and cohort and introduce further information.
• E.g. we may think it is the social environment at the time of
birth that generates differences between cohorts and the
present social environment that generates time effects.
• Let w(t) be variables describing the social environment at
historical time t (e.g. unemployment rate, income inequality,
crime rate).
• Then our model would use Ait , w(t) and w(Bi)) as covariates
• This breaks the exact relationship between age, time and
cohort effects and permits identification.

Melbourne: 05/12/2013 (54)

27
Pooled regression for panel data
The “standard” panel data regression model is:
yit = zi  + xit  + ui + it
We have observations indexed by t = 1 … Ti , i = 1 … n.
• A pooled regression of y on z and x using all the data together
would assume that there is no correlation across individuals,
nor across time periods for any individual
• This would ignore the individual effect u, which generates
correlation between the values of (ui + i1) … (ui + iT) for each
individual i
• So pooled regression doesn’t make best use of the data
 Under favourable conditions (if ui is uncorrelated with zi and xit ),
pooled regression gives unbiased but inefficient results, with
incorrect standard errors, t-ratios, etc.
 If ui is correlated with zi and xit , pooled regression is also biased

05/12/2013 (55)

Least-squares dummy variable (LSDV) regression


The panel data regression model is:
yit = zi  + xit  + ui + it
We have observations indexed by t = 1 … Ti , i = 1 … n.
The ui can be captured using dummy variables. Construct a set of n
dummy variables D1i … Dni , where:
Dri = 1 if i = r and 0 otherwise, for r = 1 … n
Thus Drit tells us whether observation i, t relates to person r.
The model is now:
yit = zi  + xit  + u1 D1i + … + unDni + it
So u1 … un are now seen as the coefficients of a set of n dummy
variables.

05/12/2013 (56)

28
Shortcut calculation of the LSDVregression
A multiple regression of y on (z , x) and (D1 … Dn) can be done in
two stages:
Stage 1: Eliminate the effect of (D1 … Dn) on each of the variables
(y, z , x) using the “within-group” data transformation:
yit*  yit  yi
x*it  x it  xi
z *i  z i  z i  0 (so zi is eliminated completely)

Stage 2: regress y* on (z* , x*) : in other words, yit  yi on x it  xi


[Intuition: think of regressing a variable on a constant. Estimate
of constant is mean and residual is deviation from mean.]

This is exactly equivalent to regressing y on (z , x , D1 , … , Dn)

05/12/2013 (57)

Another interpretation of LSDV


Start differently, by thinking how we can cope with ui
We don’t know its statistical properties, so let’s try to
eliminate it from the model. We can eliminate it in
various ways, for example:

Time differencing: yit  yit 1  (x it  x it 1 )β   it   it 1


or
Within-group transform: yit  yi  (x it  xi )β   it   i

The within-group approach is the most efficient in the


least squares sense.

05/12/2013 (58)

29
A note on terminology
Different names are commonly used for this one estimation method:
• Least squares dummy variables (LSDV)
• Within-group regression
• Fixed-effects regression
• Covariance analysis regression

 “LSDV” refers to the method of derivation using explicit dummy


variables;
 “within-group” refers to the type of data transform implied by the
method;
 “fixed effects” is common but often poor terminology which
suggests (wrongly, in the case of sample survey data) that the ui are
fixed parameters
 “covariance analysis” reflects the origins of the method as a
generalisation of analysis of variance in agricultural experiments

05/12/2013 (59)

Between-group regression
Instead of eliminating ui from the regression, we can amplify
it by averaging out all the within-individual variation, leaving
only between-individual variation to analyse:
Between-group transform: yi  z i α  x i β  ui   i

Then regress yi on z i , xi  in one of two ways:


 Use one group-mean observation per individual
 Use Ti copies of the group mean data for individual i
The former is (unfortunately) the Stata default: use wls option
for the latter
NB: The latter is equivalent to a weighted regression of yi on xi , with a
weight of Ti for individual i. It’s desirable to give more weight to cases
with many time observations, since they contain more information

05/12/2013 (60)

30
Within- & between-group estimates –
simple case
Suppose that x (and therefore β) is a single variable
(scalar), and panel is balanced (Ti = T). Want to
estimate:
Within-group: yit  yi  ( xit  xi ) β   it   i
Between-group: yi  xi β  ui   i

 x  x y 
n T n T

 x it  xi  yit  yi 
wxy i i y
bxy
ˆW  i 1 t 1
 ; ˆB  i 1 t 1

 x  x 
n T n T
wxx bxx
 x  xi 
2 2
it i
i 1 t 1 i 1 t 1

05/12/2013 (61)

Within-group estimate – simple case


Can substitute for yit  yi in preceding formula, to obtain:

n T n T

 x it  xi xit  xi    it   i   x it  xi  it   i 
wx
ˆW  i 1 t 1
n T
 i 1 t 1
n T

wxx
 x
i 1 t 1
it  xi 
2
 x
i 1 t 1
it  xi 
2

If xit and it are uncorrelated, E(wx ) = 0, so EβˆW  β


…which means, loosely speaking, that on average β̂W is
correct (unbiased).
Note: for unbiasedness of β̂ B , we need also that xit is
uncorrelated with ui  so within-group regression is less
“robust”
05/12/2013 (62)

31
Within- & between-group relationships:
correlated individual effects

W-G
W-G

W-G
u1

W-G

u2

B-G
x
u3 x1 x2 x3 x4

In this example, individual effects are negatively correlated


with xi, so B-G & W-G relationships differ
05/12/2013 (63)

Within- & between-group relationships:


uncorrelated individual effects

y
B-G

x
x1 x2 x3 x4

05/12/2013 (64)

32
BHPS example of panel data estimation

The Stata command xtreg computes within-group and


between-group regressions

Example: within- and between-group regressions of log


earnings on age, year of birth and time, allowing for
unobserved individual effects:
gen age=year-cohort
gen lwage=ln(w_hr)
xtreg lwage age cohort, fe
xtreg lwage age cohort, be

05/12/2013 (65)

Stata output: within-group regression


. xtreg lwage age cohort , fe

Fixed-effects (within) regression Number of obs = 61516


Group variable (i): pid Number of groups = 10335

R-sq: within = 0.1217 Obs per group: min = 1


between = 0.0312 avg = 6.0
overall = 0.0194 max = 14

F(1,51180) = 7094.59
corr(u_i, Xb) = -0.4880 Prob > F = 0.0000

------------------------------------------------------------------------------
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .030061 .0003569 84.23 0.000 .0293615 .0307605
cohort | (dropped)
_cons | .8994719 .01369 65.70 0.000 .8726394 .9263045
-------------+----------------------------------------------------------------
sigma_u | .60455798
sigma_e | .28494801
rho | .81822708 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(10334, 51180) = 18.19 Prob > F = 0.0000

05/12/2013 (66)

33
Stata output: between-group regression

. xtreg lwage age cohort , be

Between regression (regression on group means) Number of obs = 61516


Group variable (i): pid Number of groups = 10335

R-sq: within = 0.1217 Obs per group: min = 1


between = 0.0356 avg = 6.0
overall = 0.0313 max = 14

F(2,10332) = 190.55
sd(u_i + avg(e_i.))= .5277749 Prob > F = 0.0000

------------------------------------------------------------------------------
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0188575 .0017201 10.96 0.000 .0154858 .0222292
cohort | .0105401 .0015325 6.88 0.000 .0075361 .0135442
_cons | -19.39964 3.065617 -6.33 0.000 -25.40885 -13.39044
------------------------------------------------------------------------------

05/12/2013 (67)

Important points
• The ”intra-class correlation” is:
 = corr(ui + is , ui + it) = u2/(u2 + 2)
= 81.8%, for any two different periods s, t
so variation between individuals is dominant
• The within-group R2 is much higher than the
between-group R2 (note minor differences in Stata
calculation of them for w-g & b-g commands)
• Since cov( xi , ui )  0.488  0 , there’s a negative bias in
the between-group estimate of the age effect
(w-g coefficient = .030; b-g coefficient = .019)
• BUT: evidence of bias in between-group results
doesn’t necessarily imply that within-group results
are OK!

05/12/2013 (68)

34
Appendix
Examples of other household panels

05/12/2013 (69)

Specific examples - PSID


• Panel Study of Income Dynamics
• Based at SRC, University of Michigan
• Began in 1968 with 4,800 households.
• Original sample combined representative cross-section
and low-income sample. Now has around 7,000
households.
• Annual interviews 1968-96, biennial since 1997, with
household head (but covering all houseold members)
• Face-to-face PAPI 1968-72, mainly telephone
interviewing (CATI) since 1973.
• http://psidonline.isr.umich.edu/

Melbourne: 05/12/2013 (70)

35
Specific examples - GSOEP

• German Socio-Economic Panel Study


• Based at DIW, Berlin
• Began in 1984 with approx 6,000 households.
• Various “top-ups” including expansion to former
GDR. Now has around 12 000 households.
• Annual interviews with all adult members of hh.
• Various interview modes with gradual introduction
of CAPI (computer-aided personal interviewing)
since 1998.
• http://www.diw.de/english/soep/

Melbourne: 05/12/2013 (71)

Specific examples – BHPS/UKHLS


• British Household Panel Survey. Based at ISER, University of
Essex
• Began in 1991 with approx 5,500 households (approx 10,000
adults) from England, Wales and (most of) Scotland. Extension
samples from Scotland and Wales (1500 households each)
added in 1999; sample from Northern Ireland (2000
households) added in 2001.
• Annual interviews with all adults (aged 16+ ) in household.
Interviews with 11-16s added in 1994
• Questionnaires have annually-repeated core + less frequent or
irregular additions. CAPI since 1999
• http://www.iser.essex.ac.uk/survey/bhps
• Now absorbed into the UK Household Longitudinal Survey
(Understanding Society) with 40,000 households
• http://www.understandingsociety.org.uk/
Melbourne: 05/12/2013 (72)

36

You might also like