0% found this document useful (0 votes)
244 views

Lab Introduction To STATA

The document provides an introduction and instructions for using the statistical software package STATA. It explains how to input data manually or import it from Excel. It then gives an example of how to estimate a panel data model in STATA to analyze the determinants of financial development across countries over time. Key steps include transforming variables, descriptive statistics, pooled OLS regression, testing if random effects are needed, and using the Hausman test to check if fixed effects are preferred.

Uploaded by

Afif
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
244 views

Lab Introduction To STATA

The document provides an introduction and instructions for using the statistical software package STATA. It explains how to input data manually or import it from Excel. It then gives an example of how to estimate a panel data model in STATA to analyze the determinants of financial development across countries over time. Key steps include transforming variables, descriptive statistics, pooled OLS regression, testing if random effects are needed, and using the Hausman test to check if fixed effects are preferred.

Uploaded by

Afif
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Introduction to STATA for Windows (Version 11)

STATA is a powerful statistical package which is also easy to use. This handout
presents the rudiments of STATA. It is by no means comprehensive, but it will allow
you to do basic panel regression analysis.

Panel Data Models


Notations:

yit = 1 + xit + it , i=1,…,N; t = 1,…T

- i indexes individual economic units, and t indexes time periods.


- Total number of observations: NT
- When each i has T observations, the panel is said to be a balanced one. If the
units have varying number of observations, we have an unbalanced panel.
- Example of panel data and how we usually organize it:
Cross-sectional units: countries
Time units: years

How to input data?

STATA works like any other spreadsheet: within STATA go to data editor (or type
edit), and input the data manually: each column represents a variable and each cell
the value of the relevant variable.

How to get Information from EXCEL into STATA?

STATA expects a single matrix or table of data from a single sheet, with at most one
line of text at the start defining the contents of the columns. Using your Windows
computer,

1. Start EXCEL

2. Enter data in rows and column (or open the EXCEL file of interest)

[Open dataset file: Asian 10 Data.xls]

3. Highlight the data of interest, then pull down Edit and choose Copy

4. Start STATA

1
5. Select the data editor

6. Paste data into editor by choosing Paste (Ctrl V)

7. Select the first row is variable name

2
8. Exit edit

Click Close

9. Save the data file (*.dta) – any file name you like

3
An example using STATA
Suppose we are interested in estimating the following model based on our panel of
countries

Empirical Model: The Determinants of Financial Development (FD)

ln FDit =  + 1 ln RGDPCit + 2 ln INSit + 3 ln FDIit + it

where
FD = financial development (% of GDP)
RGDPC = real GDP per capita (Constant 2005 US dollar)
INS = institutions (scaled 1 - 100)
FDI = foreign direct investment (US$)

• First tell STATA which variable is i, and which is t.


.tsset code year
panel variable: code (strongly balanced)
time variable: year, 1999 to 2004
delta: 1 unit

• Transform to natural log series


.generate lfd = ln(fd)

.generate lrgdpc = ln(rgdpc)

.generate lins = ln(ins)

.generate lfdi = ln(fdi)

• Descriptive statistics (Command: xtsum)


.xtsum lfd lrgdpc lins lfdi 1/17
Variable | Mean Std. Dev. Min Max | Observations
-----------------+--------------------------------------------+----------------
lfd overall | 4.483581 .6633935 2.92692 5.428259 | N = 170
between | .6339046 3.369432 5.25816 | n = 10
within | .2762236 3.433328 5.334476 | T = 17
| |
lrgdpc overall | 8.505417 1.461273 5.975189 10.52367 | N = 170
between | 1.519998 6.406903 10.46566 | n = 10
within | .2089482 7.846135 9.220709 | T = 17
| |
lins overall | 3.985642 .3996766 3.049427 4.505527 | N = 170
between | .4091821 3.382909 4.464227 | n = 10
within | .090348 3.652161 4.245211 | T = 17
| |
lfdi overall | 11.28412 1.263696 8.870247 14.16784 | N = 170
between | 1.130506 9.642214 13.26921 | n = 10
within | .6632357 9.524323 12.88804 | T = 17

4
• Pooled OLS Estimation (Command: regress or reg)
regress or reg

.regress lfd lrgdpc lins lfdi


Source | SS df MS Number of obs = 170
-------------+------------------------------ F( 3, 166) = 74.68
Model | 42.7222872 3 14.2407624 Prob > F = 0.0000
Residual | 31.653087 166 .190681247 R-squared = 0.5744
-------------+------------------------------ Adj R-squared = 0.5667
Total | 74.3753742 169 .440090972 Root MSE = .43667

------------------------------------------------------------------------------
lfd | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lrgdpc | -.0280911 .0665201 -0.42 0.673 -.1594256 .1032435
lins | .9882674 .2226742 4.44 0.000 .5486288 1.427906
lfdi | .1823524 .0336401 5.42 0.000 .1159348 .24877
_cons | -1.27406 .5580983 -2.28 0.024 -2.375945 -.1721739
------------------------------------------------------------------------------

• Pooled OLS versus Random Effect (RE)

We wish to test whether RE (GLS) is necessary or Pooled OLS (simple OLS) will do.
In other words, we search whether the datasets have specific-effect or heterogeneity
().

.xtreg lfd lrgdpc lins lfdi, re Random effects option

Random-effects GLS regression Number of obs = 170


Group variable: code Number of groups = 10

R-sq: within = 0.1743 Obs per group: min = 17


between = 0.5904 avg = 17.0
overall = 0.5148 max = 17

Random effects u_i ~ Gaussian Wald chi2(3) = 42.85


corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000

------------------------------------------------------------------------------
lfd | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lrgdpc | .3001345 .0925208 3.24 0.001 .1187972 .4814719
lins | .3380844 .2032213 1.66 0.096 -.0602219 .7363908
lfdi | .0511819 .0380584 1.34 0.179 -.0234111 .1257749
_cons | .0057854 .8241524 0.01 0.994 -1.609524 1.621094
-------------+----------------------------------------------------------------
sigma_u | .42025734
sigma_e | .25797259
rho | .72631933 (fraction of variance due to u_i)
------------------------------------------------------------------------------

• Breausch & Pagan LM Test (Command: xttest0)


.xttest0
Breusch and Pagan Lagrangian multiplier test for random effects

lfd[code,t] = Xb + u[code] + e[code,t]

Estimated results:

5
| Var sd = sqrt(Var)
---------+-----------------------------
lfd | .440091 .6633935
e | .0665499 .2579726
u | .1766162 .4202573
Test: Var(u) = 0
chi2(1) = 469.12
Prob > chi2 = 0.0000
Hypotheses:
H0: 2= 0 (pooled OLS model)
HA: 2> 0 (random effects) - heterogeneity

▪ The p-value < 0.05 – Reject H0. The random effect model is more appropriate
than OLS (pooled OLS model). In other words there are country-specific
effects (heterogeneity) in the data.

▪ The second test that is commonly used in applied panel data analysis seeks
to determine which is more appropriate: Random or fixed effects?

H0: Cov (i, xit) = 0 (No correlation between the i and xit  Random Effect)
HA: Cov(i, xit)  0 (Correlation between the i and xit  Fixed Effect)

▪ Rejection of the null (H0) favours the fixed effect model.

▪ Back to our example…

• Random versus Fixed Effects Model: Hausman Test

.xtreg lfd lrgdpc lins lfdi, fe Fixed effects option (Within-groups FE)

Fixed-effects (within) regression Number of obs = 170


Group variable: code Number of groups = 10

R-sq: within = 0.1897 Obs per group: min = 17


between = 0.5536 The errors ui are avg = 17.0
overall = 0.4804 correlated with the max = 17
regression in the fixed
effects model F(3,157) = 12.25
corr(u_i, Xb) = -0.8095 Prob > F = 0.0000

------------------------------------------------------------------------------
lfd | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lrgdpc | .5804713 .1503694 3.86 0.000 .2834632 .8774794
lins | .5043667 .222531 2.27 0.025 .0648258 .9439076
lfdi | -.0198562 .0476584 -0.42 0.678 -.1139906 .0742782
_cons | -2.239735 1.326785 -1.69 0.093 -4.860386 .3809163
-------------+----------------------------------------------------------------
sigma_u | .7311766
sigma_e | .25797259
rho | .88929927 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(9, 157) = 35.40 Prob > F = 0.0000
88.93% of the variance is
due to differences across
panels.
‘rho’ is known as the
H0: Common Intercept
intraclass correlation HA: Different Intercept
The p-value < 0.05, Reject
H0
Each country has different
6
intercept (justify FE)
.est store fixed

.xtreg lfd lrgdpc lins lfdi, re


Random-effects GLS regression Number of obs = 170
Group variable: code Number of groups = 10

R-sq: within = 0.1743 Obs per group: min = 17


between = 0.5904 avg = 17.0
overall = 0.5148 max = 17

Random effects u_i ~ Gaussian Wald chi2(3) = 42.85


corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000

------------------------------------------------------------------------------
lfd | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lrgdpc | .3001345 .0925208 3.24 0.001 .1187972 .4814719
lins | .3380844 .2032213 1.66 0.096 -.0602219 .7363908
lfdi | .0511819 .0380584 1.34 0.179 -.0234111 .1257749
_cons | .0057854 .8241524 0.01 0.994 -1.609524 1.621094
-------------+----------------------------------------------------------------
sigma_u | .42025734
sigma_e | .25797259
rho | .72631933 (fraction of variance due to u_i)
------------------------------------------------------------------------------

.hausman fixed
---- Coefficients ----
| (b) (B) (b-B) sqrt(diag(V_b-V_B))
| fixed . Difference S.E.
-------------+----------------------------------------------------------------
lrgdpc | .5804713 .3001345 .2803368 .1185364
lins | .5043667 .3380844 .1662823 .0906707
lfdi | -.0198562 .0511819 -.0710381 .028686
------------------------------------------------------------------------------
b = consistent under Ho and Ha; obtained from xtreg
B = inconsistent under Ha, efficient under Ho; obtained from xtreg

Test: Ho: difference in coefficients not systematic

chi2(3) = (b-B)'[(V_b-V_B)^(-1)](b-B)
= 17.46
Prob>chi2 = 0.0006

▪ The p-value < 0.05, reject H0. We have to use the fixed effect model.

Steps to implement the Hausman Test

. xtreg Y X1 X2, fe
. est store fixed
. xtreg Y X1 X2, re
. hausman fixed

Note: Follow the above sequence

7
Diagnostic Checks

a. Multicollinearity

• Detect by using Variance inflation factor (vif)

• If mean vif > 10, there is a multicollinearity problem

Command: Regress using Pool OLS then vif

.regress lfd lrgdpc lins lfdi


Source | SS df MS Number of obs = 170
-------------+------------------------------ F( 3, 166) = 74.68
Model | 42.7222872 3 14.2407624 Prob > F = 0.0000
Residual | 31.653087 166 .190681247 R-squared = 0.5744
-------------+------------------------------ Adj R-squared = 0.5667
Total | 74.3753742 169 .440090972 Root MSE = .43667

------------------------------------------------------------------------------
lfd | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lrgdpc | -.0280911 .0665201 -0.42 0.673 -.1594256 .1032435
lins | .9882674 .2226742 4.44 0.000 .5486288 1.427906
lfdi | .1823524 .0336401 5.42 0.000 .1159348 .24877
_cons | -1.27406 .5580983 -2.28 0.024 -2.375945 -.1721739
------------------------------------------------------------------------------

.vif
Variable | VIF 1/VIF
-------------+----------------------
lrgdpc | 8.37 0.119413
lins | 7.02 0.142451
lfdi | 1.60 0.624341
-------------+----------------------
Mean VIF | 5.67

Conclusion: No multicollinearity problem since VIF < 10 (one of the advantages of


using panel data is to reduce the multicollinearity)

8
b. Heteroskedasticity

• Detect by using the Modified Wald Statistic for groupwise heteroskedasticity in


the residuals of a fixed effect regression model (Greene, 2000, p. 598)
Command: xtreg Y X1 X2, fe or xtgls
xttest3

• If the xttest3 command is not available, then need to install

Command to install (any new code)


ssc install new command or findit

.ssc install xttest3

.xtreg lfd lrgdpc lins lfdi, fe (since our final model is FE)
Fixed-effects (within) regression Number of obs = 170
Group variable: code Number of groups = 10

R-sq: within = 0.1897 Obs per group: min = 17


between = 0.5536 avg = 17.0
overall = 0.4804 max = 17

F(3,157) = 12.25
corr(u_i, Xb) = -0.8095 Prob > F = 0.0000

------------------------------------------------------------------------------
lfd | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lrgdpc | .5804713 .1503694 3.86 0.000 .2834632 .8774794
lins | .5043667 .222531 2.27 0.025 .0648258 .9439076
lfdi | -.0198562 .0476584 -0.42 0.678 -.1139906 .0742782
_cons | -2.239735 1.326785 -1.69 0.093 -4.860386 .3809163
-------------+----------------------------------------------------------------
sigma_u | .7311766
sigma_e | .25797259
rho | .88929927 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(9, 157) = 35.40 Prob > F = 0.0000

.xttest3

Modified Wald test for groupwise heteroskedasticity


in fixed effect regression model

H0: sigma(i)^2 = sigma^2 for all i

chi2 (10) = 641.79


Prob>chi2 = 0.0000

Ho: No heteroscedasticity (variances are constant)


HA: Heteroscedasticity (variances are not constant)

9
Conclusion: The p-value is < 0.05, reject the H0. This means that the variances are
not constant (there is a heteroskedasticity problem)

c. Serial Correlation [Autocorrelation]

• Detect by using the Wooldridge test for autocorrelation in panel data


Command: xtserial Y X1 X2

• The xtserial is not available, need to install

Command to install (any new code)


ssc install new command or findit
ssc install xtserial
ssc install: "xtserial" not found at SSC, type -findit xtserial-

Since not found, try the second command: findit


findit xtserial

xtserial from http://www.stata.com/users/ddrukker


xtserial tests for serial correlation in linear panel-data models /
xtserial implements a test for serial correlation in the idiosyncratic /
errors of a linear panel-data model discussed by Wooldridge (2002). /
Drukker (2003) presents simulation evidence that this test has good size /

Double click the above http link then select (click here to install)

.xtserial lfd lrgdpc lins lfdi (No pre-test such as must run the FE)

Wooldridge test for autocorrelation in panel data


H0: no first-order autocorrelation
F( 1, 9) = 266.285
Prob > F = 0.0000

H0: No serial correlation (no autocorrelation)


HA: Serial correlation

Conclusion: The p-value is < 0.05, reject the H0. This means that there is a serial
correlation problem.

The above diagnostic checks indicate heteroskedasticity and serial correlation


problems.

How to rectify/overcome?
Refer to next page….Table 1 (using robust standard error command)

10
Source: Daniel, Hoechle. (2014) “Robust Standard Errors for Panel Regressions with Cross-
Sectional Dependence, Stata Journal, Page 4.
http://fmwww.bc.edu/repec/bocode/x/xtscc_paper.pdf

In our example, the diagnostic checks indicate there are two problems:
(i) heteroskedasticity, and
(ii) serial correlation problems

To Retify: Use the OLS with Heteroscedasticity and Serial Correlation Robust
Standard Error

Example: reg Y X1 X2 (fixed dummy if the model is fixed effect),


cluster (code)

11
. regress lfd lrgdpc lins lfdi ndum1 ndum2 ndum3 ndum4 ndum5 ndum6
ndum7 ndum8 ndum9, cluster (code)
Linear regression Number of obs = 170
F( 2, 9) = .
Prob > F = .
R-squared = 0.8595
Root MSE = .25797

(Std. Err. adjusted for 10 clusters in code)


------------------------------------------------------------------------------
| Robust
lfd | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lrgdpc | .5804713 .5149604 1.13 0.289 -.5844501 1.745393
lins | .5043667 .4317009 1.17 0.273 -.4722086 1.480942
lfdi | -.0198562 .0857304 -0.23 0.822 -.2137919 .1740796
ndum1 | .2054068 .3983817 0.52 0.619 -.6957952 1.106609
ndum2 | -1.482869 1.830192 -0.81 0.439 -5.623051 2.657314
ndum3 | -.956752 .3559261 -2.69 0.025 -1.761913 -.1515913
ndum4 | -1.497996 2.093435 -0.72 0.492 -6.233674 3.237682
ndum5 | -1.536674 1.70846 -0.90 0.392 -5.40148 2.328132
ndum6 | -.7245539 1.132319 -0.64 0.538 -3.286037 1.836929
ndum7 | -.9213616 .3907418 -2.36 0.043 -1.805281 -.0374421
ndum8 | -1.993394 1.916071 -1.04 0.325 -6.327848 2.341059
ndum9 | -.2299998 .7426447 -0.31 0.764 -1.909979 1.449979
_cons | -1.325915 3.288071 -0.40 0.696 -8.764049 6.112218
------------------------------------------------------------------------------

or

. xtreg lfd lrgdpc lins lfdi, fe cluster (code)


Fixed-effects (within) regression Number of obs = 170
Group variable: code Number of groups = 10

R-sq: within = 0.1897 Obs per group: min = 17


between = 0.5536 avg = 17.0
overall = 0.4804 max = 17

F(3,9) = 1.10
corr(u_i, Xb) = -0.8095 Prob > F = 0.3987

(Std. Err. adjusted for 10 clusters in code)


------------------------------------------------------------------------------
| Robust
lfd | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lrgdpc | .5804713 .5008061 1.16 0.276 -.5524308 1.713373
lins | .5043667 .4198351 1.20 0.260 -.4453663 1.4541
lfdi | -.0198562 .083374 -0.24 0.817 -.2084614 .168749
_cons | -2.239735 4.18272 -0.54 0.605 -11.7017 7.222235
-------------+----------------------------------------------------------------
sigma_u | .7311766
sigma_e | .25797259
rho | .88929927 (fraction of variance due to u_i)
------------------------------------------------------------------------------

12
However, if the diagnostic checks only indicate heteroskedasticity problem:

Retify: Use the Robust standard error estimation

xtreg lfd lrgdpc lins lfdi, fe robust

Fixed-effects (within) regression Number of obs = 170


Group variable: code Number of groups = 10
R-sq: within = 0.1897 Obs per group: min = 17
between = 0.5536 avg = 17.0
overall = 0.4804 max = 17
F(3,157) = 6.86
corr(u_i, Xb) = -0.8095 Prob > F = 0.0002
(Std. Err. adjusted for clustering on code)
------------------------------------------------------------------------------
| Robust
lfd | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lrgdpc | .5804713 .1652519 3.51 0.001 .2540675 .9068751
lins | .5043667 .297928 1.69 0.092 -.0840974 1.092831
lfdi | -.0198562 .0395808 -0.50 0.617 -.0980357 .0583234
_cons | -2.239735 1.831484 -1.22 0.223 -5.857262 1.377792
-------------+----------------------------------------------------------------
sigma_u | .7311766
sigma_e | .25797259
rho | .88929927 (fraction of variance due to u_i)
------------------------------------------------------------------------------

If the diagnostic checks only indicate serial correlation problem:

Rectify: use the

(i) Autocorrelated with AR(1)

xtregar lfd lrgdpc lins lfdi, fe


FE (within) regression with AR(1) disturbances Number of obs = 160
Group variable: code Number of groups = 10
R-sq: within = 0.0300 Obs per group: min = 16
between = 0.4914 avg = 16.0
overall = 0.4226 max = 16
F(3,147) = 1.52
corr(u_i, Xb) = 0.4469 Prob > F = 0.2127
------------------------------------------------------------------------------
lfd | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lrgdpc | .1983066 .2174869 0.91 0.363 -.2314982 .6281113
lins | -.3173478 .1674276 -1.90 0.060 -.6482237 .0135282
lfdi | .0336502 .0431847 0.78 0.437 -.0516929 .1189933
_cons | 3.716865 .1815224 20.48 0.000 3.358135 4.075596
-------------+----------------------------------------------------------------
rho_ar | .90262172
sigma_u | .51752702
sigma_e | .1076283
rho_fov | .95854294 (fraction of variance because of u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(9,147) = 5.30 Prob > F = 0.0000

13
(ii) Panel-corrected Standard Errors, xtpcse

Example: xtpcse Y X1 X2, correlation(ar1)

xtpcse lfd lrgdpc lins lfdi, correlation (ar1)


(note: estimates of rho outside [-1,1] bounded to be in the range [-1,1])

Prais-Winsten regression, correlated panels corrected standard errors (PCSEs)

Group variable: code Number of obs = 170


Time variable: year Number of groups = 10
Panels: correlated (balanced) Obs per group: min = 17
Autocorrelation: common AR(1) avg = 17
max = 17
Estimated covariances = 55 R-squared = 0.8708
Estimated autocorrelations = 1 Wald chi2(3) = 59.01
Estimated coefficients = 4 Prob > chi2 = 0.0000

------------------------------------------------------------------------------
| Panel-corrected
lfd | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lrgdpc | .3366847 .0698404 4.82 0.000 .1998 .4735694
lins | -.2289813 .2201351 -1.04 0.298 -.6604383 .2024756
lfdi | .047209 .0461653 1.02 0.306 -.0432733 .1376913
_cons | 2.030366 .7517767 2.70 0.007 .5569104 3.503821
-------------+----------------------------------------------------------------
rho | .9264655
------------------------------------------------------------------------------

14
Summarize the above results:

Results of Panel Data Analysis


Dependent Variable: ln FD

Pooled OLS Random Fixed Effect OLS with Hetero &


Effect Serial Correlation
Constant -1.27 0.01 -2.24
(-2.28) (0.01) (-1.69)
ln RGDPC -0.03 0.30 0.58
(-0.42) (3.24)*** (3.86)***
ln INS 0.99 0.34 0.50
(4.44)*** (1.66) (2.27)**
ln FDI 0.18 0.05 -0.02
(5.42)*** (1.34) (-0.42)
Breusch-Pagan LM 469.12 _
test (0.0000)***
Hausman test _ 17.46
(0.0006)***
Observations 170 170 170

Multicollinearity _ _ 5.67 _
(mean vif)
Heteroskedasticity _ _ 641.79 _
(2 – stat) (0.0000)***
Serial Correlation _ _ 266.28 _
(F-stat) (0.0000)***

1. Figures in the parentheses are t-statistics, except for Breusch-Pagan LM test, Hausman
test, Heteroskedasticity and Serial Correlation tests, which are p-values.
2. ** and *** indicate the respective 5% and 1% significance levels.

15
d. Cook’s Distance Outlier Test

Running an OLS regression

regress lfd lrgdpc lins lfdi


Source | SS df MS Number of obs = 170
-------------+------------------------------ F( 3, 166) = 74.68
Model | 42.7222872 3 14.2407624 Prob > F = 0.0000
Residual | 31.653087 166 .190681247 R-squared = 0.5744
-------------+------------------------------ Adj R-squared = 0.5667
Total | 74.3753742 169 .440090972 Root MSE = .43667
------------------------------------------------------------------------------
lfd | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lrgdpc | -.0280911 .0665201 -0.42 0.673 -.1594256 .1032435
lins | .9882674 .2226742 4.44 0.000 .5486288 1.427906
lfdi | .1823524 .0336401 5.42 0.000 .1159348 .24877
_cons | -1.27406 .5580983 -2.28 0.024 -2.375945 -.1721739
------------------------------------------------------------------------------

Collect the Cook’s Distance residuals then generate the cutoff point, then list the
outliers

. predict d1, cooksd

. quietly generate cutoff = d1 > 4/170 & e(sample)

. list country d1 if cutoff


+----------------------+
| country d1 | N = 170 (Number of observations). 4 is fixed (do not
|----------------------| change)
37. | Indonesia .0432518 |
38. | Indonesia .0238333 |
52. | Japan .0235541 |
53. | Japan .0329401 |
54. | Japan .0389948 | 9 observations are outliers in the panel
|----------------------|
138. | Thailand .0327251 | datasets, so we will drop these and now
154. | Vietnam .0366815 | become unbalanced panel data
155. | Vietnam .0288114 | structure
156. | Vietnam .025521 |
+----------------------+

. d1 cutoff
.0002801 0
.0012586 0
.0005517 0 | country d1 |
.010484 0 |----------------------|
.0432518 1 37. | Indonesia .0432518 | The first two cutoff that
.0238333 1 38. | Indonesia .0238333 | indicate outlier is
.0214515 0 52. | Japan .0235541 |
53. | Japan .0329401 | Indonesia (obs 37, 38)
.0198299 0
54. | Japan .0389948 |
|----------------------|
138. | Thailand .0327251 |
154. | Vietnam .0366815 |
155. | Vietnam .0288114 |
156. | Vietnam .025521 |
+----------------------+

The above can check in your data editor file

16
Regress the model without outliers

. xtreg lfd lrgdpc lins lfdi if cutoff~=1, fe


Fixed-effects (within) regression Number of obs = 161
Group variable: code Number of groups = 10
R-sq: within = 0.2049 Obs per group: min = 14
between = 0.5206 avg = 16.1
overall = 0.4566 max = 17
F(3,148) = 12.71
corr(u_i, Xb) = -0.7235 Prob > F = 0.0000
------------------------------------------------------------------------------
lfd | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lrgdpc | .3981734 .1364567 2.92 0.004 .1285182 .6678286
lins | .7647175 .2018694 3.79 0.000 .3657989 1.163636
lfdi | .0002088 .0445651 0.00 0.996 -.0878572 .0882749
_cons | -1.954388 1.222277 -1.60 0.112 -4.369757 .4609806
-------------+----------------------------------------------------------------
sigma_u | .62516033
sigma_e | .21447702
rho | .89469397 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(9, 148) = 47.64 Prob > F = 0.0000

Use the exclude command if cutoff~=1


[~=] - exclude.
Compare the result with without dropping the outliers, what can you
conclude?

17
Another Fixed Effect Estimation:

Least Square Dummy Variable (LSDV)


Another way to estimate Fixed Effects: common intercept and n-1 binary
regressors (using dummies and regress)

i) The above LSDV result can be obtained by generating country dummies


(Manual Command)
If the “code” is used to represent
.tabulate code, generate (ndum) the cross section unit (refer to
Code | Freq. Percent Cum.
------------+-----------------------------------
variable’s name in the dataset)
1 | 17 10.00 10.00
2 | 17 10.00 20.00
3 | 17 10.00 30.00
4 | 17 10.00 40.00
5 | 17 10.00 50.00
6 | 17 10.00 60.00
7 | 17 10.00 70.00
8 | 17 10.00 80.00
9 | 17 10.00 90.00
10 | 17 10.00 100.00
------------+-----------------------------------
Total | 170 100.00

and then regress with these country dummies:

.regress lfd lrgdpc lins lfdi ndum2 ndum3 ndum4 ndum5 ndum6 ndum7
ndum8 ndum9 ndum10
Source | SS df MS Number of obs = 170
-------------+------------------------------ F( 12, 157) = 80.05
Model | 63.9270465 12 5.32725388 Prob > F = 0.0000
Residual | 10.4483277 157 .066549858 R-squared = 0.8595
-------------+------------------------------ Adj R-squared = 0.8488
Total | 74.3753742 169 .440090972 Root MSE = .25797

------------------------------------------------------------------------------
lfd | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lrgdpc | .5804713 .1503694 3.86 0.000 .2834632 .8774794
lins | .5043667 .222531 2.27 0.025 .0648258 .9439076
lfdi | -.0198562 .0476584 -0.42 0.678 -.1139906 .0742782
ndum2 | -1.688276 .457577 -3.69 0.000 -2.592077 -.7844744
ndum3 | -1.162159 .1163148 -9.99 0.000 -1.391903 -.932415
ndum4 | -1.703402 .5676667 -3.00 0.003 -2.824652 -.5821534
ndum5 | -1.742081 .4564517 -3.82 0.000 -2.643659 -.8405022
ndum6 | -.9299607 .2986868 -3.11 0.002 -1.519924 -.3399978
ndum7 | -1.126768 .1444713 -7.80 0.000 -1.412127 -.8414101
ndum8 | -2.198801 .5020856 -4.38 0.000 -3.190515 -1.207087
ndum9 | -.4354066 .1918365 -2.27 0.025 -.81432 -.0564931
ndum10 | -.2054068 .1297699 -1.58 0.115 -.4617269 .0509134
_cons | -1.120509 1.123154 -1.00 0.320 -3.33895 1.097933
------------------------------------------------------------------------------

18
Individual and Time-Specific Effects
Summary: The Fixed Effects Model (Least Squares Dummy Variable Model)

Let say N= 10 firms, and T = 17 years

1) Firm FE

Although there are no significant temporal/time effects, there are significant differences
among firms in this type of model. While the intercept is cross-section (group) specific and in
this case differs from firm to firm, it may or may not differ over time.

Yit = 0 + 1Firm1 + 2Firm2 + ……+ 9Firm9 + 1X1it + 2X2it + it

2) Year/Time FE

In this case, the model would have no significant firm differences but might have
autocorrelation owing to time-lagged temporal effects. The residuals of this kind of model
may have autocorrelation in the process. In this case, the variables are homogenous across
the firms. They could be similar in region or area of focus. For example, technological
changes or national policies would lead to group specific characteristics that may effect
temporal changes in the variables being analyzed. We could account for the time effect over
the t years with t-1 dummy variables on the right-hand side of the equation.

Yit = 0 + 1Year1 + 2Year2 + ……+ 16Year16 + 1X1it + 2X2it + it

3) Firm FE + Year FE (two way fixed effect model)

There is another fixed effects panel model where the slope coefficients are constant, but the
intercept varies over firm as well as time. We would have a regression model with i-1 firm
dummies and t-1 time dummies. The model could be specified as follows:

Yit = 0 + 1Firm1 + 2Firm2 + ……+ 9Firm9 + 1Year1 + 2Year2


+ ……+ 16Year16 + 1X1it + 2X2it + it

19
Testing for Individual Effects / Specific Effects for LSDV
Now suppose we would like to know if the difference in the countries effects is
statistically significant (or whether these countries can share the common intercept?)
How to do that?

Option 1: Perform the joint test by restricting the country dummies using F-statistics
(after running the LSDV above)

.regress lfd lrgdpc lins lfdi ndum2 ndum3 ndum4 ndum5 ndum6 ndum7
ndum8 ndum9 ndum10
Source | SS df MS Number of obs = 170
-------------+------------------------------ F( 12, 157) = 80.05
Model | 63.9270465 12 5.32725388 Prob > F = 0.0000
Residual | 10.4483277 157 .066549858 R-squared = 0.8595
-------------+------------------------------ Adj R-squared = 0.8488
Total | 74.3753742 169 .440090972 Root MSE = .25797
------------------------------------------------------------------------------
lfd | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lrgdpc | .5804713 .1503694 3.86 0.000 .2834632 .8774794
lins | .5043667 .222531 2.27 0.025 .0648258 .9439076
lfdi | -.0198562 .0476584 -0.42 0.678 -.1139906 .0742782
ndum2 | -1.688276 .457577 -3.69 0.000 -2.592077 -.7844744
ndum3 | -1.162159 .1163148 -9.99 0.000 -1.391903 -.932415
ndum4 | -1.703402 .5676667 -3.00 0.003 -2.824652 -.5821534
ndum5 | -1.742081 .4564517 -3.82 0.000 -2.643659 -.8405022
ndum6 | -.9299607 .2986868 -3.11 0.002 -1.519924 -.3399978
ndum7 | -1.126768 .1444713 -7.80 0.000 -1.412127 -.8414101
ndum8 | -2.198801 .5020856 -4.38 0.000 -3.190515 -1.207087
ndum9 | -.4354066 .1918365 -2.27 0.025 -.81432 -.0564931
ndum10 | -.2054068 .1297699 -1.58 0.115 -.4617269 .0509134
_cons | -1.120509 1.123154 -1.00 0.320 -3.33895 1.097933
------------------------------------------------------------------------------
.test ndum2 ndum3 ndum4 ndum5 ndum6 ndum7 ndum8 ndum9 ndum10
( 1) ndum2 = 0
( 2) ndum3 = 0
( 3) ndum4 = 0
( 4) ndum5 = 0
( 5) ndum6 = 0
( 6) ndum7 = 0
( 7) ndum8 = 0
( 8) ndum9 = 0
( 9) ndum10 = 0

F( 9, 157) = 35.40
Prob > F = 0.0000

This is similar with FE (within-group) F-stat on page 6.


H0: Common intercept for all countries
HA: Different intercept for all countries

Therefore, we reject the null hypothesis of common intercept for all countries (since
the p-value < 0.5). Thus, all countries have different intercept (or there is a fixed
effect).

Note: if there are too many country dummies, then use “testparm” command, the
same result will be obtained.

20
.regress lfd lrgdpc lins lfdi ndum2 – ndum10
Source | SS df MS Number of obs = 170
-------------+------------------------------ F( 12, 157) = 80.05
Model | 63.9270465 12 5.32725388 Prob > F = 0.0000
Residual | 10.4483277 157 .066549858 R-squared = 0.8595
-------------+------------------------------ Adj R-squared = 0.8488
Total | 74.3753742 169 .440090972 Root MSE = .25797
------------------------------------------------------------------------------
lfd | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lrgdpc | .5804713 .1503694 3.86 0.000 .2834632 .8774794
lins | .5043667 .222531 2.27 0.025 .0648258 .9439076
lfdi | -.0198562 .0476584 -0.42 0.678 -.1139906 .0742782
ndum2 | -1.688276 .457577 -3.69 0.000 -2.592077 -.7844744
ndum3 | -1.162159 .1163148 -9.99 0.000 -1.391903 -.932415
ndum4 | -1.703402 .5676667 -3.00 0.003 -2.824652 -.5821534
ndum5 | -1.742081 .4564517 -3.82 0.000 -2.643659 -.8405022
ndum6 | -.9299607 .2986868 -3.11 0.002 -1.519924 -.3399978
ndum7 | -1.126768 .1444713 -7.80 0.000 -1.412127 -.8414101
ndum8 | -2.198801 .5020856 -4.38 0.000 -3.190515 -1.207087
ndum9 | -.4354066 .1918365 -2.27 0.025 -.81432 -.0564931
ndum10 | -.2054068 .1297699 -1.58 0.115 -.4617269 .0509134
_cons | -1.120509 1.123154 -1.00 0.320 -3.33895 1.097933
------------------------------------------------------------------------------

.testparm ndum* The * indicates the number in


this case
( 1) ndum2 = 0
( 2) ndum3 = 0
( 3) ndum4 = 0
( 4) ndum5 = 0
( 5) ndum6 = 0
( 6) ndum7 = 0
( 7) ndum8 = 0
( 8) ndum9 = 0
( 9) ndum10 = 0

F( 9, 157) = 35.40
Prob > F = 0.0000

Note: The F-stat above is similar with xtreg Y X1 X2, fe command on page 6.

21
Testing for time-effect
Back to our example ……since the final model is FE, now we would like to test if
time fixed effects are needed when running a FE model.
It is a joint test to see if the dummies for all years are equal to 0, if they are then no
time fixed effects are needed.

The above time-fixed effects can be tested by creating time dummy variables:

Generate the time dummy variables:

Command to generate time dummy in the panel data If the “year” is used to represent the
time unit (refer to variable’s name
.tabulate year, generate (tdum) in the dataset)

Year | Freq. Percent Cum.


------------+----------------------------------- Any name you like.
1996 | 10 5.88 5.88 In this case, let say
1997 | 10 5.88 11.76 use tdum to
1998 | 10 5.88 17.65
1999 | 10 5.88 23.53
represent the time
2000 | 10 5.88 29.41 dummy variables
2001 | 10 5.88 35.29
2002 | 10 5.88 41.18
2003 | 10 5.88 47.06
2004 | 10 5.88 52.94
2005 | 10 5.88 58.82
2006 | 10 5.88 64.71
2007 | 10 5.88 70.59
2008 | 10 5.88 76.47
2009 | 10 5.88 82.35
2010 | 10 5.88 88.24
2011 | 10 5.88 94.12
2012 | 10 5.88 100.00
------------+-----------------------------------
Total | 170 100.00

and then regress with these time dummies:

.regress lfd lrgdpc lins lfdi tdum2 – tdum17


Source | SS df MS Number of obs = 170
-------------+------------------------------ F( 19, 150) = 11.23
Model | 43.6655247 19 2.29818551 Prob > F = 0.0000
Residual | 30.7098495 150 .20473233 R-squared = 0.5871
-------------+------------------------------ Adj R-squared = 0.5348
Total | 74.3753742 169 .440090972 Root MSE = .45247

------------------------------------------------------------------------------
lfd | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lrgdpc | -.0191519 .0698408 -0.27 0.784 -.1571507 .118847
lins | .917666 .2383268 3.85 0.000 .4467547 1.388577
lfdi | .2102009 .039418 5.33 0.000 .1323148 .2880871
tdum2 | .119034 .2025609 0.59 0.558 -.2812072 .5192751
tdum3 | .102125 .2028266 0.50 0.615 -.2986411 .5028912
tdum4 | -.0540308 .2031446 -0.27 0.791 -.4554253 .3473637
tdum5 | -.094798 .2032939 -0.47 0.642 -.4964875 .3068915
tdum6 | -.0256841 .2035174 -0.13 0.900 -.4278153 .3764471
tdum7 | -.0059049 .2042874 -0.03 0.977 -.4095575 .3977477
tdum8 | -.0190396 .2047366 -0.09 0.926 -.4235797 .3855005

22
tdum9 | -.0556393 .2052772 -0.27 0.787 -.4612477 .349969
tdum10 | -.1307652 .2055285 -0.64 0.526 -.53687 .2753396
tdum11 | -.1551338 .2075592 -0.75 0.456 -.565251 .2549835
tdum12 | -.1754475 .2089743 -0.84 0.402 -.588361 .237466
tdum13 | -.151699 .2097628 -0.72 0.471 -.5661704 .2627724
tdum14 | -.1072258 .2109714 -0.51 0.612 -.5240852 .3096337
tdum15 | -.1103227 .2134789 -0.52 0.606 -.5321369 .3114914
tdum16 | -.1038899 .2138792 -0.49 0.628 -.526495 .3187151
tdum17 | -.1108308 .2144058 -0.52 0.606 -.5344763 .3128147
_cons | -1.31946 .6046789 -2.18 0.031 -2.514248 -.1246719
------------------------------------------------------------------------------.

.testparm tdum*
( 1) tdum2 = 0
( 2) tdum3 = 0
( 3) tdum4 = 0
( 4) tdum5 = 0
( 5) tdum6 = 0
( 6) tdum7 = 0
( 7) tdum8 = 0
( 8) tdum9 = 0
( 9) tdum10 = 0
(10) tdum11 = 0
(11) tdum12 = 0 H0: No time effects
(12) tdum13 = 0
(13) tdum14 = 0 HA: There is a time effects
(14) tdum15 = 0
(15) tdum16 = 0 Similar Result with “Short-cut Command
(16) tdum17 = 0
above”. Again, failed to reject the null
F( 16, 150) = 0.29 hypothesis.
Prob > F = 0.9969 Therefore, time effects are not needed.

Further Investigate the time dummy variables: Drop Insignificant Dummy


Variable (drop the highest p-value first)…..see whether any specific time/year
dummy is significant. If significant, then should remain in the model specification
because this shows time/year effect.

23
Other STATA Commands
• To create descriptive statistics of a variable(s), type
Summarize var1 var2

• To generate a frequency table of variable(s), type


table var1

• To generate a histogram of one variable, type


hist var1

• To see if variables are correlated, type


correlate var1 var2

• To see the variables in the top 10 observations, type


list var1 var2 in 1/10

• To generate a scatter plot, type


scatter var1 var2

• To generate a matrix of scatter plots, type


graph matrix var1 var2

• To apply a label to a variable, type


label variable var1 “real GDP”

• When there are typos in a data or you want to recode the values of a variable, type
recode var1 460304.1=460304

• To generate a new variable, say log of salary, type


generate lnsalary = ln(salary)

• To generate a lag variable, say lag of log salary, type


generate lnsalary1 = l.lnsalary

• To generate a first different variable, say first different log of salary, type
generate fdlnsalary = lnsalary – lnsalary1

• To create a full set of dummy variables from an indexed variable:


tabulate index variable, generate (dummy_variable)

Example:

Time dummy:
tabulate year, generate (tdum)

Country dummy:
tabulate code, generate (ndum)

Firm dummy:
tabulate firm, generate (fdum)

• Outreg2 – to transfer the output to MS Word or Excel in a nice table (save time)

24
Lab Exercise / Hand-On Session I(a)

Dataset: determinants of fd.xls


▫ N: 63 countries over T: 6 years (1999 – 2004)
▫ Cross-sectional dimension is larger than time dimension or N > T

Estimate the following model:


ln PRI it =  +  1 ln GDPCit +  2 ln TOit +  3 ln FOit + it

where PRI = private sector credit (% of GDP) as FD


GDPC = real GDP per capita (US dollar)
TO = trade openness (% of GDP)
FO = financial openness (% of GDP)

a. Estimate the above model using pooled OLS. Interpret your results. When would this
estimation strategy be justified? Check for multicollinearity using variance inflation factor
(vif). What do you conclude?
b. Estimate the same model using the Random Effects (RE) estimator. Are Random Effects
justified?
c. Next, estimate the same model using the Fixed Effects (FE) estimator.
d. Conduct a Hausman test to compare the Fixed Effects (FE) and Random Effects (RE)
models. What do you conclude?
e. Test for heteroskedasticity for the Fixed Effect (FE) model. What do you conclude?
f. Test for serial correlation using xtserial command. What do you conclude?
g. Perform an estimation that can rectify the above (e) and (f) problem(s), and present the
result in the last column.

Results of Panel Data Analysis


Dependent Variable: ln PRI
Pooled Random Fixed Effect
OLS Effect
Constant

ln RGDPC

ln INS

ln FD

Breusch-Pagan _
LM test
Hausman test _

Observations

Multicollinearity

Heteroskedasticity

Serial Correlation

Notes: Figures in the parentheses are t-statistics, except for Breusch-Pagan LM test, Hausman test,
Heteroskedasticity and Serial Correlation tests, which are p-values. ** and *** indicate the respective
5% and 1% significance levels.

25
Lab Exercise / Hand-On Session I(b)

Greene (1997) provides a small panel data set with information on costs and output of 6
different firms, in 4 different periods of time (1955, 1960, 1965, and 1970). Your job is to
estimate a cost function using basic panel data techniques.

Data File: Firm.xls

The data is shown below in a stacked form, i.e., the first "T" lines (here T=4) regard the firm
1, then the second "T" lines regard firm 2, and so on. The columns are self-explanatory. To
facilitate your work, I included firm specific dummy variables for each firm, represented by
columns D1-D6.

Year Firm Cost Output D1 D2 D3 D4 D5 D6


2000 1 3.154 214 1 0 0 0 0 0
2001 1 4.271 419 1 0 0 0 0 0
2002 1 4.584 588 1 0 0 0 0 0
2003 1 5.849 1025 1 0 0 0 0 0
2000 2 3.859 696 0 1 0 0 0 0
2001 2 5.535 811 0 1 0 0 0 0
2002 2 8.127 1640 0 1 0 0 0 0
2003 2 10.966 2506 0 1 0 0 0 0
(...) (...) (...) (...) (...) (...) (...) (...) (...) (...)
2000 6 73.050 11796 0 0 0 0 0 1
2001 6 98.846 15551 0 0 0 0 0 1
2002 6 138.880 27218 0 0 0 0 0 1
2003 6 191.560 30958 0 0 0 0 0 1

Empirical Model

ln Costit = i +  ln Outputit + uit

Estimate the above model:


a. Pooled OLS
b. Fixed Effect
c. Random Effect
d. What should I use: Fixed Effects or Random Effects?
e. LSDV. Can we impose common intercept for all firms?
f. Test whether the time-effects are needed in the model.

26
Lab Exercise / Hand-On Session I(c)
Capital Asset Pricing Model (CAPM)

The CAPM due to Fama and MacBeth (1973) test involves a 2-step estimation procedure.
First, the betas are estimated in separate time series regressions for each firm, and Second,
for each separate point in time, a cross-sectional regression of the excess returns on the
betas is conducted

Rit - Rft = 0 + mPi + ui (1)

where the dependent variable, Rit - Rft is the excess return of the stock i at time t and the
independent variable is the estimated beta for the portfolio (P) that the stock has been
allocated to. The betas of the firms themselves are not used on the RHS, but rather, the
betas of portfolios formed on the basis of firm size. If the CAPM holds, then 0 should not be
significantly different from zero and m should approximate the (time average) equity market
risk premium, Rm – Rf.

Fama and MacBeth proposed estimating this second stage (cross-sectional) regression
separately for each time period, and then taking the average of the parameter estimates to
conduct hypothesis tests. However, one could also achieve a similar objective using a panel
approach.

The attached Excel file (Panel CAPM) contains data on return and beta, which consists of
2500 UK firms for 11 years.

a. Using Stata, estimate the above model using pooled OLS.


Interpret your results
When would this estimation strategy be justified?

b. Next, estimate the same model using the Random Effects estimator. Are Random
Effects justified?

c. Re-estimate Model 1 using the Fixed Effects estimator.

d. Compare the Fixed Effects and Random Effects estimates. Conduct a Hausman test
to compare the two models. What do you conclude?

e. Conduct Fixed Effects estimation with corrected standard errors.

27

You might also like