Using R For Introductory Econometrics
Using R For Introductory Econometrics
2nd edition
Florian Heiss
Using R for Introductory Econometrics
© Florian Heiss 2020. All rights reserved.
Address:
Universitätsstraße 1, Geb. 24.31.01.24
40225 Düsseldorf, Germany
1.5.2. Colors and Shapes in
ggplot Graphs . . . . . . . 36
1.5.3. Fine Tuning of ggplot
Graphs . . . . . . . . . . . . 39
Contents 1.5.4. Basic Data Manipulation
with dplyr . . . . . . . . . 41
1.5.5. Pipes . . . . . . . . . . . . . 43
Preface . . . . . . . . . . . . . . . . . . . 1 1.5.6. More Advanced Data Ma-
nipulation . . . . . . . . . . 44
1. Introduction 3
1.1. Getting Started . . . . . . . . . . . 3 1.6. Descriptive Statistics . . . . . . . . 48
1.1.1. Software . . . . . . . . . . . 3 1.6.1. Discrete Distributions: Fre-
1.1.2. R Scripts . . . . . . . . . . . 4 quencies and Contingency
1.1.3. Packages . . . . . . . . . . . 7 Tables . . . . . . . . . . . . 48
1.1.4. File names and the Work- 1.6.2. Continuous Distributions:
ing Directory . . . . . . . . 9 Histogram and Density . . 51
1.1.5. Errors and Warnings . . . . 10 1.6.3. Empirical Cumulative Dis-
1.1.6. Other Resources . . . . . . 10 tribution Function (ECDF) . 52
1.2. Objects in R . . . . . . . . . . . . . 11 1.6.4. Fundamental Statistics . . . 54
1.2.1. Basic Calculations and Ob-
1.7. Probability Distributions . . . . . . 55
jects . . . . . . . . . . . . . . 11
1.2.2. Vectors . . . . . . . . . . . . 12 1.7.1. Discrete Distributions . . . 55
1.2.3. Special Types of Vectors . . 15 1.7.2. Continuous Distributions . 57
1.2.4. Naming and Indexing Vectors 16 1.7.3. Cumulative Distribution
1.2.5. Matrices . . . . . . . . . . . 17 Function (CDF) . . . . . . . 58
1.2.6. Lists . . . . . . . . . . . . . 20 1.7.4. Random Draws from Prob-
1.3. Data Frames and Data Files . . . . 21 ability Distributions . . . . 59
1.3.1. Data Frames . . . . . . . . . 21
1.8. Confidence Intervals and Statisti-
1.3.2. Subsets of Data . . . . . . . 22
cal Inference . . . . . . . . . . . . . 61
1.3.3. R Data Files . . . . . . . . . 23
1.3.4. Basic Information on a 1.8.1. Confidence Intervals . . . . 61
Data Set . . . . . . . . . . . 23 1.8.2. t Tests . . . . . . . . . . . . 64
1.3.5. Import and Export of Text 1.8.3. p Values . . . . . . . . . . . 65
Files . . . . . . . . . . . . . 24 1.8.4. Automatic calculations . . . 66
1.3.6. Import and Export of Other
1.9. More Advanced R . . . . . . . . . . 70
Data Formats . . . . . . . . 25
1.3.7. Data Sets in the Examples . 26 1.9.1. Conditional Execution . . . 70
1.4. Base Graphics . . . . . . . . . . . . 27 1.9.2. Loops . . . . . . . . . . . . . 70
1.4.1. Basic Graphs . . . . . . . . 27 1.9.3. Functions . . . . . . . . . . 71
1.4.2. Customizing Graphs with 1.9.4. Outlook . . . . . . . . . . . 71
Options . . . . . . . . . . . 29
1.10. Monte Carlo Simulation . . . . . . 72
1.4.3. Overlaying Several Plots . . 30
1.4.4. Legends . . . . . . . . . . . 31 1.10.1. Finite Sample Properties of
1.4.5. Exporting to a File . . . . . 33 Estimators . . . . . . . . . . 72
1.5. Data Manipulation and Visualiza- 1.10.2. Asymptotic Properties of
tion: The Tidyverse . . . . . . . . . 34 Estimators . . . . . . . . . . 75
1.5.1. Data visualization: ggplot 1.10.3. Simulation of Confidence
Basics . . . . . . . . . . . . . 34 Intervals and t Tests . . . . 78
I. Regression Analysis with 6. Multiple Regression Analysis: Further
Cross-Sectional Data 81 Issues 137
6.1. Model Formulae . . . . . . . . . . . 137
2. The Simple Regression Model 83 6.1.1. Data Scaling: Arithmetic
2.1. Simple OLS Regression . . . . . . . 83 Operations Within a Formula 137
2.2. Coefficients, Fitted Values, and 6.1.2. Standardization: Beta Coef-
Residuals . . . . . . . . . . . . . . . 88 ficients . . . . . . . . . . . . 139
2.3. Goodness of Fit . . . . . . . . . . . 91 6.1.3. Logarithms . . . . . . . . . 140
2.4. Nonlinearities . . . . . . . . . . . . 93 6.1.4. Quadratics and Polynomials 140
2.5. Regression through the Origin and 6.1.5. ANOVA Tables . . . . . . . 142
Regression on a Constant . . . . . 94
6.1.6. Interaction Terms . . . . . . 144
2.6. Expected Values, Variances, and
Standard Errors . . . . . . . . . . . 96 6.2. Prediction . . . . . . . . . . . . . . 146
2.7. Monte Carlo Simulations . . . . . . 98 6.2.1. Confidence Intervals for
2.7.1. One sample . . . . . . . . . 98 Predictions . . . . . . . . . . 146
2.7.2. Many Samples . . . . . . . 100 6.2.2. Prediction Intervals . . . . . 148
2.7.3. Violation of SLR.4 . . . . . 103 6.2.3. Effect Plots for Nonlinear
2.7.4. Violation of SLR.5 . . . . . 103 Specifications . . . . . . . . 148
5. Multiple Regression Analysis: OLS 9. More on Specification and Data Issues 173
Asymptotics 129
9.1. Functional Form Misspecification . 173
5.1. Simulation Exercises . . . . . . . . 129
5.1.1. Normally Distributed Error 9.2. Measurement Error . . . . . . . . . 175
Terms . . . . . . . . . . . . . 129 9.3. Missing Data and Nonrandom
5.1.2. Non-Normal Error Terms . 130 Samples . . . . . . . . . . . . . . . 178
5.1.3. (Not) Conditioning on the 9.4. Outlying Observations . . . . . . . 181
Regressors . . . . . . . . . . 133 9.5. Least Absolute Deviations (LAD)
5.2. LM Test . . . . . . . . . . . . . . . . 135 Estimation . . . . . . . . . . . . . . 182
II. Regression Analysis with Time 14.3. Dummy Variable Regression and
Series Data 183 Correlated Random Effects . . . . 230
14.4. Robust (Clustered) Standard Errors 234
10. Basic Regression Analysis with Time
Series Data 185 15. Instrumental Variables Estimation
10.1. Static Time Series Models . . . . . 185 and Two Stage Least Squares 237
10.2. Time Series Data Types in R . . . . 186 15.1. Instrumental Variables in Simple
10.2.1. Equispaced Time Series in R 186 Regression Models . . . . . . . . . 237
10.2.2. Irregular Time Series in R . 187 15.2. More Exogenous Regressors . . . . 239
10.3. Other Time Series Models . . . . . 191 15.3. Two Stage Least Squares . . . . . . 240
10.3.1. The dynlm Package . . . . 191 15.4. Testing for Exogeneity of the Re-
10.3.2. Finite Distributed Lag gressors . . . . . . . . . . . . . . . . 242
Models . . . . . . . . . . . . 191 15.5. Testing Overidentifying Restrictions 243
10.3.3. Trends . . . . . . . . . . . . 194 15.6. Instrumental Variables with Panel
10.3.4. Seasonality . . . . . . . . . 195 Data . . . . . . . . . . . . . . . . . . 243
11. Further Issues In Using OLS with Time 16. Simultaneous Equations Models 247
Series Data 197 16.1. Setup and Notation . . . . . . . . . 247
11.1. Asymptotics with Time Series . . . 197 16.2. Estimation by 2SLS . . . . . . . . . 248
11.2. The Nature of Highly Persistent 16.3. Joint Estimation of System . . . . . 249
Time Series . . . . . . . . . . . . . . 200 16.4. Outlook: Estimation by 3SLS . . . 251
11.3. Differences of Highly Persistent
Time Series . . . . . . . . . . . . . . 203 17. Limited Dependent Variable Models
11.4. Regression with First Differences . 204 and Sample Selection Corrections 253
17.1. Binary Responses . . . . . . . . . . 253
12. Serial Correlation and Heteroscedas- 17.1.1. Linear Probability Models . 253
ticity in Time Series Regressions 205 17.1.2. Logit and Probit Models:
12.1. Testing for Serial Correlation of the Estimation . . . . . . . . . . 255
Error Term . . . . . . . . . . . . . . 205 17.1.3. Inference . . . . . . . . . . . 258
12.2. FGLS Estimation . . . . . . . . . . 209 17.1.4. Predictions . . . . . . . . . . 259
12.3. Serial Correlation-Robust Infer- 17.1.5. Partial Effects . . . . . . . . 260
ence with OLS . . . . . . . . . . . . 210 17.2. Count Data: The Poisson Regres-
12.4. Autoregressive Conditional Het- sion Model . . . . . . . . . . . . . . 263
eroscedasticity . . . . . . . . . . . . 211 17.3. Corner Solution Responses: The
Tobit Model . . . . . . . . . . . . . 266
17.4. Censored and Truncated Regres-
III. Advanced Topics 213 sion Models . . . . . . . . . . . . . 269
17.5. Sample Selection Corrections . . . 271
13. Pooling Cross-Sections Across Time:
Simple Panel Data Methods 215 18. Advanced Time Series Topics 273
13.1. Pooled Cross-Sections . . . . . . . 215 18.1. Infinite Distributed Lag Models . . 273
13.2. Difference-in-Differences . . . . . . 216 18.2. Testing for Unit Roots . . . . . . . 275
13.3. Organizing Panel Data . . . . . . . 219 18.3. Spurious Regression . . . . . . . . 278
13.4. Panel-specific computations . . . . 220 18.4. Cointegration and Error Correc-
13.5. First Differenced Estimator . . . . 222 tion Models . . . . . . . . . . . . . 280
18.5. Forecasting . . . . . . . . . . . . . . 281
14. Advanced Panel Data Methods 225
14.1. Fixed Effects Estimation . . . . . . 225 19. Carrying Out an Empirical Project 285
14.2. Random Effects Models . . . . . . 227 19.1. Working with R Scripts . . . . . . . 285
19.2. Logging Output in Text Files . . . 287
19.3. Formatted Documents and Re-
ports with R Markdown . . . . . . 287
19.3.1. Basics . . . . . . . . . . . . . 287
19.3.2. Advanced Features . . . . . 288
19.3.3. Bottom Line . . . . . . . . . 291
19.4. Combining R with LaTeX . . . . . 293
19.4.1. Automatic Document Gen-
eration using Sweave and
knitr . . . . . . . . . . . . 293
19.4.2. Separating R and LATEX code 298
Bibliography 359
Index 363
List of Tables
1.1. R functions for important arith-
metic calculations . . . . . . . . . . 11
1.2. R functions specifically for vectors 13
1.3. Logical Operators . . . . . . . . . . 15
1.4. R functions for descriptive statistics 54
1.5. R functions for statistical distribu-
tions . . . . . . . . . . . . . . . . . . 56
textbook sold only in Europe, the Middle East, and Africa (Wooldridge, 2014) is mostly consistent,
but lacks, among other things, the appendices on fundamental math, probability, and statistics.
All computer code used in this book can be downloaded to make it easier to replicate the results
and tinker with the specifications. The companion website also provides the full text of this book
for online viewing and additional material. It is located at
http://www.URfIE.net
1.1.1. Software
R is a free and open source software. Its homepage is http://www.r-project.org/. There, a
wealth of information is available as well as the software itself. Most of the readers of this book will
not want to compile the software themselves, so downloading the pre-compiled binary distributions
is recommended. They are available for Windows, Mac, and Linux systems. Alternatively, Microsoft
R Open (MRO) is a 100% compatible open source R distribution which is optimized for computa-
tional speed.1 It is available at https://mran.microsoft.com/open/ for all relevant operating
systems.
After downloading, installing, and running R or MRO, the program window will look similar to
the screen shot in Figure 1.1. It provides some basic information on R and the installed version.
Right to the > sign is the prompt where the user can type commands for R to evaluate.
We can type whatever we want here. After pressing the return key ( ←- ), the line is terminated, R
tries to make sense out of what is written and gives an appropriate answer. In the example shown in
Figure 1.1, this was done four times. The texts we typed are shown next to the “>” sign, R answers
under the respective line next to the “[1]”.
Our first attempt did not work out well: We have got an error message. Unfortunately, R does not
comprehend the language of Shakespeare. We will have to adjust and learn to speak R’s less poetic
language. The other experiments were more successful: We gave R simple computational tasks and
got the result (next to a “[1]”). The syntax should be easy to understand – apparently, R can do
simple addition, deals with the parentheses in the expected way, can calculate square roots (using
the term sqrt) and knows the number π.
1 Incase you were wondering: MRO uses multi-threaded BLAS/LAPACK libraries and is therefore especially powerful for
computations which involve large matrices.
4 1. Introduction
R is used by typing commands such as these. Not only Apple users may be less than impressed
by the design of the user interface and the way the software is used. There are various approaches
to make it more user friendly by providing a different user interface added on top of plain R.
Notable examples include R commander, Deducer, RKWard, and RStudio. In the following, we will
use the latter which can be downloaded free of charge for the most common operating systems at
http://www.rstudio.com/.
A screen shot of the user interface is shown in Figure 1.2. There are several sub-windows. The
big one on the left named “Console” looks very similar and behaves exactly the same as the plain
R window. In addition, there are other windows and tabs some of which are obvious (like “Help”).
The usefulness of others will become clear soon. We will show some RStudio-specific tips and tricks
below, but all the calculations can be done with any user interface and plain R as well.
Here are a few quick tricks for working in the Console of Rstudio:
• When starting to type a command, press the tabulator key − →−
−
→− to see a list of suggested
commands along with a short description. Typing sq followed by → −
−−
→− gives a list of all R
commands starting with sq.
• The F1 function key opens the full help page for the current command in the help window
(bottom right by default).2 The same can be achieved by typing ?commmand.
• With the ↑ and ↓ arrow keys, we can scroll through the previously entered commands to
repeat or correct them.
• With Ctrl on Windows or Command on a Mac pressed, ↑ will give you a list of all previous
commands. This list is also available in the “History” window (top right by default).
1.1.2. R Scripts
As already seen, we will have to get used to interacting with our software using written commands.
While this may seem odd to readers who do not have any experience with similar software at this
point, it is actually very common for econometrics software and there are good reasons for this. An
2 On some computers, the function keys are set to change the display brightness, volume, and the like by default. This can
be changed in the system settings.
1.1. Getting Started 5
important advantage is that we can easily collect all commands we need for a project in a text file
called R script.
An R script contains all commands including those for reading the raw data, data manipulation,
estimation, post-estimation analyses, and the creation of graphs and tables. In a complex project,
these tasks can be divided into separate R scripts. The point is that the script(s) together with the
raw data generate the output used in the term paper, thesis, or research paper. We can then ask R to
evaluate all or some of the commands listed in the R script at once.
This is important since a key feature of the scientific method is reproducibility. Our thesis adviser
as well as the referee in an academic peer review process or another researcher who wishes to build
on our analyses must be able to fully understand where the results come from. This is easy if we can
simply present our R script which has all the answers.
Working with R scripts is not only best practice from a scientific perspective, but also very con-
venient once we get used to it. In a nontrivial data analysis project, it is very hard to remember all
the steps involved. If we manipulate the data for example by directly changing the numbers in a
spreadsheet, we will never be able to keep track of everything we did. Each time we make a mistake
(which is impossible to avoid), we can simply correct the command and let R start from scratch by
a simple mouse click if we are using scripts. And if there is a change in the raw data set, we can
simply rerun everything and get the updated tables and figures instantly.
Using R scripts is straightforward: We just write our commands into a text file and save it with
a “.R” extension. When using a user interface like RStudio, working with scripts is especially
convenient since it is equipped with a specialized editor for script files. To open the editor for
creating a new R script, use the menu File→New→R Script, or click on the symbol in the top
left corner, or press the buttons Ctrl + Shift ⇑ + N on Windows and Command + Shift ⇑ + N
6 1. Introduction
simultaneously.
The window that opens in the top left part is the script editor. We can type arbitrary text, begin a
new line with the return key, and navigate using the mouse or the ↑ ↓ ← → arrow keys. Our
goal is not to type arbitrary text but sensible R commands. In the editor, we can also use tricks like
code completion that work in Console window as described above. A new command is generally
started in a new line, but also a semicolon “;” can be used if we want to cram more than one
command into one line – which is often not a good idea in terms of readability.
An extremely useful tool to make R scripts more readable are comments. These are lines beginning
with a #. These lines are not evaluated by R but can (and should) be used to structure the script and
explain the steps. In the editor, comments are by default displayed in green to further increase the
readability of the script. R Scripts can be saved and opened using the File menu.
Given an R script, we can send lines of code to R to be evaluated. To run the line in which the
cursor is, click on the button on top of the editor or simply press Ctrl + ←- on Windows and
Command + ←- on a Mac. If we highlight multiple lines (with the mouse or by holding Shift ⇑
while navigating), all are evaluated. The whole script can be highlighted by pressing Ctrl + A on
Windows or Command + A on a Mac.
Figure 1.3 shows a screenshot of RStudio with an R script saved as “First-R-Script.R”. It consists
of six lines in total including three comments. It has been executed as can be seen in the Console
window: The lines in the scripts are repeated next to the > symbols and the answer of R (if there is
any) follows as though we had typed the commands directly into the Console.
In what follows, we will do everything using R scripts. All these scripts are available for download
to make it easy and convenient to reproduce all contents in real time when reading this book. As
1.1. Getting Started 7
> 5*(4-1)^2
[1] 45
We will discuss some additional hints for efficiently working with R scripts in Section 19.
1.1.3. Packages
The functionality of R can be extended relatively easily by advanced users. This is not only useful
for those who are able and willing to do this, but also for a novice user who can easily make use of
a wealth of extensions generated by a big and active community. Since these extensions are mostly
programmed in R, everybody can check and improve the code submitted by a user, so the quality
control works very well.
These extensions are called packages. The standard distribution of R already comes with a number
of packages. In RStudio, the list of currently installed packages can be seen in the “Packages”
window (bottom right by default). A click on the package name opens the corresponding help file
which describes what functionality it provides and how it works. This package index can also be
activated with help(package="package name").
On top of the packages that come with the standard installation, there are countless packages
available for download. If they meet certain quality criteria, they can be published on the official
“Comprehensive R Archive Network” (CRAN) servers at http://cran.r-project.org. Down-
loading and installing these packages is especially simple: In the Packages window of RStudio, click
on “Install Packages”, enter the name of the package and click on “Install”. If you prefer to do
it using code, here is how it works: install.packages("package name"). In both cases, the
package is added to our package list and is ready to be used.
In order to use a package in an R session, we have to activate it. The can be done by clicking on
the check box next to the package name. 3 Instead of having to click on a number of check boxes
3 The reason why not all installed packages are loaded automatically is that R saves valuable start-up time and system
resources and might be able to avoid conflicts between some packages.
8 1. Introduction
in the “Packages” window before we can run an R script (and having to know which ones), it is
much more elegant to instead automatically activate the required packages by lines of code within
the script. This is done with the command library(package name).4 After activating a package,
nothing obvious happens immediately, but R understands more commands.
If we just want to use some function from a package once, it might not be worthwhile to load
the whole package. Instead, we can just write package::function(...). For example, most
common data sets can be imported using the function import from the package rio, see Section
1.3.6. Here is how to use it:
• We can either load the package and call the function:
library(rio)
import(filename)
• Or we call the function without loading the whole package:
rio::import(filename)
Packages can also contain data sets. The datasets package contains a number of example data
sets, see help(package="datasets"). It is included in standard R installations and loaded by de-
fault at startup. In this book, we heavily use the wooldridge package which makes all example data
sets conveniently available see help(package="wooldridge") for a list. We can simply load a
data set, for example the one named affairs, with data(affairs, package="wooldridge").
There are thousands of packages provided at the CRAN. Here is a list of those we will use through-
out this book:
• AER (“Applied Econometrics with R”): Provided with the book with the same name by Kleiber
and Zeileis (2008). Provides some new commands, e.g. for instrumental variables estimation
and many interesting data sets.
• car (“Companion to Applied Regression”): A comprehensive package that comes with the
book of Fox and Weisberg (2011). Provides many new commands and data sets.
• censReg: Censored regression/tobit models.
• dummies: Automatically generating dummy/indicator variables.
• dynlm: Dynamic linear regression for time series.
• effects: Graphical and tabular illustration of partial effects, see Fox (2003).
• ggplot2: Advanced and powerful graphics, see Wickham (2009) and Chang (2012).
• knitr: Combine R and LATEX code in one document, see Xie (2015).
• lmtest (“Testing Linear Regression Models”): Includes many useful tests for the linear regres-
sion model.
• maps: Draw geographical maps.
• mfx: Marginal effects, odds ratios and incidence rate ratios for GLMs.
• orcutt: Cochrane-Orcutt estimator for serially correlated errors.
• plm (“Linear Models for Panel Data”): A large collection of panel data methods, see Croissant
and Millo (2008).
• quantmod: Quantitative Financial Modelling, see http://www.quantmod.com.
• quantreg: Quantile regression, especially least absolute deviation (LAD) regression, see
Koenker (2012).
• rio: (“A Swiss-Army Knife for Data I/O”): Conveniently import and export data files.
• rmarkdown: Convert R Markdown documents into HTML, MS Word, and PDF.
• sampleSelection: Sample selection models, see Toomet and Henningsen (2008).
4 The command require does almost the same as library.
1.1. Getting Started 9
There are several possibilities for R to interact with files. The most important ones are to load, save,
import, or export a data file. We might also want to save a generated figure as a graphics file or store
regression tables as text, spreadsheet, or LATEX files.
Whenever we provide R with a file name, it can include the full path on the computer. Note that
the path separator has to be the forward slash / instead of the backslash \ which is common on MS
Windows computers. So the full (i.e. “absolute”) path to a script file might be something like
C:/Users/MyUserName/Documents/MyRProject/MyScript.R
on a Windows system or
~/MyRProject/MyScript.R
on a Mac or Linux system.
Hint: Also R installations on Windows machine recognize a path like ~/MyRProject/MyScript.R.
Here, ~ refers to the “Documents” folder of the current user.
If we do not provide any path, R will use the current “working directory” for reading or writing
files. It can be obtained by the command getwd(). In RStudio, it is also displayed on top of the Con-
sole window. To change the working directory, use the command setwd(path). Relative paths, are
interpreted relative to the current working directory. For a neat file organization, best practice is to
generate a directory for each project (say MyRProject) with several sub-directories (say Rscripts,
data, and figures). At the beginning of our script, we can use setwd(~/MyRProject) and after-
wards refer to a data set in the respective sub-directory as data/MyData.RData and to a graphics
file as figures/MyFigure.png .5
1.2. Objects in R
R can work with numbers, vectors, matrices, texts, data sets, graphs, functions, and many more
objects of different types. This section covers the most important ones we will frequently encounter
in the remainder of this book.
and some special characters such as “.” and “_”. R is case sensitive, so x and X are different object
names.
The content of an object is assigned using <- which is supposed to resemble an arrow and is
simply typed as the two characters “less than” and “minus”.6 In order to assign the value 5 to the
object x, type (the spaces are optional)
x <- 5
A new object with the name x is created and has the value 5. If there was an object with this name
before, its content is overwritten. From now on, we can use x in our calculations. Assigning a
value to an object will not produce any output. The simplest shortcut for immediately displaying
the result is to put the whole expression into parentheses as in (x <- 5). Script 1.3 (Objects.R)
shows simple examples using the three objects x, y, and z.
> x^2
[1] 25
A list of all currently defined object names can be obtained using ls(). In RStudio, it is also
shown in the “Workspace” window (top right by default). The command exists("name") checks
whether an object with the name “name” is defined and returns either TRUE or FALSE, see Section
1.2.3 for this type of “logical” object. Removing a previously defined object (for example x) from the
workspace is done using rm(x). All objects are removed with rm(list = ls()).
1.2.2. Vectors
For statistical calculations, we obviously need to work with data sets including many numbers
instead of scalars. The simplest way we can collect many numbers (or other types of informa-
tion) is called a vector in R terminology. To define a vector, we can collect different values using
c(value1,value2,...). All the operators and functions used above can be used for vectors.
Then they are applied to each of the elements separately.7 The examples in Script 1.4 (Vectors.R)
should help to understand the concept and use of vectors.
6 Consistent with other programming languages, the assignment can also be done using x=5. R purists frown on this syntax.
It makes a lot of sense to distinguish the mathematical meaning of an equality sign from the assignment of a value to an
object. Mathematically, the equation x = x + 1 does not make any sense, but the assignment x<-x+1 does – it increases
the previous value of x by 1. We will stick to the standard R syntax using <- throughout this text.
7 Note that also the multiplication of two vectors using the
* operator performs element-wise multiplication. For vector and
matrix algebra, see Section 1.2.5 on matrices.
1.2. Objects in R 13
> sqrt(d)
[1] 2.449490 3.872983 5.291503 6.708204 8.124038 9.539392
There are also specific functions to create, manipulate and work with vectors. The most important
ones are shown in Table 1.2. Script 1.5 (Vector-Functions.R) provides examples to see them in
action. We will see in section 1.6 how to obtain descriptive statistics for vectors.
> length(a)
[1] 7
> min(a)
[1] 1
> max(a)
[1] 9
> sum(a)
[1] 32
> prod(a)
[1] 9072
> rep(1,20)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> seq(50)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
[23] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
[45] 45 46 47 48 49 50
> 5:15
[1] 5 6 7 8 9 10 11 12 13 14 15
> seq(4,20,2)
[1] 4 6 8 10 12 14 16 18 20
1.2. Objects in R 15
> cities
[1] "New York" "Los Angeles" "Chicago"
Another useful type are logical vectors. Each element can only take one of two values: TRUE or
FALSE. The easiest way to generate them is to state claims which are either true or false and let R
decide. Table 1.3 lists the main logical operators.
It should be noted that internally, FALSE is equal to 0 and TRUE is equal to 1 and we can do
calculations accordingly. Script 1.6 (Logical.R) demonstrates the most important features of logical
vectors and should be pretty self-explanatory.
Output of Script 1.6: Logical.R
> # Basic comparisons:
> 0 == 1
[1] FALSE
> 0 < 1
[1] TRUE
Many economic variables of interest have a qualitative rather than quantitative interpretation.
They only take a finite set of values and the outcomes don’t necessarily have a numerical meaning.
Instead, they represent qualitative information. Examples include gender, academic major, grade,
marital status, state, product type or brand. In some of these examples, the order of the outcomes
has a natural interpretation (such as the grades), in others, it does not (such as the state).
As a specific example, suppose we have asked our customers to rate our product on a scale between
1 (=“bad”), 2 (=“okay”), and 3 (=“good”). We have stored the answers of our ten respondents in
terms of the numbers 1,2, and 3 in a vector. We could work directly with these numbers, but often,
it is convenient to use so-called factors. One advantage is that we can attach labels to the outcomes.
Given a vector x with a finite set of values, a new factor xf can be generated using the command
16 1. Introduction
The vector mylabels includes the names of the outcomes, we could for example state xf <-
factor(x, labels=c("bad","okay","good") ). In this example, the outcomes are ordered,
so the labeling is not arbitrary. In cases like this, we should add the option ordered=TRUE. This is
done for a simple example with ten ratings in Script 1.7 (Factors.R).
Output of Script 1.7: Factors.R
> # Original ratings:
> x <- c(3,2,2,3,1,2,3,2,1,2)
> x
[1] 3 2 2 3 1 2 3 2 1 2
> xf
[1] good okay okay good bad okay good okay bad okay
Levels: bad okay good
> avgs
Cobb Hornsby Jackson O’Doul Delahanty
0.366 0.358 0.356 0.349 0.346
> avgs[1:4]
Cobb Hornsby Jackson O’Doul
0.366 0.358 0.356 0.349
1.2.5. Matrices
Matrices are important tools for econometric analyses. Appendix D of Wooldridge (2019) introduces
the basic concepts of matrix algebra.8 R has a powerful matrix algebra system. Most often in applied
econometrics, matrices will be generated from an existing data set. We will come back to this below
and first look at three different ways to define a matrix object from scratch:
• matrix(vec,nrow=m) takes the numbers stored in vector vec and put them into a matrix
with m rows.
• rbind(r1,r2,...) takes the vectors r1,r2,... (which obviously should have the same
length) as the rows of a matrix.
• cbind(c1,c2,...) takes the vectors c1,c2,... (which obviously should have the same
length) as the columns of a matrix.
Script 1.9 (Matrices.R) first demonstrates how the same matrix can be created using all three
approaches. A close inspection of the output reveals the technical detail that the rows and columns
of matrices can have names. The functions rbind and cbind automatically assign the names of the
vectors as row and column names, respectively. As demonstrated in the output, we can manipulate
the names using the commands rownames and colnames. This has only cosmetic consequences
and does not affect our calculations.
Output of Script 1.9: Matrices.R
> # Generating matrix A from one vector with all values:
> v <- c(2,-4,-1,5,7,0)
8 The strippped-down European and African textbook Wooldridge (2014) does not include the Appendix on matrix algebra.
18 1. Introduction
> A
Alpha Beta Gamma
Aleph 2 -1 7
Bet -4 5 0
> diag( 3 )
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
> A[,2]
Aleph Bet
-1 5
> A[,c(1,3)]
Alpha Gamma
Aleph 2 7
Bet -4 0
We can also create special matrices as the examples in the output show:
• diag(vec) (where vec is a vector) creates a diagonal matrix with the elements on the main
diagonal given in vector vec.
• diag(n) (where n is a scalar) creates the n×n identity matrix.
If instead of a vector or scalar, a matrix M is given as an argument to the function diag, it will return
the main diagonal of M.
Finally, Script 1.9 (Matrices.R) shows how to access a subset of matrix elements. This is straight-
forward with indices that are given in brackets much like indices can be used for vectors as already
discussed. We can give a row and then a column index (or vectors of indices), separated by a comma:
• A[2,3] is the element in row 2, column 3
• A[2,c(1,2)] is a vector consisting of the elements in row 2, columns 1 and 2
• A[2,] is a vector consisting of the elements in row 2, all columns
1.2. Objects in R 19
• Matrix addition using the operator + as long as the matrices have the same dimensions.
• The operator * does not do matrix multiplication but rather element-wise multiplication.
• Matrix multiplication is done with the somewhat clumsy operator %*% (yes, it consists of three
characters!) as long as the dimensions of the matrices match.
• Transpose of a matrix X: as t(X)
• Inverse of a matrix X: as solve(X)
The examples in Script 1.10 (Matrix-Operators.R) should help to understand the workings of
these basic operations. In order to see how the OLS estimator for the multiple regression model
can be calculated using matrix algebra, see Section 3.2. Standard R is capable of many more matrix
algebra methods. Even more advanced methods are available in the Matrix package.
> A
[,1] [,2] [,3]
[1,] 2 -1 7
[2,] -4 5 0
> B
[,1] [,2] [,3]
[1,] 2 0 -1
[2,] 1 3 5
> A *B
[,1] [,2] [,3]
[1,] 4 0 -7
[2,] -4 15 0
> # Transpose:
> (C <- t(B) )
[,1] [,2]
[1,] 2 1
[2,] 0 3
[3,] -1 5
> # Inverse:
> solve(D)
[,1] [,2]
[1,] 0.0460251 -0.1422594
[2,] 0.0334728 -0.0125523
20 1. Introduction
1.2.6. Lists
In R, a list is a generic collection of objects. Unlike vectors, the components can have different
types. Each component can (and in the cases relevant for us will) be named. Lists can be generated
with a command like
mylist <- list( name1=component1, name2=component2, ... )
The names of the components are returned by names(mylist). A component can be addressed by
name using mylist$name. These features are demonstrated in Script 1.11 (Lists.R).
We will encounter special classes of lists in the form of analysis results: Commands for statistical
analyses often return a list that contains characters (like the calling command), vectors (like the
parameter estimates), and matrices (like variance-covariance matrices). But we’re getting ahead of
ourselves – we will encounter this for the first time in Section 1.8.4.
Output of Script 1.11: Lists.R
> # Generate a list object:
> mylist <- list( A=seq(8,36,4), this="that", idm = diag(3))
$this
[1] "that"
$idm
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
> sales
product1 product2 product3
2008 0 1 2
2009 3 2 4
2010 6 3 4
2011 9 5 2
2012 7 9 3
2013 8 6 2
22 1. Introduction
The outputs of the matrix sales_mat and the data frame sales look exactly the same, but they
behave differently. In RStudio, the difference can be seen in the Workspace window (top right by
default). It reports the content of sales_mat to be a “6x3 double matrix” whereas the content
of sales is “6 obs. of 3 variables”.
We can address a single variable var of a data frame df using the matrix-like syntax df[,"var"]
or by stating df$var.9 This can be used for extracting the values of a variable but also for creating
new variables. Sometimes, it is convenient not to have to type the name of the data frame several
times within a command. The function with(df, some expression using vars of df) can
help. Yet another (but not recommended) method for conveniently working with data frames is to
attach them before doing several calculations using the variables stored in them. It is important
to detach them later. Script 1.13 (Data-frames-vars.R) demonstrates these features. A very
powerful way to manipulate data frames using the “tidyverse” approach is presented in Sections
1.5.4–1.5.6 below.
Output of Script 1.13: Data-frames-vars.R
> # Accessing a single variable:
> sales$product2
[1] 1 2 3 5 9 6
> detach(sales)
> # Result:
> sales
product1 product2 product3 totalv1 totalv2 totalv3
2008 0 1 2 3 3 3
2009 3 2 4 9 9 9
2010 6 3 4 13 13 13
2011 9 5 2 16 16 16
2012 7 9 3 19 19 19
2013 8 6 2 16 16 16
9 Technically, a data frame is just a special class of a list of variables. This is the reason why the $ syntax is the same as for
general list, see Section 1.2.6
1.3. Data Frames and Data Files 23
Of course, the file name can also contain an absolute or relative path, see Section 1.1.4. To save all
currently defined objects, use save(list=ls(), file="mydata.RData") instead. All objects
stored in mydata.RData can be loaded into the workspace with
load("mydata.RData")
> sales
product1 product2 product3 totalv1 totalv2 totalv3
2008 0 1 2 3 3 3
2009 3 2 4 9 9 9
2010 6 3 4 13 13 13
2011 9 5 2 16 16 16
2012 7 9 3 19 19 19
2013 8 6 2 16 16 16
For the general rules on the file name, once again consult Section 1.1.4. The optional arguments that
can be added, separated by comma, include but are not limited to:
• header=TRUE: The text file includes the variable names as the first line
• sep=",": Instead of spaces or tabs, the columns are separated by a comma. Instead, an arbi-
trary other character can be given. sep=";" might be another relevant example of a separator.
• dec=",": Instead of a decimal point, a decimal comma is used. For example, some interna-
tional versions of MS Excel produce these sorts of text files.
10 The commands read.csv and read.delim work very similarly but have different defaults for options like header and
sep.
1.3. Data Frames and Data Files 25
• row.names=number: The values in column number number are used as row names instead
of variables.
RStudio provides a graphical user interface for importing text files which also allows to preview the
effects of changing the options: In the Workspace window, click on “Import Dataset”.
Figure 1.4 shows two flavors of a raw text file containing the same data. The file sales.txt
contains a header with the variable names. It can be imported with
mydata <- read.table("sales.txt", header=TRUE)
In file sales.csv, the columns are separated by a comma. The correct command for the import
would be
mydata <- read.table("sales.csv", sep=",")
Since this data file does not contain any variable names, they are set to their default values V1
through V4 in the resulting data frame mydata. They can be changed manually afterward, e.g. by
colnames(mydata) <- c("year","prod1","prod2","prod3").
Given some data in a data frame mydata, they can be exported to a text file using similar options
as for read.table using
write.table(mydata, file = "myfilename", ...)
Here, "myfilename" is the complete file name including the extension and the path, unless it is
located in the current working directory, see Section 1.1.4.
26 1. Introduction
> mean(affairs2$naffairs)
[1] 1.455907
> mean(affairs3$naffairs)
[1] 1.455907
> mean(affairs4$naffairs)
[1] 1.455907
0.4
4
0.3
3
dnorm(x)
x^2
0.2
2
0.1
1
0.0
0
−2 −1 0 1 2 −3 −2 −1 0 1 2 3
where function(x) is the function to be plotted in general R syntax involving x and xmin and
xmax are the limits for the x axis. For example, the command curve( x^2, -2, 2 ) generated
Figure 1.5(a) and curve( dnorm(x), -3, 3 ) produced Figure 1.5(b).12
If we have data or other points in two vectors x and y, we can easily generate scatter plots, line
plots or similar two-way graphs. The command plot is a generic plotting command that is capable
of these types of graphs and more. We will see some of the more specialized uses later on. We define
two short vectors and simply call plot with the vectors as arguments:
x <- c(1,3,4,7,8,9)
y <- c(0,3,6,9,7,8)
plot(x,y)
This will generate Figure 1.6(a). The most fundamental option of these plots is the type. It can take
the values "p" (the default), "l", "b", "o", "s", "h", and more. The resulting plots are shown in
Figure 1.6.
12 The function dnorm(x) is the standard normal density, see Section 1.7.
28 1. Introduction
8
6
6
y
y
4
4
2
2
0
0
2 4 6 8 2 4 6 8
(c) plot(x,y,type="b") (d) plot(x,y,type="o")
8
8
6
6
y
y
4
4
2
2
0
2 4 6 8 2 4 6 8
(e) plot(x,y,type="s") (f) plot(x,y,type="h")
8
8
6
6
y
y
4
4
2
2
0
2 4 6 8 2 4 6 8
1.4. Base Graphics 29
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
• The line type can be changed using the option lty. It can take (among other specifications)
the values 1 through 6:
1 2 3 4 5 6
• The size of the points and texts can be changed using the option cex. It represents a factor
(standard: cex=1).
• The width of the lines can be changed using the option lwd. It represents a factor (standard:
lwd=1).
• The color of the lines and symbols can be changed using the option col=value. It can be
specified in several ways:
– By name: A list of available color names can be obtained by colors() and will include
several hundred color names from the obvious "black", "blue", "green" or "red" to
more exotic ones like "papayawhip".
– By a number corresponding to a list of colors (palette) that can be adjusted.
– Gray scale: gray(level) with level=0 indicating black and level=1 indicating white.
– By RGB values with a string of the form "#RRGGBB" where each of the pairs RR, GG, BB
consist of two hexadecimal digits.13 This is useful for fine-tuning colors.
– Using the function rgb(red, green, blue) where the arguments represent the RBG
values, normalized between 0 and 1 by default. They can also be normalized e.g. to be
between 0 and 255 with the additional option maxColorValue = 255.
– The rgb function can also define transparency with the additional option alpha=value,
where alpha=0 means fully transparent (i.e. invisible) and alpha=1 means fully opaque.
• A main title and a subtitle can be added using main="My Title" and sub="My Subtitle".
• The horizontal and vertical axis can be labeled using xlab="My x axis label" and
ylab="My y axis label".
• The limits of the horizontal and the vertical axis can be chosen using xlim=c(min,max) and
ylim=c(min,max), respectively.
• The axis labels can be set to be parallel to the axis (las=0), horizontal (las=1), perpendicular
to the axis (las=2), or vertical (las=3).
Some additional options should be set before the graph is created using the command
par(option1=value1, option2=value2, ...). For some options, this is the only pos-
sibility. An important example is the margin around the plotting area. It can be set either in inches
using mai=c(bottom, left, top, right) or in lines of usual text using mar=c(bottom,
left, top, right). In both cases, they are simply set to a numerical vector with four elements.
Another example is the possibility to easily put several plots below or next to each other in one
graph using the options mfcol or mfrow.
13 The RGB color model defines colors as a mix of the components red, green, and blue.
30 1. Introduction
8
6
0.2
4
outlier
2
0.1
0
0.0
2 4 6 8
−10 −5 0 5 10 x
There are also useful specialized commands for adding elements to an existing graph each of
which can be tweaked with the same formatting options presented above:
• points(x,y,...) and lines(x,y,...) add point and line plots much like plot with the
add=TRUE option.
• text(x,y,"mytext",...) adds text to coordinates (x,y). The option pos=number posi-
tions the text below, to the left of, above or to the right of the specified coordinates if pos is set
to 1, 2, 3, or 4, respectively.
• abline(a=value,b=value,...) adds a line with intercept a and slope b.
• abline(h=value(s),...) adds one or more horizontal line(s) at position h (which can be
a vector).
• abline(v=value(s),...) adds one or more vertical line(s) at position v (which can be a
vector).
• arrows(x0, y0, x1, y1, ...) adds an arrow from point x0,y0 to point x1,y1.
An example is shown in Script 1.17 (Plot-Overlays.R). It combines different plotting commands
and options to generate Figure 1.7(b).
14 The function dnorm(x,0,2) is the normal density with mean 0 and standard deviation 2, see Section 1.7.
1.4. Base Graphics 31
8
1
1 2
6
sales
2
3 3
4
1 2 3
3 2 3 3
2
2
1
year
A convenient alternative for specifying the plots separately is to use the command matplot. It
expects several y variables as a matrix and x either as a vector or a matrix with the same dimen-
sions. We can use all formatting options discussed above which can be set as vectors. Script 1.18
(Plot-Matplot.R) demonstrates this command. The result is shown in Figure 1.8.
1.4.4. Legends
If we combine several plots into one, it is often useful to add a legend to a graph. The command is
legend(position,labels,formats,...) where
• position determines the placement. It can be a set of x and y coordinates but usually it is
more convenient to use one of the self-explanatory keywords "bottomright", "bottom",
"bottomleft", "left", "topleft", "top", "topright", "right", or "center".
• labels is a vector of strings that act as labels for the legend. It should be specified like
c("first label","second label",...).
32 1. Introduction
0.4
sigma=1 σ=1
σ=2 1 x2
f(x) =
sigma=2 −
0.3
0.3
dnorm(x, 0, 1)
dnorm(x, 0, 1)
e 2σ2
sigma=3 σ=3 2πσ
0.2
0.2
0.1
0.1
0.0
0.0
−10 −5 0 5 10 −10 −5 0 5 10
x x
• formats is supposed to reproduce the line and marker styles used in the plot. We can use the
same options listed in Section 1.4.2 like pch and lty.
Script 1.19 (Plot-Legend.R) adds a legend to the plot of the different density functions. The result
can be seen in Figure 1.9(a).
Script 1.19: Plot-Legend.R
curve( dnorm(x,0,1), -10, 10, lwd=1, lty=1)
curve( dnorm(x,0,2),add=TRUE, lwd=2, lty=2)
curve( dnorm(x,0,3),add=TRUE, lwd=3, lty=3)
# Add the legend
legend("topright",c("sigma=1","sigma=2","sigma=3"), lwd=1:3, lty=1:3)
In the legend, but also everywhere within a graph (title, axis labels, texts, ...) we can also use
Greek letters, equations, and similar features in a relatively straightforward way. This is done using
the command expression(specific syntax). A complete list of that syntax can be found in
the help files somewhat hidden under plotmath. Instead of trying to reproduce this list, we just
give an example in Script 1.20 (Plot-Legend2.R). Figure 1.9(b) shows the result.
15 Previously,the “tidyverse” used to be known as the “hadleyverse” after the most important developer in this area, Hadley
Wickham. In 2016, he humbly suggested to replace this term with “tidyverse”. By the way: Hadley Wickham is also a
great presenter and teacher. Feel encouraged to search for his name on YouTube and other channels.
16 The book can also be read at http://r4ds.had.co.nz/.
1.5. Data Manipulation and Visualization: The Tidyverse 35
The “gg” in ggplot2 refers to a “grammar of graphics”. In this philosophy, a graph consists of
one or more geometric objects (or geoms). These could be points, lines, or other objects. They are
added to a graph with a function specific to the type. For example:
• geom_point(): points
• geom_line(): lines
• geom_smooth(): nonparametric regression
• geom_area(): ribbon
• geom_boxplot(): boxplot
There are many other geoms, including special ones for maps and other specific needs. These objects
have visual features like the position on the x and y axes, that are given as variables in the data frame.
Also features like the color, shape or size of points can – instead of setting them globally – be linked
to variables in the data set. These connections are called aesthetic mappings and are defined in a
function aes(feature=variable, ...). For example:
• x=...: Variable to map to x axis
• y=...: Variable to map to y axis
• color=...: Variable to map to the color (e.g. of the point)17
• shape=...: Variable to map to the shape (e.g. of the point)
A ggplot2 graph is always initialized with a call of ggplot(). The geoms a added with a +. As a
basic example, we would like to use the data set mpg and map displ on the x axis and hwy on the
y axis. The basic syntax is shown in Script 1.22 (mpg-scatter.R) and the result in Figure 1.10(a).
Script 1.22: mpg-scatter.R
# load package
library(ggplot2)
Let us add a second “geom” to the graph. Nonparametric regression is a topic not covered in
Wooldridge (2019) or this book, but it is easy to implement with ggplot2. We will not go into
details here but simply use these tools for visualizing the relationship between two variables. For
more details, see for example Pagan and Ullah (2008).
Figure 1.10(b) shows the same scatter plot as before with a nonparametric regression function
added. It represents something like the average of hwy given displ is close to the respective value
on the axis. The grey ribbon around the line visualizes the uncertainty and is relatively wide for
very high displ values where the data are scarce. For most of the relevant area, there seems to be
a clearly negative relation between displacement and highway mileage.
Figure 1.10(b) can be generated by simply adding the appropriate geom to the scatter plot with
+geom_smooth(...):
ggplot() +
geom_point( data=mpg, mapping=aes(x=displ, y=hwy) ) +
geom_smooth(data=mpg, mapping=aes(x=displ, y=hwy) )
Note that the code for the graph spans three lines which makes it easier to read. We just add the +
to the end of the previous line to explicitly state that we’re not finished yet.
17 Since
Hadley Wickham is from New Zealand, the official name actually is colour, but the American version is accepted
synonymously.
36 1. Introduction
40 40
30 30
hwy
hwy
20 20
2 3 4 5 6 7 2 3 4 5 6 7
displ displ
More interestingly, we can use different colors for groups of points defined by a third variable
to explore and visualize relationships. For example, we can distinguish the points by the variable
class. In ggplot2 terminology, we add a third aesthetic mapping from a variable class to visual
1.5. Data Manipulation and Visualization: The Tidyverse 37
40
hwy 30
20
2 3 4 5 6 7
displ
class
40
2seater
compact
30 midsize
hwy
minivan
pickup
20
subcompact
suv
2 3 4 5 6 7
displ
feature color (besides the mappings to the x and y axes). Consistent with the logic, we therefore
define this mapping in the aes function. Script 1.25 (mpg-color2.R) implements this by setting
aes(color=class) as an option to geom_point. R automatically assigns a color to each value of
class. Optionally, we can choose the set of colors by adding (again with +) a scale specification.
We add +scale_color_grey() to request different shades of gray. The result is shown in Figure
1.11(b). Note that the legend is added automatically. There are many other options to choose the
color scale including scale_color_manual() for explicitly choosing the colors. If a numeric
variable is mapped to color, a continuous color scale is used.
Script 1.25: mpg-color2.R
ggplot(mpg, aes(displ, hwy)) +
geom_point( aes(color=class) ) +
geom_smooth(color="black") +
scale_color_grey()
A closer look at Figure 1.11(b) reveals that distinguishing seven values by color is hard, especially
if we restrict ourselves to gray scales. In addition (or as an alternative), we can use different point
shapes. This corresponds to a fourth mapping – in this case to the visual feature shape. We could
38 1. Introduction
class
40
2seater
compact
30 midsize
hwy
minivan
pickup
20
subcompact
suv
2 3 4 5 6 7
displ
class
40
2seater
compact
30 midsize
hwy
minivan
pickup
20
subcompact
suv
2 3 4 5 6 7
displ
also map different variables to color and shape, but this would likely be too much information
squeezed into one graph. So Script 1.26 (mpg-color3.R) maps class to both color and shape.
We choose shapes number 1–7 with the additional +scale_shape_manual(values=1:7).18
The result is shown in Figure 1.12(a). Now we can more clearly see that there are two distinct
types of cars with very high displacement: gas guzzlers of type suv and pickup have a low mileage
and cars of type 2seater have a relatively high mileage. These turn out to be five versions of the
Chevrolet Corvette.
Script 1.26: mpg-color3.R
ggplot(mpg, aes(displ, hwy)) +
geom_point( aes(color=class, shape=class) ) +
geom_smooth(color="black") +
scale_color_grey() +
scale_shape_manual(values=1:7)
Let’s once again look at the aesthetic mappings: In Script 1.26 (mpg-color3.R), x and y are
mapped within the ggplot() call and are valid for all geoms, whereas shape and color are active
18 With more than 6 values, +scale_shape_manual(values=...) is required.
1.5. Data Manipulation and Visualization: The Tidyverse 39
only within geom_point(). We can instead specify them within the ggplot() call to make them
valid for all geoms as it’s done in Script 1.27 (mpg-color4.R). The resulting graph is shown in
Figure 1.12(b). Now, also the smoothing is done by class separately and indicated by color. The
mapping to shape is ignored by geom_smooth() because it makes no sense for the regression
function. This graph appears to be overloaded with information – if we find this type of graph
useful, we might want to consider aggregating the car classes into three or four broader types.19
Script 1.27: mpg-color4.R
ggplot(mpg, aes(displ, hwy, color=class, shape=class)) +
geom_point() +
geom_smooth(se=FALSE) +
scale_color_grey() +
scale_shape_manual(values=1:7)
19 We
turn of the gray error bands for the smooths to avoid an even messier graph in Script 1.27 (mpg-color4.R) with the
option se=FALSE of geom_smooth.
40 1. Introduction
40
Miles/Gallon (Highway)
30
Car type
20 2seater
compact
midsize
minivan
10
pickup
subcompact
suv
0
0 2 4 6
Displacement [liters]
Source: EPA through the ggplot2 package
1.5. Data Manipulation and Visualization: The Tidyverse 41
> head(wdi_raw)
iso2c country SP.DYN.LE00.FE.IN year
1 1A Arab World 72.97131 2014
2 1A Arab World 72.79686 2013
3 1A Arab World 72.62239 2012
4 1A Arab World 72.44600 2011
5 1A Arab World 72.26116 2010
6 1A Arab World 72.05996 2009
> tail(wdi_raw)
iso2c country SP.DYN.LE00.FE.IN year
14515 ZW Zimbabwe 56.952 1965
14516 ZW Zimbabwe 56.521 1964
14517 ZW Zimbabwe 56.071 1963
14518 ZW Zimbabwe 55.609 1962
14519 ZW Zimbabwe 55.141 1961
14520 ZW Zimbabwe 54.672 1960
We would like to extract the relevant variables, filter out only the data for the US, rename the
variable of interest, sort by year in an increasing order, and generate a new variable using the dplyr
tools. The function names are verbs and quite intuitive to understand. They are focused on data
frames and all have the same structure: The first argument is always a data frame and the result is
one, too. So the general structure of dplyr commands is
new_data_frame <- some_verb(old_data_frame, details)
Script 1.30 (wdi-manipulation.R) performs a number of manipulations to the data set. The
first step is to filter the rows for the US. The function to do this is filter. We supply our raw data
and a condition and get the filtered data frame as a result. We would like to get rid of the ugly
variable name SP.DYN.LE00.FE.IN and rename it to LE_fem. In the tidyverse, this is done with
20 Actually, the package works with an updated version of a data frame called tibble, but that does not make any relevant
difference at this point.
21 Details and instructions for the WDI package can be found at https://github.com/vincentarelbundock/WDI.
42 1. Introduction
rename(old_data, new_var=old_var). The next step is to select the relevant variables year
and LE_fem. The appropriate verb is select and we just list the chosen variables in the preferred
order. Finally, we order the data frame by year with the function arrange.
In this script, we repeatedly overwrite the data frame ourdata in each step. Section 1.5.5 intro-
duces a more elegant way to achieve the same result. We print the first and last six rows of data
after all the manipulation steps. They are in exactly the right shape for most data analysis tasks or
to produce a plot with ggplot. This is done in the last step of the script and should be familiar by
now. The result is printed as Figure 1.14.
> tail(ourdata)
year LE_fem
50 2009 80.9
51 2010 81.0
52 2011 81.1
53 2012 81.2
54 2013 81.2
55 2014 81.3
> # Graph
> library(ggplot2)
77.5
75.0
1.5.5. Pipes
Pipes are an important concept in the tidyverse. They are actually introduced in the package
magrittr which is automatically loaded with dplyr. The goal is to replace the repeated over-
writing of the data frame in Script 1.30 (wdi-manipulation.R) with something more concise, less
error-prone, and computationally more efficient.
To understand the concept of the pipe, consider a somewhat nonsensical example of sequential
computations: Our goal is to calculate exp(log10 (6154)), rounded to two digits. A one-liner with
nested function call would be
round(exp(log(6154,10)),2)
While this produces the correct result 44.22, it is somewhat hard to write, read, and debug. It is
especially difficult to see where the parentheses to which function are closed and which argument
goes where. For more realistic problems, we would need many more nested function calls and this
approach would completely break down. An alternative would be to sequentially do the calculations
and store the results in a temporary variable:
This is easier to read since we can clearly see the order of operations and which functions the
arguments belong to. A similar approach was taken in Script 1.30 (wdi-manipulation.R): The
data frame ourdata is overwritten over and over again. However, this is far from optimal – typing
ourdata so many times is tedious and error-prone and the computational costs are unnecessarily
high.
This is where the pipe comes into play. It is an operator that is written as %>%.22 It takes the
expression to the left hand side and uses it as the first argument for the function on the right hand
22 Conveniently,
in RStudio, the pipe can be written with the combination Ctrl + Shift ⇑ + M on a Windows machine
and Command + Shift ⇑ + M on a Mac.
44 1. Introduction
side. Therefore, 25 %>% sqrt() is the same as sqrt(25). Nesting is easily done, so our toy
example can be translated as
log(6154,10) %>%
exp() %>%
round(2)
First, log(6154,10) is evaluated. Its result is “piped” into the exp() function on the right hand
side. The next pipe takes this result as the first argument to the round function on the right hand
side. So we can read the pipe as and then: Calculate the log and then take the exponent and then do
the rounding to two digits. This version of the code is quite easily readable.
This approach can perfectly be used with dplyr, since these functions expect the old data frame
as the first input and return the new data frame. Script 1.31 (wdi-pipes.R) performs exactly the
same computations as 1.30 (wdi-manipulation.R) but uses pipes. Once we understood the idea,
this code is more convenient and powerful. The code can directly be read as
• Take the data set wdi_raw and then ...
• filter the US data and then ...
• rename the variable and then ...
• select the variables and then ...
• order by year.
Script 1.31: wdi-pipes.R
library(dplyr)
# All manipulations with pipes:
ourdata <- wdi_raw %>%
filter(iso2c=="US") %>%
rename(LE_fem=SP.DYN.LE00.FE.IN) %>%
select(year, LE_fem) %>%
arrange(year)
40
1960 1970 1980 1990 2000 2010
Year
Source: World Bank, WDI
> tail(le_data)
iso2c country LE year
14515 ZW Zimbabwe 56.952 1965
14516 ZW Zimbabwe 56.521 1964
14517 ZW Zimbabwe 56.071 1963
14518 ZW Zimbabwe 55.609 1962
14519 ZW Zimbabwe 55.141 1961
14520 ZW Zimbabwe 54.672 1960
> tail(ctryinfo)
country
299 Kosovo
300 Sub-Saharan Africa excluding South Africa and Nigeria
301 Yemen, Rep.
302 South Africa
303 Zambia
304 Zimbabwe
income
299 Lower middle income
300 Aggregates
301 Low income
46 1. Introduction
> # Join:
> alldata <- left_join(le_data, ctryinfo)
Joining, by = "country"
> tail(alldata)
iso2c country LE year income
14515 ZW Zimbabwe 56.952 1965 Low income
14516 ZW Zimbabwe 56.521 1964 Low income
14517 ZW Zimbabwe 56.071 1963 Low income
14518 ZW Zimbabwe 55.609 1962 Low income
14519 ZW Zimbabwe 55.141 1961 Low income
14520 ZW Zimbabwe 54.672 1960 Low income
Now we want to calculate the average life expectancy over all countries that share the same income
classification, separately by year. Within the tidyverse, dplyr offers the function summarize. The
structure is
summarize(olddf, newvar = somefunc(oldvars))
where somefunc is any function that accepts a vector and returns a scalar. In our case, we want to
calculate the average, so we choose the function mean, see Section 1.6. Since there are a few missing
values for the life expectancy, we need to use the option na.rm=TRUE. If we were to run
summarize(alldata, LE_avg = mean(LE, na.rm=TRUE))
then we would get the overall mean over all countries and years (which is around 65.5 years). That’s
not exactly our goal: We need to make sure that the average is to be taken separately by income
and year. This can be done by first grouping the data frame with group_by(income, year).
It indicates to functions like summarize to do the calculations by group. Such a grouping can be
removed with ungroup().
Script 1.33 (wdi-ctryavg.R) does these calculations. It first removes the rows that correspond to
aggregates (like Arab World) instead of individual countries and those countries that aren’t classi-
fied by the World Bank.24 Then, the grouping is added to the data set and the average is calculated.
The last six row of data are shown: They correspond to the income group Upper middle income
for the years 2009–2014. Now we are ready to plot the data with the familiar ggplot command. The
result is shown in Figure 1.16. We almost generated Figure 1.15. Whoever is interested in the last
beautification steps can have a look at Script 1.34 (wdi-ctryavg-beautify.R) in Appendix IV.
24 It turns out that for some reason South Sudan is not classified and therefore removed from the analysis.
1.5. Data Manipulation and Visualization: The Tidyverse 47
> # plot
> ggplot(avgdata, aes(year, LE_avg, color=income)) +
> geom_line() +
> scale_color_grey()
80
income
70
High income
LE_avg
60 Low income
Lower middle income
50 Upper middle income
40
1960 1980 2000
year
48 1. Introduction
Obviously, as a statistics program R offers many commands for descriptive statistics. In this section,
we cover the most important ones for our purpose.
Suppose we have a sample of the random variables X and Y stored in the R vectors x and y, respec-
tively. For discrete variables, the most fundamental statistics are the frequencies of outcomes. The
command table(x) gives such a table of counts. If we provide two arguments like table(x,y),
we get the contingency table, i.e. the counts of each combination of outcomes for variables x and y.
For getting the sample shares instead of the counts, we can request prop.table(table(x)). For
the two-way tables, we can get a table of
• the overall sample share: prop.table(table(x,y))
• the share within x values (row percentages): prop.table(table(x,y),margin=1)
• the share within y values (column percentages): prop.table(table(x,y),margin=2)
As an example, we look at the data set affairs.dta. It contains two variables we look at in Script
1.35 (Descr-Tables.R) to demonstrate the workings of the table and prop.table commands:
• kids = 1 if the respondent has at least one child
• ratemarr = Rating of the own marriage (1=very unhappy, 5=very happy)
In the R script, we first generate factor versions of the two variables of interest. In this way, we can
generate tables with meaningful labels instead of numbers for the outcomes, see Section 1.2.3. Then
different tables are produced. Of the 601 respondents, 430 (=71.5%) have children. Overall, 2.66%
report to be very unhappy with their marriage and 38.6% are very happy. In the contingency table
with counts, we see for example that 136 respondents are very happy and have kids.
The table with the option margin=1 tells us that for example 81.25% of very unhappy individ-
uals have children and only 58.6% of very happy respondents have kids. The last table reports the
distribution of marriage ratings separately for people with and without kids: 56.1% of the respon-
dents without kids are very happy, whereas only 31.6% of those with kids report to be very happy
with their marriage. Before drawing any conclusions for your own family planning, please keep
on studying econometrics at least until you fully appreciate the difference between correlation and
causation!
There are several ways to graphically depict the information in these tables. Figure 1.17 demon-
strates the creation of basic pie and bar charts using the commands pie and barplot, respectively.
These figures can of course be tweaked in many ways, see the help pages and the general discussions
of graphics in section 1.4. We create vertical and horizontal (horiz=TRUE) bars, align the axis labels
to be horizontal (las=1) or perpendicular to the axes (las=2), include and position the legend, and
add a main title. The best way to explore the options is to tinker with the specification and observe
the results.
50 1. Introduction
Distribution of Happiness
average
very happy
unhappy happy
happy
very unhappy average
unhappy
very unhappy
very happy
unhappy
average
happy
very happy
no
very unhappy yes
A kernel density plot can be thought of as a more sophisticated version of a histogram. We cannot
go into detail here, but an intuitive (and oversimplifying) way to think about it is this: We could
create a histogram bin of a certain width, centered at an arbitrary point of x. We will do this for
many points and plot these x values against the resulting densities. Here, we will not use this plot
as an estimator of a population distribution but rather as a pretty alternative to a histogram for the
descriptive characterization of the sample distribution. For details, see for example Silverman (1986).
In R, generating a kernel density plot is straightforward: plot( density(x) ) will automat-
ically choose appropriate parameters of the algorithm given the data and often produce a useful
result. Of course, these parameters (like the kernel and bandwidth for those who know what that is)
can be set manually. Also general plot options can be used.
52 1. Introduction
0.06
60
Frequency
Density
40
0.03
20
0.00
0
0 10 20 30 40 50 60 0 10 20 30 40 50 60
ROE ROE
(a) hist(ROE) (b) hist(ROE,
breaks=c(0,5,10,20,30,60) )
Script 1.37 (KDensity.R) generates the graphs of Figure 1.19. In Sub-figure (b), a histogram is
overlayed with a kernel density plot by using the lines instead of the plot command for the latter.
We adjust the ylim axis limits and increase the line width using lwd.
Script 1.37: KDensity.R
# Subfigure (c): kernel density estimate
plot( density(ROE) )
0.06
Density
Density
0.04
0.03
0.00
0.00
0 10 20 30 40 50 60 0 10 20 30 40 50 60
ecdf(ROE)
1.0
0.8
0.6
Fn(x)
0.4
0.2
0.0
0 10 20 30 40 50 60
x
54 1. Introduction
A box plot displays the median (the bold line), the upper and lower quartile (the box) and the
extreme points graphically. Figure 1.21 shows two examples. 50% of the observations are within
the interval covered by the box, 25% are above and 25% are below. The extreme points are marked
by the “whiskers” and outliers are printed as separate dots.25 In R, box plots are generated using
25 The definition of an “outlier” relative to “extreme values” is somewhat arbitrary. Here, an value is deemed an outlier if it
is further away from the box than 1.5 times the interquartile range (i.e. the height/width of the box).
1.7. Probability Distributions 55
40
ROE
20
0
0 10 20 30 40 50 0 1
ceosal1$consprod
(a) boxplot(ROE,horizontal=TRUE) (b) boxplot(ROE~df$consprod)
the boxplot command. We have to supply the data vector and can alter the design flexibly with
numerous options.
Figure 1.21(a) shows how to get a horizontally aligned plot and Figure 1.21(b) demonstrates how
to produce different plots by sub group defined by a second variable. The variable consprod from
the data set ceosal1 is equal to 1 if the firm is in the consumer product business and 0 otherwise.
Apparently, the ROE is much higher in this industry.26
with the parameters n = 10 and p = 20% = 0.2. We know that the probability to get exactly
x ∈ {0, 1, . . . , 10} white balls for this distribution is28
n 10
f ( x ) = P( X = x ) = · p x · (1 − p ) n − x = · 0.2x · 0.810− x (1.1)
x x
For example, the probability to get exactly x = 2 white balls is f (2) = (10 2 8
2 ) · 0.2 · 0.8 = 0.302.
Of course, we can let R do these calculations using basic R commands we know from Section 1.1.
More conveniently, we can also use the built-in function for the Binomial distribution from Table 1.5
dbinom(x,n,p):
We can also give vectors as one or more arguments to dbinom(x,n,p) and receive the results as
a vector. Script 1.39 (PMF-example.R) evaluates the pmf for our example at all possible values for
x (0 through 10). It displays a table of the probabilities and creates a bar chart of these probabilities
which is shown in Figure 1.22(a). Note that the option type="h" of the command plot draws
vertical lines instead of points, see Section 1.4. As always: feel encouraged to experiment!
0.30
0.4
0.3
0.20
dnorm(x)
0.2
fx
0.10
0.1
0.00
0.0
0 2 4 6 8 10 −4 −2 0 2 4
x x
(a) Binomial pmf (b) Standard normal pdf
> # Plot
> plot(x, fx, type="h")
For continuous distributions like the uniform, logistic, exponential, normal, t, χ2 , or F distribution,
the probability density functions f ( x ) are also implemented for direct use in R. These can for example
be used to plot the density functions using the curve command (see Section 1.4). Figure 1.22(b)
shows the famous bell-shaped pdf of the standard normal distribution. It was created using the
command curve( dnorm(x), -4,4 ).
58 1. Introduction
The probability that a standard normal random variable takes a value between −1.96 and 1.96 is
95%:
Note that we get a slightly different answer than the one given in Wooldridge (2019) since we’re working
with the exact 23 instead of the rounded .67. The same approach can be used for the second problem:
The graph of the cdf is a step function for discrete distributions and can therefore be best created
using the type="s" option of plot, see Section 1.4. For the urn example, the cdf is shown in Figure
1.23(a). It was created using the following code:
x <- seq(-1,10)
Fx <- pbinom(x, 10, 0.2)
plot(x, Fx, type="s")
The cdf of a continuous distribution can very well be plotted using the curve command. The S-
shaped cdf of the normal distribution is shown in Figure 1.23(b). It was simply generated with
curve( pnorm(x), -4,4 ).
1.7. Probability Distributions 59
0 2 4 6 8 10 −4 −2 0 2 4
x x
(a) Binomial cdf (b) Standard normal cdf
Quantile function
The q-quantile x [q] of a random variable is the value for which the probability to sample a value
x ≤ x [q] is just q. These values are important for example for calculating critical values of test
statistics.
To give a simple example: Given X is standard normal, the 0.975-quantile is x [0.975] ≈ 1.96. So
the probability to sample a value less or equal to 1.96 is 97.5%:
> qnorm(0.975)
[1] 1.959964
> rbinom(10,1,0.5)
[1] 1 1 0 0 0 0 1 0 1 0
the sample size to 1,000 or 10,000,000. Taking draws from the standard normal distribution is equally
simple:
> rnorm(10)
[1] 0.83446013 1.31241551 2.50264541 1.16823174 -0.42616558
[6] -0.99612975 -1.11394990 -0.05573154 1.17443240 1.05321861
Working with computer-generated random samples creates problems for the reproducibility of
the results. If you run the code above, you will get different samples. If we rerun the code, the
sample will change again. We can solve this problem by making use of how the random numbers
are actually generated which is, as already noted, not involving true randomness. Actually, we will
always get the same sequence of numbers if we reset the random number generator to some specific
state (“seed”). In R, this is done with set.seed(number), where number is some arbitrary number
that defines the state but has no other meaning. If we set the seed to some arbitrary number, take
a sample, reset the seed to the same state and take another sample, both samples will be the same.
Also, if I draw a sample with that seed it will be equal to the sample you draw if we both start from
the same seed.
Script 1.40 (Random-Numbers.R) demonstrates the workings of set.seed.
Output of Script 1.40: Random-Numbers.R
> # Sample from a standard normal RV with sample size n=5:
> rnorm(5)
[1] 0.05760597 -0.73504289 0.93052842 1.66821097 0.55968789
> # Set the seed of the random number generator and take two samples:
> set.seed(6254137)
> rnorm(5)
[1] 0.6601307 0.5123161 -0.4616180 -1.3161982 0.1811945
> rnorm(5)
[1] -0.2933858 -0.9023692 1.8385493 0.5652698 -1.2848862
> # Reset the seed to the same value to get the same samples again:
> set.seed(6254137)
> rnorm(5)
[1] 0.6601307 0.5123161 -0.4616180 -1.3161982 0.1811945
> rnorm(5)
[1] -0.2933858 -0.9023692 1.8385493 0.5652698 -1.2848862
1.8. Confidence Intervals and Statistical Inference 61
This “manual” way of calculating the CI is used in the solution to Example C.2. We will see a more
convenient way to calculate the confidence interval together with corresponding t test in Section
1.8.4. In Section 1.10.3, we will calculate confidence intervals in a simulation experiment to help us
understand the meaning of confidence intervals.
29 The stripped-down textbook for Europe and Africa Wooldridge (2014) does not include the discussion of this material.
62 1. Introduction
> SR88<-c(3,1,5,.5,1.54,1.5,.8,2,.67,1.17,.51,.5,.61,6.7,
> 4,7,19,.2,5,3.83)
30 Notethat Wooldridge (2019) has a typo in the discussion of this example, therefore the numbers don’t quite match for the
95% CI.
64 1. Introduction
1.8.2. t Tests
Hypothesis tests are covered in Wooldridge (2019, Appendix C.6). The t test statistic for testing a
hypothesis about the mean µ of a normally distributed random variable Y is shown in Equation
C.35. Given the null hypothesis H0 : µ = µ0 ,
ȳ − µ0
t= . (1.3)
se(ȳ)
We already know how to calculate the ingredients from Section 1.8.1. Given the calculations shown
there, t for the null hypothesis H0 : µ = 1 would simply be
t <- (ybar-1) / se
The critical value for this test statistic depends on whether the test is one-sided or two-sided.
The value needed for a two-sided test c α was already calculated for the CI, the other values can be
2
generated accordingly. The values for different degrees of freedom n − 1 and significance levels α
are listed in Wooldridge (2019, Table G.2). Script 1.43 (Critical-Values-t.R) demonstrates how
we can calculate our own table of critical values for the example of 19 degrees of freedom.
Output of Script 1.43: Critical-Values-t.R
> # degrees of freedom = n-1:
> df <- 19
1.8.3. p Values
The p value for a test is the probability that (under the assumptions needed to derive the distribution
of the test statistic) a different random sample would produce the same or an even more extreme
value of the test statistic.31 The advantage of using p values for statistical testing is that they are
convenient to use. Instead of having to compare the test statistic with critical values which are
implied by the significance level α, we directly compare p with α. For two-sided t tests, the formula
for the p value is given in Wooldridge (2019, Equation C.42):
p = 2 · P( Tn−1 > |t|) = 2 · 1 − Ftn−1 (|t|) , (1.4)
where Ftn−1 (·) is the cdf of the tn−1 distribution which we know how to calculate from Table 1.5.
Similarly, a one-sided test rejects the null hypothesis only if the value of the estimate is “too high”
or “too low” relative to the null hypothesis. The p values for these types of tests are
(
P( Tn−1 < t) = Ftn−1 (t) for H1 : µ < µ0
p= (1.5)
P( Tn−1 > t) = 1 − Ftn−1 (t) for H1 : µ > µ0
Since we are working on a computer program that knows the cdf of the t distribution as pt,
calculating p values is straightforward: Given we have already calculated the t statistic above, the p
value would simply be one of the following expressions, depending of the type of the null hypothesis:
p <- 2 * ( 1 - pt(abs(t), n-1) )
p <- pt(t, n-1)
p <- 1 - pt(t, n-1)
31 The p value is often misinterpreted. It is for example not the probability that the null hypothesis is true. For a discussion,
see for example https://www.nature.com/news/scientific-method-statistical-errors-1.14700.
66 1. Introduction
> # p value
> (p <- pt(t,n-1))
[1] 0.02229063
> # p value
> (p <- pt(t,240))
[1] 1.369273e-05
This would implicitly calculate the relevant results for the two-sided test of the null H0 : µy = µ0 , H1 :
µy 6= µ0 , where µ0 = 0 by default. The 95% CI is reported. We can choose different tests using the
options
• alternative="greater" for H0 : µy = µ0 , H1 : µy > µ0
• alternative="less" for H0 : µy = µ0 , H1 : µy < µ0
• mu=value to set µ0 =value instead of µ0 = 0
• conf.level=value to set the confidence level to value·100% instead of conf.level=0.95
To give a comprehensive example: Suppose you want to test H0 : µy = 5 against the one-sided
alternative H1 : µy > 5 and obtain a 99% CI. The command would be
t.test(y, mu=5, alternative="greater", conf.level=0.99)
> SR88<-c(3,1,5,.5,1.54,1.5,.8,2,.67,1.17,.51,.5,.61,6.7,4,7,19,.2,5,3.83)
data: Change
t = -2.1507, df = 19, p-value = 0.04458
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-2.27803369 -0.03096631
sample estimates:
mean of x
-1.1545
68 1. Introduction
data: Change
t = -2.1507, df = 19, p-value = 0.02229
alternative hypothesis: true mean is less than 0
95 percent confidence interval:
-Inf -0.2263028
sample estimates:
mean of x
-1.1545
data: audit$y
t = -4.2768, df = 240, p-value = 2.739e-05
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.1939385 -0.0716217
sample estimates:
mean of x
-0.1327801
data: audit$y
t = -4.2768, df = 240, p-value = 1.369e-05
alternative hypothesis: true mean is less than 0
95 percent confidence interval:
-Inf -0.08151529
sample estimates:
mean of x
-0.1327801
1.8. Confidence Intervals and Statistical Inference 69
The command t.test is our first example of a function that returns a list. Instead of just
displaying the results as we have done so far, we can store them as an object for further use. Section
1.2.6 described the general workings of these sorts of objects.
If we store the results for example as testres <- t.test(...), the object testres contains
all relevant information about the test results. Like a basic list, the names of all components can be
displayed with names(testres). They include
• statistic = value of the test statistic
• p.value = value of the p value of the test
• conf.int = confidence interval
A single component, for example p.value is accessed as testres$p.value. Script 1.49
(Test-Results-List.R) demonstrates this for the test in Example C.3.
Output of Script 1.49: Test-Results-List.R
> data(audit, package=’wooldridge’)
data: audit$y
t = -4.2768, df = 240, p-value = 2.739e-05
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.1939385 -0.0716217
sample estimates:
mean of x
-0.1327801
> # p-value
> testres$p.value
[1] 2.738542e-05
70 1. Introduction
The condition has to be a single logical value (TRUE or FALSE). If it is TRUE, then expression1
is executed, otherwise expression2 which can also be omitted. A simple example would be
The character object decision will take the respective value depending on the value of the numeric
scalar p. Often, we want to conditionally execute several lines of code. This can easily be achieved by
grouping the expressions in curly braces {...}. Note that the else statement (if it is used) needs
to go on the same line as the closing brace of the if statement. So the structure will look like
if (condition) {
[several...
...lines...
... of code]
} else {
[different...
...lines...
... of code]
}
1.9.2. Loops
For repeatedly executing an expression (which can again be grouped by braces {...}), different
kinds of loops are available. In this book, we will use them for Monte Carlo analyses introduced in
Section 1.10. For our purposes, the for loop is well suited. Its typical structure is as follows:
The loop variable loopvar will take the value of each element of vector, one after another. For
each of these elements, [some commands] are executed. Often, vector will be a sequence like
1:100.
A nonsense example which combines for loops with an if statement is the following:
1.9. More Advanced R 71
for (i in 1:6) {
if (i<4) {
print(i^3)
} else {
print(i^2)
}
}
Note that the print commands are necessary to print any results within expressions grouped by
braces. The reader is encouraged to first form expectations about the output this will generate and
then compare them with the actual results:
[1] 1
[1] 8
[1] 27
[1] 16
[1] 25
[1] 36
R offers more ways to repeat expressions, but we will not present them here. Interested readers
can look up commands like repeat, while, replicate, apply or lapply.
1.9.3. Functions
Functions are special kinds of objects in R. There are many pre-defined functions – the first one we
used was sqrt. Packages provide more functions to expand the capabilities of R. And now, we’re
ready to define our own little function. The command function(arg1, arg2,...) defines a
new function which accepts the arguments arg1, arg2,. . . The function definition follows in arbitrar-
ily many lines of code enclosed in curly braces. Within the function, the command return(stuff)
means that stuff is to be returned as a result of the function call. For example, we can define the
function mysqrt that expects one argument internally named x as
mysqrt <- function(x) {
if(x>=0){
return(sqrt(x))
} else {
return("You fool!")
}
}
Once we have executed this function definition, mysqrt is known to the system and we can use it
just like any other function:
> mysqrt(4)
[1] 2
> mysqrt(-1)
[1] "You fool!"
1.9.4. Outlook
While this section is called “Advanced R”, we have admittedly only scratched the surface of semi-
advanced topics. One topic we defer to Chapter 19 is how R can automatically create formatted
reports and publication-ready documents.
72 1. Introduction
Another advanced topic is the optimization of computational speed. Like most other software
packages used for econometrics, R is an interpreted language. A disadvantage compared to compiled
languages like C++ or Fortran is that the execution speed for computationally intensive tasks is lower.
So an example of seriously advanced topics for the real R geek is how to speed up computations.
Possibilities include compiling R code, integrating C++ or Fortran code, and parallel computing.
Since real R geeks are not the target audience of this book, we will stop to even mention more
intimidating possibilities and focus on implementing the most important econometric methods in
the most straightforward and pragmatic way.
> mean(sample)
[1] 9.913197
> mean(sample)
[1] 10.21746
All sample means Ȳ are around the true mean µ = 10 which is consistent with our presumption
formulated in Equation 1.7. It is also not surprising that we don’t get the exact population parameter
– that’s the nature of the sampling noise. According to Equation 1.7, the results are expected to have
2
a variance of σn = 0.04. Three samples of this kind are insufficient to draw strong conclusions
regarding the validity of Equation 1.7. Good Monte Carlo simulation studies should use as many
samples as possible.
In Section 1.9.2, we introduced for loops. While they are not the most powerful technique
available in R to implement a Monte Carlo study, we will stick to them since they are quite
transparent and straightforward. The code shown in Script 1.51 (Simulation-Repeated.R) uses
a for loop to draw 10 000 samples of size n = 100 and calculates the sample average for all of
them. After setting the random seed, a vector ybar is initialized to 10 000 zeros using the numeric
command. We will replace these zeros with the estimates one after another in the loop. In each of
these replications j = 1, 2, . . . , 10 000, a sample is drawn, its average calculated and stored in position
number j of ybar. In this way, we end up with a vector of 10 000 estimates from different samples.
The script Simulation-Repeated.R does not generate any output.
# repeat r times:
for(j in 1:r) {
# Draw a sample and store the sample mean in pos. j=1,2,... of ybar:
sample <- rnorm(100,10,2)
ybar[j] <- mean(sample)
}
74 1. Introduction
density.default(x = ybar)
2.0
1.5
Density
1.0
0.5
0.0
To summarize, the simulation results confirm the theoretical results in Equation 1.7. Mean, vari-
ance and density are very close and it seems likely that the remaining tiny differences are due to the
fact that we “only” used 10 000 samples.
Remember: for most advanced estimators, such simulations are the only way to study some of
their features since it is impossible to derive theoretical results of interest. For us, the simple exam-
ple hopefully clarified the approach of Monte Carlo simulations and the meaning of the sampling
distribution and prepared us for other interesting simulation exercises.
34 Inorder to ensure the same scale in each graph, the axis limits were manually set instead of being chosen by R. This was
done using the options xlim=c(8.5,11.5),ylim=c(0,2) in the plot command producing the estimated density.
35 A motivated reader will already have figured out that this graph was generated by curve( dchisq(x,1) ,0,3).
76 1. Introduction
2.0
1.5
1.5
Density
Density
1.0
1.0
0.5
0.5
0.0
0.0
8.5 9.0 9.5 10.0 10.5 11.0 11.5 8.5 9.0 9.5 10.0 10.5 11.0 11.5
(a) n = 10 (b) n = 50
2.0
1.5
1.5
Density
Density
1.0
1.0
0.5
0.5
0.0
0.0
8.5 9.0 9.5 10.0 10.5 11.0 11.5 8.5 9.0 9.5 10.0 10.5 11.0 11.5
(c) n = 100 (d) n = 1000
1.10. Monte Carlo Simulation 77
2.0
1.5
dchisq(x, 1)
1.0
0.5
0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
1.0
0.8
0.8
0.6
0.6
Density
Density
0.4
0.4
0.2
0.2
0.0
0.0
20
2.0
Density
Density
1.5
15
1.0
10
0.5
5
0.0
0.6 0.8 1.0 1.2 1.4 1.6 0.94 0.96 0.98 1.00 1.02 1.04 1.06
(c) n = 100 (d) n = 10000
78 1. Introduction
> table(reject1)
reject1
FALSE TRUE
9492 508
> table(reject2)
reject2
FALSE TRUE
3043 6957
Correct H0 Incorrect H0
100
100
80
80
60
60
Sample No.
Sample No.
40
40
20
20
0
9.0 9.5 10.0 10.5 11.0 9.0 9.5 10.0 10.5 11.0
Part I.
β̂ 0 = ȳ − β̂ 1 x̄ (2.2)
Cov( x, y)
β̂ 1 = . (2.3)
Var( x )
ŷ = β̂ 0 + β̂ 1 x. (2.4)
For a given sample, we just need to calculate the four statistics ȳ, x̄, Cov( x, y), and Var( x ) and
plug them into these equations. We already know how to make these calculations in R, see Section
1.6. Let’s do it!
> attach(ceosal1)
> var(roe)
[1] 72.56499
> mean(salary)
[1] 1281.12
> mean(roe)
[1] 17.18421
While calculating OLS coefficients using this pedestrian approach is straightforward, there is a
more convenient way to do it. Given the importance of OLS regression, it is not surprising that R
has a specialized command to do the calculations automatically.
If the values of the dependent variable are stored in the vector y and those of the regressor are in
the vector x, we can calculate the OLS coefficients as
lm( y ~ x )
The name of the command lm comes from the abbreviation of linear model. Its argument y ~ x is
called a formula in R lingo. Essentially, it means that we want to model a left-hand-side variable
y to be explained by a right-hand-side variable x in a linear fashion. We will discuss more general
model formulae in Section 6.1.
If we have a data frame df with the variables y and x, instead of calling lm( df$y ~ df$x ),
we can use the more elegant version
lm( y ~ x, data=df )
Call:
lm(formula = salary ~ roe, data = ceosal1)
Coefficients:
(Intercept) roe
963.2 18.5
From now on, we will rely on the built-in routine lm instead of doing the calculations manually.
It is not only more convenient for calculating the coefficients, but also for further analyses as we will
see soon.
lm returns its results in a special version of a list.1 We can store these results in an object using
code like
myolsres <- lm( y ~ x )
This will create an object with the name myolsres or overwrite it if it already existed. The name
could of course be anything, for example yummy.chocolate.chip.cookies, but choosing telling
variable names makes our life easier. This object does not only include the vector of OLS coefficients,
but also information on the data source and much more we will get to know and use later on.
Given the results from a regression, plotting the regression line is straightforward. As we have
already seen in Section 1.4.3, the command abline(...) can add a line to a graph. It is clever
enough to understand our objective if we simply supply the regression result object as an argument.
# OLS regression
CEOregres <- lm( salary ~ roe, data=ceosal1 )
1 Remember a similar object returned by t.test (Section 1.8.4). General lists were introduced in Section 1.2.6
86 2. The Simple Regression Model
3000
salary
1000
0
0 10 20 30 40 50
roe
wage = β 0 + β 1 education + u.
In Script 2.4 (Example-2-4.R), we analyze the data and find that the OLS regression line is
One additional year of education is associated with an increase of the typical wage by about 54 cents
an hour.
Call:
lm(formula = wage ~ educ, data = wage1)
Coefficients:
(Intercept) educ
-0.9049 0.5414
2.1. Simple OLS Regression 87
voteA = β 0 + β 1 shareA + u.
is estimated in Script 2.5 (Example-2-5.R). The OLS regression line turns out to be
The scatter plot with the regression line generated in the code is shown in Figure 2.2
Call:
lm(formula = voteA ~ shareA, data = vote1)
Coefficients:
(Intercept) shareA
26.8122 0.4638
> abline(VOTEres)
40
20
0 20 40 60 80 100
shareA
88 2. The Simple Regression Model
Another way to interact with objects like this is through generic functions. They accept different
types of arguments and, depending on the type, give appropriate results. As an example, the number
of observations n is returned with nobs(myolsres) if the regression results are stored in the object
myolsres.
Obviously, we are interested in the OLS coefficients. As seen above, they can be obtained as
myolsres$coefficients. An alternative is the generic function coef(myolsres). The co-
efficient vector has names attached to its elements. The name of the intercept parameter β̂ 0 is
"(Intercept)" and the name of the slope parameter β̂ 1 is the variable name of the regressor
x. In this way, we can access the parameters separately by using either the position (1 or 2) or the
name as an index to the coefficients vector. For details, review Section 1.2.4 for a general discussion
of working with vectors.
Given these parameter estimates, calculating the predicted values ŷi and residuals ûi for each
observation i = 1, ..., n is easy:
ŷi = β̂ 0 + β̂ 1 · xi (2.5)
ûi = yi − ŷi (2.6)
If the values of the dependent and independent variables are stored in the vectors y and x, re-
spectively, we can estimate the model and do the calculations of these equations for all observations
jointly using the code
myolsres <- lm( y ~ x )
bhat <- coef(myolsres)
yhat <- bhat["(Intercept)"] + bhat["x"] * x
uhat <- y - yhat
We can also use a more black-box approach which will give exactly the same results using the
generic functions fitted and resid on the regression results object:
myolsres <- lm( y ~ x )
bhat <- coef(myolsres)
yhat <- fitted(myolsres)
uhat <- resid(myolsres)
2.2. Coefficients, Fitted Values, and Residuals 89
Wooldridge (2019, Section 2.3) presents and discusses three properties of OLS statistics which we
will confirm for an example.
n
∑ ûi = 0 ⇒ û¯ i = 0 (2.7)
i =1
n
∑ xi ûi = 0 ⇒ Cov( xi , ûi ) = 0 (2.8)
i =1
ȳ = β̂ 0 + β̂ 1 · x̄ (2.9)
Var(ŷ) Var(û)
R2 = = 1− (2.13)
Var(y) Var(y)
We have already come across the command summary as a generic function that produces appro-
priate summaries for very different types of objects. We can also use it to get many interesting results
for a regression. They are introduced one by one in the next sections. If the variable rres contains
a result from a regression, summary(rres) will display
• Some statistics for the residual like the extreme values and the median
92 2. The Simple Regression Model
• A coefficient table. So far, we only discussed the OLS coefficients shown in the first column.
The next columns will be introduced below.
• Some more information of which only R2 is of interest to us so far. It is reported as Multiple
R-squared.
Call:
lm(formula = voteA ~ shareA, data = vote1)
Residuals:
Min 1Q Median 3Q Max
-16.8919 -4.0660 -0.1682 3.4965 29.9772
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.81221 0.88721 30.22 <2e-16 ***
shareA 0.46383 0.01454 31.90 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
2.4. Nonlinearities
For the estimation of logarithmic or semi-logarithmic models, the respective formula can be directly
entered into the specification of lm(...) as demonstrated in Examples 2.10 and 2.11. For the
interpretation as percentage effects and elasticities, see Wooldridge (2019, Section 2.4).
Call:
lm(formula = log(wage) ~ educ, data = wage1)
Coefficients:
(Intercept) educ
0.58377 0.08274
Call:
lm(formula = log(salary) ~ log(sales), data = ceosal1)
Coefficients:
(Intercept) log(sales)
4.8220 0.2567
94 2. The Simple Regression Model
Wooldridge (2019, Section 2.6) discusses models without an intercept. This implies that the regres-
sion line is forced to go through the origin. In R, we can suppress the constant which is otherwise
implicitly added to a formula by specifying
lm(y ~ 0 + x)
instead of lm(y ~ x). The result is a model which only has a slope parameter.
Another topic discussed in this section is a linear regression model without a slope parameter, i.e.
with a constant only. In this case, the estimated constant will be the sample average of the dependent
variable. This can be implemented in R using the code
lm(y ~ 1)
Both special kinds of regressions are implemented in Script 2.12 (SLR-Origin-Const.R) for the
example of the CEO salary and ROE we already analyzed in Example 2.8 and others. The resulting
regression lines are plotted in Figure 2.3 which was generated using the last lines of code shown in
the output.
Call:
lm(formula = salary ~ roe, data = ceosal1)
Coefficients:
(Intercept) roe
963.2 18.5
Call:
lm(formula = salary ~ 0 + roe, data = ceosal1)
Coefficients:
roe
63.54
Call:
lm(formula = salary ~ 1, data = ceosal1)
Coefficients:
(Intercept)
1281
> # average y:
> mean(ceosal1$salary)
[1] 1281.12
full
through origin
3000
ceosal1$salary
const only
1000
0
0 10 20 30 40 50
ceosal1$roe
96 2. The Simple Regression Model
In R, we can obviously do the calculations of Equations 2.14 through 2.16 explicitly. But the output
of the summary command for linear regression results which we already discovered in Section 2.3
already contains the results. We use the following example to calculate the results in both ways to
open the black box of the canned routine and convince ourselves that from now on we can rely on it.
Wooldridge, Example 2.12: Student Math Performance and the School Lunch
Program2.12
Using the data set MEAP93.dta, we regress a math performance score of schools on the share of stu-
dents eligible for a federally funded lunch program. Wooldridge (2019) uses this example to demon-
strate the importance of assumption SLR.4 and warns us against interpreting the regression results in a
causal way. Here, we merely use the example to demonstrate the calculation of standard errors.
Script 2.13 (Example-2-12.R) first calculates the SER manually using the fact that the residuals û are
available as resid(results), see Section 2.2. Then, the SE of the parameters are calculated ac-
cording to Equations 2.15 and 2.16, where the regressor is addressed as the variable in the data frame
df$lnchprg.
2.6. Expected Values, Variances, and Standard Errors 97
Finally, we see the output of the summary command. The SE of the parameters are reported in the
second column of the regression table, next to the parameter estimates. We will look at the other
columns in Chapter 4. The SER is reported as Residual standard error below the table. All three
values are exactly the same as the manual results.
> # SER:
> (SER <- sd(resid(results)) * sqrt((n-1)/(n-2)) )
[1] 9.565938
Call:
lm(formula = math10 ~ lnchprg, data = meap93)
Residuals:
Min 1Q Median 3Q Max
-24.386 -5.979 -1.207 4.865 45.845
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.14271 0.99758 32.221 <2e-16 ***
lnchprg -0.31886 0.03484 -9.152 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
1.2092 0.4384
> sum((x-mean(x))^2)
[1] 990.4104
> # Graph
> plot(x, y, col="gray", xlim=c(0,8) )
2.7. Monte Carlo Simulations 99
> abline(b0,b1,lwd=2)
> abline(olsres,col="gray",lwd=2)
2
−2 0
0 2 4 6 8
x
100 2. The Simple Regression Model
Since the SLR assumptions hold in our exercise, Theorems 2.1 and 2.2 of Wooldridge (2019) should
apply. Theorem 2.1 implies for our model that the estimators are unbiased, i.e.
E( β̂ 0 ) = β 0 = 1 E( β̂ 1 ) = β 1 = 0.5
The estimates obtained from our sample are relatively close to their population values. Obviously,
we can never expect to hit the population parameter exactly. If we change the random seed by
specifying a different number in the first line of code of Script 2.14 (SLR-Sim-Sample.R), we get a
different sample and different parameter estimates.
Theorem 2.2 of Wooldridge (2019) states the sampling variance of the estimators conditional on
the sample values { x1 , . . . , xn }. It involves the average squared value x2 = 16.966 and the sum of
squares ∑in−1 ( x − x )2 = 990.41 which we also know from the R output:
σ2 x2 4 · 16.966
Var( β̂ 0 ) = = = 0.0685
∑in=1 ( x − x )2 990.41
σ2 4
Var( β̂ 1 ) = = = 0.0040
∑in=1 ( x − x )2 990.41
√
If Wooldridge (2019) is right, the standard error of β̂ 1 is 0.004 = 0.063. So getting an estimate of
β̂ 1 = 0.438 for one sample doesn’t seem unreasonable given β 1 = 0.5.
2 InScript 2.15 (SLR-Sim-Model.R) shown on page 321, we implement the joint sampling from x and y. The results are
essentially the same.
2.7. Monte Carlo Simulations 101
# repeat r times:
for(j in 1:r) {
# Draw a sample of y:
u <- rnorm(n,0,su)
y <- b0 + b1*x + u
Script 2.17 (SLR-Sim-Results.R) gives descriptive statistics of the r = 10, 000 estimates we got
from our simulation exercise. Wooldridge (2019, Theorem 2.1) claims that the OLS estimators are
unbiased, so we should expect to get estimates which are very close to the respective population
parameters. This is clearly confirmed. The average value of β̂ 0 is very close to β 0 = 1 and the
average value of β̂ 1 is very close to β 1 = 0.5.
The simulated sampling variances are Var g ( β̂ 0 ) = 0.069 and Var
g ( β̂ 1 ) = 0.004. Also these values are
very close to the ones we expected from Theorem 2.2. The last lines of the code produce Figure 2.5.
It shows the OLS regression lines for the first 10 simulated samples together with the population
regression function.
102 2. The Simple Regression Model
> mean(b1hat)
[1] 0.5000466
> var(b1hat)
[1] 0.004069063
Population
OLS regressions
5
4
3
y
2
1
0
0 2 4 6 8
x
2.7. Monte Carlo Simulations 103
The simulation results are presented in the output of Script 2.19 (SLR-Sim-Results-ViolSLR4.R).
Obviously, the OLS coefficients are now biased: The average estimates are far from the population
parameters β 0 = 1 and β 1 = 0.5. This confirms that Assumption SLR.4 is required to hold for the
unbiasedness shown in Theorem 2.1.
Output of Script 2.19: SLR-Sim-Results-ViolSLR4.R
> # MC estimate of the expected values:
> mean(b0hat)
[1] 0.1985388
> mean(b1hat)
[1] 0.7000466
> var(b1hat)
[1] 0.004069063
4
Var(u| x ) = · ex ,
e4.5
so SLR.5 is clearly violated since the variance depends on x. We assume exogeneity, so assumption
SLR.4 holds. The factor in front ensures that the unconditional variance Var(u) = 4.3 Based on this
unconditional variance only, the sampling variance should not change compared to the results above
and we would still expect Var( β̂ 0 ) = 0.0685 and Var( β̂ 1 ) = 0.0040. But since Assumption SLR.5 is
violated, Theorem 2.2 is not applicable.
Script 2.20 (SLR-Sim-ViolSLR5.R) implements a simulation of this model and is listed in the
appendix (p. 323). Here, we only had to change the line of code for the sampling of u to
varu <- 4/exp(4.5) * exp(x)
u <- rnorm(n, 0, sqrt(varu) )
> mean(b1hat)
[1] 0.4992376
> var(b1hat)
[1] 0.007264373
3. Multiple Regression Analysis: Estimation
Running a multiple regression in R is as straightforward as running a simple regression using the
lm command. Section 3.1 shows how it is done. Section 3.2 opens the black box and replicates the
main calculations using matrix algebra. This is not required for the remaining chapters, so it can be
skipped by readers who prefer to keep black boxes closed.
Section 3.3 should not be skipped since it discusses the interpretation of regression results and the
prevalent omitted variables problems. Finally, Section 3.4 covers standard errors and multicollinear-
ity for multiple regression.
The tilde ~ again separates the dependent variable from the regressors which are now separated
using a + sign. We can add options as before. For example if the data are contained in a data frame
df, we should add the option “data=df”. The constant is again automatically added unless it is
explicitly suppressed using lm(y ~ 0+x1+x2+x3+...).
We are already familiar with the workings of lm: The command creates an object which contains
all relevant information. A simple call like the one shown above will only display the parameter esti-
mates. We can store the estimation results in a variable myres using the code myres <- lm(...)
and then use this variable for further analyses. For a typical regression output including a coefficient
table, call summary(myres). Of course if this is all we want, we can leave out storing the result and
simply call summary( lm(...) ) in one step. Further analyses involving residuals, fitted values
and the like can be used exactly as presented in Chapter 2.
The output of summary includes parameter estimates, standard errors according to Theorem 3.2
of Wooldridge (2019), the coefficient of determination R2 , and many more useful results we cannot
interpret yet before we have worked through Chapter 4.
Call:
lm(formula = colGPA ~ hsGPA + ACT, data = gpa1)
Coefficients:
(Intercept) hsGPA ACT
1.286328 0.453456 0.009426
> summary(GPAres)
Call:
lm(formula = colGPA ~ hsGPA + ACT, data = gpa1)
Residuals:
Min 1Q Median 3Q Max
-0.85442 -0.24666 -0.02614 0.28127 0.85357
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.286328 0.340822 3.774 0.000238 ***
hsGPA 0.453456 0.095813 4.733 5.42e-06 ***
ACT 0.009426 0.010777 0.875 0.383297
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = log(wage) ~ educ + exper + tenure, data = wage1)
Residuals:
Min 1Q Median 3Q Max
-2.05802 -0.29645 -0.03265 0.28788 1.42809
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.284360 0.104190 2.729 0.00656 **
educ 0.092029 0.007330 12.555 < 2e-16 ***
exper 0.004121 0.001723 2.391 0.01714 *
tenure 0.022067 0.003094 7.133 3.29e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = prate ~ mrate + age, data = k401k)
Residuals:
Min 1Q Median 3Q Max
-81.162 -8.067 4.787 12.474 18.256
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 80.1191 0.7790 102.85 < 2e-16 ***
mrate 5.5213 0.5259 10.50 < 2e-16 ***
age 0.2432 0.0447 5.44 6.21e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = narr86 ~ pcnv + ptime86 + qemp86, data = crime1)
Residuals:
Min 1Q Median 3Q Max
-0.7118 -0.4031 -0.2953 0.3452 11.4358
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.711772 0.033007 21.565 < 2e-16 ***
pcnv -0.149927 0.040865 -3.669 0.000248 ***
ptime86 -0.034420 0.008591 -4.007 6.33e-05 ***
qemp86 -0.104113 0.010388 -10.023 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = narr86 ~ pcnv + avgsen + ptime86 + qemp86, data = crime1)
Residuals:
Min 1Q Median 3Q Max
-0.9330 -0.4247 -0.2934 0.3506 11.4403
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.706756 0.033151 21.319 < 2e-16 ***
pcnv -0.150832 0.040858 -3.692 0.000227 ***
avgsen 0.007443 0.004734 1.572 0.115993
ptime86 -0.037391 0.008794 -4.252 2.19e-05 ***
qemp86 -0.103341 0.010396 -9.940 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = log(wage) ~ educ, data = wage1)
Residuals:
Min 1Q Median 3Q Max
-2.21158 -0.36393 -0.07263 0.29712 1.52339
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.583773 0.097336 5.998 3.74e-09 ***
educ 0.082744 0.007567 10.935 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This equation involves three matrix operations which we know how to implement in R from Section
1.2.5:
• Transpose: The expression X0 is t(X) in R
• Matrix multiplication: The expression X0 X is translated as t(X)%*%X
• Inverse: (X0 X)−1 is written as solve( t(X)%*%X )
So we can collect everything and translate Equation 3.2 into the somewhat unsightly expression
û = y − X β̂ (3.3)
which is equivalent to sigsqhat <- t(uhat) %*% uhat / (n-k-1). For technical reasons, it
will be convenient to have this variable as a scalar instead of a 1 × 1 matrix, so we put this expression
into the as.numeric function in our actual implementation:
sigsqhat <- as.numeric( t(uhat) %*% uhat / (n-k-1) )
√
The standard error of the regression (SER) is its square root σ̂ = σ̂2 . The estimated OLS variance-
covariance matrix according to Wooldridge (2019, Theorem E.2) is then
\
Var ( β̂) = σ̂2 (X0 X)−1 (3.5)
Finally, the standard errors of the parameter estimates are the square roots of the main diagonal of
Var( β̂) which can be expressed in R as
se <- sqrt( diag(Vbetahat) )
Script 3.6 (OLS-Matrices.R) implements this for the GPA regression from Example 3.1. Com-
paring the results to the built-in function (see Script 3.1 (Example-3-1.R)), it is reassuring that we
get exactly the same numbers for the parameter estimates, SER (“Residual standard error”),
and standard errors of the coefficients.
Output of Script 3.6: OLS-Matrices.R
> data(gpa1, package=’wooldridge’)
> # extract y
> y <- gpa1$colGPA
[4,] 1 3.5 27
[5,] 1 3.9 28
[6,] 1 3.4 25
ŷ = β̂ 0 + β̂ 1 x1 + β̂ 2 x2 (3.6)
The parameter β̂ 1 is the estimated effect of increasing x1 by one unit while keeping x2 fixed. In
contrast, consider the simple regression including only x1 as a regressor:
ỹ = β̃ 0 + β̃ 1 x1 . (3.7)
The parameter β̃ 1 is the estimated effect of increasing x1 by one unit (and NOT keeping x2 fixed). It
can be related to β̂ 1 using the formula
β̃ 1 = β̂ 1 + β̂ 2 δ̃1 (3.8)
• Each of these δ̃1 units leads to an increase of predicted y by β̂ 2 units, giving a total indirect
effect of δ̃1 β̂ 2 (see again Equ. 3.6)
• The overall effect β̃ 1 is the sum of the direct and indirect effects (see Equ. 3.8).
We revisit Example 3.1 to see whether we can demonstrate equation 3.8 in R. Script 3.7
(Omitted-Vars.R) repeats the regression of the college GPA (colGPA) on the achievement test
score (ACT) and the high school GPA (hsGPA). We study the ceteris paribus effect of ACT on colGPA
which has an estimated value of β̂ 1 = 0.0094. The estimated effect of hsGPA is β̂ 2 = 0.453. The slope
parameter of the regression corresponding to Eq. 3.9 is δ̃1 = 0.0389. Plugging these values into Equ.
3.8 gives a total effect of β̃ 1 = 0.0271 which is exactly what the simple regression at the end of the
output delivers.
Output of Script 3.7: Omitted-Vars.R
> data(gpa1, package=’wooldridge’)
> beta.hat
(Intercept) ACT hsGPA
1.286327767 0.009426012 0.453455885
> delta.tilde
(Intercept) ACT
2.46253658 0.03889675
Call:
lm(formula = colGPA ~ ACT, data = gpa1)
Coefficients:
(Intercept) ACT
2.40298 0.02706
In this example, the indirect effect is actually stronger than the direct effect. ACT predicts colGPA
mainly because it is related to hsGPA which in turn is strongly related to colGPA.
These relations hold for the estimates from a given sample. In Section 3.3, Wooldridge (2019)
discusses how to apply the same sort of arguments to the OLS estimators which are random variables
varying over different samples. Omitting relevant regressors causes bias if we are interested in
estimating partial effects. In practice, it is difficult to include all relevant regressors making of
omitted variables a prevalent problem. It is important enough to have motivated a vast amount
of methodological and applied research. More advanced techniques like instrumental variables or
panel data methods try to solve the problem in cases where we cannot add all relevant regressors,
for example because they are unobservable. We will come back to this in Part 3.
3.4. Standard Errors, Multicollinearity, and VIF 113
σ2 1 σ2 1
Var( β̂ j ) = 2
= · · , (3.10)
SSTj (1 − R j ) n Var( x j ) 1 − R2j
where SSTj = ∑in=1 ( x ji − x j )2 = (n − 1) · Var ( x j ) is the total sum of squares and R2j is the usual
coefficient of determination from a regression of x j on all of the other regressors.1
The variance of β̂ j consists of four parts:
1
• n : The variance is smaller for larger samples.
• σ2 : The variance is larger if the error term varies a lot, since it introduces randomness into the
relationship between the variables of interest.
1
• Var( x j )
: The variance is smaller if the regressor x j varies a lot since this provides relevant
information about the relationship.
1
• 1− R2j
: This variance inflation factor (VIF) accounts for (imperfect) multicollinearity. If x j is
highly related to the other regressors, R2j and therefore also V IFj and the variance of β̂ j are
large.
Since the error variance σ2 is unknown, we replace it with an estimate to come up with an esti-
mated variance of the parameter estimate. Its square root is the standard error
1 σ̂ 1
se( β̂ j ) = √ · ·q . (3.11)
n sd( x j ) 1 − R2j
It is not directly obvious that this formula leads to the same results as the matrix formula in
Equation 3.5. We will validate this formula by replicating Example 3.1 which we also used for
manually calculating the SE using the matrix formula above. The calculations are shown in Script
3.8 (MLR-SE.R).
We also use this example to demonstrate how to extract results which are reported by the summary
of the lm results. Given its results are stored in variable sures using the results of sures
<- summary(lm(...)), we can easily access the results using sures$resultname where the
resultname can be any of the following:
• coefficients for a matrix of the regression table (including coefficients, SE, ...)
• residuals for a vector of residuals
• sigma for the SER
• r.squared for R2
• and more.2
1 Note that here, we use the population variance formula Var( x j ) = n1 ∑in=1 ( x ji − x j )2
2 As with any other list, a full listing of result names can again be obtained by names(sures) if sures stores the results.
114 3. Multiple Regression Analysis: Estimation
> summary(res)
Call:
lm(formula = colGPA ~ hsGPA + ACT, data = gpa1)
Residuals:
Min 1Q Median 3Q Max
-0.85442 -0.24666 -0.02614 0.28127 0.85357
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.286328 0.340822 3.774 0.000238 ***
hsGPA 0.453456 0.095813 4.733 5.42e-06 ***
ACT 0.009426 0.010777 0.875 0.383297
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> sdx <- sd(gpa1$hsGPA) * sqrt((n-1)/n) # (Note: sd() uses the (n-1) version)
This is used in Script 3.8 (MLR-SE.R) to extract the SER of the main regression and the R2j from
the regression of hsGPA on ACT which is needed for calculating the VIF for the coefficient of hsGPA.3
The other ingredients of formula 3.11 are straightforward. The standard error calculated this way
is exactly the same as the one of the built-in command and the matrix formula used in Script 3.6
(OLS-Matrices.R).
3 We could have calculated these values manually like in Scripts 2.8 (Example-2-8.R), 2.13 (Example-2-12.R) or 3.6
(OLS-Matrices.R).
3.4. Standard Errors, Multicollinearity, and VIF 115
A convenient way to automatically calculate variance inflation factors (VIF) is provided by the
package car. Remember from Section 1.1.3 that in order to use this package, we have to install it
once per computer using install.packages("car"). Then we can load it with the command
library(car). Among other useful tools, this package implements the command vif(lmres)
where lmres is a regression result from lm. It delivers a vector of VIF for each of the regressors as
demonstrated in Script 3.9 (MLR-VIF.R).
We extend Example 3.6. and regress individual log wage on education (educ), potential overall
work experience (exper), and the number of years with current employer (tenure). We could
imagine that these three variables are correlated with each other, but the results show no big VIF.
The largest one is for the coefficient of exper. Its variance is higher by a factor of (only) 1.478 than
in a world in which it were uncorrelated with the other regressors. So we don’t have to worry about
multicollinearity here.
Output of Script 3.9: MLR-VIF.R
> data(wage1, package=’wooldridge’)
Call:
lm(formula = log(wage) ~ educ + exper + tenure, data = wage1)
Residuals:
Min 1Q Median 3Q Max
-2.05802 -0.29645 -0.03265 0.28788 1.42809
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.284360 0.104190 2.729 0.00656 **
educ 0.092029 0.007330 12.555 < 2e-16 ***
exper 0.004121 0.001723 2.391 0.01714 *
tenure 0.022067 0.003094 7.133 3.29e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
H0 : β j = a j , (4.1)
where a j is some given number, very often a j = 0. For the most common case of two-tailed tests, the
alternative hypothesis is
H1 : β j 6= a j , (4.2)
These hypotheses can be conveniently tested using a t test which is based on the test statistic
β̂ j − a j
t= . (4.4)
se( β̂ j )
If H0 is in fact true and the CLM assumptions hold, then this statistic has a t distribution with
n − k − 1 degrees of freedom.
118 4. Multiple Regression Analysis: Inference
H0 : β j = 0, H1 : β j 6= 0, (4.5)
β̂ j
t β̂ = (4.6)
j se( β̂ j )
The subscript on the t statistic indicates that this is “the” t value for β̂ j for this frequent version
of the test. Under H0 , it has the t distribution with n − k − 1 degrees of freedom implying that
the probability that |t β̂ | > c is equal to α if c is the 1 − α2 quantile of this distribution. If α is our
j
significance level (e.g. α = 5%), then we
reject H0 if |t β̂ | > c
j
in our sample. For the typical significance level α = 5%, the critical value c will be around 2 for
reasonably large degrees of freedom and approach the counterpart of 1.96 from the standard normal
distribution in very large samples.
The p value indicates the smallest value of the significance level α for which we would still reject H0
using our sample. So it is the probability for a random variable T with the respective t distribution
that | T | > |t β̂ | where t β̂ is the value of the t statistic in our particular sample. In our two-tailed test,
j j
it can be calculated as
p β̂ = 2 · Ftn−k−1 (−|t β̂ |), (4.7)
j j
where Ftn−k−1 (·) is the cdf of the t distribution with n − k − 1 degrees of freedom. If our software
provides us with the relevant p values, they are easy to use: We
reject H0 if p β̂ ≤ α.
j
Since this standard case of a t test is so common, R provides us with the relevant t and p values
directly in the summary of the estimation results we already saw in the previous chapter. The
regression table includes for all regressors and the intercept
• Parameter estimates and standard errors, see Section 3.1.
• The test statistics t β̂ from Equation 4.6 in the column t value
j
For the critical values of the t tests, using the normal approximation instead of the exact t distribution
with n − k − 1 = 137 d.f. doesn’t make much of a difference:
> # CV for alpha=5% and 1% using the t distribution with 137 d.f.:
> alpha <- c(0.05, 0.01)
> # Critical values for alpha=5% and 1% using the normal approximation:
> qnorm(1-alpha/2)
[1] 1.959964 2.575829
Script 4.1 (Example-4-3.R) presents the standard summary which directly contains all the
information to test the hypotheses in Equation 4.5 for all parameters. The t statistics for all
coefficients except β 2 are larger in absolute value than the critical value c = 2.61 (or c = 2.58
using the normal approximation) for α = 1%. So we would reject H0 for all usual significance
levels. By construction, we draw the same conclusions from the p values (or the symbols next
to it).
In order to confirm that R is exactly using the formulas of Wooldridge (2019), we next
reconstruct the t and p values manually. The whole regression table is stored as
sumres$coefficients, where sumres contains the summary results, see Section 3.4.
We extract the first two columns of it as the coefficients and standard errors, respectively.
Then we simply apply Equations 4.6 and 4.7.
Call:
lm(formula = colGPA ~ hsGPA + ACT + skipped, data = gpa1)
Residuals:
Min 1Q Median 3Q Max
-0.85698 -0.23200 -0.03935 0.24816 0.81657
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.38955 0.33155 4.191 4.95e-05 ***
hsGPA 0.41182 0.09367 4.396 2.19e-05 ***
ACT 0.01472 0.01056 1.393 0.16578
skipped -0.08311 0.02600 -3.197 0.00173 **
---
120 4. Multiple Regression Analysis: Inference
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Given the standard regression output like the one in Script 4.1 (Example-4-3.R) including the p
value for two-sided tests p β̂ , we can easily do one-sided t tests for the null hypothesis H0 : β j = 0 in
j
two steps:
• Is β̂ j positive (if H1 : β j > 0) or negative (if H1 : β j < 0)?
– No → do not reject H0 since this cannot be evidence against H0 .
– Yes → The relevant p value is half of the reported p β̂ .
j
⇒ Reject H0 if p = 21 p β̂ j < α.
4.1. The t Test 121
> # CV for alpha=5% and 1% using the t distribution with 522 d.f.:
> alpha <- c(0.05, 0.01)
> # Critical values for alpha=5% and 1% using the normal approximation:
> qnorm(1-alpha)
[1] 1.644854 2.326348
Script 4.2 (Example-4-1.R) shows the standard regression output. The reported t statistic for
the parameter of exper is t β̂2 = 2.391 which is larger than the critical value c = 2.33 for the
significance level α = 1%, so we reject H0 . By construction, we get the same answer from
looking at the p value. Like always, the reported p β̂ value is for a two-sided test, so we have
j
0.01714
to divide it by 2. The resulting value p = 2 = 0.00857 < 0.01, so we reject H0 using an
α = 1% significance level.
Call:
lm(formula = log(wage) ~ educ + exper + tenure, data = wage1)
Residuals:
Min 1Q Median 3Q Max
-2.05802 -0.29645 -0.03265 0.28788 1.42809
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.284360 0.104190 2.729 0.00656 **
educ 0.092029 0.007330 12.555 < 2e-16 ***
exper 0.004121 0.001723 2.391 0.01714 *
tenure 0.022067 0.003094 7.133 3.29e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
β̂ j ± c · se( β̂ j ), (4.8)
where c is the same critical value for the two-sided t test using a significance level α = 5%.
Wooldridge (2019) shows examples of how to manually construct these CI.
R provides a convenient way to calculate the CI for all parameters: If the regression results are
stored in a variable myres, the command confint(myres) gives a table of 95% confidence inter-
vals. Other levels can be chosen using the option level = value. The 99% CI are for example
obtained as confint(myres,level=0.99).
Script 4.3 (Example-4-8.R) presents the regression results as well as the 95% and 99% CI. See Wooldridge
(2019) for the manual calculation of the CI and comments on the results.
Call:
lm(formula = log(rd) ~ log(sales) + profmarg, data = rdchem)
Residuals:
Min 1Q Median 3Q Max
-0.97681 -0.31502 -0.05828 0.39020 1.21783
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.37827 0.46802 -9.355 2.93e-10 ***
log(sales) 1.08422 0.06020 18.012 < 2e-16 ***
profmarg 0.02166 0.01278 1.694 0.101
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We want to test whether the performance measures batting average (bavg), home runs per year
(hrunsyr), and runs batted in per year (rbisyr) have an impact on the salary once we control
for the number of years as an active player (years) and the number of games played per year
(gamesyr). So we state our null hypothesis as H0 : β 3 = 0, β 4 = 0, β 5 = 0 versus H1 : H0 is false, i.e.
at least one of the performance measures matters.
The test statistic of the F test is based on the relative difference between the sum of squared
residuals in the general (unrestricted) model and a restricted model in which the hypotheses are
imposed SSRur and SSRr , respectively. In our example, the restricted model is one in which bavg,
hrunsyr, and rbisyr are excluded as regressors. If both models involve the same dependent
variable, it can also be written in terms of the coefficient of determination in the unrestricted and the
restricted model R2ur and R2r , respectively:
SSRr − SSRur n − k − 1 R2 − R2 n − k − 1
F= · = ur 2 r · , (4.10)
SSRur q 1 − Rur q
where q is the number of restrictions (in our example, q = 3). Intuitively, if the null hypothesis is
correct, then imposing it as a restriction will not lead to a significant drop in the model fit and the
F test statistic should be relatively small. It can be shown that under the CLM assumptions and the
null hypothesis, the statistic has an F distribution with the numerator degrees of freedom equal to q
and the denominator degrees of freedom of n − k − 1. Given a significance level α, we will reject H0
if F > c, where the critical value c is the 1 − α quantile of the relevant Fq,n−k−1 distribution. In our
example, n = 353, k = 5, q = 3. So with α = 1%, the critical value is 3.84 and can be calculated using
the qf function as
> # CV for alpha=1% using the F distribution with 3 and 347 d.f.:
124 4. Multiple Regression Analysis: Inference
Script 4.4 (F-Test-MLB.R) shows the calculations for this example. The result is F = 9.55 > 3.84,
so we clearly reject H0 . We also calculate the p value for this test. It is p = 4.47 · 10−06 = 0.00000447,
so we reject H0 for any reasonable significance level.
Output of Script 4.4: F-Test-MLB.R
> data(mlb1, package=’wooldridge’)
> # R2:
> ( r2.ur <- summary(res.ur)$r.squared )
[1] 0.6278028
> # F statistic:
> ( F <- (r2.ur-r2.r) / (1-r2.ur) * 347/3 )
[1] 9.550254
It should not be surprising that there is a more convenient way to do this in R. The package car
provides a command linearHypothesis which is well suited for these kinds of tests.1 Given the
unrestricted estimation results are stored in a variable res, an F test is conducted with
linearHypothesis(res, myH0)
where myH0 describes the null hypothesis to be tested. It is a vector of length q where each restriction
is described as a text in which the variable name takes the place of its parameter. In our example, H0
is that the three parameters of bavg, hrunsyr, and rbisyr are all equal to zero, which translates
as myH0 <- c("bavg=0","hrunsyr=0","rbisyr=0"). The “=0” can also be omitted since this
is the default hypothesis. Script 4.5 (F-Test-MLB-auto.R) implements this for the same test as the
manual calculations done in Script 4.4 (F-Test-MLB.R) and results in exactly the same F statistic
and p value.
> # F test
> myH0 <- c("bavg","hrunsyr","rbisyr")
Hypothesis:
bavg = 0
hrunsyr = 0
rbisyr = 0
This function can also be used to test more complicated null hypotheses. For example, suppose
a sports reporter claims that the batting average plays no role and that the number of home runs
has twice the impact as the number of runs batted in. This translates (using variable names instead
of numbers as subscripts) as H0 : β bavg = 0, β hrunsyr = 2 · β rbisyr . For R, we translate it as myH0 <-
c("bavg=0","hrunsyr=2*rbisyr"). The output of Script 4.6 (F-Test-MLB-auto2.R) shows
the results of this test. The p value is p = 0.6, so we cannot reject H0 .
Hypothesis:
bavg = 0
hrunsyr - 2 rbisyr = 0
If we are interested in testing the null hypothesis that a set of coefficients with similar names are
equal to zero, the function matchCoefs(res,expr) can be handy. It provides the names of all
coefficients in result res which contain the expression expr. Script 4.7 (F-Test-MLB-auto3.R)
presents an example how this works. A more realistic example is given in Section 7.5 where we can
automatically select all interaction coefficients.
Output of Script 4.7: F-Test-MLB-auto3.R
> # Note: Script "F-Test-MLB-auto.R" has to be run first to create res.ur.
> # Which variables used in res.ur contain "yr" in their names?
> myH0 <- matchCoefs(res.ur,"yr")
> myH0
[1] "gamesyr" "hrunsyr" "rbisyr"
Hypothesis:
gamesyr = 0
hrunsyr = 0
rbisyr = 0
Both the most important and the most straightforward F test is the one for overall significance.
The null hypothesis is that all parameters except for the constant are equal to zero. If this null
hypothesis holds, the regressors do not have any joint explanatory power for y. The results of such
a test are automatically included in the last line of summary(lm(...)). As an example, see Script
4.3 (Example-4-8.R). The null hypothesis that neither the sales nor the margin have any relation
to R&D spending is clearly rejected with an F statistic of 162.2 and a p value smaller than 10−15 .
4.4. Reporting Regression Results 127
Script 4.8 (Example-4-10.R) loads the data, generates the new variable b_s = (benefits/salary)
and runs three regressions with different sets of other factors. The stargazer command is then
used to display the results in a clearly arranged table of all relevant results. We choose the options
type="text" to request a text output (instead of a LATEX table) and keep.stat=c("n","rsq") to have
n and R2 reported in the table.
Note that the default translation of p values to stars differes between stargazer() and summary():
one star * here tranlates to p < 0.1 whereas it means p < 0.05 in the standard summary() out-
put. This is of course arbitrary. The behavior of stargazer can be changed with the option
star.cutoffs=c(0.05, 0.01, 0.001).
> stargazer(list(model1,model2,model3),type="text",keep.stat=c("n","rsq"))
==========================================
Dependent variable:
-----------------------------
log(salary)
(1) (2) (3)
------------------------------------------
b_s -0.825*** -0.605*** -0.589***
(0.200) (0.165) (0.165)
droprate -0.0003
(0.002)
gradrate 0.001
(0.001)
------------------------------------------
Observations 408 408 408
R2 0.040 0.353 0.361
==========================================
Note: *p<0.1; **p<0.05; ***p<0.01
5. Multiple Regression Analysis: OLS
Asymptotics
Asymptotic theory allows us to relax some assumptions needed to derive the sampling distribution
of estimators if the sample size is large enough. For running a regression in a software package, it
does not matter whether we rely on stronger assumptions or on asymptotic arguments. So we don’t
have to learn anything new regarding the implementation.
Instead, this chapter aims to improve on our intuition regarding the workings of asymptotics by
looking at some simulation exercises in Section 5.1. Section 5.2 briefly discusses the implementation
of the regression-based LM test presented by Wooldridge (2019, Section 5.2).
the samples. For a more detailed discussion of the implementation, see Section 2.7.2 where a very
similar simulation exercise is introduced.
Script 5.1: Sim-Asy-OLS-norm.R
# Note: We’ll have to set the sample size first, e.g. by uncommenting:
# n <- 100
# Set the random seed
set.seed(1234567)
# set true parameters: intercept & slope
b0 <- 1; b1 <- 0.5
# initialize b1hat to store 10000 results:
b1hat <- numeric(10000)
This code was run for different sample sizes. The density estimate together with the corresponding
normal density are shown in Figure 5.1. Not surprisingly, all distributions look very similar to the
normal distribution – this is what Theorem 4.1 predicted. Note that the fact that the sampling
variance decreases as n rises is only obvious if we pay attention to the different scales of the axes.
For each of the same sample sizes used above, we again estimate the slope parameter for 10 000
samples. The densities of β̂ 1 are plotted in Figure 5.3 together with the respective normal distribu-
tions with the corresponding variances. For the small sample sizes, the deviation from the normal
distribution is strong. Note that the dashed normal distributions have the same mean and variance.
The main difference is the kurtosis which is larger than 8 in the simulations for n = 5 compared to
the normal distribution for which the kurtosis is equal to 3.
5.1. Simulation Exercises 131
Figure 5.1. Density of β̂ 1 with different sample sizes: normal error terms
1.2
0.8
1.0
0.6
0.8
Density
Density
0.6
0.4
0.4
0.2
0.2
0.0
0.0
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −0.5 0.0 0.5 1.0 1.5
(a) n = 5 (b) n = 10
12
10
3
8
Density
Density
2
6
4
1
2
0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.40 0.45 0.50 0.55 0.60
(c) n = 100 (d) n = 1 000
standardized χ21
standard normal
−3 −2 −1 0 1 2 3
u
132 5. Multiple Regression Analysis: OLS Asymptotics
Figure 5.3. Density of β̂ 1 with different sample sizes: non-normal error terms
1.5
1.5
1.0
1.0
Density
Density
0.5
0.5
0.0
0.0
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −0.5 0.0 0.5 1.0 1.5
(a) n = 5 (b) n = 10
4
12
10
3
8
Density
Density
2
6
4
1
2
0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.40 0.45 0.50 0.55 0.60
(c) n = 100 (d) n = 1 000
For larger sample sizes, the sampling distribution of β̂ 1 converges to the normal distribution. For
n = 100, the difference is much smaller but still discernible. For n = 1 000, it cannot be detected
anymore in our simulation exercise. How large the sample needs to be depends among other things
on the severity of the violations of MLR.6. If the distribution of the error terms is not as extremely
non-normal as in our simulations, smaller sample sizes like the rule of thumb n = 30 might suffice
for valid asymptotics.
5.1. Simulation Exercises 133
# repeat r times:
for(j in 1:10000) {
# Draw a sample of x, varying over replications:
x <- rnorm(n,4,1)
# Draw a sample of u (std. normal):
u <- rnorm(n)
# Draw a sample of y:
y <- b0 + b1*x + u
# regress y on x and store slope estimate at position j
bhat <- coef( lm(y~x) )
b1hat[j] <- bhat["x"]
}
Figure 5.4 shows the distribution of the 10 000 estimates generated by Script 5.3
(Sim-Asy-OLS-uncond.R) for n = 5, 10, 100, and 1 000. As we expected from theory, the
distribution is (close to) normal for large samples. For small samples, it deviates quite a bit. The
kurtosis is 8.7 for a sample size of n = 5 which is far away from the kurtosis of 3 of a normal
distribution.
134 5. Multiple Regression Analysis: OLS Asymptotics
1.0
0.6
0.8
Density
Density
0.4
0.6
0.4
0.2
0.2
0.0
0.0
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −0.5 0.0 0.5 1.0 1.5
(a) n = 5 (b) n = 10
4
12
10
3
8
Density
Density
2
6
4
1
2
0
5.2. LM Test
As an alternative to the F tests discussed in Section 4.3, LM tests for the same sort of hypotheses can
be very useful with large samples. In the linear regression setup, the test statistic is
LM = n · R2ũ ,
where n is the sample size and R2ũ is the usual R2 statistic in a regression of the residual ũ from the
restricted model on the unrestricted set of regressors. Under the null hypothesis, it is asymptotically
distributed as χ2q with q denoting the number of restrictions. Details are given in Wooldridge (2019,
Section 5.2).
The implementation in R is straightforward if we remember that the residuals can be obtained
with the resid command.
The dependent variable narr86 reflects the number of times a man was arrested and is explained by
the proportion of prior arrests (pcnv), previous average sentences (avgsen), the time spend in prison
before 1986 (tottime), the number of months in prison in 1986 (ptime86), and the number of quarters
unemployed in 1986 (qemp86).
The joint null hypothesis is
H0 : β 2 = β 3 = 0,
so the restricted set of regressors excludes avgsen and tottime. Script 5.4 (Example-5-3.R) shows
an implementation of this LM test. The restricted model is estimated and its residuals utilde=ũ are
calculated. They are regressed on the unrestricted set of regressors. The R2 from this regression is
0.001494, so the LM test statistic is calculated to be around LM = 0.001494 · 2725 = 4.071. This is smaller
than the critical value for a significance level of α = 10%, so we do not reject the null hypothesis. We
can also easily calculate the p value in R using the χ2 cdf qchisq. It turns out to be 0.1306.
The same hypothesis can be tested using the F test presented in Section 4.3 using the command
linearHypothesis. In this example, it delivers the same p value up to three digits.
136 5. Multiple Regression Analysis: OLS Asymptotics
> # R-squared:
> (r2 <- summary(LMreg)$r.squared )
[1] 0.001493846
> LM
[1] 4.070729
Hypothesis:
avgsen = 0
tottime = 0
Call:
lm(formula = bwght ~ cigs + faminc, data = bwght)
Coefficients:
(Intercept) cigs faminc
116.97413 -0.46341 0.09276
Call:
lm(formula = bwghtlbs ~ cigs + faminc, data = bwght)
Coefficients:
(Intercept) cigs faminc
7.310883 -0.028963 0.005798
Call:
lm(formula = I(bwght/16) ~ cigs + faminc, data = bwght)
Coefficients:
(Intercept) cigs faminc
7.310883 -0.028963 0.005798
Call:
lm(formula = bwght ~ I(cigs/20) + faminc, data = bwght)
Coefficients:
(Intercept) I(cigs/20) faminc
116.97413 -9.26815 0.09276
6.1. Model Formulae 139
If the regression model only contains standardized variables, the coefficients have a special inter-
pretation. They measure by how many standard deviations y changes as the respective independent
variable increases by one standard deviation. Inconsistent with the notation used here, they are some-
times referred to as beta coefficients.
In R, we can use the same type of arithmetic transformations as in Section 6.1.1 to subtract the
mean and divide by the standard deviation. But it can also be done more conveniently by using the
function scale directly for all variables we want to standardize. The equation and the corresponding
R formula in a model with two standardized regressors would be
zy = b1 z x1 + b2 z x2 + u (6.3)
Call:
lm(formula = scale(price) ~ 0 + scale(nox) + scale(crime) + scale(rooms) +
scale(dist) + scale(stratio), data = hprice2)
Coefficients:
scale(nox) scale(crime) scale(rooms) scale(dist)
-0.3404 -0.1433 0.5139 -0.2348
scale(stratio)
-0.2703
140 6. Multiple Regression Analysis: Further Issues
6.1.3. Logarithms
We have already seen in Section 2.4 that we can include the function log directly in formulas to
represent logarithmic and semi-logarithmic models. A simple example of a partially logarithmic
model and its R formulary would be
Call:
lm(formula = log(price) ~ log(nox) + rooms, data = hprice2)
Coefficients:
(Intercept) log(nox) rooms
9.2337 -0.7177 0.3059
y = β 0 + β 1 x + β 2 x2 + β 3 x3 + u (6.5)
This can be more concise with long variable names and/or a high degree of the polynomial. It
is also useful since some post-estimation commands like Anova are better able to understand the
specification. And without the option raw=TRUE, we specify orthogonal polynomials instead of
standard (raw) polynomials.
For nonlinear models like this, it is often useful to get a graphical illustration of the effects. Section
6.2.3 shows how to conveniently generate these.
6.1. Model Formulae 141
Script 6.4 (Example-6-2.R) implements this model and presents detailed results including t statistics
and their p values. The quadratic term of rooms has a significantly positive coefficient β̂ 4 implying
that the semi-elasticity increases with more rooms. The negative coefficient for rooms and the positive
coefficient for rooms2 imply that for “small” numbers of rooms, the price decreases with the number
of rooms and for “large” values, it increases. The number of rooms implying the smallest price can be
found as2
− β3
rooms∗ = ≈ 4.4.
2β 4
> summary(res)
Call:
lm(formula = log(price) ~ log(nox) + log(dist) + rooms + I(rooms^2) +
stratio, data = hprice2)
Residuals:
Min 1Q Median 3Q Max
-1.04285 -0.12774 0.02038 0.12650 1.25272
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.385477 0.566473 23.630 < 2e-16 ***
log(nox) -0.901682 0.114687 -7.862 2.34e-14 ***
log(dist) -0.086781 0.043281 -2.005 0.04549 *
rooms -0.545113 0.165454 -3.295 0.00106 **
I(rooms^2) 0.062261 0.012805 4.862 1.56e-06 ***
stratio -0.047590 0.005854 -8.129 3.42e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(res)
Call:
lm(formula = log(price) ~ log(nox) + log(dist) + poly(rooms,
2, raw = TRUE) + stratio, data = hprice2)
Residuals:
Min 1Q Median 3Q Max
-1.04285 -0.12774 0.02038 0.12650 1.25272
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.385477 0.566473 23.630 < 2e-16 ***
log(nox) -0.901682 0.114687 -7.862 2.34e-14 ***
log(dist) -0.086781 0.043281 -2.005 0.04549 *
poly(rooms, 2, raw = TRUE)1 -0.545113 0.165454 -3.295 0.00106 **
poly(rooms, 2, raw = TRUE)2 0.062261 0.012805 4.862 1.56e-06 ***
stratio -0.047590 0.005854 -8.129 3.42e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Hypothesis:
poly(rooms, 2, raw = TRUE)1 = 0
poly(rooms, 2, raw = TRUE)2 = 0
Response: log(price)
Sum Sq Df F value Pr(>F)
log(nox) 4.153 1 61.8129 2.341e-14 ***
log(dist) 0.270 1 4.0204 0.04549 *
poly(rooms, 2, raw = TRUE) 14.838 2 110.4188 < 2.2e-16 ***
stratio 4.440 1 66.0848 3.423e-15 ***
Residuals 33.595 500
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The ANOVA table also allows to quickly compare the relevance of the regressors. The first column
shows the sum of squared deviations explained by the variables after all the other regressors are
controlled for. We see that in this sense, the number of rooms has the highest explanatory power in
our example.
ANOVA tables are also convenient if the effect of a variable is captured by several parameters for
other reasons. We will give an example when discuss factor variables ins Section 7.3. ANOVA tables
of Types I and III are less often of interest. They differ in what other variables are controlled for
when testing for the effect of one regressor. Fox and Weisberg (2011, Sections 4.4.3–4.4.4.) discuss
ANOVA tables in more detail.
144 6. Multiple Regression Analysis: Further Issues
y = β 0 + β 1 x1 + β 2 x2 + β 3 x1 x2 + u. (6.6)
Of course, we can implement this in R by defining a new variable containing the product of the two
regressors. But again, a direct specification in the model formula is more convenient. The expression
x1:x2 within a formula adds the interaction term x1 x2 . Even more conveniently, x1*x2 adds not
only the interaction but also both original variables allowing for a very concise syntax. So the model
in equation 6.6 can be specified in R as either of the two formulas
y ~ x1+x2+x1:x2 ⇔ y ~ x1*x2
If one variable x1 is interacted with a set of other variables, they can be grouped by parentheses to
allow for a compact syntax. For example, a model equation and its R formula could be
y = β 0 + β 1 x1 + β 2 x2 + β 3 x3 + β 4 x1 x2 + β 5 x1 x3 + u. (6.7)
Script 6.6 (Example-6-3.R) estimates this model. The effect of attending classes is
∂stndfnl
= β 1 + β 6 priGPA.
∂atndrte
For the average priGPA = 2.59, the script estimates this partial effect to be around 0.0078. It tests the
null hypothesis that this effect is zero using a simple F test, see Section 4.3. With a p value of 0.0034, this
hypothesis can be rejected at all common significance levels.
6.1. Model Formulae 145
Call:
lm(formula = stndfnl ~ atndrte * priGPA + ACT + I(priGPA^2) +
I(ACT^2), data = attend)
Coefficients:
(Intercept) atndrte priGPA ACT
2.050293 -0.006713 -1.628540 -0.128039
I(priGPA^2) I(ACT^2) atndrte:priGPA
0.295905 0.004533 0.005586
> linearHypothesis(myres,c("atndrte+2.59*atndrte:priGPA"))
Linear hypothesis test
Hypothesis:
atndrte + 2.59 atndrte:priGPA = 0
6.2. Prediction
In this section, we are concerned with predicting the value of the dependent variable y given certain
values of the regressors x1 , . . . , xk . If these are the regressor values in our estimation sample, we
called these predictions “fitted values” and discussed their calculation in Section 2.2. Now, we
generalize this to arbitrary values and add standard errors, confidence intervals, and prediction
intervals.
θ0 = E( y | x1 = c1 , . . . , x k = c k ) = β 0 + β 1 c1 + β 2 c2 + · · · + β k c k . (6.9)
θ̂0 = β̂ 0 + β̂ 1 c1 + β̂ 2 c2 + · · · + β̂ k ck (6.10)
and can readily be obtained once the parameter estimates β̂ 0 , . . . , β̂ k are calculated.
Standard errors and confidence intervals are less straightforward to compute. Wooldridge (2019,
Section 6.4) suggests a smart way to obtain these from a modified regression. R provides an even
simpler and more convenient approach.
The command predict can not only automatically calculate θ̂0 but also its standard error and
confidence intervals. Its arguments are
• The regression results. If they are stored in a variable reg by a command like
reg <- lm(y~x1+x2+x3,...), we can just supply the name reg.
• A data frame containing the values of the regressors c1 , . . . ck of the regressors x1 , . . . xk with
the same variable names as in the data frame used for estimation. If we don’t have one yet,
it can for example be specified as data.frame(x1=c1, x2=c2,..., xk=ck) where x1
through xk are the variable names and c1 through ck are the values which can also specified
as vectors to get predictions at several values of the regressors. See Section 1.3.1 for more on
data frames.
• se.fit=TRUE to also request standard errors of the predictions
• interval="confidence" to also request confidence intervals (or for prediction intervals
interval="prediction", see below)
• level=0.99 to choose the 99% confidence interval instead of the default 95%. Of course,
arbitrary other values are possible.
• and more.
If the model formula contains some of the advanced features such as rescaling, quadratic terms and
interactions presented in Section 6.1, predict is clever enough to make the same sort of transfor-
mations for the predictions. Example 6.5 demonstrates some of the features.
6.2. Prediction 147
Script 6.7 (Example-6-5.R) shows the implementation of the estimation and prediction. The estima-
tion results are stored as the variable reg. The values of the regressors for which we want to do the
prediction are stored in the new data frame cvalues. Then command predict is called with these
two arguments. For an SAT score of 1200, a high school percentile of 30 and a high school size of 5
(i.e. 500 students), the predicted college GPA is 2.7. Wooldridge (2019) obtains the same value using
a general but more cumbersome regression approach. The 95% confidence interval is reported with
the next command. With 95% confidence we can say that the expected college GPA for students with
these features is between 2.66 and 2.74.
Finally, we define three types of students with different values of sat, hsperc, and hsize. The data
frame cvalues is filled with these numbers and displayed as a table. For these three regressor variables,
we obtain the 99% confidence intervals.
> reg
Call:
lm(formula = colgpa ~ sat + hsperc + hsize + I(hsize^2), data = gpa2)
Coefficients:
(Intercept) sat hsperc hsize I(hsize^2)
1.492652 0.001492 -0.013856 -0.060881 0.005460
> # Generate data set containing the regressor values for predictions
> cvalues <- data.frame(sat=1200, hsperc=30, hsize=5)
> cvalues
sat hsperc hsize
148 6. Multiple Regression Analysis: Further Issues
1 1200 30 5
2 900 20 3
3 1400 5 1
For a better visual understanding of the implications of our model, it is often useful to calculate
predictions for different values of one regressor of interest while keeping the other regressors fixed at
certain values like their overall sample means. By plotting the results against the regressor value, we
get a very intuitive graph showing the estimated ceteris paribus effects of the regressor.
We already know how to calculate predictions and their confidence intervals from Section 6.2.1.
Script 6.9 (Effects-Manual.R) repeats the regression from Example 6.2 and creates an effects plot
for the number of rooms manually. The number of rooms is varied between 4 and 8 and the other
variables are set to their respective sample means for all predictions. The regressor values and the
implied predictions are shown in a table and then plotted using matplot for automatically including
the confidence bands. The resulting graph is shown in Figure 6.1(a).
The package effects provides the convenient command effect. It creates the same kind of
plots we just generated, but it is more convenient to use and the result is nicely formatted. After
storing the regression results in variable res, Figure 6.1(b) is produced with the simple command
plot( effect("rooms",res) )
The full code including loading the data and running the regression is in Script 6.10
(Effects-Automatic.R). We see the minimum at a number of rooms of around 4.4. We
also see the observed values of rooms as ticks on the axis. Obviously nearly all observations are in
the area right of the minimum where the slope is positive.
Output of Script 6.9: Effects-Manual.R
> # Repeating the regression from Example 6.2:
> data(hprice2, package=’wooldridge’)
> # Plot
> matplot(X$rooms, pred, type="l", lty=c(1,2,2))
10.4
10.2
log(price)
pred
9.8 10.0
10.0
9.8
9.6
9.6
rooms
X$rooms
(a) Manual Calculations (Script 6.9) (c) Automatic Calculations (Script 6.10)
Call:
lm(formula = wage ~ female + educ + exper + tenure, data = wage1)
Coefficients:
(Intercept) female educ exper tenure
-1.5679 -1.8109 0.5715 0.0254 0.1410
> lm(log(wage)~married*female+educ+exper+I(exper^2)+tenure+I(tenure^2),
> data=wage1)
Call:
lm(formula = log(wage) ~ married * female + educ + exper + I(exper^2) +
tenure + I(tenure^2), data = wage1)
Coefficients:
(Intercept) married female educ
0.3213781 0.2126757 -0.1103502 0.0789103
exper I(exper^2) tenure I(tenure^2)
0.0268006 -0.0005352 0.0290875 -0.0005331
married:female
-0.3005931
Instead of transforming logical variables into dummies, they can be directly used as regressors.
The coefficient is then named varnameTRUE. Script 7.3 (Example-7-1-logical.R) repeats the
analysis of Example 7.1 with the regressor female being coded as a logical instead of a 0/1 dummy
variable.
Output of Script 7.3: Example-7-1-logical.R
> data(wage1, package=’wooldridge’)
> table(wage1$female)
FALSE TRUE
274 252
Call:
lm(formula = wage ~ female + educ + exper + tenure, data = wage1)
Coefficients:
(Intercept) femaleTRUE educ exper tenure
-1.5679 -1.8109 0.5715 0.0254 0.1410
In real-world data sets, qualitative information is often not readily coded as logical or dummy
variables, so we might want to create our own regressors. Suppose a qualitative variable OS takes
one of the three string values “Android”, “iOS”, “Windows”, or “other”. We can manually define
the three relevant logical variables with “Android” as the reference category with
iOS <- OS=="iOS"
wind <- OS=="Windows"
oth <- OS=="other"
The package dummies provides convenient functions to automatically generate dummy variables.
But a even more convenient and elegant way to deal with qualitative variables in R are factor vari-
ables discussed in the next section.
154 7. Multiple Regression Analysis with Qualitative Regressors
1 Rememberthat packages have to installed once before we can use them. With an active internet connection, the command
to automatically do this is install.packages("AER").
7.3. Factor variables 155
male female
289 245
> table(CPS1985$occupation)
Call:
lm(formula = log(wage) ~ education + experience + gender + occupation,
data = CPS1985)
Coefficients:
(Intercept) education experience
0.97629 0.07586 0.01188
genderfemale occupationtechnical occupationservices
-0.22385 0.14246 -0.21004
occupationoffice occupationsales occupationmanagement
-0.05477 -0.20757 0.15254
Call:
lm(formula = log(wage) ~ education + experience + gender + occupation,
data = CPS1985)
Coefficients:
(Intercept) education experience
0.90498 0.07586 0.01188
gendermale occupationworker occupationtechnical
0.22385 -0.15254 -0.01009
occupationservices occupationoffice occupationsales
-0.36259 -0.20731 -0.36011
156 7. Multiple Regression Analysis with Qualitative Regressors
> # Regression
> res <- lm(log(wage) ~ education+experience+gender+occupation, data=CPS1985)
Response: log(wage)
Sum Sq Df F value Pr(>F)
education 10.981 1 56.925 2.010e-13 ***
experience 9.695 1 50.261 4.365e-12 ***
gender 5.414 1 28.067 1.727e-07 ***
occupation 7.153 5 7.416 9.805e-07 ***
Residuals 101.269 525
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = log(salary) ~ rankcat + LSAT + GPA + log(libvol) +
log(cost), data = lawsch85)
Coefficients:
(Intercept) rankcat(0,10] rankcat(10,25] rankcat(25,40]
9.1652952 0.6995659 0.5935434 0.3750763
rankcat(40,60] rankcat(60,100] LSAT GPA
0.2628191 0.1315950 0.0056908 0.0137255
log(libvol) log(cost)
0.0363619 0.0008412
Response: log(salary)
Sum Sq Df F value Pr(>F)
rankcat 1.86887 5 50.9630 < 2e-16 ***
LSAT 0.02532 1 3.4519 0.06551 .
GPA 0.00025 1 0.0342 0.85353
log(libvol) 0.01433 1 1.9534 0.16467
log(cost) 0.00001 1 0.0011 0.97336
Residuals 0.92411 126
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
158 7. Multiple Regression Analysis with Qualitative Regressors
> # Model with full interactions with female dummy (only for spring data)
> reg<-lm(cumgpa~female*(sat+hsperc+tothrs), data=gpa3, subset=(spring==1))
> summary(reg)
Call:
lm(formula = cumgpa ~ female * (sat + hsperc + tothrs), data = gpa3,
subset = (spring == 1))
Residuals:
Min 1Q Median 3Q Max
-1.51370 -0.28645 -0.02306 0.27555 1.24760
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.4808117 0.2073336 7.142 5.17e-12 ***
female -0.3534862 0.4105293 -0.861 0.38979
sat 0.0010516 0.0001811 5.807 1.40e-08 ***
hsperc -0.0084516 0.0013704 -6.167 1.88e-09 ***
tothrs 0.0023441 0.0008624 2.718 0.00688 **
female:sat 0.0007506 0.0003852 1.949 0.05211 .
female:hsperc -0.0005498 0.0031617 -0.174 0.86206
female:tothrs -0.0001158 0.0016277 -0.071 0.94331
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> # F-Test from package "car". H0: the interaction coefficients are zero
> # matchCoefs(...) selects all coeffs with names containing "female"
7.5. Interactions and Differences in Regression Functions Across Groups 159
> library(car)
Hypothesis:
female = 0
female:sat = 0
female:hsperc = 0
female:tothrs = 0
We can estimate the same model parameters by running two separate regressions, one for females
and one for males, see Script 7.8 (Dummy-Interact-Sep.R). We see that in the joint model, the
parameters without interactions ((Intercept), sat, hsperc, and tothrs) apply to the males
and the interaction parameters reflect the differences to the males.
To reconstruct the parameters for females from the joint model, we need to add the two respective
parameters. The intercept for females is 1.4808117 − 0.3534862 = 1.127325 and the coefficient of sat
for females is 0.0010516 + 0.0007506 = 0.0018022.
Output of Script 7.8: Dummy-Interact-Sep.R
> data(gpa3, package=’wooldridge’)
Call:
lm(formula = cumgpa ~ sat + hsperc + tothrs, data = gpa3, subset = (spring ==
1 & female == 0))
Coefficients:
(Intercept) sat hsperc tothrs
1.480812 0.001052 -0.008452 0.002344
Call:
lm(formula = cumgpa ~ sat + hsperc + tothrs, data = gpa3, subset = (spring ==
1 & female == 1))
Coefficients:
(Intercept) sat hsperc tothrs
1.127325 0.001802 -0.009001 0.002228
8. Heteroscedasticity
The homoscedasticity assumptions SLR.5 for the simple regression model and MLR.5 for the multiple
regression model require that the variance of the error terms is unrelated to the regressors, i.e.
Var(u| x1 , . . . , xk ) = σ2 . (8.1)
Unbiasedness and consistency (Theorems 3.1, 5.1) do not depend on this assumption, but the sam-
pling distribution (Theorems 3.2, 4.1, 5.2) does. If homoscedasticity is violated, the standard errors
are invalid and all inferences from t, F and other tests based on them are unreliable. Also the
(asymptotic) efficiency of OLS (Theorems 3.4, 5.3) depends on homoscedasticity. Generally, ho-
moscedasticity is difficult to justify from theory. Different kinds of individuals might have different
amounts of unobserved influences in ways that depend on regressors.
We cover three topics: Section 8.1 shows how the formula of the estimated variance-covariance
can be adjusted so it does not require homoscedasticity. In this way, we can use OLS to get unbiased
and consistent parameter estimates and draw inference from valid standard errors and tests. Section
8.2 presents tests for the existence of heteroscedasticity. Section 8.3 discusses weighted least squares
(WLS) as an alternative to OLS. This estimator can be more efficient in the presence of heteroscedas-
ticity.
1 The package sandwich provides the same functionality as hccm using the specification vcovHC and can be used more
flexibly for advanced analyses.
162 8. Heteroscedasticity
For a convenient regression table with coefficients, standard errors, t statistics and their p values
based on arbitrary variance-covariance matrices, the command coeftest from the package lmtest
is useful. In addition to the regression results reg, it expects either a readily calculated variance-
covariance matrix or the function (such as hccm) to calculate it. The syntax is
• coeftest(reg) for the default homoscedasticity-based standard errors
• coeftest(reg, vcov=hccm) for the refined version of White’s robust SE
• coeftest(reg, vcov=hccm(reg,type="hc0")) for the classical version of White’s ro-
bust SE. Other versions can be chosen accordingly.
For general F-tests, we have repeatedly used the command linearHypothesis from the package
car. The good news is that it also accepts alternative variance-covariance specifications and is also
compatible with hccm. To perform F tests of the joint hypothesis described in myH0 for an estimated
model reg, the syntax is2
• linearHypothesis(reg, myH0) for the default homoscedasticity-based covariance matrix
• linearHypothesis(reg, myH0, vcov=hccm) for the refined version of White’s robust
covariance matrix
• linearHypothesis(reg, myH0, vcov=hccm(reg,type="hc0")) for the classical ver-
sion of White’s robust covariance matrix. Again, other types can be chosen accordingly.
t test of coefficients:
t test of coefficients:
Hypothesis:
black = 0
white = 0
Hypothesis:
black = 0
white = 0
Res.Df Df F Pr(>F)
1 361
2 359 2 0.6725 0.5111
Hypothesis:
black = 0
white = 0
Res.Df Df F Pr(>F)
1 361
2 359 2 0.7478 0.4741
8.2. Heteroscedasticity Tests 165
> reg
Call:
lm(formula = price ~ lotsize + sqrft + bdrms, data = hprice1)
Coefficients:
(Intercept) lotsize sqrft bdrms
-21.770308 0.002068 0.122778 13.852522
> bptest(reg)
data: reg
BP = 14.092, df = 3, p-value = 0.002782
Call:
lm(formula = resid(reg)^2 ~ lotsize + sqrft + bdrms, data = hprice1)
Residuals:
Min 1Q Median 3Q Max
-9044 -2212 -1256 -97 42582
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.523e+03 3.259e+03 -1.694 0.09390 .
lotsize 2.015e-01 7.101e-02 2.838 0.00569 **
sqrft 1.691e+00 1.464e+00 1.155 0.25128
bdrms 1.042e+03 9.964e+02 1.046 0.29877
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The White test is a variant of the BP test where in the second stage, we do not regress the squared
first-stage residuals on the original regressors only. Instead, we add interactions and polynomials of
them or on the fitted values ŷ and ŷ2 . This can easily be done in a manual second-stage regression
remembering that the fitted values can be obtained with the fitted function.
Conveniently, we can also use the bptest command to do the calculations of the LM version of
the test including the p values automatically. All we have to do is to explain that in the second stage
we want a different set of regressors. Given the original regression results are stored as reg, this is
done by specifying
bptest(reg, ~ regressors)
In the “special form” of the White test, the regressors are fitted and their squared values, so the
command can be compactly written as
Wooldridge, Example 8.5: BP and White test in the Log Housing Price Equa-
tion8.5
Script 8.4 (Example-8-5.R) implements the BP and the White test for a model that now contains loga-
rithms of the dependent variable and two independent variables. The LM versions of both the BP and
the White test do not reject the null hypothesis at conventional significance levels with p values of 0.238
and 0.178, respectively.
8.2. Heteroscedasticity Tests 167
> reg
Call:
lm(formula = log(price) ~ log(lotsize) + log(sqrft) + bdrms,
data = hprice1)
Coefficients:
(Intercept) log(lotsize) log(sqrft) bdrms
-1.29704 0.16797 0.70023 0.03696
> # BP test
> library(lmtest)
> bptest(reg)
data: reg
BP = 4.2232, df = 3, p-value = 0.2383
data: reg
BP = 3.4473, df = 2, p-value = 0.1784
168 8. Heteroscedasticity
Call:
lm(formula = nettfa ~ inc + I((age - 25)^2) + male + e401k, data = k401ksubs,
subset = (fsize == 1))
Coefficients:
(Intercept) inc I((age - 25)^2) male
-20.98499 0.77058 0.02513 2.47793
e401k
6.88622
> # WLS
> lm(nettfa ~ inc + I((age-25)^2) + male + e401k, weight=1/inc,
> data=k401ksubs, subset=(fsize==1))
Call:
lm(formula = nettfa ~ inc + I((age - 25)^2) + male + e401k, data = k401ksubs,
subset = (fsize == 1), weights = 1/inc)
Coefficients:
(Intercept) inc I((age - 25)^2) male
-16.70252 0.74038 0.01754 1.84053
e401k
5.18828
8.3. Weighted Least Squares 169
We can also use heteroscedasticity-robust statistics from Section 8.1 to account for the fact that our
variance function might be misspecified. Script 8.6 (WLS-Robust.R) repeats the WLS estimation
of Example 8.6 but reports non-robust and robust standard errors and t statistics. It replicates
Wooldridge (2019, Table 8.2) with the only difference that we use a refined version of the robust
SE formula. There is nothing special about the implementation. The fact that we used weights is
correctly accounted for in the following calculations.
Output of Script 8.6: WLS-Robust.R
> data(k401ksubs, package=’wooldridge’)
> # WLS
> wlsreg <- lm(nettfa ~ inc + I((age-25)^2) + male + e401k,
> weight=1/inc, data=k401ksubs, subset=(fsize==1))
> coeftest(wlsreg)
t test of coefficients:
t test of coefficients:
The assumption made in Example 8.6 that the variance is proportional to a regressor is usually
hard to justify. Typically, we don’t not know the variance function and have to estimate it. This
feasible GLS (FGLS) estimator replaces the (allegedly) known variance function with an estimated
one.
We can estimate the relation between variance and regressors using a linear regression of the log of
the squared residuals from an initial OLS regression log(û2 ) as the dependent variable. Wooldridge
(2019, Section 8.4) suggests two versions for the selection of regressors:
• the regressors x1 , . . . , xk from the original model similar to the BP test
• ŷ and ŷ2 from the original model similar to the White test
As the estimated error variance, we can use exp log \ (û2 ) . Its inverse can then be used as a weight
in WLS estimation.
> # OLS
> olsreg<-lm(cigs~log(income)+log(cigpric)+educ+age+I(age^2)+restaurn,
> data=smoke)
> olsreg
Call:
lm(formula = cigs ~ log(income) + log(cigpric) + educ + age +
I(age^2) + restaurn, data = smoke)
Coefficients:
(Intercept) log(income) log(cigpric) educ age
-3.639826 0.880268 -0.750862 -0.501498 0.770694
I(age^2) restaurn
-0.009023 -2.825085
> # BP test
> library(lmtest)
> bptest(olsreg)
data: olsreg
BP = 32.258, df = 6, p-value = 1.456e-05
> varreg<-lm(logu2~log(income)+log(cigpric)+educ+age+I(age^2)+restaurn,
> data=smoke)
> lm(cigs~log(income)+log(cigpric)+educ+age+I(age^2)+restaurn,
> weight=w ,data=smoke)
Call:
lm(formula = cigs ~ log(income) + log(cigpric) + educ + age +
I(age^2) + restaurn, data = smoke, weights = w)
Coefficients:
(Intercept) log(income) log(cigpric) educ age
5.635463 1.295239 -2.940312 -0.463446 0.481948
I(age^2) restaurn
-0.005627 -3.461064
9. More on Specification and Data Issues
This chapter covers different topics of model specification and data problems. Section 9.1 asks how
statistical tests can help us specify the “correct” functional form given the numerous options we
have seen in Chapters 6 and 7. Section 9.2 shows some simulation results regarding the effects of
measurement errors in dependent and independent variables. Sections 9.3 covers missing values and
how R deals with them. In Section 9.4, we briefly discuss outliers and Section 9.5, the LAD estimator
is presented.
> RESETreg
174 9. More on Specification and Data Issues
Call:
lm(formula = price ~ lotsize + sqrft + bdrms + I(fitted(orig)^2) +
I(fitted(orig)^3), data = hprice1)
Coefficients:
(Intercept) lotsize sqrft
1.661e+02 1.537e-04 1.760e-02
bdrms I(fitted(orig)^2) I(fitted(orig)^3)
2.175e+00 3.534e-04 1.546e-06
Hypothesis:
I(fitted(orig)^2) = 0
I(fitted(orig)^3) = 0
> resettest(orig)
RESET test
data: orig
RESET = 4.6682, df1 = 2, df2 = 82, p-value = 0.01202
9.2. Measurement Error 175
Wooldridge (2019, Section 9.1-b) also discusses tests of non-nested models. As an example, a test
of both models against a comprehensive model containing all the regressors is mentioned. Such
a test can conveniently be implemented in R using the command encomptest from the package
lmtest. Script 9.3 (Nonnested-Test.R) shows this test in action for a modified version of the
Example 9.2.
The two alternative models for the housing price are
The output shows the “encompassing model” E with all variables. Both models are rejected against
this comprehensive model.
y∗ = β 0 + β 1 x + u, y = y ∗ + e0 . (9.3)
The assumption is that we do not observe the true values of the dependent variable y∗ but our
measure y is contaminated with a measurement error e0 .
176 9. More on Specification and Data Issues
In the simulation, the parameter estimates using both the correct y∗ and the mismeasured y are
stored as the variables b1hat and b1hat.me, respectively. As expected, the simulated mean of both
variables is close to the expected value of β 1 = 0.5. The variance of b1hat.me is around 0.002 which
is twice as high as the variance of b1hat. This was expected since in our simulation, u and e0 are
both independent standard normal variables, so Var(u) = 1 and Var(u + e0 ) = 2:
If an explanatory variable is mismeasured, the consequences are usually more dramatic. Even in
the classical errors-in-variables case where the measurement error is unrelated to the regressors, the
parameter estimates are biased and inconsistent. This model is
y = β 0 + β 1 x ∗ + u, x = x ∗ + e1 (9.4)
9.2. Measurement Error 177
where the measurement error e1 is independent of both x ∗ and u. Wooldridge (2019, Section 9.4)
shows that if we regress y on x instead of x ∗ ,
Var( x ∗ )
plim β̂ 1 = β 1 · . (9.5)
Var( x ∗ ) + Var(e1 )
The simulation in Script 9.5 (Sim-ME-Explan.R) draws 10 000 samples of size n = 1 000 from this
model.
Since in this simulation, Var( x ∗ ) = Var(e1 ) = 1, equation 9.5 implies that plim β̂ 1 = 21 β 1 = 0.25.
This is confirmed by the simulation results. While the mean of the estimate b1hat using the correct
regressor again is around 0.5, the mean parameter estimate using the mismeasured regressor is about
0.25:
> # Mean with and without ME
> c( mean(b1hat), mean(b1hat.me) )
[1] 0.5003774 0.2490821
> data.frame(x,logx,invx,ncdf,isna)
x logx invx ncdf isna
1 -1 NaN -1 0.1586553 FALSE
2 0 -Inf Inf 0.5000000 FALSE
3 1 0 1 0.8413447 FALSE
4 NA NA NA NA TRUE
5 NaN NaN NaN NaN TRUE
6 -Inf NaN 0 0.0000000 FALSE
7 Inf Inf 0 1.0000000 FALSE
Depending on the data source, real-world data sets can have different rules for indicating missing
information. Sometimes, impossible numeric values are used. For example, a survey including the
number of years of education as a variable educ might have a value like “9999” to indicate missing
information. For any software package, it is highly recommended to change these to proper missing-
value codes early in the data-handling process. Otherwise, we take the risk that some statistical
method interprets those values as “this person went to school for 9999 years” producing highly
nonsensical results. For the education example, if the variable educ is in the data frame mydata this
can be done with
mydata$educ[mydata$educ==9999] <- NA
We can also create logical variables indicating missing values using the function
is.na(variable). It will generate a new logical variable of the same length which is TRUE
whenever variable is either NA or NaN. The function can also be used on data frames. The
command is.na(mydata) will return another data frame with the same dimensions and variable
names but full of logical indicators for missing observations. It is useful to count the missings for
each variable in a data frame with
9.3. Missing Data and Nonrandom Samples 179
colSums(is.na(mydata))
The function complete.cases(mydata) generates one logical vector indicating the rows of the
data frame that don’t have any missing information.
Script 9.7 (Missings.R) demonstrates these commands for the data set LAWSCH85.dta which
contains data on law schools. Of the 156 schools, 6 do not report median LSAT scores. Looking at all
variables, the most missings are found for the age of the school – we don’t know it for 45 schools.
For only 90 of the 156 schools, we have the full set of variables, for the other 66, one or more variable
is missing.
Output of Script 9.7: Missings.R
> data(lawsch85, package=’wooldridge’)
> table(compl)
compl
FALSE TRUE
66 90
The question how to deal with missing values is not trivial and depends on many things. R offers
different strategies. The strictest approach is used by default for basic statistical functions such as
mean. If we don’t know all the numbers, we cannot calculate their average. So by default, mean and
other commands return NA if at least one value is missing.
In many cases, this is overly pedantic. A widely used strategy is to simply remove the observations
with missing values and do the calculations for the remaining ones. For commands like mean, this
180 9. More on Specification and Data Issues
is requested with the option na.rm=TRUE. Regression commands like lm do this by default. If
observations are excluded due to missing values, the summary of the results contain a line stating
(XXX observations deleted due to missingness)
Script 9.8 (Missings-Analyses.R) gives examples of these features. There are more advanced
methods for dealing with missing data implemented in R, for example package mi provides multiple
imputation algorithms. But these methods are beyond the scope of this book.
Output of Script 9.8: Missings-Analyses.R
> data(lawsch85, package=’wooldridge’)
> mean(lawsch85$LSAT,na.rm=TRUE)
[1] 158.2933
Call:
lm(formula = log(salary) ~ LSAT + cost + age, data = lawsch85)
Residuals:
Min 1Q Median 3Q Max
-0.40989 -0.09438 0.00317 0.10436 0.45483
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.384e+00 6.781e-01 6.465 4.94e-09 ***
LSAT 3.722e-02 4.501e-03 8.269 1.06e-12 ***
cost 1.114e-05 4.321e-06 2.577 0.011563 *
age 1.503e-03 4.354e-04 3.453 0.000843 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> # Regression
> reg <- lm(rdintens~sales+profmarg, data=rdchem)
> max(studres)
[1] 4.555033
0.2
0.1
0.0
−2 −1 0 1 2 3 4 5
studres
182 9. More on Specification and Data Issues
=================================================
Dependent variable:
-----------------------------
rdintens
OLS quantile
regression
(1) (2)
-------------------------------------------------
I(sales/1000) 0.053 0.019
(0.044) (0.059)
-------------------------------------------------
Observations 32 32
R2 0.076
Adjusted R2 0.012
Residual Std. Error 1.862 (df = 29)
F Statistic 1.195 (df = 2; 29)
=================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Part II.
Call:
lm(formula = i3 ~ inf + def, data = intdef)
Residuals:
Min 1Q Median 3Q Max
-3.9948 -1.1694 0.1959 0.9602 4.7224
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.73327 0.43197 4.012 0.00019 ***
inf 0.60587 0.08213 7.376 1.12e-09 ***
def 0.51306 0.11838 4.334 6.57e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The options of this command describe the time structure of the data. The most important ones are
• start: Time of first observation. Examples:
– start=1: Time units are numbered starting at 1 (the default if left out).
10.2. Time Series Data Types in R 187
Figure 10.1. Time series plot: Imports of barium chloride from China
1500
1000
impts
500
0
Time
Once we have defined this time series object, we can conveniently do additional analyses. A time
series plot is simply generated with
plot(impts)
and is shown in Figure 10.1. The time axis is automatically formatted appropriately. The full R Script
10.2 (Example-Barium.R) for these calculations is shown in the appendix on page 338.
14
12
10
zoodata$i3
8
6
4
2
Index
The zoo objects are very useful for both regular and irregular time series. Because the data are
not necessarily equispaced, each observation needs a time stamp provided in another vector. They
can be measured in arbitrary time units such as years. For high frequency data, standard units such
as the POSIX system are useful for pretty graphs and other outputs. Details are provided by Zeileis
and Grothendieck (2005) and Ryan and Ulrich (2008).
We have already used the data set INTDEF.dta in example 10.2. It contains yearly data on interest
rates and related time series. In Script 10.3 (Example-zoo.R), we define a zoo object containing
all data using the variable year as the time measure. Simply plotting the variable i3 gives the time
series plot shown in Figure 10.2.
Output of Script 10.3: Example-zoo.R
> data(intdef, package=’wooldridge’)
Daily financial data sets are important examples of irregular time series. Because of weekends and
bank holidays, these data are not equispaced and each data point contains a time stamp - usually the
10.2. Time Series Data Types in R 189
date. To demonstrate this, we will briefly look at the package quantmod which implements various
tools for financial modelling.1 It can also automatically download financial data from Yahoo Finance
and other sources. In order to do so, we must know the ticker symbol of the stock or whatever we
are interested in. It can be looked up at
http://finance.yahoo.com/lookup
For example, the symbol for the Dow Jones Industrial Average is ^DJI, Apple stocks have the
symbol AAPL and the Ford Motor Company is simply abbreviated as F. The package quantmod
now for example automatically downloads daily data on the Ford stock using
getSymbols("F", auto.assign=TRUE)
The results are automatically assigned to a xts object named after the symbol F. It includes infor-
mation on opening, closing, high, and low prices as well as the trading volume and the adjusted (for
events like stock splits and dividend payments) closing prices. We demonstrate this with the Ford
stocks in Script 10.4 (Example-quantmod.R). We download the data, print the first and last 6 rows
of data, and plot the adjusted closing prices over time.
> tail(F)
F.Open F.High F.Low F.Close F.Volume F.Adjusted
2020-05-08 4.96 5.25 4.95 5.24 101333800 5.24
2020-05-11 5.18 5.19 5.05 5.12 75593900 5.12
2020-05-12 5.15 5.22 4.97 4.98 70965200 4.98
2020-05-13 5.00 5.01 4.66 4.72 100192300 4.72
2020-05-14 4.64 4.92 4.52 4.89 108061100 4.89
2020-05-15 4.80 4.94 4.75 4.90 80502100 4.90
1 See http://www.quantmod.com for more details on the tools and the package.
190
2
4
6
8
10
12
Jan 03 2007
Jul 02 2007
Jan 02 2008
Jul 01 2008
F$F.Adjusted
Jan 02 2009
Jul 01 2009
Jan 04 2010
Jul 01 2010
Jan 03 2011
Jul 01 2011
Jan 03 2012
Jul 02 2012
Jan 02 2013
Jul 01 2013
Jan 02 2014
Jul 01 2014
Jan 02 2015
Jul 01 2015
Jan 04 2016
Figure 10.3. Time series plot: Stock prices of Ford Motor Company
Jul 01 2016
Jan 03 2017
Jul 03 2017
Jan 02 2018
Jul 02 2018
Jan 02 2019
Jul 01 2019
Dec 31 2019
2007−01−03 / 2020−05−15
2
4
6
8
10
12
10. Basic Regression Analysis with Time Series Data
10.3. Other Time Series Models 191
Wooldridge (2019, Section 10.2) discusses the specification and interpretation of such models. For
the implementation, it is convenient not to have to generate the q additional variables that reflect the
lagged values zt−1 , . . . , zt−q but directly specify them in the model formula using dynlm instead of
lm.
> coeftest(res)
t test of coefficients:
Hypothesis:
pe = 0
L(pe) = 0
L(pe, 2) = 0
The long-run propensity (LRP) of FDL models measures the cumulative effect of a change in the
independent variable z on the dependent variable y over time and is simply equal to the sum of the
respective parameters
LRP = δ0 + δ1 + · · · + δq .
We can estimate it directly from the estimated parameter vector coef(). For testing whether it is
different from zero, we can again use the convenient linearHypothesis command.
Hypothesis:
pe + L(pe) + L(pe, 2) = 0
10.3.3. Trends
As pointed out by Wooldridge (2019, Section 10.5), deterministic linear (and exponential) time trends
can be accounted for by adding the time measure as another independent variable. In a regression
with dynlm, this can easily be done using the expression trend(tsobj) in the model formula with
the time series object tsobj.
=================================================================
Dependent variable:
---------------------------------------------
log(invpc)
(1) (2)
-----------------------------------------------------------------
log(price) 1.241*** -0.381
(0.382) (0.679)
trend(tsdata) 0.010***
(0.004)
-----------------------------------------------------------------
Observations 42 42
R2 0.208 0.341
Adjusted R2 0.189 0.307
Residual Std. Error 0.155 (df = 40) 0.144 (df = 39)
F Statistic 10.530*** (df = 1; 40) 10.080*** (df = 2; 39)
=================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
10.3. Other Time Series Models 195
10.3.4. Seasonality
To account for seasonal effects, we can add dummy variables for all but one (the reference) “season”.
So with monthly data, we can include eleven dummies, see Chapter 7 for a detailed discussion.
The command dynlm automatically creates and adds the appropriate dummies when using the
expression season(tsobj) in the model formula with the time series object tsobj.
> coeftest(res)
t test of coefficients:
========================================================================
Dependent variable:
-----------------------------------------------------------
return
(1) (2) (3)
------------------------------------------------------------------------
L(return) 0.059 0.060 0.061
(0.038) (0.038) (0.038)
L(return, 3) 0.031
(0.038)
------------------------------------------------------------------------
Observations 689 688 687
R2 0.003 0.005 0.006
Adjusted R2 0.002 0.002 0.001
F Statistic 2.399 (df = 1; 687) 1.659 (df = 2; 685) 1.322 (df = 3; 683)
========================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
We can do a similar analysis for daily data. The getSymbols command from the package
quantmod introduced in Section 10.2.2 allows us to directly download daily stock prices from Yahoo
Finance. Script 11.2 (Example-EffMkts.R) downloads daily stock prices of Apple (ticker symbol
AAPL) and stores them as a xts object. From the prices pt , daily returns rt are calculated using the
standard formula p t − p t −1
rt = log( pt ) − log( pt−1 ) ≈ .
p t −1
Note that in the script, we calculate the difference using the function diff. It calculates the difference
from trading day to trading day, ignoring the fact that some of them are separated by weekends or
holidays. Figure 11.1 plots the returns of the Apple stock. Even though we now have n = 2266
observations of daily returns, we cannot find any relation between current and past returns which
supports (this version of) the efficient markets hypothesis.
11.1. Asymptotics with Time Series 199
===========================================================================
Dependent variable:
--------------------------------------------------------------
ret
(1) (2) (3)
---------------------------------------------------------------------------
L(ret) -0.003 -0.004 -0.003
(0.021) (0.021) (0.021)
L(ret, 3) 0.005
(0.021)
---------------------------------------------------------------------------
Observations 2,266 2,265 2,264
R2 0.00001 0.001 0.001
Adjusted R2 -0.0004 -0.00004 -0.0004
F Statistic 0.027 (df = 1; 2264) 0.955 (df = 2; 2262) 0.728 (df = 3; 2260)
===========================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
200 11. Further Issues In Using OLS with Time Series Data
Figure 11.1. Time series plot: Daily stock returns 2008–2016, Apple Inc.
0.10 0.10
0.05 0.05
0.00 0.00
−0.05 −0.05
−0.10 −0.10
−0.15 −0.15
Jan 02 2008 Jul 01 2009 Jan 03 2011 Jul 02 2012 Jan 02 2014 Jul 01 2015 Dec 30 2016
y t = y t −1 + e t (11.1)
= y 0 + e1 + e2 + · · · + e t −1 + e t (11.2)
where the shocks e1 , . . . , et are i.i.d with a zero mean. It is a special case of a unit root process.
Random walk processes are strongly dependent and nonstationary, violating assumption TS1’ re-
quired for the consistency of OLS parameter estimates. As Wooldridge (2019, Section 11.3) shows,
the variance of yt (conditional on y0 ) increases linearly with t:
This can be easily seen in a simulation exercise. Script 11.3 (Simulate-RandomWalk.R) draws
30 realizations from a random walk process with i.i.d. standard normal shocks et . After initializing
the random number generator, an empty figure with the right dimensions is produced. Then, the
realizations of the time series are drawn in a loop.1 In each of the 30 draws, we first obtain a sample
of the n = 50 shocks e1 , . . . , e50 . The random walk is generated as the cumulative sum of the shocks
according to Equation 11.2 with an initial value of y0 = 0. The respective time series are then added
to the plot. In the resulting Figure 11.2, the increasing variance can be seen easily.
15
10
5
0
−5
−15 −10
0 10 20 30 40 50
y t = α 0 + y t −1 + e t (11.4)
= y 0 + α 0 · t + e1 + e2 + · · · + e t −1 + e t (11.5)
Script 11.4 (Simulate-RandomWalkDrift.R) simulates such a process with α0 = 2 and i.i.d. stan-
dard normal shocks et . The resulting time series are plotted in Figure 11.3. The values fluctuate
around the expected value α0 · t. But unlike weakly dependent processes, they do not tend towards
their mean, so the variance increases like for a simple random walk process.
202 11. Further Issues In Using OLS with Time Series Data
100
80
60
40
20
0
0 10 20 30 40 50
An obvious question is whether a given sample is from a unit root process such as a random walk.
We will cover tests for unit roots in Section 18.2.
11.3. Differences of Highly Persistent Time Series 203
Figure 11.4. Simulations of a random walk process with drift: first differences
5
4
3
2
1
0
−1
0 10 20 30 40 50
y t = α 0 + y t −1 + e t (11.6)
∆yt = yt − yt−1 = α0 + et (11.7)
============================================================
Dependent variable:
----------------------------------------
d(gfr)
(1) (2)
------------------------------------------------------------
d(pe) -0.043 -0.036
(0.028) (0.027)
L(d(pe)) -0.014
(0.028)
L(d(pe), 2) 0.110***
(0.027)
------------------------------------------------------------
Observations 71 69
R2 0.032 0.232
Adjusted R2 0.018 0.197
Residual Std. Error 4.221 (df = 69) 3.859 (df = 65)
F Statistic 2.263 (df = 1; 69) 6.563*** (df = 3; 65)
============================================================
Note: *p<0.1; **p<0.05; ***p<0.01
12. Serial Correlation and
Heteroscedasticity in Time Series
Regressions
In Chapter 8, we discussed the consequences of heteroscedasticity in cross-sectional regressions. In
the time series setting, similar consequences and strategies apply to both heteroscedasticity (with
some specific features) and serial correlation of the error term. Unbiasedness and consistency of the
OLS estimators are unaffected. But the OLS estimators are inefficient and the usual standard errors
and inferences are invalid.
We first discuss how to test for serial correlation in Section 12.1. Section 12.2 introduces efficient
estimation using feasible GLS estimators. As an alternative, we can still use OLS and calculate stan-
dard errors that are valid under both heteroscedasticity and autocorrelation as discussed in Section
12.3. Finally, Section 12.4 covers heteroscedasticity and autoregressive conditional heteroscedasticity
(ARCH) models.
are serially correlated. A straightforward and intuitive testing approach is described by Wooldridge
(2019, Section 12.3). It is based on the fitted residuals ût = yt − β̂ 0 − β̂ 1 xt1 − · · · − β̂ k xtk which can
be obtained in R with the function resid, see Section 2.2.
To test for AR(1) serial correlation under strict exogeneity, we regress ût on their lagged values
ût−1 . If the regressors are not necessarily strictly exogenous, we can adjust the test by adding the
original regressors xt1 , . . . , xtk to this regression. Then we perform the usual t test on the coefficient
of ût−1 .
For testing for higher order serial correlation, we add higher order lags ût−2 , ût−3 , . . . as explana-
tory variables and test the joint hypothesis that they are all equal to zero using either an F test or a
Lagrange multiplier (LM) test. Especially the latter version is often called Breusch-Godfrey test.
206 12. Serial Correlation and Heteroscedasticity in Time Series Regressions
inft = β 0 + β 1 unemt + ut
∆inft = β 0 + β 1 unemt + ut .
Script 12.1 (Example-12-2.R) shows the analyses. After the estimation, the residuals are calculated
with resid and regressed on their lagged values. We report standard errors and t statistics using the
coeftest command. While there is strong evidence for autocorrelation in the static equation with a t
statistic of 4.93, the null hypothesis of no autocorrelation cannot be rejected in the second model with
a t statistic of −0.29.
t test of coefficients:
t test of coefficients:
This class of tests can also be performed automatically using the command bgtest from the
package lmtest. Given the regression results are stored in a variable res, the LM version of a test
of AR(1) serial correlation can simply be tested using
bgtest(res)
> linearHypothesis(resreg,
> c("L(residual)","L(residual, 2)","L(residual, 3)"))
Linear hypothesis test
Hypothesis:
L(residual) = 0
L(residual, 2) = 0
L(residual, 3) = 0
data: reg
LM test = 5.1247, df1 = 3, df2 = 121, p-value = 0.002264
Another popular test is the Durbin-Watson test for AR(1) serial correlation. While the test statistic
is pretty straightforward to compute, its distribution is non-standard and depends on the data.
Package lmtest offers the command dwtest. It is convenient because it reports p values which can
be interpreted in the standard way (given the necessary CLM assumptions hold).
Script 12.3 (Example-DWtest.R) repeats Example 12.2 but conducts DW tests instead of the t
tests. The conclusions are the same: For the static model, no serial correlation is clearly rejected with
a test statistic of DW = 0.8027 and p < 10−6 . For the expectation augmented Phillips curve, the null
hypothesis is not rejected at usual significance levels (DW = 1.7696, p = 0.1783).
Output of Script 12.3: Example-DWtest.R
> library(dynlm);library(lmtest)
> # DW tests
> dwtest(reg.s)
Durbin-Watson test
data: reg.s
DW = 0.8027, p-value = 7.552e-07
alternative hypothesis: true autocorrelation is greater than 0
> dwtest(reg.ea)
Durbin-Watson test
data: reg.ea
DW = 1.7696, p-value = 0.1783
alternative hypothesis: true autocorrelation is greater than 0
12.2. FGLS Estimation 209
Call:
dynlm(formula = log(chnimp) ~ log(chempi) + log(gas) + log(rtwex) +
befile6 + affile6 + afdec6, data = tsdata)
number of interaction: 8
rho 0.293362
Durbin-Watson statistic
(original): 1.45841 , p-value: 1.688e-04
(transformed): 2.06330 , p-value: 4.91e-01
coefficients:
(Intercept) log(chempi) log(gas) log(rtwex) befile6
-37.322241 2.947434 1.054858 1.136918 -0.016372
affile6 afdec6
-0.033082 -0.577158
210 12. Serial Correlation and Heteroscedasticity in Time Series Regressions
t test of coefficients:
t test of coefficients:
As the equation suggests, we can estimate α0 and α1 by an OLS regression of the residuals û2t on
û2t−1 .
> coeftest(ARCHreg)
t test of coefficients:
As a second example, let us reconsider the daily stock returns from Script 11.2
(Example-EffMkts.R). We again download the daily Apple stock prices from Yahoo Finance and
calculate their returns. Figure 11.1 on page 200 plots them. They show a very typical pattern for
an ARCH-type of model: there are periods with high (such as fall 2008) and other periods with
low volatility (fall 2010). In Script 12.7 (Example-ARCH.R), we estimate an AR(1) process for the
squared residuals. The t statistic is larger than 8, so there is very strong evidence for autoregressive
conditional heteroscedasticity.
Output of Script 12.7: Example-ARCH.R
> library(zoo);library(quantmod);library(dynlm);library(stargazer)
> summary(ARCHreg)
Call:
dynlm(formula = residual.sq ~ L(residual.sq))
Residuals:
Min 1Q Median 3Q Max
-0.002745 -0.000346 -0.000280 -0.000045 0.038809
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.453e-04 2.841e-05 12.155 <2e-16 ***
L(residual.sq) 1.722e-01 2.071e-02 8.318 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
There are many generalizations of ARCH models. The packages tseries and rugarch provide
automated maximum likelihood estimation for many models of this class.
Part III.
Advanced Topics
13. Pooling Cross-Sections Across Time:
Simple Panel Data Methods
Pooled cross sections consist of random samples from the same population at different points in
time. Section 13.1 introduces this type of data set and how to use it for estimating changes over
time. Section 13.2 covers difference-in-differences estimators, an important application of pooled
cross-sections for identifying causal effects.
Panel data resemble pooled cross sectional data in that we have observations at different points in
time. The key difference is that we observe the same cross-sectional units, for example individuals
or firms. Panel data methods require the data to be organized in a systematic way, as discussed
in Section 13.3. This allows specific calculations used for panel data analyses that are presented in
Section 13.4. Section 13.5 introduces the first panel data method, first differenced estimation.
Note that we divide exper2 by 100 and thereby multiply β 3 by 100 compared to the results reported in
Wooldridge (2019). The parameter β 1 measures the return to education in 1978 and δ1 is the difference
of the return to education in 1985 relative to 1978. Likewise, β 5 is the gender wage gap in 1978 and δ5 is
the change of the wage gap.
Script 13.1 (Example-13-2.R) estimates the model. The return to education is estimated to have in-
creased by δ̂1 = 0.018 and the gender wage gap decreased in absolute value from β̂ 5 = −0.317 to
β̂ 5 + δ̂5 = −0.232, even though this change is only marginally significant. The interpretation and imple-
mentation of interactions were covered in more detail in Section 6.1.6.
216 13. Pooling Cross-Sections Across Time: Simple Panel Data Methods
Call:
lm(formula = lwage ~ y85 * (educ + female) + exper + I((exper^2)/100) +
union, data = cps78_85)
Residuals:
Min 1Q Median 3Q Max
-2.56098 -0.25828 0.00864 0.26571 2.11669
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.458933 0.093449 4.911 1.05e-06 ***
y85 0.117806 0.123782 0.952 0.3415
educ 0.074721 0.006676 11.192 < 2e-16 ***
female -0.316709 0.036621 -8.648 < 2e-16 ***
exper 0.029584 0.003567 8.293 3.27e-16 ***
I((exper^2)/100) -0.039943 0.007754 -5.151 3.08e-07 ***
union 0.202132 0.030294 6.672 4.03e-11 ***
y85:educ 0.018461 0.009354 1.974 0.0487 *
y85:female 0.085052 0.051309 1.658 0.0977 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
13.2. Difference-in-Differences
Wooldridge (2019, Section 13.2) discusses an important type of applications for pooled cross-sections.
Difference-in-differences (DiD) estimators estimate the effect of a policy intervention (in the broadest
sense) by comparing the change over time of an outcome of interest between an affected and an
unaffected group of observations.
In a regression framework, we regress the outcome of interest on a dummy variable for the affected
(“treatment”) group, a dummy indicating observations after the treatment and an interaction term
between both. The coefficient of this interaction term can then be a good estimator for the effect of
interest, controlling for initial differences between the groups and contemporaneous changes over
time.
13.2. Difference-in-Differences 217
> # Separate regressions for 1978 and 1981: report coeeficients only
> coef( lm(rprice~nearinc, data=kielmc, subset=(year==1978)) )
(Intercept) nearinc
82517.23 -18824.37
t test of coefficients:
> library(stargazer)
> stargazer(DiD,DiDcontr,type="text")
====================================================================
Dependent variable:
------------------------------------------------
log(rprice)
(1) (2)
--------------------------------------------------------------------
nearinc -0.340*** 0.032
(0.055) (0.047)
age -0.008***
(0.001)
I(age2) 0.00004***
(0.00001)
log(intst) -0.061*
(0.032)
log(land) 0.100***
(0.024)
log(area) 0.351***
(0.051)
rooms 0.047***
(0.017)
baths 0.094***
(0.028)
--------------------------------------------------------------------
Observations 321 321
R2 0.246 0.733
Adjusted R2 0.239 0.724
Residual Std. Error 0.338 (df = 317) 0.204 (df = 310)
F Statistic 34.470*** (df = 3; 317) 84.915*** (df = 10; 310)
====================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
13.3. Organizing Panel Data 219
where the double subscript now indicates values for individual (or other cross-sectional unit) i at time
t. We could estimate this model by OLS, essentially ignoring the panel structure. But at least the
assumption that the error terms are unrelated is very hard to justify since they contain unobserved
individual traits that are likely to be constant or at least correlated over time. Therefore, we need
specific methods for panel data.
For the calculations used by panel data methods, we have to make sure that the data set is sys-
tematically organized and the estimation routines understand its structure. Usually, a panel data set
comes in a “long” form where each row of data corresponds to one combination of i and t. We have
to define which observations belong together by introducing an index variable for the cross-sectional
units i and preferably also the time index t.
The package plm (for panel linear models) is a comprehensive collection of commands deal-
ing with panel data. Similar to specific data types for time series, it offers a data type named
pdata.frame. It essentially corresponds to a standard data.frame but has additional attributes
that describe the individual and time dimensions. Suppose we have our data in a standard data
frame named mydf. It includes a variable ivar indicating the cross-sectional units and a variable
tvar indicating the time. Then we can create a panel data frame with the command
mypdf <- pdata.frame( mydf, index=c("ivar","tvar") )
If we have a balanced panel (i.e. the same number of observations T for each “individual” i =
1, . . . , n) and the observations are first sorted by i and then by t, we can alternatively call
mypdf <- pdata.frame( mydf, index=n )
In this case, the new variables id and time are generated as the index variables.
Once we have defined our data set, we can check the dimensions with pdim(mypdf). It will
report whether the panel is balanced, the number of cross-sectional units n, the number of time units
T, and the total number of observations N (which is n · T in balanced panels).
Let’s apply this to the data set CRIME2.dta discussed by Wooldridge (2019, Section 13.3). It is a
balanced panel of 46 cities, properly sorted. Script 13.4 (PDataFrame.R) imports the data set. We
define our new panel data frame crime2.p and check its dimensions. Apparently, R understood us
correctly and reports a balanced panel with two observations on 46 cities each. We also display the
first six rows of data for the new id and time index variables and other selected variables. Now
we’re ready to work with this data set.
220 13. Pooling Cross-Sections Across Time: Simple Panel Data Methods
> # Observation 1-6: new "id" and "time" and some other variables:
> crime2.p[1:6,c("id","time","year","pop","crimes","crmrte","unem")]
id time year pop crimes crmrte unem
1-1 1 1 82 229528 17136 74.65756 8.2
1-2 1 2 87 246815 17306 70.11729 3.7
2-1 2 1 82 814054 75654 92.93487 8.1
2-2 2 2 87 933177 83960 89.97221 5.4
3-1 3 1 82 374974 31352 83.61113 9.0
3-2 3 2 87 406297 31364 77.19476 5.9
Script 13.5 (Example-PLM-Calcs.R) demonstrates these functions. The data set CRIME4.dta
has data on 90 counties for seven years. The data set includes the index variables county and year
which are used in the definition of our pdata.frame. We calculate lags, differences, between and
within transformations of the crime rate (crmrte). The results are stored back into the panel data
frame. The first rows of data are then presented for illustration.
The lagged variable vcr.l is just equal to crmrte but shifted down one row. The difference between
these two variables is cr.d. The average crmrte within the first seven rows (i.e. for county 1) is
given as the first seven values of cr.B and cr.W is the difference between crmrte and cr.B.
which differs from Equation 13.1 in that it explicitly involves an unobserved effect ai that is constant
over time (since it has no t subscript). If it is correlated with one or more of the regressors xit1 , . . . , xitk ,
we cannot simply ignore ai , leave it in the composite error term vit = ai + uit and estimate the
equation by OLS. The error term vit would be related to the regressors, violating assumption MLR.4
(and MLR.4’) and creating biases and inconsistencies. Note that this problem is not unique to panel
data, but possible solutions are.
The first differenced (FD) estimator is based on the first difference of the whole equation:
Note that we cannot evaluate this equation for the first observation t = 1 for any i since the lagged
values are unknown for them. The trick is that ai drops out of the equation by differencing since it
does not change over time. No matter how badly it is correlated with the regressors, it cannot hurt
the estimation anymore. This estimating equation is then analyzed by OLS. We simply regress the
differenced dependent variable ∆yit on the differenced independent variables ∆xit1 , . . . , ∆xitk .
Script 13.6 (Example-FD.R) opens the data set CRIME2.dta already used above. Within a
pdata.frame, we use the function diff to calculate first differences of the dependent variable
crime rate (crmrte) and the independent variable unemployment rate (unem) within our data set.
A list of the first six observations reveals that the differences are unavailable (NA) for the first year
of each city. The other differences are also calculated as expected. For example the change of the
crime rate for city 1 is 70.11729 − 74.65756 = −4.540268 and the change of the unemployment rate
for city 2 is 5.4 − 8.1 = −2.7.
The FD estimator can now be calculated by simply applying OLS to these differenced values.
The observations for the first year with missing information are automatically dropped from the
estimation sample. The results show a significantly positive relation between unemployment and
crime.
t test of coefficients:
t test of coefficients:
Generating the differenced values and using lm on them is actually unnecessary. Package plm
provide the versatile command plm which implements FD and other estimators, some of which we
will use in chapter 14. It works just like lm but is directly applied to the original variables and
does the necessary calculations internally. With the option model="pooling", the pooled OLS
estimator is requested, option model="fd" produces the FD estimator. As the output of Script 13.6
(Example-FD.R) shows, the parameter estimates are exactly the same as our pedestrian calculations.
We will repeat this example with “robust” standard errors in Section 14.4.
> pdim(crime4.p)
Balanced Panel: n = 90, T = 7, N = 630
t test of coefficients:
where ȳi is the average of yit over time for cross-sectional unit i and for the other variables accord-
ingly. The within transformation subtracts these individual averages from the respective observa-
tions yit . We already know how to conveniently calculate these demeaned variables like ÿit using the
command Within from Section 13.4.
The fixed effects (FE) estimator simply estimates the demeaned Equation 14.1 using pooled OLS.
Instead of applying the within transformation to all variables and running lm, we can simply use plm
on the original data with the option model="within". This has the additional advantage that the
degrees of freedom are adjusted to the demeaning and the variance-covariance matrix and standard
errors are adjusted accordingly. We will come back to different ways to get the same estimates in
Section 14.3.
> pdim(wagepan.p)
Balanced Panel: n = 545, T = 8, N = 4360
Call:
plm(formula = lwage ~ married + union + factor(year) * educ,
data = wagepan.p, model = "within")
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-4.152111 -0.125630 0.010897 0.160800 1.483401
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
married 0.0548205 0.0184126 2.9773 0.002926 **
union 0.0829785 0.0194461 4.2671 2.029e-05 ***
factor(year)1981 -0.0224158 0.1458885 -0.1537 0.877893
factor(year)1982 -0.0057611 0.1458558 -0.0395 0.968495
factor(year)1983 0.0104297 0.1458579 0.0715 0.942999
factor(year)1984 0.0843743 0.1458518 0.5785 0.562965
factor(year)1985 0.0497253 0.1458602 0.3409 0.733190
factor(year)1986 0.0656064 0.1458917 0.4497 0.652958
factor(year)1987 0.0904448 0.1458505 0.6201 0.535216
factor(year)1981:educ 0.0115854 0.0122625 0.9448 0.344827
factor(year)1982:educ 0.0147905 0.0122635 1.2061 0.227872
factor(year)1983:educ 0.0171182 0.0122633 1.3959 0.162830
factor(year)1984:educ 0.0165839 0.0122657 1.3521 0.176437
factor(year)1985:educ 0.0237085 0.0122738 1.9316 0.053479 .
factor(year)1986:educ 0.0274123 0.0122740 2.2334 0.025583 *
factor(year)1987:educ 0.0304332 0.0122723 2.4798 0.013188 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> pdim(wagepan.p)
Balanced Panel: n = 545, T = 8, N = 4360
==========================================
Dependent variable:
-----------------------------
lwage
OLS RE FE
(1) (2) (3)
------------------------------------------
educ 0.091*** 0.092***
(0.005) (0.011)
------------------------------------------
Observations 4,360 4,360 4,360
R2 0.189 0.181 0.181
==========================================
Note: *p<0.1; **p<0.05; ***p<0.01
14.2. Random Effects Models 229
The RE estimator needs stronger assumptions to be consistent than the FE estimator. On the other
hand, it is more efficient if these assumptions hold and we can include time constant regressors. A
widely used test of this additional assumption is the Hausman test. It is based on the comparison
between the FE and RE parameter estimates. Package plm offers the simple command phtest for
automated testing. It expects both estimates and reports test results including the appropriate p
values.
Script 14.4 (Example-HausmTest.R) uses the estimates obtained in Script 14.3 (Example-14-4-2.R)
and stored in variables reg.re and reg.fe to run the Hausman test for this model. With the p
value of 0.0033, the null hypothesis that the RE model is consistent is clearly rejected with sensible
significance levels like α = 5% or α = 1%.
Output of Script 14.4: Example-HausmTest.R
> # Note that the estimates "reg.fe" and "reg.re" are calculated in
> # Example 14.4. The scripts have to be run first.
>
> # Hausman test of RE vs. FE:
> phtest(reg.fe, reg.re)
Hausman Test
If ri is uncorrelated with the regressors, we can consistently estimate the parameters of this model
using the RE estimator. In addition to the original regressors, we include their averages over time.
Remember from Section 13.4 that these averages are computed with the function Between.
Script 14.5 (Example-Dummy-CRE-1.R) uses WAGEPAN.dta again. We estimate the FE parame-
ters using the within transformation (reg.fe), the dummy variable approach (reg.dum), and the
CRE approach (reg.cre). We also estimate the RE version of this model (reg.re). Script 14.6
(Example-Dummy-CRE-2.R) produces the regression table using stargazer. The results confirm
that the first three methods deliver exactly the same parameter estimates, while the RE estimates
differ.
Script 14.5: Example-Dummy-CRE-1.R
library(plm);library(stargazer)
data(wagepan, package=’wooldridge’)
# Generate pdata.frame:
wagepan.p <- pdata.frame(wagepan, index=c("nr","year") )
====================================================
Dependent variable:
-----------------------------------
lwage
Within Dummies CRE RE
(1) (2) (3) (4)
----------------------------------------------------
married 0.055*** 0.055*** 0.055*** 0.078***
(0.018) (0.018) (0.018) (0.017)
Between(married) 0.127***
(0.044)
Between(union) 0.160***
(0.050)
----------------------------------------------------
Observations 4,360 4,360 4,360 4,360
R2 0.171 0.616 0.174 0.170
====================================================
Note: *p<0.1; **p<0.05; ***p<0.01
232 14. Advanced Panel Data Methods
Given we have estimated the CRE model, it is easy to test the null hypothesis that the RE estimator
is consistent. The additional assumptions needed are γ1 = · · · = γk = 0. They can easily be tested
using an F test as demonstrated in Script 14.7 (Example-CRE-test-RE.R). Like the Hausman test,
we clearly reject the null hypothesis that the RE model is appropriate with a tiny p value of about
0.00005.
Output of Script 14.7: Example-CRE-test-RE.R
> # Note that the estimates "reg.cre" are calculated in
> # Script "Example-Dummy-CRE-1.R" which has to be run first.
>
> # RE test as an F test on the "Between" coefficients
> library(car)
Hypothesis:
Between(married) = 0
Between(union) = 0
Another advantage of the CRE approach is that we can add time-constant regressors to the model.
Since we cannot control for average values x̄ij for these variables, they have to be uncorrelated with ai
for consistent estimation of their coefficients. For the other coefficients of the time-varying variables,
we still don’t need these additional RE assumptions.
Script 14.8 (Example-CRE2.R) estimates another version of the wage equation using the CRE
approach. The variables married and union vary over time, so we can control for their between
effects. The variables educ, black, and hisp do not vary. For a causal interpretation of their
coefficients, we have to rely on uncorrelatedness with ai . Given ai includes intelligence and other
labor market success factors, this uncorrelatedness is more plausible for some variables (like gender
or race) than for other variables (like education).
14.3. Dummy Variable Regression and Correlated Random Effects 233
> summary(plm(lwage~married+union+educ+black+hisp+Between(married)+
> Between(union), data=wagepan.p, model="random"))
Oneway (individual) effect Random Effect Model
(Swamy-Arora’s transformation)
Call:
plm(formula = lwage ~ married + union + educ + black + hisp +
Between(married) + Between(union), data = wagepan.p, model = "random")
Effects:
var std.dev share
idiosyncratic 0.1426 0.3776 0.577
individual 0.1044 0.3231 0.423
theta: 0.6182
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-4.530129 -0.161868 0.026625 0.202817 1.648168
Coefficients:
Estimate Std. Error z-value Pr(>|z|)
(Intercept) 0.6325629 0.1081545 5.8487 4.954e-09 ***
married 0.2416845 0.0176735 13.6750 < 2.2e-16 ***
union 0.0700438 0.0207240 3.3798 0.0007253 ***
educ 0.0760374 0.0087787 8.6616 < 2.2e-16 ***
black -0.1295162 0.0488981 -2.6487 0.0080802 **
hisp 0.0116700 0.0428188 0.2725 0.7852042
Between(married) -0.0797386 0.0442674 -1.8013 0.0716566 .
Between(union) 0.1918545 0.0506522 3.7877 0.0001521 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
1 Don’tconfuse this with vcovHC from the package sandwich which only gives heteroscedasticity-robust results and unfor-
tunately has the same name.
14.4. Robust (Clustered) Standard Errors 235
t test of coefficients:
t test of coefficients:
t test of coefficients:
Cov(z, y)
β̂IV
1 = . (15.2)
Cov(z, x )
A valid instrument is correlated with the regressor x (“relevant”), so the denominator of Equation
15.2 is nonzero. It is also uncorrelated with the error term u (“exogenous”). Wooldridge (2019,
Section 15.1) provides more discussion and examples.
To implement IV regression in R, the package AER offers the convenient command ivreg. It works
similar to lm. In the formula specification, the regressor(s) are separated from the instruments with
a vertical line | (like in “conditional on z”):
ivreg( y ~ x | z )
Note that we can easily deal with heteroscedasticity: Results obtained by ivreg can be directly used
with robust standard errors from hccm (Package car) or vcovHC (package sandwich), see Section
8.1.
238 15. Instrumental Variables Estimation and Two Stage Least Squares
> # IV automatically
> reg.iv <- ivreg(log(wage) ~ educ | fatheduc, data=oursample)
===================================================================
Dependent variable:
------------------------------------
log(wage)
OLS instrumental
variable
(1) (2)
-------------------------------------------------------------------
educ 0.109*** 0.059*
(0.014) (0.035)
-------------------------------------------------------------------
Observations 428 428
R2 0.118 0.093
Adjusted R2 0.116 0.091
Residual Std. Error (df = 426) 0.680 0.689
F Statistic 56.929*** (df = 1; 426)
===================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
15.2. More Exogenous Regressors 239
> # OLS
> ols<-lm(log(wage)~educ+exper+I(exper^2)+black+smsa+south+smsa66+reg662+
> reg663+reg664+reg665+reg666+reg667+reg668+reg669, data=card)
> # IV estimation
> iv <-ivreg(log(wage)~educ+exper+I(exper^2)+black+smsa+south+smsa66+
> reg662+reg663+reg664+reg665+reg666+reg667+reg668+reg669
> | nearc4+exper+I(exper^2)+black+smsa+south+smsa66+
> reg662+reg663+reg664+reg665+reg666+reg667+reg668+reg669
> , data=card)
=============================================
Dependent variable:
--------------------------------
educ log(wage)
OLS OLS instrumental
variable
(1) (2) (3)
---------------------------------------------
nearc4 0.320***
(0.088)
---------------------------------------------
Observations 3,010 3,010 3,010
R2 0.477 0.300 0.238
=============================================
Note: *p<0.1; **p<0.05; ***p<0.01
=============================================
Dependent variable:
------------------------------
educ log(wage)
OLS OLS instrumental
variable
(1) (2) (3)
---------------------------------------------
fitted(stage1) 0.061*
(0.033)
educ 0.061*
(0.031)
motheduc 0.158***
(0.036)
fatheduc 0.190***
(0.034)
---------------------------------------------
Observations 428 428 428
R2 0.211 0.050 0.136
=============================================
Note: *p<0.1; **p<0.05; ***p<0.01
242 15. Instrumental Variables Estimation and Two Stage Least Squares
t test of coefficients:
> # IV regression
> summary( res.2sls <- ivreg(log(wage) ~ educ+exper+I(exper^2)
> | exper+I(exper^2)+motheduc+fatheduc,data=oursample) )
Call:
ivreg(formula = log(wage) ~ educ + exper + I(exper^2) | exper +
I(exper^2) + motheduc + fatheduc, data = oursample)
Residuals:
Min 1Q Median 3Q Max
-3.0986 -0.3196 0.0551 0.3689 2.3493
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0481003 0.4003281 0.120 0.90442
educ 0.0613966 0.0314367 1.953 0.05147 .
exper 0.0441704 0.0134325 3.288 0.00109 **
I(exper^2) -0.0008990 0.0004017 -2.238 0.02574 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> # IV FD regression
> summary( plm(log(scrap)~hrsemp|grant, model="fd",data=jtrain.p) )
Oneway (individual) effect First-Difference Model
Instrumental variable estimation
(Balestra-Varadharajan-Krishnakumar’s transformation)
Call:
plm(formula = log(scrap) ~ hrsemp | grant, data = jtrain.p, model = "fd")
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-2.3088292 -0.2188848 -0.0089255 0.2674362 2.4305637
Coefficients:
Estimate Std. Error z-value Pr(>|z|)
(Intercept) -0.0326684 0.1269512 -0.2573 0.79692
hrsemp -0.0141532 0.0079147 -1.7882 0.07374 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Estimation of each equation separately by 2SLS is straightforward once we have set up the system
and ensured identification. The excluded regressors in each equation serve as instrumental variables.
As shown is Chapter 15, the command ivreg from the package AER provides convenient 2SLS
estimation.
Call:
ivreg(formula = hours ~ log(wage) + educ + age + kidslt6 + nwifeinc |
educ + age + kidslt6 + nwifeinc + exper + I(exper^2), data = oursample)
Residuals:
Min 1Q Median 3Q Max
-4570.13 -654.08 -36.94 569.86 8372.91
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2225.662 574.564 3.874 0.000124 ***
log(wage) 1639.556 470.576 3.484 0.000545 ***
educ -183.751 59.100 -3.109 0.002003 **
age -7.806 9.378 -0.832 0.405664
kidslt6 -198.154 182.929 -1.083 0.279325
nwifeinc -10.170 6.615 -1.537 0.124942
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
ivreg(formula = log(wage) ~ hours + educ + exper + I(exper^2) |
educ + age + kidslt6 + nwifeinc + exper + I(exper^2), data = oursample)
Residuals:
Min 1Q Median 3Q Max
-3.49800 -0.29307 0.03208 0.36486 2.45912
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.6557254 0.3377883 -1.941 0.0529 .
hours 0.0001259 0.0002546 0.494 0.6212
educ 0.1103300 0.0155244 7.107 5.08e-12 ***
exper 0.0345824 0.0194916 1.774 0.0767 .
I(exper^2) -0.0007058 0.0004541 -1.554 0.1209
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
systemfit results
method: 2SLS
The results of systemfit provides additional information, see the output of Script 16.3
(Example-16-5-systemfit.R). An interesting piece of information is the correlation be-
tween the residuals of the equations. In the example, it is reported to be a substantially negative
-0.90. We can account for the correlation between the error terms to derive a potentially more
efficient parameter estimator than 2SLS. Without going into details here, the three stage least
squares (3SLS) estimator adds another stage to 2SLS by estimating the correlation and accounting
for it using a FGLS approach. For a detailed discussion of this and related methods, see for example
Wooldridge (2010, Chapter 8).
Using 3SLS in R is simple: Option method="3SLS" of systemfit is all we need to do as the
output of Script 16.4 (Example-16-5-3sls.R) shows.
systemfit results
method: 3SLS
So when we study the conditional mean, it makes sense to think about it as the probability of
outcome y = 1. Likewise, the predicted value ŷ should be thought of as a predicted probability.
y = β 0 + β 1 x1 + · · · + β k x k (17.2)
and make the usual assumptions, especially MLR.4: E(u|x) = 0, this implies for the conditional
mean (which is the probability that y = 1) and the predicted probabilities
P( y = 1| x ) = E( y | x ) = β 0 + β 1 x1 + · · · + β k x k (17.3)
P̂(y = 1|x) = ŷ = β̂ 0 + β̂ 1 x1 + · · · + β̂ k xk (17.4)
t test of coefficients:
One problem with linear probability models is that P(y = 1|x) is specified as a linear function of
the regressors. By construction, there are (more or less realistic) combinations of regressor values
that yield ŷ < 0 or ŷ > 1. Since these are probabilities, this does not really make sense.
As an example, Script 17.2 (Example-17-1-2.R) calculates the predicted values for two women
(see Section 6.2 for how to predict after OLS estimation): Woman 1 is 20 years old, has no work
experience, 5 years of education, two children below age 6 and has additional family income of
100,000 USD. Woman 2 is 52 years old, has 30 years of work experience, 17 years of education, no
children and no other source of income. The predicted “probability” for woman 1 is −41%, the
probability for woman 2 is 104% as can also be easily checked with a calculator.
Output of Script 17.2: Example-17-1-2.R
> # predictions for two "extreme" women (run Example-17-1-1.R first!):
> xpred <- list(nwifeinc=c(100,0),educ=c(5,17),exper=c(0,30),
> age=c(20,52),kidslt6=c(2,0),kidsge6=c(0,0))
> predict(linprob,xpred)
1 2
-0.4104582 1.0428084
17.1. Binary Responses 255
where the “link function” G (z) always returns values between 0 and 1. In the statistics literature,
this type of models is often called generalized linear model (GLM) because a linear part xβ shows
up within the nonlinear function G.
For binary response models, by far the most widely used specifications for G are
• the probit model with G (z) = Φ(z), the standard normal cdf and
exp(z)
• the logit model with G (z) = Λ(z) = 1+exp(z)
, the cdf of the logistic distribution.
Wooldridge (2019, Section 17.1) provides useful discussions of the derivation and interpretation of
these models. Here, we are concerned with the practical implementation. In R, many generalized
linear models can be estimated with the command glm which works similar to lm. It accepts the
additional option
• family=binomial(link=logit) for the logit model or
• family=binomial(link=probit) for the probit model.
Maximum likelihood estimation (MLE) of the parameters is done automatically and the summary
of the results contains the most important regression table and additional information. Scripts
17.3 (Example-17-1-3.R) and 17.4 (Example-17-1-4.R) implement this for the logit and probit
model, respectively. The log likelihood value L ( β̂) is not reported by default but can be requested
with the function logLik. Instead, a statistic called Residual deviance is reported in the stan-
dard output. It is simply defined as D ( β̂) = −2L ( β̂). Null deviance means D0 = −2L0 where
L0 is the likelihood of a model with an intercept only.
The two deviance statistics can be accessed for additional calculations from a stored result res
with res$deviance and res$null.deviance. Scripts 17.3 (Example-17-1-3.R) and 17.4
(Example-17-1-4.R) demonstrate the calculation of different statistics derived from these results.
McFadden’s pseudo R-squared can be calculated as
L ( β̂) D ( β̂)
pseudo R2 = 1 − = 1− . (17.6)
L0 D0
256 17. Limited Dependent Variable Models and Sample Selection Corrections
Call:
glm(formula = inlf ~ nwifeinc + educ + exper + I(exper^2) + age +
kidslt6 + kidsge6, family = binomial(link = logit), data = mroz)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1770 -0.9063 0.4473 0.8561 2.4032
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.425452 0.860365 0.495 0.62095
nwifeinc -0.021345 0.008421 -2.535 0.01126 *
educ 0.221170 0.043439 5.091 3.55e-07 ***
exper 0.205870 0.032057 6.422 1.34e-10 ***
I(exper^2) -0.003154 0.001016 -3.104 0.00191 **
age -0.088024 0.014573 -6.040 1.54e-09 ***
kidslt6 -1.443354 0.203583 -7.090 1.34e-12 ***
kidsge6 0.060112 0.074789 0.804 0.42154
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
glm(formula = inlf ~ nwifeinc + educ + exper + I(exper^2) + age +
kidslt6 + kidsge6, family = binomial(link = probit), data = mroz)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2156 -0.9151 0.4315 0.8653 2.4553
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2700736 0.5080782 0.532 0.59503
nwifeinc -0.0120236 0.0049392 -2.434 0.01492 *
educ 0.1309040 0.0253987 5.154 2.55e-07 ***
exper 0.1233472 0.0187587 6.575 4.85e-11 ***
I(exper^2) -0.0018871 0.0005999 -3.145 0.00166 **
age -0.0528524 0.0084624 -6.246 4.22e-10 ***
kidslt6 -0.8683247 0.1183773 -7.335 2.21e-13 ***
kidsge6 0.0360056 0.0440303 0.818 0.41350
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
17.1.3. Inference
The summary output of fitted glm results contains a standard regression table with parameters
and (asymptotic) standard errors. The next column is labeled z value instead of t value in the
output of lm. The interpretation is the same. The difference is that the standard errors only have
an asymptotic foundation and the distribution used for calculating p values is the standard normal
distribution (which is equal to the t distribution with very large degrees of freedom). The bottom
line is that tests for single parameters can be done as before, see Section 4.1.
For testing multiple hypotheses similar to the F test (see Section 4.3), the likelihood ratio test is
popular. It is based on comparing the log likelihood values of the unrestricted and the restricted
model. The test statistic is
LR = 2(Lur − Lr ) = Dr − Dur (17.7)
where Lur and Lr are the log likelihood values of the unrestricted and restricted model, respectively,
and Dur and Dr are the corresponding reported deviance statistics. Under H0 , the LR test statistic is
asymptotically distributed as χ2 with the degrees of freedom equal to the number of restrictions to
be tested. The test of overall significance is a special case just like with F-tests. The null hypothesis
is that all parameters except the constant are equal to zero. With the notation above, the test statistic
is
LR = 2 L ( β̂) − L0 = D0 − D ( β̂).
(17.8)
Translated to R with fitted model results stored in res, this corresponds to
LR = res$null.deviance - res$deviance
The package lmtest also offers the LR test as the function lrtest including the convenient
calculation of p values. The syntax is
• lrtest(res) for a test of overall significance for model res
• lrtest(restr, unrestr) for a test of the restricted model restr vs. the unrestricted
model unrestr
Script 17.5 (Example-17-1-5.R) implements the test of overall significance for the probit model
using both manual and automatic calculations. It also tests the joint null hypothesis that experience
and age are irrelevant by first estimating the restricted model and then running the automated LR
test. Output of Script 17.5: Example-17-1-5.R
> ################################################################
> # Test of overall significance:
> # Manual calculation of the LR test statistic:
> probitres$null.deviance - probitres$deviance
[1] 227.142
> lrtest(probitres)
Likelihood ratio test
> ################################################################
> # Test of H0: experience and age are irrelevant
> restr <- glm(inlf~nwifeinc+educ+ kidslt6+kidsge6,
> family=binomial(link=probit),data=mroz)
> lrtest(restr,probitres)
Likelihood ratio test
17.1.4. Predictions
The command predict can calculate predicted values for the estimation sample (“fitted values”)
or arbitrary sets of regressor values also for binary response models estimated with glm. Given the
results are stored in variable res, we can calculate
• xi β̂ for the estimation sample with predict(res)
• xi β̂ for the regressor values stored in xpred with predict(res, xpred)
• ŷ = G (xi β̂) for the estimation sample with predict(res, type = "response")
• ŷ = G (xi β̂) for the regressor values stored in xpred with predict(res, xpred, type =
"response")
The predictions for the two hypothetical women introduced in Section 17.1.1 are repeated for the
linear probability, logit, and probit models in Script 17.6 (Example-17-1-6.R). Unlike the linear
probability model, the predicted probabilities from the logit and probit models remain between 0
and 1.
Output of Script 17.6: Example-17-1-6.R
> # Predictions from linear probability, probit and logit model:
> # (run 17-1-1.R through 17-1-4.R first to define the variables!)
> predict(linprob, xpred,type = "response")
1 2
-0.4104582 1.0428084
1.0
linear prob.
logit
0.8
probit
0.6
y
0.4
0.2
0.0
−1 0 1 2 3 4 5
If we only have one regressor, predicted values can nicely be plotted against it. Figure 17.1 shows
such a figure for a simulated data set. For interested readers, the script used for generating the data
and the figure is printed as Script 17.7 (Binary-Predictions.R) in Appendix IV (p. 351). In
this example, the linear probability model clearly predicts probabilities outside of the “legal” area
between 0 and 1. The logit and probit models yield almost identical predictions. This is a general
finding that holds for most data sets.
∂ŷ ∂G ( β̂ 0 + β̂ 1 x1 + · · · + β̂ k xk )
= (17.9)
∂x j ∂x j
= β̂ j · g( β̂ 0 + β̂ 1 x1 + · · · + β̂ k xk ), (17.10)
where φ(z) and λ(z) are the pdfs of the standard normal and the logistic distribution, respectively.
The partial effect depends on the value of x β̂. The pdfs have the famous bell-shape with highest
values in the middle and values close to zero in the tails. This is already obvious from Figure 17.1.
17.1. Binary Responses 261
Figure 17.2. Partial effects for binary response models (simulated data)
linear prob.
0.6
logit
probit
partial effect
0.4
0.2
0.0
−1 0 1 2 3 4 5
Depending on the value of x, the slope of the probability differs. For our simulated data set, Figure
17.2 shows the estimated partial effects for all 100 observed x values. Interested readers can see the
complete code for this as Script 17.8 (Binary-Margeff.R) in Appendix IV (p. 352).
The fact that the partial effects differ by regressor values makes it harder to present the results in
a concise and meaningful way. There are two common ways to aggregate the partial effects:
• Partial effects at the average: PEA = β̂ j · g(x β̂)
• Average partial effects: APE = 1
n ∑in=1 β̂ j · g(xi β̂) = β̂ j · g(x β̂)
where x is the vector of sample averages of the regressors and g(x β̂) is the sample average of g
evaluated at the individual linear index xi β̂. Both measures multiply each coefficient β̂ j with a
constant factor.
Script 17.9 (Example-17-1-7.R) implements the APE calculations for our labor force participa-
tion example using already known R functions:
1. The linear indices xi β̂ are calculated using predict
2. The factors g(x β̂) are calculated by using the pdf functions dlogis and dnorm and then
averaging over the sample with mean.
3. The APEs are calculated by multiplying the coefficient vector obtained with coef with the
corresponding factor. Note that for the linear probability model, the partial effects are constant
and simply equal to the coefficients.
The results for the constant do not have a direct meaningful interpretation. The APEs for the other
variables don’t differ too much between the models. As a general observation, as long as we are
interested in APEs only and not in individual predictions or partial effects and as long as not too
many probabilities are close to 0 or 1, the linear probability model often works well enough.
262 17. Limited Dependent Variable Models and Sample Selection Corrections
> cbind(factor.log,factor.prob)
factor.log factor.prob
[1,] 0.1785796 0.3007555
A convenient package for calculating PEA and APE is mfx. Among others, it provides the com-
mands logitmfx and probitmfx. They estimate the corresponding model and display a regression
table not with parameter estimates but with PEAs with the option atmean=TRUE and APEs with the
option atmean=FALSE. Script 17.10 (Example-17-1-8.R) demonstrates this for the logit model of
our labor force participation example. The reported APEs are the same as those manually calculated
in Script 17.9 (Example-17-1-7.R).
17.2. Count Data: The Poisson Regression Model 263
> logitmfx(inlf~nwifeinc+educ+exper+I(exper^2)+age+kidslt6+kidsge6,
> data=mroz, atmean=FALSE)
Call:
logitmfx(formula = inlf ~ nwifeinc + educ + exper + I(exper^2) +
age + kidslt6 + kidsge6, data = mroz, atmean = FALSE)
Marginal Effects:
dF/dx Std. Err. z P>|z|
nwifeinc -0.00381181 0.00153898 -2.4769 0.013255 *
educ 0.03949652 0.00846811 4.6641 3.099e-06 ***
exper 0.03676411 0.00655577 5.6079 2.048e-08 ***
I(exper^2) -0.00056326 0.00018795 -2.9968 0.002728 **
age -0.01571936 0.00293269 -5.3600 8.320e-08 ***
kidslt6 -0.25775366 0.04263493 -6.0456 1.489e-09 ***
kidsge6 0.01073482 0.01339130 0.8016 0.422769
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
∂E(y|x)
= β j · exβ = β j · E(y|x) (17.13)
∂x j
1 ∂E(y|x)
⇔ βj = · . (17.14)
E( y | x ) ∂x j
If x j increases by one unit (and the other regressors remain the same), E(y|x) will increase roughly
by 100 · β j percent (the exact value is once again 100 · (e β j − 1)).
A problem with the Poisson model is that it is quite restrictive. The Poisson distribution implicitly
restricts the variance of y to be equal to its mean. If this assumption is violated but the conditional
264 17. Limited Dependent Variable Models and Sample Selection Corrections
mean is still correctly specified, the Poisson parameter estimates are consistent, but the standard
errors and all inferences based on them are invalid. A simple solution is to interpret the Poisson
estimators as quasi-maximum likelihood estimators (QMLE). Similar to the heteroscedasticity-robust
inference for OLS discussed in Section 8.1, the standard errors can be adjusted.
Estimating Poisson regression models in R is straightforward. They also belong to the class of
generalized linear models (GLM) and can be estimated using glm. The option to specify a Pois-
son model is family=poisson. For the more robust QMLE standard errors, we simply specify
family=quasipoisson. For implementing more advanced count data models, see Kleiber and
Zeileis (2008, Section 5.3).
> stargazer(lm.res,Poisson.res,QPoisson.res,type="text",keep.stat="n")
==================================================
Dependent variable:
-------------------------------------
narr86
OLS Poisson glm: quasipoisson
link = log
(1) (2) (3)
--------------------------------------------------
pcnv -0.132*** -0.402*** -0.402***
(0.040) (0.085) (0.105)
--------------------------------------------------
Observations 2,725 2,725 2,725
==================================================
Note: *p<0.1; **p<0.05; ***p<0.01
266 17. Limited Dependent Variable Models and Sample Selection Corrections
y*
3
y
E(y*)
2
1 E(y)
y
0
−1
−2
−3
3 4 5 6
> summary(TobitRes)
Call:
censReg(formula = hours ~ nwifeinc + educ + exper + I(exper^2) +
age + kidslt6 + kidsge6, data = mroz)
Observations:
Total Left-censored Uncensored Right-censored
753 325 428 0
Coefficients:
Estimate Std. error t value Pr(> t)
(Intercept) 965.30528 446.43631 2.162 0.030599 *
nwifeinc -8.81424 4.45910 -1.977 0.048077 *
educ 80.64561 21.58324 3.736 0.000187 ***
exper 131.56430 17.27939 7.614 2.66e-14 ***
I(exper^2) -1.86416 0.53766 -3.467 0.000526 ***
age -54.40501 7.41850 -7.334 2.24e-13 ***
kidslt6 -894.02174 111.87803 -7.991 1.34e-15 ***
kidsge6 -16.21800 38.64139 -0.420 0.674701
logSigma 7.02289 0.03706 189.514 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Another alternative for estimating Tobit models is the command survreg from package
survival. It is less straightforward to use but more flexible. We cannot comprehensively discuss
268 17. Limited Dependent Variable Models and Sample Selection Corrections
all features but just show how to reproduce the same results for Example 17.2 in Script 17.15
(Example-17-2-survreg.R). We will come back to this command in the next section.
Output of Script 17.15: Example-17-2-survreg.R
> # Estimate Tobit model using survreg:
> library(survival)
> summary(res)
Call:
survreg(formula = Surv(hours, hours > 0, type = "left") ~ nwifeinc +
educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mroz,
dist = "gaussian")
Value Std. Error z p
(Intercept) 965.3053 446.4361 2.16 0.03060
nwifeinc -8.8142 4.4591 -1.98 0.04808
educ 80.6456 21.5832 3.74 0.00019
exper 131.5643 17.2794 7.61 2.7e-14
I(exper^2) -1.8642 0.5377 -3.47 0.00053
age -54.4050 7.4185 -7.33 2.2e-13
kidslt6 -894.0217 111.8780 -7.99 1.3e-15
kidsge6 -16.2180 38.6414 -0.42 0.67470
Log(scale) 7.0229 0.0371 189.51 < 2e-16
Scale= 1122
Gaussian distribution
Loglik(model)= -3819.1 Loglik(intercept only)= -3954.9
Chisq= 271.59 on 7 degrees of freedom, p= 7e-55
Number of Newton-Raphson Iterations: 4
n= 753
17.4. Censored and Truncated Regression Models 269
1 Wooldridge (2019, Section 7.4) uses the notation w instead of y and y instead of y∗ .
270 17. Limited Dependent Variable Models and Sample Selection Corrections
> # Output:
> summary(res)
Call:
survreg(formula = Surv(log(durat), uncensored, type = "right") ~
workprg + priors + tserved + felon + alcohol + drugs + black +
married + educ + age, data = recid, dist = "gaussian")
Value Std. Error z p
(Intercept) 4.099386 0.347535 11.80 < 2e-16
workprg -0.062572 0.120037 -0.52 0.6022
priors -0.137253 0.021459 -6.40 1.6e-10
tserved -0.019331 0.002978 -6.49 8.5e-11
felon 0.443995 0.145087 3.06 0.0022
alcohol -0.634909 0.144217 -4.40 1.1e-05
drugs -0.298160 0.132736 -2.25 0.0247
black -0.542718 0.117443 -4.62 3.8e-06
married 0.340684 0.139843 2.44 0.0148
educ 0.022920 0.025397 0.90 0.3668
age 0.003910 0.000606 6.45 1.1e-10
Log(scale) 0.593586 0.034412 17.25 < 2e-16
Scale= 1.81
Gaussian distribution
Loglik(model)= -1597.1 Loglik(intercept only)= -1680.4
Chisq= 166.74 on 10 degrees of freedom, p= 1.3e-30
Number of Newton-Raphson Iterations: 4
n= 1445
Truncation is a more serious problem than censoring since our observations are more severely
affected. If the true latent variable y∗ is above or below a certain threshold, the individual is not even
sampled. We therefore do not even have any information. Classical truncated regression models rely
on parametric and distributional assumptions to correct this problem. In R, they are available in the
package truncreg.
Figure 17.4 shows results for a simulated data set. Because it is simulated, we actually know the
values for everybody (hollow and solid dots). In our sample, we only observe those with y > 0
(solid dots). When applying OLS to this sample, we get a downward biased slope (dashed line).
Truncated regression fixes this problem and gives a consistent slope estimator (solid line). Script
17.17 (TruncReg-Simulation.R) which generated the data set and the graph is shown in Ap-
pendix IV (p. 354).
17.5. Sample Selection Corrections 271
all points
3
observed points
OLS fit
2
truncated regression
1
y
0
−1
−2
−3
3 4 5 6
> # ts data:
> hseinv.ts <- ts(hseinv)
=========================================
Dependent variable:
----------------------------
linv.detr
(1) (2)
-----------------------------------------
gprice 3.095*** 3.256***
(0.933) (0.970)
L(gprice) -2.936***
(0.973)
-----------------------------------------
Observations 41 40
Adjusted R2 0.375 0.504
=========================================
Note: *p<0.1; **p<0.05; ***p<0.01
∆y = α + θyt−1 + γ1 ∆yt−1 + δt t + et .
We already know how to implement such a regression. The different terms and their equivalent in dynlm
syntax are:
• ∆y = d(y)
• yt−1 = L(y)
• ∆yt−1 = L(d(y))
• t = trend(data)
The relevant test statistic is t = −2.421 and the critical values are given in Wooldridge (2019, Table 18.3).
More conveniently, the script also uses the automatic command adf.test which reports a p value of
0.41. So the null hypothesis of a unit root cannot be rejected with any reasonable significance level.
Script 18.3 (Example-18-4-urca.R) repeats the same analysis but uses the package urca.
276 18. Advanced Time Series Topics
Call:
dynlm(formula = d(y) ~ L(y) + L(d(y)) + trend(inven.ts), data = inven.ts)
Residuals:
Min 1Q Median 3Q Max
-0.046332 -0.012563 0.004026 0.013572 0.030789
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.650922 0.666399 2.477 0.0189 *
L(y) -0.209621 0.086594 -2.421 0.0215 *
L(d(y)) 0.263751 0.164739 1.601 0.1195
trend(inven.ts) 0.005870 0.002696 2.177 0.0372 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
data: inven$y
Dickey-Fuller = -2.4207, Lag order = 1, p-value = 0.4092
alternative hypothesis: stationary
18.2. Testing for Unit Roots 277
###############################################
# Augmented Dickey-Fuller Test Unit Root Test #
###############################################
Call:
lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
Residuals:
Min 1Q Median 3Q Max
-0.046332 -0.012563 0.004026 0.013572 0.030789
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.656792 0.669068 2.476 0.0189 *
z.lag.1 -0.209621 0.086594 -2.421 0.0215 *
tt 0.005870 0.002696 2.177 0.0372 *
z.diff.lag 0.263751 0.164739 1.601 0.1195
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
x t = x t −1 + a t
y t = y t −1 + e t ,
where at and et are i.i.d. random innovations. If we want to test whether they are related from a
random sample, we could simply regress y on x. A t test should reject the (true) null hypothesis that
the slope coefficient is equal to zero with a probability of α, for example 5%. The phenomenon of
spurious regression implies that this happens much more often.
Script 18.4 (Simulate-Spurious-Regression-1.R) simulates this model for one sample. Re-
member from Section 11.2 how to simulate a random walk in a simple way: with a starting value of
zero, it is just the cumulative sum of the innovations. The time series for this simulated sample of
size n = 50 is shown in Figure 18.1. When we regress y on x, the t statistic for the slope parameter is
larger than 4 with a p value much smaller than 1%. So we would reject the (correct) null hypothesis
that the variables are unrelated.
x
0
y
−5
x
−10
−15
0 10 20 30 40 50
Index
18.3. Spurious Regression 279
> # plot
> plot(x,type="l",lty=1,lwd=1)
> # Regression of y on x
> summary( lm(y~x) )
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-3.5342 -1.4938 -0.2549 1.4803 4.6198
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.15050 0.56498 -5.576 1.11e-06 ***
x 0.29588 0.06253 4.732 2.00e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We know that by definition, a valid test should reject a true null hypothesis with a probabil-
ity of α, so maybe we were just unlucky with the specific sample we took. We therefore re-
peat the same analysis with 10,000 samples from the same data generating process in Script 18.5
(Simulate-Spurious-Regression-2.R). For each of the samples, we store the p value of the
slope parameter in a vector named pvals. After these simulations are run, we simply check how
often we would have rejected H0 : β 1 = 0 by comparing these p values with 0.05.
We find that in 6, 626 of the samples, so in 66% instead of α = 5%, we rejected H0 . So the t test
seriously screws up the statistical inference because of the unit roots.
280 18. Advanced Time Series Topics
FALSE TRUE
3374 6626
yt = β 0 + β 1 xt + ut ,
the error term u does not have a unit root, while both y and x do. A test for cointegration can
be based on this finding: We first estimate this model by OLS and then test for a unit root in the
residuals û. Again, we have to adjust the distribution of the test statistic and critical values. This
approach is called Engle-Granger test in Wooldridge (2019, Section 18.4) or Phillips–Ouliaris (PO)
test. It is implemented in package tseries as po.test and in package urca as ca.po.
If we find cointegration, we can estimate error correction models. In the Engle-Granger proce-
dure, these models can be estimated in a two-step procedure using OLS. There are also powerful
commands that automatically estimate different types of error correction models. Package urca
provides ca.jo and for structural models, package vars offers the command SVEC.
18.5. Forecasting 281
18.5. Forecasting
One major goal of time series analysis is forecasting. Given the information we have today, we want
to give our best guess about the future and also quantify our uncertainty. Given a time series model
for y, the best guess for yt+1 given information It is the conditional mean of E(yt+1 | It ). For a model
like
suppose we are at time t and know both yt and zt and want to predict yt+1 . Also suppose that
E(ut | It−1 ) = 0. Then,
E(yt+1 | It ) = δ0 + α1 yt + γ1 zt (18.6)
and our prediction from an estimated model would be ŷt+1 = δ̂0 + α̂1 yt + γ̂1 zt .
We already know how to get in-sample and (hypothetical) out-of-sample predictions including
forecast intervals from linear models using the command predict. It can also be used for our
purposes.
There are several ways how the performance of forecast models can be evaluated. It makes a
lot of sense not to look at the model fit within the estimation sample but at the out-of-sample
forecast performances. Suppose we have used observations y1 , . . . , yn for estimation and additionally
have observations yn+1 , . . . , yn+m . For this set of observations, we obtain out-of-sample forecasts
f n+1 , . . . , f n+m and calculate the m forecast errors
et = y t − f t for t = n + 1, . . . , n + m. (18.7)
We want these forecast errors to be as small (in absolute value) as possible. Useful measures are
the root mean squared error (RMSE) and the mean absolute error (MAE):
s
1 m 2
m h∑
RMSE = en+ h (18.8)
=1
m
1
∑ en+ h
MAE = (18.9)
m h =1
(18.10)
===================================================
Dependent variable:
-------------------------------
unem
(1) (2)
---------------------------------------------------
unem_1 0.732*** 0.647***
(0.097) (0.084)
inf_1 0.184***
(0.041)
---------------------------------------------------
Observations 48 48
Adjusted R2 0.544 0.677
Residual Std. Error 1.049 (df = 46) 0.883 (df = 45)
===================================================
Note: *p<0.1; **p<0.05; ***p<0.01
> e2<- y - f2
> # RMSE:
> sqrt(mean(e1^2))
[1] 0.5761199
> sqrt(mean(e2^2))
[1] 0.5217543
> # MAE:
> mean(abs(e1))
[1] 0.542014
> mean(abs(e2))
[1] 0.4841945
284 18. Advanced Time Series Topics
2003
Unempl.
Forecast 1 2002
Forecast 2
5.5
5.0
1997
2001
4.5
1998
1999
4.0
2000
If a project requires many and/or time-consuming calculations, it might be useful to separate them
into several R scripts. For example, we could have four different scripts corresponding to the steps
listed above:
• data.R
• descriptives.R
• estimation.R
• results.R
So once the potentially time-consuming data cleaning is done, we don’t have to repeat it every time
we run regressions. To avoid confusion, it is highly advisable to document interdependencies. Both
descriptives.R and estimation.R should at the beginning have a comment like
# Depends on data.R
Somewhere, we will have to document the whole work flow. The best way to do it is a master script
that calls the separate scripts to reproduce the whole analyses from raw data to tables and figures.
This can be done using the command source(scriptfile).
For generating the familiar output, we should add the option echo=TRUE. To avoid abbreviated
output, set max.deparse.length=1000 or another large number. For our example, a master file
could look like
Script 19.2: projecty-master.R
########################################################################
# Bachelor Thesis Mr. Z
# "Best Practice in Using R Scripts"
#
# R Script "master"
# Date of this version: 2020-08-13
########################################################################
# Some preparations:
setwd(~/bscthesis/r)
rm(list = ls())
# Call R scripts
source("data.R" ,echo=TRUE,max=1000) # Data import and cleaning
source("descriptives.R",echo=TRUE,max=1000) # Descriptive statistics
source("estimation.R" ,echo=TRUE,max=1000) # Estimation of model
source("results.R" ,echo=TRUE,max=1000) # Tables and Figures
19.2. Logging Output in Text Files 287
All output between starting the log file with sink("logfile.txt") and stopping it with sink()
will be written to the file logfile.txt in the current working directory. We can of course use a
different directory e.g. with sink("~/mydir/logfile.txt"). Note that comments, commands,
and messages are not written to the file by default. The next section describes a more advanced way
to store and display R results.
19.3.1. Basics
R Markdown can be used with the R package rmarkdown. An R Markdown file is a standard text
file but should have the file name extension .Rmd. It includes normal text, formatting instructions,
and R code. It is processed by R and generates a formatted document. As a simple example, let’s
turn Script 19.1 into a basic R Markdown document. The file looks like this:1
File ultimate-calcs-rmd.Rmd
/Documents/R/URfIE/19/ultimate-calcs-rmd.Rmd"
%%
%% This is file ‘.tex’,
%% generated with the docstrip utility.
%%
%% The original source files were:
%%
%% fileerr.dtx (with options: ‘return’)
%%
%% This is a generated file.
%%
%% The source is maintained by the LaTeX Project team and bug
%% reports for it can be opened at https://latex-project.org/bugs/
%% (but please observe conditions on bug reports sent to that address!)
%%
%%
%% Copyright (C) 1993-2020
%% The LaTeX3 Project and any individual authors listed elsewhere
%% in this file.
%%
%% This file was generated from file(s) of the Standard LaTeX ‘Tools Bundle’.
%% --------------------------------------------------------------------------
1 This file can be downloaded along with all other files presented here at http://www.URfIE.net.
288 19. Carrying Out an Empirical Project
%%
%% It may be distributed and/or modified under the
%% conditions of the LaTeX Project Public License, either version 1.3c
%% of this license or (at your option) any later version.
%% The latest version of this license is in
%% https://www.latex-project.org/lppl.txt
%% and version 1.3c or later is part of all distributions of LaTeX
%% version 2005/12/01 or later.
%%
%% This file may only be distributed together with a copy of the LaTeX
%% ‘Tools Bundle’. You may however distribute the LaTeX ‘Tools Bundle’
%% without such generated files.
%%
%% The list of all files belonging to the LaTeX ‘Tools Bundle’ is
%% given in the file ‘manifest.txt’.
%%
\message{File ignored}
\endinput
%%
%% End of file ‘.tex’.
The file starts with a header between the two --- that specifies a few standard properties like the
author and the date. Then we see basic text and a URL. The only line that actually involves R code
is framed by a ‘‘‘{r} at the beginning and a ‘‘‘ at the end.
Instead of running this file through R directly, we process it with tools from the package
rmarkdown. If ultimate-calcs-rmd.Rmd is the current working directory (otherwise, we need
to add the path), we can simply create a HTML document with
render("ultimate-calcs-rmd.Rmd")
The HTML document can be opened in any web browser, but also in word processors. Instead of
HTML documents, we can create Microsoft Word documents with
render("ultimate-calcs-rmd.Rmd",output_format="word_document")
If the computer has a working LATEX system installed, we can create a PDF file with
render("ultimate-calcs-rmd.Rmd",output_format="pdf_document")
With RStudio, R Markdown is even easier to use: When editing a R Markdown document, there
is a Knit HTML button on top of the editor window. It will render the document properly. By
default, the documents are created in the same directory and with the same file name (except the
extension). We can also choose a different file name and/or a different directory with the options
output_file=... and output_path=..., respectively.
All three formatted documents results look similar to each other and are displayed in Figure 19.1.
Word output:
PDF output:
290 19. Carrying Out an Empirical Project
Different formatting options for text and code chunks are demonstrated in the following R Mark-
down script. Its HTML output is shown in Figure 19.2.
File rmarkdown-examples.Rmd
/Documents/R/URfIE/19/rmarkdown-examples.Rmd"
%%
%% This is file ‘.tex’,
%% generated with the docstrip utility.
%%
%% The original source files were:
%%
%% fileerr.dtx (with options: ‘return’)
%%
%% This is a generated file.
%%
%% The source is maintained by the LaTeX Project team and bug
%% reports for it can be opened at https://latex-project.org/bugs/
%% (but please observe conditions on bug reports sent to that address!)
%%
%%
%% Copyright (C) 1993-2020
%% The LaTeX3 Project and any individual authors listed elsewhere
%% in this file.
%%
%% This file was generated from file(s) of the Standard LaTeX ‘Tools Bundle’.
%% --------------------------------------------------------------------------
%%
%% It may be distributed and/or modified under the
%% conditions of the LaTeX Project Public License, either version 1.3c
%% of this license or (at your option) any later version.
%% The latest version of this license is in
%% https://www.latex-project.org/lppl.txt
%% and version 1.3c or later is part of all distributions of LaTeX
%% version 2005/12/01 or later.
%%
%% This file may only be distributed together with a copy of the LaTeX
%% ‘Tools Bundle’. You may however distribute the LaTeX ‘Tools Bundle’
%% without such generated files.
%%
%% The list of all files belonging to the LaTeX ‘Tools Bundle’ is
%% given in the file ‘manifest.txt’.
%%
\message{File ignored}
\endinput
%%
%% End of file ‘.tex’.
%%
%% This file was generated from file(s) of the Standard LaTeX ‘Tools Bundle’.
%% --------------------------------------------------------------------------
%%
%% It may be distributed and/or modified under the
%% conditions of the LaTeX Project Public License, either version 1.3c
%% of this license or (at your option) any later version.
%% The latest version of this license is in
%% https://www.latex-project.org/lppl.txt
%% and version 1.3c or later is part of all distributions of LaTeX
%% version 2005/12/01 or later.
%%
%% This file may only be distributed together with a copy of the LaTeX
%% ‘Tools Bundle’. You may however distribute the LaTeX ‘Tools Bundle’
%% without such generated files.
%%
%% The list of all files belonging to the LaTeX ‘Tools Bundle’ is
%% given in the file ‘manifest.txt’.
%%
\message{File ignored}
\endinput
%%
%% End of file ‘.tex’.
The file contains standard LATEX code. It also includes an R code chunk which is started with <<>>=
and ended with @. This file is processed (“knitted”) by the knitr package using the command
knit("ultimate-calcs-knitr.Rnw")
to produce a pure LATEX file ultimate-calcs-knitr.tex. This file can in the next step be pro-
cessed using a standard LATEX installation. R can also call any command line / shell commands
appropriate for the operating system using the function shell("some OS command"). With a
working pdflatex command installed on the system, we can therefore produce a .pdf from a
.Rnw file with the R commands
knit("ultimate-calcs-knitr.Rnw")
shell("pdflatex ultimate-calcs-knitr.tex")
If we are using LATEX references and the like, pdflatex might have to be called repeatedly.
RStudio can be used to conveniently work with knitr including syntax highlighting for the LATEX
code. By default, RStudio is set to work with Sweave instead, at least at the time of writing this.
To use knitr, change the option Tools → Global Options → Sweave → Weave Rnw files
using from Sweave to knitr. Then we can produce a .pdf file from a .Rnw file with a click of a
“Compile PDF” button.
The R code chunks in a knitr can be customized with options by starting the chunk with
<<chunk-name, option 1, option 2, ...>>= to change the way the R results are displayed.
Important examples include
• echo=FALSE: Don’t print the R commands
• results="hide": Don’t print the R output
• results="asis": The results are LATEX code, for example generated by xtable or
stargazer.
• error=FALSE, warning=FALSE, message=FALSE: Don’t print any errors, warnings, or mes-
sages from R.
19.4. Combining R with LaTeX 295
The following .Rnw file demonstrates some of these features. After running this file through knit
and pdflatex, the resulting PDF file is shown in Figure 19.3. For more details on knitr, see Xie
(2015).
File knitr-example.Rnw
/Documents/R/URfIE/19/knitr-example.Rnw"
%%
%% This is file ‘.tex’,
%% generated with the docstrip utility.
%%
%% The original source files were:
%%
%% fileerr.dtx (with options: ‘return’)
%%
%% This is a generated file.
%%
%% The source is maintained by the LaTeX Project team and bug
%% reports for it can be opened at https://latex-project.org/bugs/
%% (but please observe conditions on bug reports sent to that address!)
%%
%%
%% Copyright (C) 1993-2020
%% The LaTeX3 Project and any individual authors listed elsewhere
%% in this file.
%%
%% This file was generated from file(s) of the Standard LaTeX ‘Tools Bundle’.
%% --------------------------------------------------------------------------
%%
%% It may be distributed and/or modified under the
%% conditions of the LaTeX Project Public License, either version 1.3c
%% of this license or (at your option) any later version.
%% The latest version of this license is in
%% https://www.latex-project.org/lppl.txt
%% and version 1.3c or later is part of all distributions of LaTeX
%% version 2005/12/01 or later.
%%
%% This file may only be distributed together with a copy of the LaTeX
%% ‘Tools Bundle’. You may however distribute the LaTeX ‘Tools Bundle’
%% without such generated files.
%%
%% The list of all files belonging to the LaTeX ‘Tools Bundle’ is
%% given in the file ‘manifest.txt’.
%%
\message{File ignored}
\endinput
%%
%% End of file ‘.tex’.
19.4. Combining R with LaTeX 297
Our data set has 141 observations. The distribution of gender is the following:
gender
female 67
male 74
Dependent variable:
colGPA
(1) (2) (3)
∗∗∗
hsGPA 0.482 0.453∗∗∗
(0.090) (0.096)
# Number of obs.
sink("numb-n.txt"); cat(nrow(gpa1)); sink()
# generate frequency table in file "tab-gender.txt"
gender <- factor(gpa1$male,labels=c("female","male"))
sink("tab-gender.txt")
xtable( table(gender) )
sink()
# b1 hat
sink("numb-b1.txt"); cat(round(coef(res1)[2],3)); sink()
After this script was run, the four text files have the following content:2
File numb-n.txt
141
File numb-b1.txt
0.482
File tab-gender.txt
% latex table generated in R 4.0.0 by xtable 1.8-4 package
% Mon May 18 16:31:00 2020
\begin{table}[ht]
\centering
\begin{tabular}{rr}
\hline
& gender \\
\hline
female & 67 \\
male & 74 \\
\hline
\end{tabular}
\end{table}
File tab-regr.txt
% Table created by stargazer v.5.2.2 by Marek Hlavac, Harvard University.
% E-mail: hlavac at fas.harvard.edu
% Date and time: Mon, May 18, 2020 - 4:25:47 PM
\begin{table}[!htbp] \centering
\caption{Regression Results}
\label{t:reg}
\begin{tabular}{@{\extracolsep{5pt}}lccc}
\\[-1.8ex]\hline
\hline \\[-1.8ex]
& \multicolumn{3}{c}{\textit{Dependent variable:}} \\
\cline{2-4}
\\[-1.8ex] & \multicolumn{3}{c}{colGPA} \\
\\[-1.8ex] & (1) & (2) & (3)\\
\hline \\[-1.8ex]
hsGPA & 0.482$^{***}$ & & 0.453$^{***}$ \\
& (0.090) & & (0.096) \\
& & & \\
ACT & & 0.027$^{**}$ & 0.009 \\
& & (0.011) & (0.011) \\
& & & \\
Constant & 1.415$^{***}$ & 2.403$^{***}$ & 1.286$^{***}$ \\
& (0.307) & (0.264) & (0.341) \\
& & & \\
\hline \\[-1.8ex]
Observations & 141 & 141 & 141 \\
R$^{2}$ & 0.172 & 0.043 & 0.176 \\
\hline
\hline \\[-1.8ex]
\textit{Note:} & \multicolumn{3}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05;
$^{***}$p$<$0.01} \\
\end{tabular}
\end{table}
2 Make sure to use setwd first to choose the correct directory where we want to store the results.
300 19. Carrying Out an Empirical Project
Now we write a LATEX file with the appropriate \input{...} commands to put tables and num-
bers into the right place. A file that generates the same document as the one in Figure 19.3 is the
following:
File LaTeXwithR.tex
/Documents/R/URfIE/19/LaTeXwithR.tex"
%%
%% This is file ‘.tex’,
%% generated with the docstrip utility.
%%
%% The original source files were:
%%
%% fileerr.dtx (with options: ‘return’)
%%
%% This is a generated file.
%%
%% The source is maintained by the LaTeX Project team and bug
%% reports for it can be opened at https://latex-project.org/bugs/
%% (but please observe conditions on bug reports sent to that address!)
%%
%%
%% Copyright (C) 1993-2020
%% The LaTeX3 Project and any individual authors listed elsewhere
%% in this file.
%%
%% This file was generated from file(s) of the Standard LaTeX ‘Tools Bundle’.
%% --------------------------------------------------------------------------
%%
%% It may be distributed and/or modified under the
%% conditions of the LaTeX Project Public License, either version 1.3c
%% of this license or (at your option) any later version.
%% The latest version of this license is in
%% https://www.latex-project.org/lppl.txt
%% and version 1.3c or later is part of all distributions of LaTeX
%% version 2005/12/01 or later.
%%
%% This file may only be distributed together with a copy of the LaTeX
%% ‘Tools Bundle’. You may however distribute the LaTeX ‘Tools Bundle’
%% without such generated files.
%%
%% The list of all files belonging to the LaTeX ‘Tools Bundle’ is
%% given in the file ‘manifest.txt’.
%%
\message{File ignored}
\endinput
%%
%% End of file ‘.tex’.
Whenever we update the calculations, we rerun the R script and create updated tables, numbers,
and graphs. Whenever we update the text in our document, LATEX will use the latest version of the
results to generate a publication-ready PDF document.
We have automatically generated exactly the same PDF document in two different ways in this
and the previous section. Which one is better? It depends. In smaller projects with little and fast R
computations, knitr is convenient because it combines everything in one file. This is also the ideal
in terms of reproducibility. For larger projects with many or time-consuming R calculations, it is
more convenient to separate calculations from the text, since knitr requires to redo all calculations
19.4. Combining R with LaTeX 301
whenever we compile the LATEX code. This book was done in the separated spirit described in this
section.
Part IV.
Appendices
R Scripts
1. Scripts Used in Chapter 01
Script 1.1: R-as-a-Calculator.R
1+1
5*(4-1)^2
sqrt( log(10) )
# Basic functions:
sort(a)
length(a)
min(a)
max(a)
sum(a)
306 R Scripts
prod(a)
# Logical vectors:
( a <- c(7,2,6,9,4,1,3) )
# Indices by number:
avgs[2]
avgs[1:4]
# Indices by name:
avgs["Jackson"]
# Logical indices:
avgs[ avgs>=0.35 ]
# Transpose:
(C <- t(B) )
# Matrix multiplication:
(D <- A %*% C )
# Inverse:
solve(D)
# Vector of names:
names(mylist)
# Result:
sales
sales
head(wdi_raw)
tail(wdi_raw)
# Graph
library(ggplot2)
ggplot(ourdata, aes(year, LE_fem)) +
geom_line() +
theme_light() +
labs(title="Life expectancy of females in the US",
subtitle="World Bank: World Development Indicators",
x = "Year",
y = "Life expectancy [years]"
)
tail(le_data)
tail(ctryinfo)
# Join:
alldata <- left_join(le_data, ctryinfo)
tail(alldata)
# First 6 rows:
tail(avgdata)
# plot
ggplot(avgdata, aes(year, LE_avg, color=income)) +
geom_line() +
scale_color_grey()
data(affairs, package=’wooldridge’)
# sample average:
mean(ceosal1$salary)
# sample median:
median(ceosal1$salary)
#standard deviation:
sd(ceosal1$salary)
# summary information:
summary(ceosal1$salary)
# Table(matrix) of values:
cbind(x, fx)
# Plot
plot(x, fx, type="h")
# Set the seed of the random number generator and take two samples:
set.seed(6254137)
rnorm(5)
rnorm(5)
# Reset the seed to the same value to get the same samples again:
set.seed(6254137)
rnorm(5)
rnorm(5)
# Ingredients to CI formula
(avgCh<- mean(Change))
(n <- length(Change))
(sdCh <- sd(Change))
(se <- sdCh/sqrt(n))
(c <- qt(.975, n-1))
# Confidence interval:
c( avgCh - c*se, avgCh + c*se )
# Ingredients to CI formula
(avgy<- mean(audit$y))
(n <- length(audit$y))
(sdy <- sd(audit$y))
(se <- sdy/sqrt(n))
(c <- qnorm(.975))
# p value
(p <- pt(t,n-1))
# p value
(p <- pt(t,240))
# print results:
testres
# p-value
testres$p.value
# repeat r times:
for(j in 1:r) {
# Draw a sample and store the sample mean in pos. j=1,2,... of ybar:
sample <- rnorm(100,10,2)
ybar[j] <- mean(sample)
}
# Simulated mean:
mean(ybar)
# Simulated variance:
2. Scripts Used in Chapter 02 317
var(ybar)
# Simulated density:
plot(density(ybar))
curve( dnorm(x,10,sqrt(.04)), add=TRUE,lty=2)
# repeat r times:
for(j in 1:r) {
# Draw a sample
sample <- rnorm(100,10,2)
# test the (correct) null hypothesis mu=10:
testres1 <- t.test(sample,mu=10)
# store CI & p value:
CIlower[j] <- testres1$conf.int[1]
CIupper[j] <- testres1$conf.int[2]
pvalue1[j] <- testres1$p.value
# test the (incorrect) null hypothesis mu=9.5 & store the p value:
pvalue2[j] <- t.test(sample,mu=9.5)$p.value
}
# OLS regression
lm( salary ~ roe, data=ceosal1 )
# OLS regression
CEOregres <- lm( salary ~ roe, data=ceosal1 )
# OLS regression:
lm(wage ~ educ, data=wage1)
# average y:
mean(ceosal1$salary)
# Number of obs.
( n <- nobs(results) )
# SER:
(SER <- sd(resid(results)) * sqrt((n-1)/(n-2)) )
# SE of b0hat & b1hat, respectively:
SER / sd(meap93$lnchprg) / sqrt(n-1) * sqrt(mean(meap93$lnchprg^2))
SER / sd(meap93$lnchprg) / sqrt(n-1)
# Automatic calculations:
summary(results)
# Graph
plot(x, y, col="gray", xlim=c(0,8) )
abline(b0,b1,lwd=2)
abline(olsres,col="gray",lwd=2)
legend("topleft",c("pop. regr. fct.","OLS regr. fct."),
lwd=2,col=c("black","gray"))
# repeat r times:
for(j in 1:r) {
# Draw a sample of size n:
x <- rnorm(n,4,1)
u <- rnorm(n,0,su)
y <- b0 + b1*x + u
# repeat r times:
for(j in 1:r) {
# Draw a sample of y:
u <- rnorm(n,0,su)
y <- b0 + b1*x + u
# repeat r times:
2. Scripts Used in Chapter 02 323
for(j in 1:r) {
# Draw a sample of y:
u <- rnorm(n, (x-4)/5, su)
y <- b0 + b1*x + u
# repeat r times:
for(j in 1:r) {
# Draw a sample of y:
varu <- 4/exp(4.5) * exp(x)
u <- rnorm(n, 0, sqrt(varu) )
y <- b0 + b1*x + u
# OLS regression:
summary( lm(log(wage) ~ educ+exper+tenure, data=wage1) )
# OLS regression:
summary( lm(prate ~ mrate+age, data=k401k) )
# OLS regression:
summary( lm(log(wage) ~ educ, data=wage1) )
# extract y
y <- gpa1$colGPA
# Parameter estimates:
( bhat <- solve( t(X)%*%X ) %*% t(X)%*%y )
3. Scripts Used in Chapter 03 325
# OLS regression:
lmres <- lm(log(wage) ~ educ+exper+tenure, data=wage1)
# Regression output:
summary(lmres)
# Reproduce t statistic
( tstat <- bhat / se )
# Reproduce p value
( pval <- 2*pt(-abs(tstat),137) )
# OLS regression:
summary( lm(log(wage) ~ educ+exper+tenure, data=wage1) )
# OLS regression:
myres <- lm(log(rd) ~ log(sales)+profmarg, data=rdchem)
# Regression output:
summary(myres)
# 95% CI:
confint(myres)
# 99% CI:
confint(myres, level=0.99)
# R2:
( r2.ur <- summary(res.ur)$r.squared )
( r2.r <- summary(res.r)$r.squared )
5. Scripts Used in Chapter 05 327
# F statistic:
( F <- (r2.ur-r2.r) / (1-r2.ur) * 347/3 )
# F test
myH0 <- c("bavg","hrunsyr","rbisyr")
linearHypothesis(res.ur, myH0)
set.seed(1234567)
# set true parameters: intercept & slope
b0 <- 1; b1 <- 0.5
# initialize b1hat to store 10000 results:
b1hat <- numeric(10000)
# repeat r times:
for(j in 1:10000) {
# Draw a sample of x, varying over replications:
x <- rnorm(n,4,1)
# Draw a sample of u (std. normal):
u <- rnorm(n)
6. Scripts Used in Chapter 06 329
# Draw a sample of y:
y <- b0 + b1*x + u
# regress y on x and store slope estimate at position j
bhat <- coef( lm(y~x) )
b1hat[j] <- bhat["x"]
}
# Basic model:
lm( bwght ~ cigs+faminc, data=bwght)
# Packs of cigarettes:
lm( bwght ~ I(cigs/20) +faminc, data=bwght)
# Using poly(...):
res <- lm(log(price)~log(nox)+log(dist)+poly(rooms,2,raw=TRUE)+
stratio,data=hprice2)
summary(res)
# Plot
matplot(X$rooms, pred, type="l", lty=c(1,2,2))
lm(log(wage)~married*female+educ+exper+I(exper^2)+tenure+I(tenure^2),
data=wage1)
# Rerun regression:
lm(log(wage) ~ education+experience+gender+occupation, data=CPS1985)
# Regression
res <- lm(log(wage) ~ education+experience+gender+occupation, data=CPS1985)
# ANOVA table
car::Anova(res)
8. Scripts Used in Chapter 08 333
# Display frequencies
table(lawsch85$rankcat)
# Run regression
(res <- lm(log(salary)~rankcat+LSAT+GPA+log(libvol)+log(cost), data=lawsch85))
# ANOVA table
car::Anova(res)
# Model with full interactions with female dummy (only for spring data)
reg<-lm(cumgpa~female*(sat+hsperc+tothrs), data=gpa3, subset=(spring==1))
summary(reg)
# F-Test from package "car". H0: the interaction coefficients are zero
# matchCoefs(...) selects all coeffs with names containing "female"
library(car)
linearHypothesis(reg, matchCoefs(reg, "female"))
# Estimate model
reg <- lm(price~lotsize+sqrft+bdrms, data=hprice1)
reg
# Automatic BP test
library(lmtest)
bptest(reg)
# Estimate model
reg <- lm(log(price)~log(lotsize)+log(sqrft)+bdrms, data=hprice1)
reg
# BP test
library(lmtest)
bptest(reg)
# White test
bptest(reg, ~ fitted(reg) + I(fitted(reg)^2) )
# WLS
lm(nettfa ~ inc + I((age-25)^2) + male + e401k, weight=1/inc,
data=k401ksubs, subset=(fsize==1))
# WLS
9. Scripts Used in Chapter 09 335
# non-robust results
library(lmtest); library(car)
coeftest(wlsreg)
# OLS
olsreg<-lm(cigs~log(income)+log(cigpric)+educ+age+I(age^2)+restaurn,
data=smoke)
olsreg
# BP test
library(lmtest)
bptest(olsreg)
# FGLS: WLS
w <- 1/exp(fitted(varreg))
lm(cigs~log(income)+log(cigpric)+educ+age+I(age^2)+restaurn,
weight=w ,data=smoke)
# RESET test
library(lmtest)
resettest(orig)
data.frame(x,logx,invx,ncdf,isna)
# extract LSAT
lsat <- lawsch85$LSAT
# Frequencies of indicator
table(missLSAT)
# Regression
reg <- lm(rdintens~sales+profmarg, data=rdchem)
# OLS Regression
ols <- lm(rdintens ~ I(sales/1000) +profmarg, data=rdchem)
# LAD Regression
library(quantreg)
lad <- rq(rdintens ~ I(sales/1000) +profmarg, data=rdchem)
# regression table
library(stargazer)
stargazer(ols,lad, type = "text")
# Download data
getSymbols("F", auto.assign=TRUE)
# Plot returns
11. Scripts Used in Chapter 11 341
plot(ret)
Dy <- diff(y)
# Add line to graph
lines(Dy, col=gray(.6))
}
# Pedestrian test:
residual <- resid(reg)
resreg <- dynlm(residual ~ L(residual)+L(residual,2)+L(residual,3)+
log(chempi)+log(gas)+log(rtwex)+befile6+
12. Scripts Used in Chapter 12 343
affile6+afdec6, data=tsdata )
linearHypothesis(resreg,
c("L(residual)","L(residual, 2)","L(residual, 3)"))
# Automatic test:
bgtest(reg, order=3, type="F")
# DW tests
dwtest(reg.s)
dwtest(reg.ea)
# OLS estimation
olsres <- dynlm(log(chnimp)~log(chempi)+log(gas)+log(rtwex)+
befile6+affile6+afdec6, data=tsdata)
# Cochrane-Orcutt estimation
cochrane.orcutt(olsres)
# OLS regression
reg<-dynlm(log(prepop)~log(mincov)+log(prgnp)+log(usgnp)+trend(tsdata),
data=tsdata )
# results with usual SE
coeftest(reg)
# results with HAC SE
coeftest(reg, vcovHAC)
# squared residual
residual.sq <- resid(reg)^2
# squared residual
residual.sq <- resid(reg)^2
# Panel dimensions:
pdim(crime2.p)
# Observation 1-6: new "id" and "time" and some other variables:
crime2.p[1:6,c("id","time","year","pop","crimes","crmrte","unem")]
# Generate pdata.frame:
crime4.p <- pdata.frame(crime4, index=c("county","year") )
# Estimate FD model:
coeftest( plm(log(crmrte)~d83+d84+d85+d86+d87+lprbarr+lprbconv+
lprbpris+lavgsen+lpolpc,data=crime4.p, model="fd") )
# Generate pdata.frame:
wagepan.p <- pdata.frame(wagepan, index=c("nr","year") )
pdim(wagepan.p)
# Estimate FE model
summary( plm(lwage~married+union+factor(year)*educ,
data=wagepan.p, model="within") )
# Generate pdata.frame:
wagepan.p <- pdata.frame(wagepan, index=c("nr","year") )
pdim(wagepan.p)
reg.ols<- (plm(lwage~educ+black+hisp+exper+I(exper^2)+married+union+yr,
data=wagepan.p, model="pooling") )
reg.re <- (plm(lwage~educ+black+hisp+exper+I(exper^2)+married+union+yr,
data=wagepan.p, model="random") )
reg.fe <- (plm(lwage~ I(exper^2)+married+union+yr,
data=wagepan.p, model="within") )
# Generate pdata.frame:
wagepan.p <- pdata.frame(wagepan, index=c("nr","year") )
# Generate pdata.frame:
wagepan.p <- pdata.frame(wagepan, index=c("nr","year") )
# Generate pdata.frame:
crime4.p <- pdata.frame(crime4, index=c("county","year") )
# Estimate FD model:
reg <- ( plm(log(crmrte)~d83+d84+d85+d86+d87+lprbarr+lprbconv+
348 R Scripts
lprbpris+lavgsen+lpolpc,data=crime4.p, model="fd") )
# Regression table with standard SE
coeftest(reg)
# Regression table with "clustered" SE (default type HC0):
coeftest(reg, vcovHC)
# Regression table with "clustered" SE (small-sample correction)
# This is the default version used by Stata and reported by Wooldridge:
coeftest(reg, vcovHC(reg, type="sss"))
# OLS automatically
reg.ols <- lm(log(wage) ~ educ, data=oursample)
# IV automatically
reg.iv <- ivreg(log(wage) ~ educ | fatheduc, data=oursample)
# 2nd stage
man.2SLS<-lm(log(wage)~fitted(stage1)+exper+I(exper^2), data=oursample)
# 2nd stage
stage2<-lm(log(wage)~educ+exper+I(exper^2)+resid(stage1),data=oursample)
# IV regression
summary( res.2sls <- ivreg(log(wage) ~ educ+exper+I(exper^2)
| exper+I(exper^2)+motheduc+fatheduc,data=oursample) )
# Auxiliary regression
res.aux <- lm(resid(res.2sls) ~ exper+I(exper^2)+motheduc+fatheduc
, data=oursample)
# IV FD regression
summary( plm(log(scrap)~hrsemp|grant, model="fd",data=jtrain.p) )
# 2SLS regressions
summary( ivreg(hours~log(wage)+educ+age+kidslt6+nwifeinc
|educ+age+kidslt6+nwifeinc+exper+I(exper^2), data=oursample))
summary( ivreg(log(wage)~hours+educ+exper+I(exper^2)
|educ+age+kidslt6+nwifeinc+exper+I(exper^2), data=oursample))
summary(systemfit(eq.system,inst=instrum,data=oursample,method="3SLS"))
################################################################
# Test of H0: experience and age are irrelevant
restr <- glm(inlf~nwifeinc+educ+ kidslt6+kidsge6,
family=binomial(link=probit),data=mroz)
lrtest(restr,probitres)
# Estimation
linpr.res <- lm(y~x)
logit.res <- glm(y~x,family=binomial(link=logit))
probit.res<- glm(y~x,family=binomial(link=probit))
# Graph
plot(x,y)
lines(xp,linpr.p, lwd=2,lty=1)
lines(xp,logit.p, lwd=2,lty=2)
lines(xp,probit.p,lwd=1,lty=1)
legend("topleft",c("linear prob.","logit","probit"),
lwd=c(2,2,1),lty=c(1,2,1))
# Graph
plot( x,linpr.eff, pch=1,ylim=c(0,.7),ylab="partial effect")
points(x,logit.eff, pch=3)
points(x,probit.eff,pch=18)
legend("topright",c("linear prob.","logit","probit"),pch=c(1,3,18))
# Table of APEs
cbind(APE.lin, APE.log, APE.prob)
17. Scripts Used in Chapter 17 353
# Conditional means
Eystar <- xb
Ey <- pnorm(xb/1)*xb+1*dnorm(xb/1)
# Graph
plot(x,ystar,ylab="y", pch=3)
points(x,y, pch=1)
lines(x,Eystar, lty=2,lwd=2)
lines(x,Ey , lty=1,lwd=2)
abline(h=0,lty=3) # horizontal line at 0
legend("topleft",c(expression(y^"*"),"y",expression(E(y^"*")),"E(y)"),
lty=c(NA,NA,2,1),pch=c(3,1,NA,NA),lwd=c(1,1,2,2))
# Predictions
pred.OLS <- predict( lm(y~x, data=sample) )
pred.trunc <- predict( truncreg(y~x, data=sample) )
# Graph
plot( compl$x, compl$y, pch= 1,xlab="x",ylab="y")
points(sample$x,sample$y, pch=16)
lines( sample$x,pred.OLS, lty=2,lwd=2)
lines( sample$x,pred.trunc,lty=1,lwd=2)
abline(h=0,lty=3) # horizontal line at 0
legend("topleft", c("all points","observed points","OLS fit",
"truncated regression"),
lty=c(NA,NA,2,1),pch=c(1,16,NA,NA),lwd=c(1,1,2,2))
# LRP rationalDL:
b <- coef(rDL)
(b["gprice"]+b["L(gprice)"]) / (1-b["L(linv.detr)"])
x <- cumsum(a)
y <- cumsum(e)
# plot
plot(x,type="l",lty=1,lwd=1)
lines(y ,lty=2,lwd=2)
legend("topright",c("x","y"), lty=c(1,2), lwd=c(1,2))
# Regression of y on x
summary( lm(y~x) )
# Forecast errors:
e1<- y - f1
e2<- y - f2
# RMSE:
sqrt(mean(e1^2))
sqrt(mean(e2^2))
# MAE:
mean(abs(e1))
mean(abs(e2))
# Some preparations:
setwd(~/bscthesis/r)
rm(list = ls())
# Call R scripts
source("data.R" ,echo=TRUE,max=1000) # Data import and cleaning
source("descriptives.R",echo=TRUE,max=1000) # Descriptive statistics
source("estimation.R" ,echo=TRUE,max=1000) # Estimation of model
source("results.R" ,echo=TRUE,max=1000) # Tables and Figures
358 R Scripts
# Number of obs.
sink("numb-n.txt"); cat(nrow(gpa1)); sink()
# generate frequency table in file "tab-gender.txt"
gender <- factor(gpa1$male,labels=c("female","male"))
sink("tab-gender.txt")
xtable( table(gender) )
sink()
# b1 hat
sink("numb-b1.txt"); cat(round(coef(res1)[2],3)); sink()
Wickham, H. and G. Grolemund (2016): R for Data Science: Import, Tidy, Transform, Visualize,
and Model Data, O’Reilly.
Wooldridge, J. M. (2010): Econometric Analysis of Cross Section and Panel Data, MIT Press.
——— (2013): Introductory Econometrics: A Modern Approach, Cengage Learning, 5th ed.
——— (2014): Introduction to Econometrics, Cengage Learning.
——— (2019): Introductory Econometrics: A Modern Approach, 7th Edition, Cengage Learning, 6th
ed.
Xie, Y. (2015): Dynamic Documents with R and knitr, CRC Press, Chapman & Hall.
Zeileis, A. (2004): “Econometric Computing with HC and HAC Covariance Matrix Estimators,”
Journal of Statistical Software, 11.
Zeileis, A. and G. Grothendieck (2005): “zoo: S3 Infrastructure for Regular and Irregular
Time Series,” Journal of Statistical Software, 14.
List of Wooldridge (2019) Examples
361
Index
363
364 Index
ggplot, 35 legend, 31
ggplot2 package, 8, 34 length, 13
ggsave, 39 library, 8
glm, 255, 264 likelihood ratio (LR) test, 258
GPA1.dta, 119 linear probability model, 253
graph linearHypothesis, 124, 135, 142, 158, 162,
export, 33 191
gray, 29 lines, 30
group_by, 46 list, 20, 69, 85, 249
lm, 84, 105, 113
Hausman test of RE vs. FE, 229 LM Test, 135
haven package, 25 lmtest package, 8, 162, 165, 173, 175, 258
hccm, 161 lmtest, 207, 208, 210
head, 23 load, 23
Heckman selection model, 271 log, 11
help, 7 log files, 287
heteroscedasticity, 161 logarithmic model, 93
autoregressive conditional (ARCH), 211 logical variable, 152
hist, 51 logical vector, 15
histogram, 51 logit, 255
HSEINV.dta, 194 logitmfx, 262
HTML documents, 287 logLik, 255
long-run propensity (LRP), 193, 273
I(...), 137
lrtest, 258
if, 70
lrtest(res), 258
import, 25
lrtest(restr, unrestr), 258
Inf, 178
ls, 12
influence.measures, 181
inner_join, 44 magrittr package, 43
install.packages, 7 maps package, 8
instrumental variables, 237 margEff, 266
INTDEF.dta, 185, 188 margin=1, 49
interactions, 144 marginal effect, 260
is.na, 178 matchCoefs, 126, 158
ivreg, 237, 248 matplot, 31, 149
Matrix package, 19
JTRAIN.dta, 243
matrix, 17
kernel density plot, 51 multiplication, 19
KIELMC.dta, 217 matrix, 17
knitr package, 8, 293 matrix algebra, 19
max, 13
lag, 220 maximum likelihood estimation (MLE), 255
lapply, 71 mean, 54, 179
LATEX, 293 mean absolute error (MAE), 281
law of large numbers, 75 MEAP93.dta, 96
LAWSCH85.dta, 156, 179 measurement error, 175
least absolute deviations (LAD), 182 median, 54
left_join, 44 mfx package, 8, 262
legend, 31 mi package, 180
366 Index
ungroup, 46
unit root, 200, 275
unobserved effects model, 222
ur.df, 275
urca package, 9, 275
var, 54
variance inflation factor (VIF), 113
vars package, 9, 280
vcovHAC, 210
vcovHC, 161, 234
vector, 12
vif, 115
VOTE1.dta, 87
WAGE1.dta, 86
WAGEPAN.dta, 225, 227, 230
WDI package, 9, 41, 44
WDI, 41
WDIsearch, 41
Weighted Least Squares (WLS), 168
while, 71
White standard errors, 161
White test for heteroscedasticity, 166