Econometrics in R
Econometrics in R
Grant V. Farnsworth∗
August 1, 2005
∗ This paper was originally written as part of a teaching assistantship and has subsequently become a personal
reference. I learned most of this stuff by trial and error, so it may contain inefficiencies, inaccuracies, or incomplete
explanations. If you find something here suboptimal or have suggestions, please let me know. Until at least 2008 I
can be contacted at [email protected].
Contents
1 Are You Ready for R? 3
1.1 What is R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 How is R Better Than Other Packages? . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Obtaining R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Using R Interactively and Writing Scripts . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Getting Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Special Regressions 11
4.1 Models With Factors/Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Logit/Probit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.1 Multinomial Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.2 Ordered Logit/Probit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Tobit and Censored Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4 Robust Regression - M Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.5 Nonlinear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.6 Two Stage Least Squares on a Single Structural Equation . . . . . . . . . . . . . . . 14
4.7 Systems of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.7.1 Seemingly Unrelated Regression . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.7.2 Two Stage Least Squares on a System . . . . . . . . . . . . . . . . . . . . . . 15
2
5.7 Time Series Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.7.1 Durbin-Watson Test for Autocorrelation . . . . . . . . . . . . . . . . . . . . . 20
5.7.2 Box-Pierce and Breusch-Godfrey Tests for Autocorrelation . . . . . . . . . . 20
5.7.3 Dickey-Fuller Test for Unit Root . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.8 Vector Autoregressions (VAR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6 Plotting 21
6.1 Plotting Empirical Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.2 Adding Legends and Stuff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3 Multiple Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.4 Saving Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7 Statistics 24
7.1 Working with Common Statistical Distributions . . . . . . . . . . . . . . . . . . . . . 24
7.2 P-Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
8 Math in R 25
8.1 Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
8.1.1 Matrix Algebra and Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
8.1.2 Factorizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.2 Writing Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.3 Numerical Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
9 Programming 27
9.1 Looping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
9.2 Conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
9.2.1 Binary Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
9.3 The Ternary Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
10 Changing Configurations 29
10.1 Default Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
10.1.1 Significant Digits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
10.1.2 What to do with NAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
10.1.3 How to Handle Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
10.1.4 Suppressing Warnings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
12 Conclusion 31
1.3 Obtaining R
The R installation program can be downloaded free of charge from http://www.r-project.org.
Because R is a programming language and not just an econometrics program, most of the functions
we will be interested in are available through libraries (sometimes called packages) obtained from
the R website. To obtain a library that does not come with the standard installation follow the
CRAN link on the above website. Under contrib you will find is a list of compressed libraries ready
for download. Click on the one you need and save it somewhere you can find it later. If you are
using a gui, start R and click install package from local directory under the package menu. Then
select the file that you downloaded. Now the package will be available for use in the future. If you
are using R under linux, install new libraries by issuing the following command at the command
prompt: “R CMD INSTALL packagename”
Alternately you can download and install packages at once from inside R by issuing a command
like
> install.packages(c("car","systemfit"),repo="http://cran.stat.ucla.edu",dep=TRUE)
which installs the car and systemfit libraries. The repo parameter is usually auto-configured, so
there is normally no need to specify it. The dependencies or dep parameter indicates that R
should download packages that these depend on as well, and is recommended. Note: you must have
administrator (or root) privileges to your computer to install the program and packages.
4
Contributed Packages Mentioned in this Paper and Why
(* indicates package is included by default)
car Regression tests and robust standard errors
sem Two stage least squares
MASS Robust regression, ordered logit/probit
lmtest Breusch-Pagan and Breusch-Godfrey tests
sandwich (and zoo) Heteroskedasticity and autocorrelation robust covariance
tseries Garch, ARIMA, and other time series functions
MNP Multinomial probit via MCMC
Hmisc LATEX export
xtable Alternative LATEX export
systemfit SUR and 2SLS on systems of equations
fracdiff Fractionally integrated ARIMA models
survival Tobit and censored regression
nlme Nonlinear fixed and random effects models
nnet Multinomial logit/probit
ts* Time series manipulation functions
nls* Nonlinear least squares
foreign* Loading and saving data from other programs
zoo required in order to have the sandwich package
> source("mcmc.R")
One way to run R is to have a script file open in an external text editor and run periodically from
the R window. Commands executed from a script file may not print as much output to the screen as
they do when run interactively. If we want interactive-level verbosity, we can use the echo argument
> source("mcmc.R",echo=TRUE)
If no path is specified to the script file, R assumes that the file is located in the current working
directory. The working directory can be viewed or changed via R commands
> getwd()
[1] "/home/gvfarns/r"
> setwd("/home/gvfarns")
> getwd()
[1] "/home/gvfarns"
or under windows by using the menu item change working directory. Also note that under windows
the slashes should be replaced with double backslashes.
5
> getwd()
[1] "C:\\Program Files\\R\\rw1051\\bin"
> setwd("C:\\Program Files\\R\\scripts")
> getwd()
[1] "C:\\Program Files\\R\\scripts"
We can also run R in batch (noninteractive) mode under linux by issuing the command:“R CMD
BATCH scriptname.R” The output will be saved in a file named scriptname.Rout. Batch mode is
also available under windows using Rcmd.exe instead of Rgui.exe.
Since every command we will use is a function that is stored in one of the libraries, we will often
have to load libraries before working. Many of the common functions are in the library base, which
is loaded by default. For access to any other function, however, we have to load the appropriate
library.
> library(foreign)
will load the library that contains the functions for reading and writing data that is formatted for
other programs, such as SAS and Stata. Alternately (under windows), we can pull down the package
menu and select library
6
uses the c() (concatenate) command to create a COLUMN vector with values 7.5, 6, and 5. c()
is a generic function that can be used on multiple types of data. The t() command transposes f
to make a row vector. The two data objects f and F are separate because of the case sensitivity of
R. The command cbind() concatenates the objects given it side by side: into an array if they are
vectors, and into a single dataframe if they are columns of named data.
> dat <- cbind(c(7.5,6,5),c(1,2,3))
Similarly, rbind() concatenates objects by rows (one above the other).
Elements in vectors and similar data types are indexed using square brackets. R uses one-based
indexing.
> f
[1] 7.5 6.0 5.0
> f[2]
[1] 6
Notice that for multidimensional data types, such as matrices and dataframes, leaving an index
blank refers to the whole column or row corresponding to that index. Thus if foo is a 4x5 array of
numbers,
> foo
will print the whole array to the screen,
> foo[1,]
will print the first row,
> foo[,3]
will print the third column, etc. We can get summary statistics on the data in goo using the
summary() and we can determine its dimensionality using the NROW(), and NCOL() commands.
More generally, we can use the dim() command to know the dimensions of an R object.
If we wish to extract or print only certain rows or columns, we can use the concatenation operator.
> oddfoo <- foo[c(1,3,5),]
makes a 4x3 array out of columns 1,3, and 5 of foo and saves it in oddfoo. By prepending the
subtraction operator, we can remove certain rows
> nooddfoo <- foo[-c(1,3,5),]
makes a 4x2 array out of columns 2 and 4 of foo (i.e., it removes columns 1,3, and 5). We can also
use comparison operators to extract certain columns or rows.
> smallfoo <- foo[ foo[,1]<1 ,]
compares each entry in the first column of foo to one and inserts the row corresponding to each
match into smallfoo. We can also reorder data. If wealth is a dataframe with columns year,gdp,
and gnp, we could sort the data by year using order() or extract a period of years using the colon
operator
> wealth <- wealth[ order(wealth$year),]
> firstten <- wealth[1:10,]
> eighty <- wealth[wealth$year==1980,]
7
This sorts by year and puts the first ten years of data in firstten. All rows from year 1980 are stored
in eighty (notice the double equals sign).
Using double instead of single brackets for indexing changes the behavior slightly. Basically it
doesn’t allow referencing multiple objects using a vector of indices, as the single bracket case does.
For example,
> w[[1:10]]
does not return a vector of the first ten elements of w, as it would in the single bracket case. Also,
it strips off attributes and types. If the variable is a list, indexing it with single brackets yields a list
containing the data, double brackets return the (vector of) data itself.
Occasionally we have data in the incorrect form (i.e., as a dataframe when we would prefer to
have a matrix). In this case we can use the as. functionality. If all the values in goo are numeric,
we could put them into a matrix named mgoo with the command
> mgoo <- as.matrix(goo)
Other data manipulation operations can be found in the standard R manual and online. There
are a lot of them.
2.2.2 Dataframes
Most econometric data will be in the form of a dataframe. A dataframe is a collection of columns
containing data, which need not all be of the same type, but each column must have the same
number of elements. Each column has a title by which the whole column may be addressed. If goo
is a 3x4 data frame with titles age, gender,education, and salary, then we can print the salary
column with the command
> goo$salary
or view the names of the columns in goo
> names(goo)
Most mathematical operations affect multidimensional data elementwise. From the previous
example,
> salarysq <- (goo$salary)^2
creates a dataframe with one column entitled salary with entries equal to the square of the corre-
sponding entries in goo$salary. Most mathematical operations behave as one would expect.
Output from actions can also be saved in the original variable, for example,
> salarysq <- sqrt(salarysq)
8
replaces each of the entries in salarysq with its square root.
adds a column named lnsalary to goo, containing the log of the salary.
2.2.3 Lists
A list is more general than a dataframe. It is essentially a bunch of data objects bound together,
optionally with a name given to each. These data objects may be scalars, strings, dataframes, or
any other type. Functions that return many elements of data (like summary()) generally bind the
returned data together as a list, since functions return only one data object. As with dataframes,
we can see what objects are in a list (by name if they have them) using the names() command.
Now mydata is a dataframe with named columns, ready for analysis. Note that R assumes that
there are no labels on the columns, and gives them default values, if you omit the header=TRUE
argument. Now let’s suppose that instead of blah.dat we have blah.dta, a stata file.
> library(foreign)
> mydata <- read.dta("C:/WINDOWS/Desktop/blah.dta")
or alternately:
as a third alternative, we could “attach” the dataframe, which makes its columns available as regular
variables
> attach(byu)
> lm(salary ~ age + exper)
9
Notice the syntax of the model argument (using the tilde). The above command would correspond
to the linear model
salary = β0 + β1 age + β2 exper + (1)
Using lm() results in an abbreviated summary being sent to the screen, giving only the β
coefficient estimates. For more exhaustive analysis, we can save the results in as a data member or
“fitted model”
The summary() command, run on raw data, such as byu$age, gives statistics, such as the mean
and median (these are also available through their own functions, mean and median). When run on
an ols object, summary gives important statistics about the regression, such as p-values and the R2 .
The residuals and several other pieces of data can also be extracted from result, for use in other
computations. The variance-covariance matrix (of the beta coefficients) is accessible through the
vcov() command.
Notice that more complex formulae are allowed, including interaction terms (specified by mul-
tiplying two data members) and functions such as log() and sqrt(). Unfortunately, in order to
include a power term, such as age squared, we must either first compute the values, then run the
regression, or use the I() operator, which forces computation of its argument before evaluation of
the formula
or
In order to run a regression without an intercept, we simply specify the intercept explicitly,
traditionally with a zero.
10
Where SSR is the residual sum of squares, LL is the log likelihood statistic, Yhat is the vector of
fitted values, Resid is the vector of residuals, s is the estimated standard deviation of the errors
(assuming homoskedasticity), CovMatrix is the variance-covariance matrix of the coefficients (also
available via vcov()), and other statistics are as named.
Breusch-Pagan test
data: unrestricted
BP = 44.5465, df = 1, p-value = 2.484e-11
This performs the “studentized” version of the test. In order to be consistent with some other
software (including ncv.test()) we can specify studentize=FALSE.
11
> unrestricted <- lm(y~x1+x2+x3+x4)
> rhs <- c(0,1)
> hm <- rbind(c(1,0,0,0,0),c(0,0,1,1,0))
> linear.hypothesis(unrestricted,hm,rhs)
Notice that if unrestricted is an lm object, an F test is performed by default, if it is a glm
object, a Wald χ2 test is done instead. The type of test can be modified through the type argument.
Also, if we want to perform the test using heteroskedasticity or autocorrelation robust standard
errors, we can either specify white.adjust=TRUE to use white standard errors, or we can supply
our own covariance matrix using the vcov parameter. For example, if we had wished to use the
Newey-West corrected covariance matrix above, we could have specified
> linear.hypothesis(unrestricted,hm,rhs,vcov=NeweyWest(unrestricted))
See the section on heteroskedasticity robust covariance matrices for information about the NeweyWest()
function. We should remember that the specification white.adjust=TRUE corrects for heteroskedas-
ticity using an improvement to the white estimator. To use the classic white estimator, we can
specify white.adjust="hc0".
4 Special Regressions
4.1 Models With Factors/Groups
There is a separate datatype for qualitative factors in R. When a variable included in a regression is
of type factor, the requisite dummy variables are automatically created. For example, if we wanted
to regress the adoption of personal computers (pc) on the number of employees in the firm (emple)
and include a dummy for each state (where state is a vector of two letter abbreviations), we could
simply run the regression
> summary(lm(pc~emple+state))
Call:
lm(formula = pc ~ emple + state)
Residuals:
Min 1Q Median 3Q Max
-1.7543 -0.5505 0.3512 0.4272 0.5904
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.572e-01 6.769e-02 8.232 <2e-16 ***
12
emple 1.459e-04 1.083e-05 13.475 <2e-16 ***
stateAL -4.774e-03 7.382e-02 -0.065 0.948
stateAR 2.249e-02 8.004e-02 0.281 0.779
stateAZ -7.023e-02 7.580e-02 -0.926 0.354
stateDE 1.521e-01 1.107e-01 1.375 0.169
...
4.2 Logit/Probit
There are several ways to do logit and probit regressions in R. The simplest way may be to use the
glm() command with the family option.
> h <- glm(c~y, family=binomial(link="logit"))
or replace logit with probit for a probit regression. The glm() function produces an object similar
to the lm() function, so it can be analyzed using the summary() command. In order to extract the
log likelihood statistic, use the logLik() command.
> logLik(h)
‘log Lik.’ -337.2659 (df=1)
There is also a special package for binary dependent variable regressions called boolean. The
boolean framework generally requires that a boolean data object be prepared using boolprep() and
passed to boolean(). It also includes functions to plot and do tests.
13
> g <- pc*1 + inetacc*10 + iapp*100
> multinom(factor(g)~pc.subsidy+inet.subsidy+iapp.subsidy+emple+msamissing)
The second argument to the Surv() function specifies whether each observation has been censored
or not (one indicating that it was observed and zero that it was censored). The third argument
indicates on which side the data was censored. Since it was the lower tail of this distribution that
got censored, we specify left. The dist option passed to the survreg is necessary in order to get a
classical Tobit model.
Y = F (X; β) + (2)
14
Notice that the error term must be additive in the functional form. If it is not, transform the model
equation so that it is. The R function for nonlinear least squares is nls() and has a syntax similar
to lm(). Consider the following nonlinear example.
Y = (3)
1 + eβ1 X1 +β2 X2
log(Y ) = − log(1 + eβ1 X1 +β2 X2 ) + log() (4)
The second equation is the transformed version that we will use for the estimation. nls() takes
the formula as its first argument and also requires starting estimates for the parameters. The entire
formula should be specified, including the parameters. R looks at the starting values to see which
parameters it will estimate.
> result <- nls(log(Y)~-log(1+exp(a*X1+b*X2)),start=list(a=1,b=1),data=mydata)
stores estimates of a and b in an nls object called result. Estimates can be viewed using the
summary() command. In the most recent versions of R, the nls() command is part of the base
package, but in older versions, we may have to load the nls library.
15
4.7.2 Two Stage Least Squares on a System
Instruments can be used as well in order to do a two stage least squares on the above system. We
create a model object (with no left side) to specify the instruments that we will use and specify the
2SLS option
There are also routines for three stage least squares, weighted two stage least squares, and a host of
others.
Most time-series related functions automatically coerce the data into ts format, so this command is
often not necessary.
produce a once lagged version of y relative to ysmall. This way of generating lags can get awkward
if we are trying combinations of lags in regressions because for each lagged version of the variable,
conformability requires that we have a corresponding version of the original data that has the first
few observations removed.
Another way to lag data is to convert it to a time series object and use the lag() function. It
is very important to remember that this function does not actually change the data, it changes an
attribute of a time series object that indicates where the series starts. This allows for more flexibility
with time series functions, but it can cause confusion for general functions such as lm() that do
not understand time series attributes. Notice that lag() only works usefully on time series
objects. For example, the code snippet
16
creates a vector of zeros named d if a is a normal vector, but returns a ts object with the first
difference of the series if a is a ts object. There is no warning issued if lag() is used on regular data,
so care should be exercised.
In order to use lagged data in a regression, we can use time series functions to generate a
dataframe with various lags of the data and NA characters stuck in the requisite leading and trailing
positions. In order to do this, we use the ts.union() function. Suppose X and Y are vectors of
ordinary data and we want to include a three times lagged version of X in the regression, then
> y <- ts(Y)
> x <- ts(X)
> x3 <- lag(x,-3)
> d <- ts.union(y,x,x3)
converts the vectors to ts data and forms a multivariate time series object with columns yt , xt ,
and xt−3 . Again, remember that data must be converted to time series format before lagging or
binding together with the union operator in order to get the desired offset. The ts.union() function
automatically decides on a title for each column, must as the data.frame() command does. We
can also do the lagging inside the union and assign our own titles
> y <- ts(Y)
> x <- ts(X)
> d <- ts.union(y,x,x1=lag(xt,-1),x2=lag(xt,-2),x3=lag(xt,-3))
It is critical to note that the lag operator works in the opposite direction of what one
might expect: positive lag values result in leads and negative lag values result in lags.
When the resulting multivariate time series object is converted to a data frame (as it is read by
ls() for example), the offset will be preserved. Then
> lm(y~x3,data=d)
will then regress yt on xt−3 .
Also note that by default observations that have a missing value (NA) are omitted. This is
what we want. If the default setting has somehow been changed, we should include the argument
na.action=na.omit in the lm() call. In order to get the right omission behavior, it is generally
necessary to bind all the data we want to use (dependent and independent variables) together in a
single union.
In summary, in order to use time series data, convert all data to type ts, lag it appropriately
(using the strange convention that positive lags are leads), and bind it all together using ts.union().
Then proceed with the regressions and other operations.
5.2 Filters
5.2.1 Canned AR and MA filters
One can pass data through filters constructed by polynomials in the lag operator using the filter()
command. It handles two main types of filters: autoregressive or “recursive” filters and moving
average or “convolution” filters. The first type is of the form
y = (1 + a1 L + a2 L2 + . . . + ap Lp )x
and the second has the same form except that it does not include the implied unit coefficient on the
zero lag. Further, for recursive filters, if we specify sides=2 the filter coefficients will be centered
about zero (including as many leads as lags) unless there is an even number of coefficients, in which
case one more lead than lag is included.
When we use the filter() command, we supply the a vector as follows
17
> y <- filter(x,c(.2,-.35,.1),method="recursive")
The data vector x may be a time series object or a normal vector of data, and the output y will be
a ts object.
except that the filter() command by default inserts zeros (or a pre-specified vector) for missing
beginning data, whereas the manual filter omits the observations for which lagged data is unavailable.
Notice that the above command will only work if x is a ts object.
where lambda is the standard tuning parameter, often set to 1600 for macroeconomic data. Passing
a series to this function will return the smoothed series.
This filter is also a special case of the smooth.spline() function in which the parameter tt
all.knots=TRUE has been passed. Unfortunately, the tuning parameter for the smooth.spline()
function, spar is different from the lambda above and we have not figured out how to convert from
spar to lambda. If we knew the appropriate value of spar to use, the filter would be
Otherwise we could generate this list manually. Recall that a state space model can be written
y = Z 0a + η
a = T a + Re
18
Where η and e are normally distributed disturbances, a is the unobserved state vector, and y is the
observed data vector. The elements of the list represent the coefficients here. For more information
on generating a state-space model, see help on KalmanLike.
Once we have the state-space list, we can use KalmanLike, KalmanSmooth, and KalmanPredict
to get estimated likelihoods, state estimates, and predicted values, respectively.
5.3 ARIMA/ARFIMA
The arima() command from the ts() library can fit time series data using an autoregressive inte-
grated moving average model.
where
∆yt = yt − yt−1 (6)
The parameters p, d, and q specify the order of the arima model. These values are passed as a
vector c(p,d,q) to arima(). Notice that the model used by R makes no assumption about the sign
of the θ terms, so the sign of the corresponding coefficients may differ from those of other software
packages (such as S+).
Data-members ar1 and ma1 contain estimated coefficients obtained by fitting y with an AR(1) and
MA(1) model, respectively. They also contain the log likelihood statistic and estimated standard
errors.
If we are modeling a simple autoregressive model, we could also use the ar() command, from the
ts package, which either takes as an argument the order of the model or picks a reasonable default
order.
> library(fracdiff)
> fracdiff(y,nar=2,nma=1)
5.4 ARCH/GARCH
R can numerically fit data using a generalized autoregressive conditional heteroskedasticity model
GARCH(p,q), written
σt2 = α0 + δ1 σt−1
2 2
+ ... + δp σt−p + α1 2t + ... + αq 2t−q (7)
setting p = 0 we obtain the ARCH(q) model. The R command garch() comes from the tseries
library. It’s syntax is
19
> archoutput <- garch(y,order=c(0,3))
> garchoutput <- garch(y,order=c(2,3))
so that archoutput is the result of modeling an ARCH(3) model and garchoutput is the result
of modeling a GARCH(2,3). Notice that the first value in the order argument is p, the number
of alphas, and the second argument is the number of delta parameters. The resulting coefficient
estimates will be named a0, a1, etc. for the alpha and b1, b2, etc. for the delta parameters.
5.5 Correlograms
It is common practice when analyzing time series data to plot the autocorrelation and partial autocor-
relation functions in order to try to guess the functional form of the data. To plot the autocorrelation
and partial autocorrelation functions, use the ts library functions acf() and pacf(), respectively.
The following commands plot the ACF and PACF on the same graph, one above (not on top of)
the other. See section on plotting for more details.
> par(mfrow=c(2,1))
> acf(y)
> pacf(y)
Series y
0.0 0.4 0.8
ACF
0 5 10 15 20 25
Lag
Series y
0.4 0.8
Partial ACF
−0.2
0 5 10 15 20 25
Lag
returns predictions on five periods following the data in y, along with corresponding standard error
estimates.
20
5.7 Time Series Tests
5.7.1 Durbin-Watson Test for Autocorrelation
The Durbin-Watson test for autocorrelation can be administered using the durbin.watson() func-
tion from the car library. It takes as its argument an lm object (the output from an lm() command)
and returns the autocorrelation, DW statistic, and an estimated p-value. The number of lags can
be specified using the max.lag argument. See help file for more details.
> library(car)
> results <- lm(Y ~ x1 + x2)
> durbin.watson(results,max.lag=2)
Box-Pierce test
data: a$resid
X-squared = 18.5114, df = 1, p-value = 1.689e-05
would lead us to believe that the model may not be correctly specified, since we soundly re-
ject the Box-Pierce null. If we want to the Ljung-Box test instead, we include the parameter
type="Ljung-Box".
For an appropriate model, this test is asymptotically equivalent to the Breusch-Godfrey test,
which is available in the lmtest() library as bgtest(). It takes a fitted lm object instead of a
vector of data as an argument.
> library(tseries)
> adf.test(y)
data: y
Dickey-Fuller = -2.0135, Lag order = 7, p-value = 0.5724
alternative hypothesis: stationary
21
need only bind the vectors together as a dataframe and give that dataframe as an argument to ar().
Notice that ar() by default uses AIC to determine how many lags to use, so it may be necessary to
specifiy aic=FALSE and/or an order.max parameter. Remember that if aic is TRUE (the default),
the function uses AIC to choose a model using up to the number of lags specified by order.max.
6 Plotting
One of R’s strongest points is its graphical ability. It provides both high level plotting commands
and the ability to edit even the smallest details of the plots.
The plot() command opens a new window and plots the the series of data given it. By default
a single vector is plotted as a time series line. If two vectors are given to plot(), the values are
plotted in the x-y place using small circles. The type of plot (scatter, lines, histogram, etc.) can be
determined using the type argument. Strings for the main, x, and y labels can also be passed to
plot.
plots a line in the x-y plane, for example. Colors, symbols, and many other options can be passed
to plot(). For more detailed information, see the help system entries for plot() and par().
After a plotting window is open, if we wish to superimpose another plot on top of what we
already have, we use the lines() command or the points() command, which draw connected lines
and scatter plots, respectively. Many of the same options that apply to plot() apply to lines()
and a host of other graphical functions.
We can plot a line, given its coefficients, using the abline() command. This is often useful in
visualizing the placement of a regression line after a bivariate regression
22
Kernel Density Estimate of Y
0.12
0.10
0.08
Density
0.06
0.04
0.02
0.00
0 5 10 15
N = 25 Bandwidth = 1.386
We can also plot the empirical CDF of a set of data using the ecdf() command from the stepfun
library, which is included in the default distribution. We could then plot the estimated CDF using
plot().
> library(stepfun)
> d <- ecdf(y)
> plot(d,main="Empirical CDF of Y")
23
Predicted vs True
12.4
12.0
true values
predicted
11.6
Notice that we saved the current settings in op before plotting so that we could restore them after
our plotting and that we must set the no.readonly attribute while doing this.
> png("myplot.png")
> plot(x,y,main="A Graph Worth Saving")
> dev.off()
24
creates a png file of the plot of x and y. In the case of the postscript file, if we intend to include the
graphics in another file (like in a LATEX document), we could modify the default postscript settings
controlling the paper size and orientation. Notice that when the special paper size is used, the
width and height must be specified. Actually with LATEX we often resize the image explicitly, so the
resizing may not be that important.
> postscript("myplot.eps",paper="special",width=4,height=4,horizontal=FALSE)
> plot(x,y,main="A Graph Worth Including in LaTeX")
> dev.off()
One more thing to notice is that the default paper size is a4, which is the European standard. For
8.5x11 paper, we use paper="letter". When using images that have been generated as a postscript,
then converted to pdf, incorrect paper specifications are a common problem.
There is also a pdf() command that works the same way the postscript command does, except
that by default its paper size is special with a height and width of 6 inches.
7 Statistics
R has extensive statistical functionality. The functions mean(), sd(), min(), max(), and var()
operate on data as we would expect2 .
> rnorm(1,mean=2,sd=3)
[1] 2.418665
> pnorm(2.418665,mean=2,sd=3)
[1] 0.5554942
> dnorm(2.418665,mean=2,sd=3)
[1] 0.1316921
> qnorm(.5554942,mean=2,sd=3)
[1] 2.418665
These functions generate a random number from the N(2,9) distribution, calculate its cdf and pdf
value, and then verify that the cdf value corresponds to the original observation. If we had not
specified the mean and standard deviation, R would have assumed standard normal. Note that we
could replace norm with binom , nbinom , chisq ,t ,f or other distribution names if appropriate.
Command Meaning
rX() Generate random vector from distribution X
dX() Return the value of the PDF of distribution X
pX() Return the value of the CDF of distribution X
qX() Return the number at which the CDF hits input value [0,1]
2 note: the functions pmax() and pmin() function like max and min but elementwise on vectors or matrices.
25
7.2 P-Values
By way of example, in order to calculate the p-value of 3.6 using an f (4, 43) distribution, we would
use the command
> 1-pf(3.6,4,43)
[1] 0.01284459
and find that we fail to reject at the 1% level, but we would be able to reject at the 5% level.
Remember, if the p-value is smaller than the alpha value, we are able to reject. Also recall that the
p-value should be multiplied by two if it we are doing a two tailed test. For example, the one and
two tailed tests of a t statistic of 2.8 with 21 degrees of freedom would be, respectively
> 1-pt(2.8,21)
[1] 0.005364828
> 2*(1-pt(2.8,21))
[1] 0.01072966
So that we would reject the null hypothesis of insignificance at the 10% level if it were a one tailed
test (remember, small p-value, more evidence in favor of rejection), but we would fail to reject in
the sign-agnostic case.
8 Math in R
8.1 Matrix Operations
8.1.1 Matrix Algebra and Inversion
Most R commands work with multiple types of data. Most standard mathematical functions and
operators (including multiplication, division, and powers) operate on each component of multidi-
mensional objects. Thus the operation A*B, where A and B are matrices, multiplies corresponding
components. In order to do matrix multiplication or inner products, use the %*% operator. Notice
that in the case of matrix-vector multiplication, R will automatically make the vector a row or
column vector, whichever is conformable. Matrix inversion is obtained via the solve() function.
(Note: if solve() is passed a matrix and a vector, it solves the corresponding linear problem) The
t() function transposes its argument. Thus
β = (X 0 X)−1 X 0 Y (8)
or more efficiently
The Kronecker product is also supported and is specified by the the %x% operator.
26
8.1.2 Factorizations
R can compute the standard matrix factorizations. The Cholesky factorization of a symmetric
positive definite matrix is available via chol(). It should be noted that chol() does not check for
symmetry in its argument, so the user must be careful.
We can also extract the eigenvalue decomposition of a symmetric matrix using eigen(). By
default this routine checks the input matrix for symmetry, but the parameter symmetric=FALSE
may be specified in order to skip this test if we know the matrix is symmetric by construction.
If the more general singular value decomposition is desired, we use instead svd().
Notice that R changes the prompt to a “+” sign to remind us that we are inside brackets.
Because R does not distinguish what kind of data object a variable in the parameter list is, we
should be careful how we write our functions. If x is a vector, the above functions would return a
vector of the same dimension. Also, notice that if an argument has a long name, it can be abbreviated
as long as the abbreviation is unique. Thus the following two statements are equivalent
> f(c(2,4,1),Al=3)
> f(c(2,4,1),Alpha=3)
Variables that are not passed in as arguments are not available within functions and variables
defined within functions are unavailable outside of the function. Changing the value of a passed-in
argument within a function does not change its value outside of the function. In other words, R
passes arguments by value and variable scoping applies.
27
of the user-defined function should be the parameter(s) over which R will minimize the function,
additional arguments to the function (constants) should be specified by name in the nlm call. In
order to maximize a function, multiply the function by -1 and minimize it.
> g <- function(x,A,B){
+ out <- sin(x[1])-sin(x[2]-A)+x[3]^2+B
+ out
+ }
> results <- nlm(g,c(1,2,3),A=4,B=2)
> results$min
[1] 6.497025e-13
> results$est
[1] -1.570797e+00 -7.123895e-01 -4.990333e-07
This function uses a matrix-secant method that numerically approximates the gradient, but if the
return value of the function contains an attribute called gradient, it will use a quasi-newton method.
The gradient based optimization corresponding to the above would be
> g <- function(x,A,B){
+ out <- sin(x[1])-sin(x[2]-A)+x[3]^2+B
+ grad <- function(x,A){
+ c(cos(x[1]),-cos(x[2]-A),2*x[3])
+ }
+ attr(out,"gradient") <- grad(x,A)
+ return(out)
+ }
> results <- nlm(g,c(1,2,3),A=4,B=2)
Other optimization functions which may be of interest are optimize() for one-dimensional min-
imization, uniroot() for root finding, and deriv() for calculating numerical derivatives.
9 Programming
9.1 Looping
Looping is performed using the for command. It’s syntax is as follows
> for (i in 1:20){
+ cat(i)
> }
Where cat() may be replaced with the block of code we wish to repeat. Instead of 1:20, a vector
or matrix of values can be used. The index variable will take on each value in the vector or matrix
and run the code contained in curly brackets.
If we simply want a loop to run until something happens to stop it, we could use the repeat
loop and a break
> repeat {
+ g <- rnorm(1)
+ if (g > 2.0) break
+ cat(g);cat("\n")
> }
28
Notice the second cat command issues a newline character, so the output is not squashed onto one
line. The semicolon acts to let R know where the end of our command is, when we put several
commands on a line. For example, the above is equivalent to
> repeat {g <- rnorm(1);if (g>2.0) break;cat(g);cat("\n");}
9.2 Conditionals
9.2.1 Binary Operators
Conditionals, like the rest of R, are highly vectorized. The comparison
> x < 3
returns a vector of TRUE/FALSE values, if x is a vector. This vector can then be used in compu-
tations. For example. We could set all x values that are less that 3 to zero with one command
> x[x<3] <- 0
The conditional within the brackets evaluates to a TRUE/FALSE vector. Wherever the value is
TRUE, the assignment is made. Of course, the same computation could be done using a for loop
and the if command.
> for (i in 1:NROW(x)){
+ if (x[i] < 3) {
+ x[i] <- 0
+ }
+ }
Because R is highly vectorized, the latter code works much more slowly than the former. It is good
programming practice to avoid loops and if statements whenever possible when writing in any
scripting language.
The Boolean Operators
! x NOT x
x & y x and y elementwise
x && y x and y total object
x | y x or y elementwise
x || y x or y total object
xor(x, y) x xor y (true if one and only one argument is true)
29
10 Changing Configurations
10.1 Default Options
A number of runtime options relating to R’s behavior are governed by the options() function.
Running this function with no arguments returns a list of the current options. One can change the
value of a single option by passing the option name and a new value. For temporary changes, the
option list may be saved and then reused.
> options(digits=10)
> options(error=recover)
> source("log.R")
Error: subscript out of bounds
1: source("log.R")
2: eval.with.vis(ei, envir)
3: eval.with.vis(expr, envir, enclos)
4: mypredict(v12, newdata = newdata)
Selection: 4
30
Called from: eval(expr, envir, enclos)
Browse[1]> i
[1] 1
Browse[1]> j
[1] 301
Pressing enter while in browse mode takes the user back to the menu. After debugging, we can set
error to NULL again.
31
> sink("myoutput.txt")
> source("rcode.R",echo=T)
> sink()
R can save plots and graphs as image files as well. Under windows, simply click once on the
graph so that it is in the foreground and then go to file/Save as and save it as jpeg or png. There are
also ways to save as an image or postscript file from the command line, as described in the plotting
section.
produces a file named “summary.tex” that produces the following when included in a LATEX source
file3
summary Estimate Std. Error t value Pr(¿—t—)
(Intercept) 17.2043926 0.088618337 194.140323 0.00000e + 00
exper −0.4126387 0.008851445 −46.618227 0.00000e + 00
south −0.7098870 0.074707431 −9.502228 4.05227e − 21
which we see is pretty much what we want. The table lacks a title and the math symbols in the
p-value column are not contained in $ characters. Fixing these by hand we get
OLS regression of educ on exper and south
summary Estimate Std. Error t value Pr(> |t|)
(Intercept) 17.2043926 0.088618337 194.140323 0.00000e + 00
exper −0.4126387 0.008851445 −46.618227 0.00000e + 00
south −0.7098870 0.074707431 −9.502228 4.05227e − 21
Notice that the latex() command takes matrices, summaries, regression output, dataframes,
and many other data types. Another option, which may be more flexible, is the xtable() function
from the xtable library.
12 Conclusion
R provides an effective platform for econometric computation and research. It has built in function-
ality sufficiently advanced for professional research and has a reasonably steep learning curve (if you
put knowledge on the y axis and effort on the x). Because R is a programming language as well as
an econometrics program, it allows for more complex, tailored computations and simulations than
one would get in a prepackaged system. On the other hand, it takes some time to become familiar
with the syntax and reasoning of the language. I hope that this guide eases the burden of learning to
program and do standard data analysis in the finest statistical environment available. Don’t forget
to let me know if you feel like I didn’t do this or have a suggestion about how I could do it better.
3 Under linux, at least, the latex() command also pops up a window showing how the output will look.
32
13 Appendix: Code Examples
13.1 Monte Carlo Simulation
The following block of code creates a vector of randomly distributed data X with 25 members. It
then creates a y vector that is conditionally distributed as
y = 2 + 3x + . (9)
It then does a regression of x on y and stores the slope coefficient. The generation of y and calculation
of the slope coefficient are repeated 500 times. The mean and sample variance of the slope coefficient
are then calculated. A comparison of the sample variance of the estimated coefficient with the
analytic solution for the variance of the slope coefficient is then possible.
of the scalar or vector passed to it. Notice that a better version of this code would use a vectorized
comparison, but this is an example of conditionals, including the else statement. The interested
student could rewrite this function without using a loop.
> haar <- function(x){
+ y <- x*0
+ for(i in 1:NROW(y)){
+ if(x[i]<0 && x[i]>-1){
+ y[i]=-1/sqrt(2)
+ } else if (x[i]>0 && x[i]<1){
+ y[i]=1/sqrt(2)
+ }
+ }
+ y
+ }
Notice also the use of the logical ‘and’ operator, &&, in the if statement. The logical ‘or’ operator is
the double vertical bar, ||. These logical operators compare the entire object before and after them.
For example, two vectors that differ in only one place will return FALSE under the && operator.
For elementwise comparisons, use the single & and | operators.
33
13.3 Maximum Likelihood Estimation
Now we consider code to find the likelihood estimator of the coefficients in a nonlinear model. Let
us assume a normal distribution on the additive errors
y = aLb K c + (11)
Notice that the best way to solve this problem is a nonlinear least squares regression using nls().
We do the maximum likelihood estimation anyway. First we write a function that returns the log
likelihood value (actually the negative of it, since minimization is more convenient) then we optimize
using nlm(). Notice that Y, L, and K are vectors of data and a, b, and c are the parameters we wish
to estimate.
34