Farnsworth, Econometrics in R
Farnsworth, Econometrics in R
Farnsworth, Econometrics in R
Grant V. Farnsworth∗
June 26, 2006
∗ This paper was originally written as part of a teaching assistantship and has subsequently become a personal reference.
I learned most of this stuff by trial and error, so it may contain inefficiencies, inaccuracies, or incomplete explanations. If
you find something here suboptimal or have suggestions, please let me know. Until at least 2008 I can be contacted at
[email protected].
Contents
1 Introductory Comments 3
1.1 What is R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 How is R Better Than Other Packages? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Obtaining R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Using R Interactively and Writing Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Getting Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Special Regressions 15
4.1 Fixed/Random Effects Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.1 Fixed Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.2 Random Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Qualitative Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.1 Logit/Probit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.2 Multinomial Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.3 Ordered Logit/Probit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Tobit and Censored Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Quantile Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.5 Robust Regression - M Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.6 Nonlinear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.7 Two Stage Least Squares on a Single Structural Equation . . . . . . . . . . . . . . . . . . . . 19
4.8 Systems of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.8.1 Seemingly Unrelated Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.8.2 Two Stage Least Squares on a System . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2
5 Time Series Regression 19
5.1 Differences and Lags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.1 Canned AR and MA filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.2 Manual Filtration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.3 Hodrick Prescott Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.4 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 ARIMA/ARFIMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4 ARCH/GARCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.4.1 Basic GARCH–garch() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.4.2 Advanced GARCH–garchFit() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.4.3 Miscellaneous GARCH–Ox G@RCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.5 Correlograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.6 Predicted Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.7 Time Series Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.7.1 Durbin-Watson Test for Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.7.2 Box-Pierce and Breusch-Godfrey Tests for Autocorrelation . . . . . . . . . . . . . . . 25
5.7.3 Dickey-Fuller Test for Unit Root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.8 Vector Autoregressions (VAR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6 Plotting 26
6.1 Plotting Empirical Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Contour Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.3 Adding Legends and Stuff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.4 Adding Arrows, Text, and Markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.5 Multiple Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.6 Saving Plots—png, jpg, eps, pdf, xfig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.7 Adding Greek Letters and Math Symbols to Plots . . . . . . . . . . . . . . . . . . . . . . . . 29
7 Statistics 30
7.1 Working with Common Statistical Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.2 P-Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.3 Sampling from Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8 Math in R 32
8.1 Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.1.1 Matrix Algebra and Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.1.2 Factorizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.2 Numerical Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.3 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
9 Programming 34
9.1 Writing Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
9.2 Looping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
9.3 Avoiding Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
9.3.1 Applying a Function to an Array (or a Cross Section of it) . . . . . . . . . . . . . . . 35
9.3.2 Replicating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9.4 Conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9.4.1 Binary Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9.5 The Ternary Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.6 Outputting Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.7 Pausing/Getting Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.8 Timing Blocks of Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.9 Calling C functions from R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.9.1 How to Write the C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3
9.9.2 How to Use the Compiled Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
9.10 Calling R Functions from C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10 Changing Configurations 40
10.1 Default Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.1.1 Significant Digits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.1.2 What to do with NAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.1.3 How to Handle Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.1.4 Suppressing Warnings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
12 Conclusion 42
1.3 Obtaining R
The R installation program can be downloaded free of charge from http://www.r-project.org. Because
R is a programming language and not just an econometrics program, most of the functions we will be
interested in are available through libraries (sometimes called packages) obtained from the R website. To
obtain a library that does not come with the standard installation follow the CRAN link on the above
website. Under contrib you will find is a list of compressed libraries ready for download. Click on the one
you need and save it somewhere you can find it later. If you are using a gui, start R and click install package
from local directory under the package menu. Then select the file that you downloaded. Now the package
will be available for use in the future. If you are using R under linux, install new libraries by issuing the
following command at the command prompt: “R CMD INSTALL packagename”
Alternately you can download and install packages at once from inside R by issuing a command like
> install.packages(c("car","systemfit"),repo="http://cran.stat.ucla.edu",dep=TRUE)
which installs the car and systemfit libraries. The repo parameter is usually auto-configured, so there is
normally no need to specify it. The dependencies or dep parameter indicates that R should download
packages that these depend on as well, and is recommended. Note: you must have administrator (or root)
privileges to your computer to install the program and packages.
5
Contributed Packages Mentioned in this Paper and Why
(* indicates package is included by default)
adapt Multivariate numerical integration
car Regression tests and robust standard errors
dse1 State space models, Kalman filtration, and Vector ARMA
fSeries Garch models with nontrivial mean equations
fracdiff Fractionally integrated ARIMA models
foreign* Loading and saving data from other programs
graphics* Contour graphs and arrows for plots
Hmisc LATEX export
lattice An alternate type of contour plot
lmtest Breusch-Pagan and Breusch-Godfrey tests
MASS* Robust regression, ordered logit/probit
Matrix Matrix norms and other matrix algebra stuff
MCMCpack Inverse gamma distribution
MNP Multinomial probit via MCMC
nlme* Nonlinear fixed and random effects models
nls* Nonlinear least squares
nnet Multinomial logit/probit
quantreg Quantile Regressions
R.matlab Read matlab data files
sandwich (and zoo) Heteroskedasticity and autocorrelation robust covariance
sem Two stage least squares
survival* Tobit and censored regression
systemfit SUR and 2SLS on systems of equations
ts* Time series manipulation functions
tseries Garch, ARIMA, and other time series functions
xtable Alternative LATEX export
zoo required in order to have the sandwich package
From time to time we can get updates of the installed packages by running update.packages().
6
> setwd("/home/gvfarns")
> getwd()
[1] "/home/gvfarns"
or under windows using the menu item change working directory. Also note that when using older versions
of R under windows the slashes must be replaced with double backslashes.
> getwd()
[1] "C:\\Program Files\\R\\rw1051\\bin"
> setwd("C:\\Program Files\\R\\scripts")
> getwd()
[1] "C:\\Program Files\\R\\scripts"
We can also run R in batch (noninteractive) mode under linux by issuing the command:“R CMD BATCH
scriptname.R” The output will be saved in a file named scriptname.Rout. Batch mode is also available under
windows using Rcmd.exe instead of Rgui.exe.
Since every command we will use is a function that is stored in one of the libraries, we will often have
to load libraries before working. Many of the common functions are in the library base, which is loaded by
default. For access to any other function, however, we have to load the appropriate library.
> library(foreign)
will load the library that contains the functions for reading and writing data that is formatted for other
programs, such as SAS and Stata. Alternately (under windows), we can pull down the package menu and
select library
7
Entering the name of the object typically echos its data to the screen. In fact, a function is just another
data member in R. We can see the function’s code by typing its name without parenthesis.
The command for creating and/or assigning a value to a data object is the less-than sign followed by the
minus sign.
creates a numeric object called g, which contains the value 7.5. True vectors in R (as opposed to one
dimensional matrices) are treated as COLUMN vectors, when the distinction needs to be made.
uses the c() (concatenate) command to create a vector with values 7.5, 6, and 5. c() is a generic function
that can be used on multiple types of data. The t() command transposes f to create a 1x2 matrix—because
“vectors” are always column vectors. The two data objects f and F are separate because of the case sensitivity
of R. The command cbind() concatenates the objects given it side by side: into an array if they are vectors,
and into a single dataframe if they are columns of named data.
Similarly, rbind() concatenates objects by rows—one above the other—and assumes that vectors given it
are ROW vectors (see 2.3.1).
Elements in vectors and similar data types are indexed using square brackets. R uses one-based indexing.
> f
[1] 7.5 6.0 5.0
> f[2]
[1] 6
Notice that for multidimensional data types, such as matrices and dataframes, leaving an index blank refers
to the whole column or row corresponding to that index. Thus if foo is a 4x5 array of numbers,
> foo
> foo[1,]
> foo[,3]
will print the third column, etc. We can get summary statistics on the data in goo using the summary() and
we can determine its dimensionality using the NROW(), and NCOL() commands. More generally, we can use
the dim() command to know the dimensions of many R objects.
If we wish to extract or print only certain rows or columns, we can use the concatenation operator.
makes a 4x3 array out of columns 1,3, and 5 of foo and saves it in oddfoo. By prepending the subtraction
operator, we can remove certain columns
makes a 4x2 array out of columns 2 and 4 of foo (i.e., it removes columns 1,3, and 5). We can also use
comparison operators to extract certain columns or rows.
8
compares each entry in the first column of foo to one and inserts the row corresponding to each match into
smallfoo. We can also reorder data. If wealth is a dataframe with columns year, gdp, and gnp, we could
sort the data by year using order() or extract a period of years using the colon operator
> wealth <- wealth[order(wealth$year),]
> firstten <- wealth[1:10,]
> eighty <- wealth[wealth$year==1980,]
This sorts by year and puts the first ten years of data in firstten. All rows from year 1980 are stored in
eighty (notice the double equals sign).
Using double instead of single brackets for indexing changes the behavior slightly. Basically it doesn’t
allow referencing multiple objects using a vector of indices, as the single bracket case does. For example,
> w[[1:10]]
does not return a vector of the first ten elements of w, as it would in the single bracket case. Also, it strips
off attributes and types. If the variable is a list, indexing it with single brackets yields a list containing the
data, double brackets return the (vector of) data itself. Most often, when getting data out of lists, double
brackets are wanted, otherwise single brackets are more common.
Occasionally we have data in the incorrect form (i.e., as a dataframe when we would prefer to have a
matrix). In this case we can use the as. functionality. If all the values in goo are numeric, we could put
them into a matrix named mgoo with the command
> mgoo <- as.matrix(goo)
Other data manipulation operations can be found in the standard R manual and online. There are a lot
of them.
9
2.3 Important Data Types
2.3.1 Vectors
The most fundamental numeric data type in R is an unnamed vector. A scalar is, in fact, a 1-vector. Vectors
are more abstract than one dimensional matrices because they do not contain information about whether
they are row or column vectors—although when the distinction must be made, R usually assumes that
vectors are columns.
The vector abstraction away from rows/columns is a common source of confusion in R by people familiar
with matrix oriented languages, such as matlab. The confusion associated with this abstraction can be shown
by an example
a<-c(1,2,3)
b<-c(4,6,8)
Now we can make a matrix by stacking vertically or horizontally and R assumes that the vectors are either
rows or columns, respectively.
> cbind(a,b)
a b
[1,] 1 4
[2,] 2 6
[3,] 3 8
> rbind(a,b)
[,1] [,2] [,3]
a 1 2 3
b 4 6 8
One function assumed that these were column vectors and the other that they were row vectors. The take
home lesson is that a vector is not a one dimensional matrix, so don’t expect them to work as they do in a
linear algebra world. To convert a vector to a 1xN matrix for use in linear algebra-type operations (column
vector) us as.matrix().
Note that t() returns a matrix, so that the object t(t(a)) is not the same as a.
2.3.3 Dataframes
Most econometric data will be in the form of a dataframe. A dataframe is a collection of vectors (we think
of them as columns) containing data, which need not all be of the same type, but each column must have
the same number of elements. Each column has a title by which the whole vector may be addressed. If goo
is a 3x4 data frame with titles age, gender, education, and salary, then we can print the salary column
with the command
> goo$salary
or view the names of the columns in goo
> names(goo)
Most mathematical operations affect multidimensional data elementwise (unlike some mathematical lan-
guages, such as matlab). From the previous example,
10
> salarysq <- (goo$salary)^2
creates a dataframe with one column entitled salary with entries equal to the square of the corresponding
entries in goo$salary. Output from actions can also be saved in the original variable, for example,
adds a column named lnsalary to goo, containing the log of the salary.
2.3.4 Lists
A list is more general than a dataframe. It is essentially a bunch of data objects bound together, optionally
with a name given to each. These data objects may be scalars, strings, dataframes, or any other type.
Functions that return many elements of data (like summary()) generally bind the returned data together as
a list, since functions return only one data object. As with dataframes, we can see what objects are in a
list (by name if they have them) using the names() command and refer to them either by name (if existent)
using the $ symbol or by number using brackets. Remember that referencing a member of a list is
generally done using double, not single, brackets (see section 2.1). Sometimes we would like to
simplify a list into a vector. For example, the function strsplit() returns a list containing substrings of
its argument. In order to make them into a vector of strings, we must change the list to a vector using
unlist(). Lists sometimes get annoying, so unlist() is a surprisingly useful function.
2.3.5 S3 Classes
Many functions return an object containing many types of data, like a list, but would like R to know
something about what type of object it is. A list with an associated “class” attribute designating what type
of list it is is an S3 class. If the class is passed as an argument, R will first search for an appropriately
named function. For example, lm() produces output that should mean something to R. When we print an
lm object, a method written especially for fitted objects should be called instead of the standard routine.
> print(x)
calls print() if x is an arbitrary data type, but if it is of class lm, print.lm() is called instead.
S3 lists are quite simple to use. The are really just lists with an extra attribute. We can create them
either using the class() function or just adding the class attribute after creation of a list.
If we were to call print() without assigning the class, we would get a different result.
2.3.6 S4 Classes
S4 classes are a recent addition to R. They generically hold data and functions, just like S3 classes, but
have some technical advantages, which transcend the scope of this document. For our purposes, the most
important difference between an S3 and S4 class is that attributes of the latter are referenced using @ instead
of $ and it can only be created using the new() command.
> g <- garchFit(~arma(0,1),~garch(2,3),y)
> fitvalues <- g@fit
11
2.4 Working with Dates
The standard way of storing dates internally in R is as an object of class Date. This allows for such things as
subtraction of one date from another yielding the number of days between them. To convert data to dates,
we use as.Date(). This function takes as input a character string and a format. If the given vector of dates
is stored as a numeric format (like 20050627) it should be converted to a string using as.character() first.
The format argument informs the code what part of the string corresponds to what part of the date. Four
digit year is %Y, two digit year is %y, numeric month is %m, alphabetic (abbreviated) month is %b, alphabetic
(full) month is %B, day is %d. For other codes, see the help files on strptime. For example, if d is a vector
of dates formatted like ”2005-Jun-27”, we could use
> g<-as.Date(d,format="%Y-%b-%d")
Internally, Date objects are numeric quantities, so they don’t take up very much memory.
We can perform the reverse operation of as.Date()—formatting or extracting parts of a Date object—
using format(). For example, given a column of numbers like 20040421, we can extract a character string
representing the year using
> year<-format(as.Date(as.character(v$DATE),format="%Y%m%d"),format="%Y")
> OUT<-merge(B,E)
If the date column was named date in B but day in E, the command would instead be
> OUT<-merge(B,E,by.x="date",by.y="day")
Now mydata is a dataframe with named columns, ready for analysis. Note that R assumes that there are
no labels on the columns, and gives them default values, if you omit the header=TRUE argument. Now let’s
suppose that instead of blah.dat we have blah.dta, a stata file.
> library(foreign)
> mydata <- read.dta("C:/WINDOWS/Desktop/blah.dta")
12
2.7 Working With Very Large Data Files
R objects can be as big as our computer will allow. If, however, the objects we are loading are so big that
they exhaust the system RAM, they are swapped to disk. This means that extremely large objects can
slow everything down tremendously. The read.table() family of routines assume that we are not working
with very large data sets and so are not careful to conserve on memory. They load everything at once and
probably make at least one copy of it. A better way to work with huge datasets is to read the file a line (or
group of lines) at a time. We do this using connections. A connection is an R object that points to a file
or other input/output stream. Each time we read from a connection the location in the file from which we
read moves forward.
Before we can use a connection, we must create it using file() if our data source is a file or url() for an
online source (there are other types of connections too). Then we use open() to open it. Now we can read
one or many lines of text using readLines(), read fields of data using scan(), or write data using cat() or
one of the write.table() family of routines. When we are done we close using close().
Reading fields of data from a huge file is a very common task, so we give it special attention. The most
important argument to scan() is what, which specifies what type of data to read in. If the file contains
columns of data, what should be a list, with each member of the list representing a column of data. For
example, if the file contains a name followed by a comma separator and an age, we could read a single line
using
> a <- scan(f,what=list(name="",age=0),sep=",",nlines=1)
where f is an open connection. Now a is a list with fields name and age. Example 13.4 shows how to read
from a large data file. Notice that reading one line at a time is not the fastest way to do things. R can
comfortably read 100, 1000, or more lines at a time. Increasing how many lines are read per iteration could
speed up large reads considerably. With large files, we could read lines 1000 at a time, transform them, and
then write 1000 at a time to another open connection, thereby keep system memory free.
If all of the data is of the same type and belong in the same object (a 2000x2000 numeric matrix, for
example) we can use scan() without including the nlines argument and get tremendously faster reads. The
resulting vector would need only to be converted to type matrix.
13
Any changes we make (including changing the type of data contained in a column) will be reflected in a
immediately. If we want to save the changed data to a different variable, we could have used
Notice that both de() and data.entry() return a variable of type list. If what we wanted was a dataframe,
for example, we need to convert it back after editing.
The function edit() works like de() but for many different data types. In practice, it calls either de()
or the system default text editor (which is set using options()).
A similar useful function is fix(), which edits an R object in place. fix() operates on any kind of data:
dataframes are opened with de() and functions are opened with a text editor. This can be useful for quick
edits either to data or code.
or alternately:
> lm(salary ~ age + exper,data=byu)
as a third alternative, we could “attach” the dataframe, which makes its columns available as regular variables
> attach(byu)
> lm(salary ~ age + exper)
Notice the syntax of the model argument (using the tilde). The above command would correspond to
the linear model
salary = β0 + β1 age + β2 exper + (1)
Using lm() results in an abbreviated summary being sent to the screen, giving only the β coefficient
estimates. For more exhaustive analysis, we can save the results in as a data member or “fitted model”
The summary() command, run on raw data, such as byu$age, gives statistics, such as the mean and
median (these are also available through their own functions, mean and median). When run on an ols
object, summary gives important statistics about the regression, such as p-values and the R2 .
The residuals and several other pieces of data can also be extracted from result, for use in other computa-
tions. The variance-covariance matrix (of the beta coefficients) is accessible through the vcov() command.
Notice that more complex formulae are allowed, including interaction terms (specified by multiplying
two data members) and functions such as log() and sqrt(). Unfortunately, in order to include a power
term, such as age squared, we must either first compute the values, then run the regression, or use the I()
operator, which forces computation of its argument before evaluation of the formula
or
14
> result <- lm(salary ~ age + I(age^2) + log(exper) + age*log(exper),data=byu)
In order to run a regression without an intercept, we simply specify the intercept explicitly, traditionally
with a zero.
where p is the number of estimated parameters and L(p) is the likelihood. Some econometricians prefer to
call AIC/N the information criterion. To obtain the Bayesian Information Criterion (or Schwartz Bayesian
Criterion) we use AIC but specify a different “penalty” parameter as follows
> sbc <- AIC(result,k=log(NROW(smokerdata)))
which means
AIC = −2 log L(p) + p log(N )
Breusch-Pagan test
data: unrestricted
BP = 44.5465, df = 1, p-value = 2.484e-11
This performs the “studentized” version of the test. In order to be consistent with some other software
(including ncv.test()) we can specify studentize=FALSE.
15
3.3.2 Heteroskedasticity (Autocorrelation) Robust Covariance Matrix
In the presence of heteroskedasticity, the ols estimates remain unbiased, but the ols estimates of the variance
of the beta coefficients are no longer correct. In order to compute the heteroskedasticity consistent covariance
matrix1 we use the hccm() function (from the car library) instead of vcov(). The diagonal entries are
variances and off diagonals are covariance terms.
This functionality is also available via the vcovHC() command in the sandwich package. Also in that
package is the heteroskedasticity and autocorrelation robust Newey-West estimator, available in the function
vcovHAC() or the function NeweyWest().
H0 : β0 = 0, β3 + β4 = 1
> linear.hypothesis(unrestricted,hm,rhs,vcov=NeweyWest(unrestricted))
See the section on heteroskedasticity robust covariance matrices for information about the NeweyWest() func-
tion. We should remember that the specification white.adjust=TRUE corrects for heteroskedasticity using an
improvement to the white estimator. To use the classic white estimator, we can specify white.adjust="hc0".
Generalized least squares is available through the lm.gls() command in the MASS library. It takes a
formula, weighting matrix, and (optionally) a dataframe from which to get the data as arguments.
The glm() command provides access to a plethora of other advanced linear regression methods. See the
help file for more details.
1 obtaining the White standard errors, or rather, their squares.
16
3.6 Models With Factors/Groups
There is a separate datatype for qualitative factors in R. When a variable included in a regression is of type
factor, the requisite dummy variables are automatically created. For example, if we wanted to regress the
adoption of personal computers (pc) on the number of employees in the firm (emple) and include a dummy
for each state (where state is a vector of two letter abbreviations), we could simply run the regression
> summary(lm(pc~emple+state))
Call:
lm(formula = pc ~ emple + state)
Residuals:
Min 1Q Median 3Q Max
-1.7543 -0.5505 0.3512 0.4272 0.5904
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.572e-01 6.769e-02 8.232 <2e-16 ***
emple 1.459e-04 1.083e-05 13.475 <2e-16 ***
stateAL -4.774e-03 7.382e-02 -0.065 0.948
stateAR 2.249e-02 8.004e-02 0.281 0.779
stateAZ -7.023e-02 7.580e-02 -0.926 0.354
stateDE 1.521e-01 1.107e-01 1.375 0.169
...
4 Special Regressions
4.1 Fixed/Random Effects Models
Warning: The definitions of fixed and random effects models are not standardized across disciplines. I
describe fixed and random effects estimation as these terms are generally used by econometricians. The
terms “fixed” and “random” have historical roots and are econometrically misleading.
Within the context of economics, fixed and random effects estimators are panel data models that account
for cross sectional variation in the intercept. Letting i denote the cross sectional index (or the one by which
data is grouped) and t the time index (or the index that varies within a group), a standard fixed effects
model can be written
yit = α + ui + βXit + it . (2)
17
Essentially, each individual has a different time-invariant intercept (α + ui ). Usually we are interested in β
but not any of the ui . A random effects model has the same mean equation, but imposes the additional
restriction that the individual specific effect is uncorrelated with the explanatory variables Xit . That is,
E[ui Xit ] = 0. Econometrically this is a more restrictive version of the fixed effects estimator (which allows
for arbitrary correlation between the “effect” and exogenous variables). One should not let the unfortunate
nomenclature confuse the relationship between these models.
> lm(y~factor(index)+x)
will do a fixed effects estimation and will report the correct standard errors on β. Unfortunately, in cases
where there are many individuals in the sample and we are not interested in the value of their fixed effects,
the lm() results are awkward to deal with and the estimation of a large number of ui coefficients could
render the problem numerically intractable.
A more common way to estimate fixed effects models is to remove the fixed effect by time demeaning
each variable (the so called within estimator). Then equation (2) becomes
Most econometric packages (for example, stata’s xtreg) use this method when doing fixed effects estimation.
Using R, one can manually time demean the independent and dependent variables. If d is a dataframe
containing year, firm, profit, and predictor and we wish to time demean each firm’s profit and predictor,
we could run something like
> g<-d
> for (i in unique(d$firm)){
+ timemean<-mean(d[d$firm==i,])
+ g$profit[d$firm==i]<-d$profit[d$firm==i]-timemean["profit"]
+ g$predictor[d$firm==i]<-d$predictor[d$firm==i]-timemean["predictor"]
+ }
> output<-lm(profit~predictor,data=g)
Notice that the standard errors reported by this regression will be biased downward. The σ̂ 2 reported by
lm() is computed using
SSR
σ̂ 2 =
NT − K
whereas the true standard errors in this fixed effects regression are
SSR
σ̂ 2 =
N (T − 1) − K
18
4.1.2 Random Effects
The package nlme contains functions for doing random effects regression (but not fixed effects—the docu-
mentation refers to the statistics interpretation of the term “fixed effect”) in a linear or nonlinear framework.
Suppose we had a linear model with a random effect on the sic3 code.
19
4.3 Tobit and Censored Regression
In order to estimate a model in which the values of some of the data have been censored, we use the survival
library. The function survreg() performs this type of regression, and takes as its dependent variable a Surv
object. The best way to see how to do this type of regression is by example. Suppose we want to regress y
on x and z, but a number of y observations were censored on the left and set to zero.
result <- survreg(Surv(y,y>0,type=’left’) ~ x + z, dist=’gaussian’)
The second argument to the Surv() function specifies whether each observation has been censored or not
(one indicating that it was observed and zero that it was censored). The third argument indicates on which
side the data was censored. Since it was the lower tail of this distribution that got censored, we specify left.
The dist option passed to the survreg is necessary in order to get a classical Tobit model.
20
4.7 Two Stage Least Squares on a Single Structural Equation
For single equation two stage least squares, the easiest function is probably tsls() from the sem library.
If we want to find the effect of education on wage while controlling for marital status but think educ is
endogenous, we could use motheduc and fatheduc as instruments by running
> library(sem)
> outputof2sls <- tsls(lwage~educ+married,~married+motheduc+fatheduc)
The first argument is the structural equation we want to estimate and the second is a tilde followed by all
the instruments and exogenous variables from the structural equation—everything we need for the Z matrix
in the 2SLS estimator β̃ = (X 0 Z(Z 0 Z)−1 Z 0 X)−1 X 0 Z(Z 0 Z)−1 Z 0 y.
The resulting output can be analyzed using summary() and other ols analysis functions. Note that since
this command produces a two stage least squares object, the summary statistics, including standard errors,
will be correct. Recall that if we were to do this using an actual two stage approach, the resulting standard
errors would be bogus.
There are also routines for three stage least squares, weighted two stage least squares, and a host of others.
Most time-series related functions automatically coerce the data into ts format, so this command is often
not necessary.
21
5.1 Differences and Lags
We can compute differences of a time series object using the diff() operator, which takes as optional
arguments which difference to use and how much lag should be used in computing that difference. For
example, to take the first difference with a lag of two, so that wt = vt − vt−3 we would use
produce a once lagged version of y relative to ysmall. This way of generating lags can get awkward if we
are trying combinations of lags in regressions because for each lagged version of the variable, conformability
requires that we have a corresponding version of the original data that has the first few observations removed.
Another way to lag data is to convert it to a time series object and use the lag() function. It is very
important to remember that this function does not actually change the data, it changes an attribute of a time
series object that indicates where the series starts. This allows for more flexibility with time series functions,
but it can cause confusion for general functions such as lm() that do not understand time series attributes.
Notice that lag() only works usefully on time series objects. For example, the code snippet
creates a vector of zeros named d if a is a normal vector, but returns a ts object with the first difference of
the series if a is a ts object. There is no warning issued if lag() is used on regular data, so care should be
exercised.
In order to use lagged data in a regression, we can use time series functions to generate a dataframe with
various lags of the data and NA characters stuck in the requisite leading and trailing positions. In order
to do this, we use the ts.union() function. Suppose X and Y are vectors of ordinary data and we want to
include a three times lagged version of X in the regression, then
converts the vectors to ts data and forms a multivariate time series object with columns yt , xt , and xt−3 .
Again, remember that data must be converted to time series format before lagging or binding together with
the union operator in order to get the desired offset. The ts.union() function automatically decides on
a title for each column, must as the data.frame() command does. We can also do the lagging inside the
union and assign our own titles
It is critical to note that the lag operator works in the opposite direction of what one might
expect: positive lag values result in leads and negative lag values result in lags.
When the resulting multivariate time series object is converted to a data frame (as it is read by ls() for
example), the offset will be preserved. Then
> lm(y~x3,data=d)
22
will then regress yt on xt−3 .
Also note that by default observations that have a missing value (NA) are omitted. This is what we want.
If the default setting has somehow been changed, we should include the argument na.action=na.omit in
the lm() call. In order to get the right omission behavior, it is generally necessary to bind all the data we
want to use (dependent and independent variables) together in a single union.
In summary, in order to use time series data, convert all data to type ts, lag it appropriately (using the
strange convention that positive lags are leads), and bind it all together using ts.union(). Then proceed
with the regressions and other operations.
5.2 Filters
5.2.1 Canned AR and MA filters
One can pass data through filters constructed by polynomials in the lag operator using the filter() com-
mand. It handles two main types of filters: moving average or “convolution” filters and autoregressive or
“recursive” filters. Convolution filters have the form
y = (a0 + a1 L + . . . + ap Lp )x
y = (1 − a1 L − a2 L2 − . . . − ap Lp )−1 x
When we use the filter() command, we supply the {an } vector as follows
> y <- filter(x,c(1,.2,-.35,.1),method="convolution",sides=1)
The data vector x may be a time series object or a regular vector of data, and the output y will be a ts
object. It is necessary to specify sides=1 for a convolution filter, otherwise the software tries to center the
filter (in positive and negative lags), which is not usually what the econometrician wants. The recursive
filter ignores the sides keyword. Notice that (except for data loss near the beginning of the series) we can
recover the data from y above using a recursive filter
> X<-filter(y,c(-.2,.35,-.1),method="recursive")
The autoregressive filter would require a for loop to reproduce. Remember that the lag method of
filtering will only work if x is a ts object.
23
hpfilter <- function(x,lambda=1600){
eye <- diag(length(x))
result <- solve(eye+lambda*crossprod(diff(eye,lag=1,d=2)),x)
return(result)
}
where lambda is the standard tuning parameter, often set to 1600 for macroeconomic data. Passing a series
to this function will return the smoothed series.
This filter is also a special case of the smooth.spline() function in which the parameter tt all.knots=TRUE
has been passed. Unfortunately, the tuning parameter for the smooth.spline() function, spar is different
from the lambda above and we have not figured out how to convert from spar to lambda. If the reader
knows how to use smooth.spline() to do HP filtration, please let me know.
yt = Ft0 θt + vt
θt = Gt θt−1 + wt
where
vt ∼ N (0, Vt ), wt ∼ N (0, Wt ).
θt is the unobserved state vector, and yt is the observed data vector. For Kalman filtering, the initial value
of θ is drawn from N (m0 , C0 ).
Because of the possibly time varying nature of this general model, the coefficient matrices must be given
as functions. One caveat to remember is that the input Fmat is the transpose of Ft . After writing functions to
input Ft , Gt , Vt , Wt and generating the input variables φ (a vector of parameters to be used by the functions
that generate Ft , Gt , Vt , and Wt ), m0 , and C0 , we run SS() to create the state space model. Then we
run kfilter() on the model object to filter and obtain the log likelihood—it returns a copy of the model
with estimated parameters included. If we want to run the Kalman smoother, we take the output model of
kfilter() and run smoother() on it.
The functions in package dse1 appear more flexible but more complicated to initialize.
5.3 ARIMA/ARFIMA
The arima() command from the ts() library can fit time series data using an autoregressive integrated
moving average model.
where
∆yt = yt − yt−1 (8)
The parameters p, d, and q specify the order of the arima model. These values are passed as a vector
c(p,d,q) to arima(). Notice that the model used by R makes no assumption about the sign of the θ terms,
so the sign of the corresponding coefficients may differ from those of other software packages (such as S+).
> ar1 <- arima(y,order=c(1,0,0))
> ma1 <- arima(y,order=c(0,0,1))
24
Data-members ar1 and ma1 contain estimated coefficients obtained by fitting y with an AR(1) and MA(1)
model, respectively. They also contain the log likelihood statistic and estimated standard errors.
If we are modeling a simple autoregressive model, we could also use the ar() command, from the ts
package, which either takes as an argument the order of the model or picks a reasonable default order.
> library(fracdiff)
> fracdiff(y,nar=2,nma=1)
finds the optimal d value using p=2 and q=1. Then it estimates the resulting ARFIMA(p,d,q) model.
5.4 ARCH/GARCH
5.4.1 Basic GARCH–garch()
R can numerically fit data using a generalized autoregressive conditional heteroskedasticity model GARCH(p,q),
written
y =C + (9)
σt2 = α0 + 2
δ1 σt−1 + ... + 2
δp σt−p + α1 2t + ... + αq 2t−q (10)
setting p = 0 we obtain the ARCH(q) model. The R command garch() comes from the tseries library. It’s
syntax is
> archoutput <- garch(y,order=c(0,3))
> garchoutput <- garch(y,order=c(2,3))
so that archoutput is the result of modeling an ARCH(3) model and garchoutput is the result of modeling
a GARCH(2,3). Notice that the first value in the order argument is q, the number of deltas, and the second
argument is q, the number of alpha parameters. The resulting coefficient estimates will be named a0, a1,
. . . for the alpha and b1, b2, . . . for the delta parameters. Estimated values of the conditional standard
deviation process are available via
fitted(garchoutput)
fits the same model as above, but the mean equation now is an MA(1).
This function returns an S4 class object, which means to access the data inside we use the @ operator
Here h gives the estimated σ 2 process, not σ, so it is the square of the “fitted” values from garch(), above.
As of this writing, garchFit() produces copious output as it estimates the parameters. Some of this can
be avoided by passing the parameter trace=FALSE.
25
5.4.3 Miscellaneous GARCH–Ox G@RCH
fSeries also provides an interface to some functions from the Ox G@RCH2 software, which is not free in the same
sense as R is, although as of this writing it was free for academic use. That interface provides routines for
estimation of EGARCH, GJR, APARCH, IGARCH, FIGARCH, FIEGARCH, FIAPARCH and HYGARCH
models, according to the documentation.
5.5 Correlograms
It is common practice when analyzing time series data to plot the autocorrelation and partial autocorre-
lation functions in order to try to guess the functional form of the data. To plot the autocorrelation and
partial autocorrelation functions, use the ts library functions acf() and pacf(), respectively. The following
commands plot the ACF and PACF on the same graph, one above (not on top of) the other. See section 6.5
for more details on arranging multiple graphs on the canvas.
> par(mfrow=c(2,1))
> acf(y)
> pacf(y)
These functions also return the numeric values of the ACF and PACF functions along with some other
output. Plotting can be suppressed by passing plot=F.
Series y
0.0 0.4 0.8
ACF
0 5 10 15 20 25
Lag
Series y
0.4 0.8
Partial ACF
−0.2
0 5 10 15 20 25
Lag
returns predictions on five periods following the data in y, along with corresponding standard error estimates.
2 Ox and G@RCH are distributed by Timberlake Consultants Ltd. Timberlake Consultants can be contacted through the
26
5.7 Time Series Tests
5.7.1 Durbin-Watson Test for Autocorrelation
The Durbin-Watson test for autocorrelation can be administered using the durbin.watson() function from
the car library. It takes as its argument an lm object (the output from an lm() command) and returns
the autocorrelation, DW statistic, and an estimated p-value. The number of lags can be specified using the
max.lag argument. See help file for more details.
> library(car)
> results <- lm(Y ~ x1 + x2)
> durbin.watson(results,max.lag=2)
> library(ts)
> a <- arima(y,order=c(1,1,0))
> Box.test(a$resid)
Box-Pierce test
data: a$resid
X-squared = 18.5114, df = 1, p-value = 1.689e-05
would lead us to believe that the model may not be correctly specified, since we soundly reject the Box-Pierce
null. If we want to the Ljung-Box test instead, we include the parameter type="Ljung-Box".
For an appropriate model, this test is asymptotically equivalent to the Breusch-Godfrey test, which is
available in the lmtest() library as bgtest(). It takes a fitted lm object instead of a vector of data as an
argument.
> library(tseries)
> adf.test(y)
data: y
Dickey-Fuller = -2.0135, Lag order = 7, p-value = 0.5724
alternative hypothesis: stationary
27
Unfortunately, the ar() approach does not have built in functionality for such things as predictions and
impulse response functions. The reader may have to code those up by hand if necessary.
Alternately, the ARMA() function in the dse1 library can fit multivariate time series regression in great
generality, but the programming overhead is correspondingly great.
6 Plotting
One of R’s strongest points is its graphical ability. It provides both high level plotting commands and the
ability to edit even the smallest details of the plots.
The plot() command opens a new window and plots the the series of data given it. By default a single
vector is plotted as a time series line. If two vectors are given to plot(), the values are plotted in the x-y
place using small circles. The type of plot (scatter, lines, histogram, etc.) can be determined using the type
argument. Strings for the main, x, and y labels can also be passed to plot.
plots a line in the x-y plane, for example. Colors, symbols, and many other options can be passed to plot().
For more detailed information, see the help system entries for plot() and par().
After a plotting window is open, if we wish to superimpose another plot on top of what we already have,
we use the lines() command or the points() command, which draw connected lines and scatter plots,
respectively. Many of the same options that apply to plot() apply to lines() and a host of other graphical
functions.
We can plot a line, given its coefficients, using the abline() command. This is often useful in visualizing
the placement of a regression line after a bivariate regression
abline() can also be used to plot a vertical or horizontal line at a particular value using the parameters v
or h respectively.
To draw a nonlinear deterministic function, like f (x) = x3 , we don’t need to generate a bunch of data
that lie on the function and then connect those dots. We can plot the function directly using the curve()
function. If we want to lay the function on top of an already created plot, we can pass the add=TRUE
parameter.
28
Kernel Density Estimate of Y
0.12
0.10
0.08
Density
0.06
0.04
0.02
0.00
0 5 10 15
N = 25 Bandwidth = 1.386
We can also plot the empirical CDF of a set of data using the ecdf() command from the stepfun library,
which is included in the default distribution. We could then plot the estimated CDF using plot().
> library(stepfun)
> d <- ecdf(y)
> plot(d,main="Empirical CDF of Y")
29
Predicted vs True
12.4
12.0 true values
predicted
11.6
30
> png("myplot.png")
> plot(x,y,main="A Graph Worth Saving")
> dev.off()
creates a png file of the plot of x and y. In the case of the postscript file, if we intend to include the graphics
in another file (like in a LATEX document), we could modify the default postscript settings controlling the
paper size and orientation. Notice that when the special paper size is used, the width and height must be
specified. Actually with LATEX we often resize the image explicitly, so the resizing may not be that important.
> postscript("myplot.eps",paper="special",width=4,height=4,horizontal=FALSE)
> plot(x,y,main="A Graph Worth Including in LaTeX")
> dev.off()
One more thing to notice is that the default paper size is a4, which is the European standard. For 8.5x11
paper, we use paper="letter". When using images that have been generated as a postscript, then converted
to pdf, incorrect paper specifications are a common problem.
There is also a pdf() command that works the same way the postscript command does, except that
by default its paper size is special with a height and width of 6 inches.
Finally, many scientific diagrams are written using the free software xfig. R has the capability to export
to xfig format, which allows us complete flexibility in adding to and altering our plots. If we want to use R
to make a generic plot (like indifference curves), we remove the axis numbers and other extraneous marks
from the figure.
The xaxt and yaxt parameters remove the numbers and tic marks from the axes.
> plot(x,y,main=substitute(y==Psi*z-sum(beta^gamma)),type="l")
> text(3,40,substitute(Delta[K]==1))
> text(0.6,20,substitute(Delta[K]==epsilon))
y = Ψz − ∑ βγ
120
100
80
60
y
∆K = 1
40
∆K = ε
20
0
−20
0 1 2 3 4 5
31
Capitalizing the first letter of the Greek symbol results in the “capital” version of the symbol. Notice
that to get the equal sign in the expression, one must use the double equal sign, as above. Brackets indicate
subscripts. We can optionally pass variables to substitute() to include their value in the formula. For
example
will make ten plots, in each plot the title will reflect the value of γ that was passed to f (). The rules for
generating mathematical expressions are available through the help for plotmath, which is the mathematical
typesetting engine used in R plots.
To mix text and symbols, use the paste() command inside of substitute()
plot(density(tstats),main=substitute(paste("t-stat of ",beta[0])))
7 Statistics
R has extensive statistical functionality. The functions mean(), sd(), min(), max(), and var() operate on
data as we would expect3 . If our data is a matrix and we would like to find the mean or sum of each row or
column, the fastest and best way is to use one of rowMeans(), colMeans(), rowSums(), colSums().
> rnorm(1,mean=2,sd=3)
[1] 2.418665
> pnorm(2.418665,mean=2,sd=3)
[1] 0.5554942
> dnorm(2.418665,mean=2,sd=3)
[1] 0.1316921
> qnorm(.5554942,mean=2,sd=3)
[1] 2.418665
These functions generate a random number from the N(2,9) distribution, calculate its cdf and pdf value,
and then verify that the cdf value corresponds to the original observation. If we had not specified the mean
and standard deviation, R would have assumed standard normal.
Command Meaning
rX() Generate random vector from distribution X
dX() Return the value of the PDF of distribution X
pX() Return the value of the CDF of distribution X
qX() Return the number at which the CDF hits input value [0,1]
Note that we could replace norm with one of the following standard distribution names
3 Note: the functions pmax() and pmin() function like max and min but elementwise on vectors or matrices.
32
Distribution R Name Possible Arguments
beta beta shape1, shape2, ncp
binomial binom size, prob
Cauchy cauchy location, scale
chi-squared chisq df, ncp
exponential exp rate
F f df1, df1, ncp
gamma gamma shape, scale
geometric geom prob
hypergeometric hyper m, n, k
log-normal lnorm meanlog, sdlog
logistic logis location, scale
negative binomial nbinom size, prob
normal norm mean, sd
Poisson pois lambda
Students t t df, ncp
uniform unif min, max
Weibull weibull shape, scale
Wilcoxon wilcox m, n
The mvtnorm package provides the multivariate normal and t distributions with names mvnorm and
mvt, respectively. Other distributions are found in other packages. For example, invgamma is available in
MCMCpack.
7.2 P-Values
By way of example, in order to calculate the p-value of 3.6 using an f (4, 43) distribution, we would use the
command
> 1-pf(3.6,4,43)
[1] 0.01284459
and find that we fail to reject at the 1% level, but we would be able to reject at the 5% level. Remember,
if the p-value is smaller than the alpha value, we are able to reject. Also recall that the p-value should be
multiplied by two if it we are doing a two tailed test. For example, the one and two tailed tests of a t statistic
of 2.8 with 21 degrees of freedom would be, respectively
> 1-pt(2.8,21)
[1] 0.005364828
> 2*(1-pt(2.8,21))
[1] 0.01072966
So that we would reject the null hypothesis of insignificance at the 10% level if it were a one tailed test
(remember, small p-value, more evidence in favor of rejection), but we would fail to reject in the sign-
agnostic case.
33
8 Math in R
8.1 Matrix Operations
8.1.1 Matrix Algebra and Inversion
Most R commands work with multiple types of data. Most standard mathematical functions and operators
(including multiplication, division, and powers) operate on each component of multidimensional objects.
Thus the operation A*B, where A and B are matrices, multiplies corresponding components. In order to
do matrix multiplication or inner products, use the %*% operator. Notice that in the case of matrix-vector
multiplication, R will automatically make the vector a row or column vector, whichever is conformable.
Matrix inversion is obtained via the solve() function. (Note: if solve() is passed a matrix and a vector,
it solves the corresponding linear problem) The t() function transposes its argument. Thus
β = (X 0 X)−1 X 0 Y (11)
34
8.1.2 Factorizations
R can compute the standard matrix factorizations. The Cholesky factorization of a symmetric positive
definite matrix is available via chol(). It should be noted that chol() does not check for symmetry in its
argument, so the user must be careful.
We can also extract the eigenvalue decomposition of a symmetric matrix using eigen(). By default this
routine checks the input matrix for symmetry, but it is probably better to specify whether the matrix is
symmetric by construction or not using the parameter symmetric.
If the more general singular value decomposition is desired, we use instead svd(). For the QR factoriza-
tion, we use qr(). The Matrix package provides the lu() and Schur() decompositions—just remember to
convert the matrix to type Matrix (not matrix) before using them.
Here nlm() uses a matrix-secant method that numerically approximates the gradient, but if the return value
of the function contains an attribute called gradient, it will use a quasi-newton method. The gradient based
optimization corresponding to the above would be
If function maximization is wanted one should multiply the function by -1 and minimize. If optim() is
used, one can instead pass the parameter control=list(fnscale=-1), which indicates a multiplier for the
35
objective function and gradient. It can also be used to scale up functions that are nearly flat so as to avoid
numerical inaccuracies.
Other optimization functions which may be of interest are optimize() for one-dimensional minimization,
uniroot() for root finding, and deriv() for calculating numerical derivatives. Also, constrOptim() provides
constrained optimization functionality.
We notice that integrate() returns additional information, such as error bounds, so we extract the value
using $value. Also, in addition to a function name, integrate() takes limits of integration, which—as in
this case—may be infinite. For multidimensional integration we instead use adapt() from the adapt library,
which does not allow infinite bounds.
9 Programming
9.1 Writing Functions
A function can be treated as any other object in R. It is created with the assignment operator and
function(), which is passed an argument list (use the equal sign to denote default arguments; all other
arguments will be required at runtime). The code that will operate on the arguments follows, surrounded
by curly brackets if it comprises more than one line.
If an expression or variable is evaluated within a function, it will not echo to the screen. However, if it
is the last evaluation within the function, it will act as the return value. This means the following functions
are equivalent
Notice that R changes the prompt to a “+” sign to remind us that we are inside brackets.
Because R does not distinguish what kind of data object a variable in the parameter list is, we should
be careful how we write our functions. If x is a vector, the above functions would return a vector of the
same dimension. Also, notice that if an argument has a long name, it can be abbreviated as long as the
abbreviation is unique. Thus the following two statements are equivalent
> f(c(2,4,1),Al=3)
> f(c(2,4,1),Alpha=3)
Function parameters are passed by value, so changing them inside the function does not change them
outside of the function. Also variables defined within functions are unavailable outside of the function. If a
variable is referenced inside of a function, first the function scope is checked for that variable, then the scope
above it, etc. In other words, variables outside of the function are available to code inside for reading, but
changes made to a variable defined outside a function are lost when the function terminates. For example,
36
> a<-c(1,2)
> k<-function(){
+ cat("Before: ",a,"\n")
+ a<-c(a,3)
+ cat("After: ",a,"\n")
+ }
> k()
Before: 1 2
During: 1 2 3
> a
[1] 1 2
If a function wishes to write to a variable defined in the scope above it, it can use the “superassignment”
operator <<-. The programmer should think twice about his or her program structure before using this
operator. Its need can easily be the result of bad programming practices.
9.2 Looping
Looping is performed using the for command. It’s syntax is as follows
> for (i in 1:20){
+ cat(i)
> }
Where cat() may be replaced with the block of code we wish to repeat. Instead of 1:20, a vector or matrix
of values can be used. The index variable will take on each value in the vector or matrix and run the code
contained in curly brackets.
If we simply want a loop to run until something happens to stop it, we could use the repeat loop and a
break
> repeat {
+ g <- rnorm(1)
+ if (g > 2.0) break
+ cat(g);cat("\n")
> }
Notice the second cat command issues a newline character, so the output is not squashed onto one line. The
semicolon acts to let R know where the end of our command is, when we put several commands on a line.
For example, the above is equivalent to
> repeat {g <- rnorm(1);if (g>2.0) break;cat(g);cat("\n");}
In addition to the break keyword, R provides the next keyword for dealing with loops. This termi-
nates the current iteration of the for or repeat loop and proceeds to the beginning of the next iteration
(programmers experienced in other languages sometimes expect this keyword to be continue but it is not).
37
Of course, an array may be more than two dimensional, so the second argument (the dimension over which
to apply the function) may go above 2. A better way to do this particular example, of course, would be to
use rowMeans() or colMeans().
We can use apply() to apply a function to every element of an array individually by specifying more
than one dimension. In the above example, we could return a 50x10 matrix of normal quantiles using
> apply(X,c(1,2),qnorm,mean=3,sd=4)
After the required three arguments, any additional arguments are passed to the inside function, qnorm in
this case.
In order to execute a function on each member of a list, vector, or other R object, we can use the function
lapply(). The result will be a new list, each of whose elements is the result of executing the supplied function
on the list element. The syntax is the same as apply() except the dimension attribute is not needed. If a
different type of data type is passed, such as a vector or matrix, each element becomes an element of a list.
To get the results of these function evaluations as a vector instead of a list, we can use sapply().
9.3.2 Replicating
If the programmer wishes to run a loop to generate values (as in a simulation) that do not depend on the
index, the function replicate() provides a convenient and fast interface. To generate a vector of 50000
draws from a user defined function called GetEstimate() which could, for example, generate simulated data
and return a corresponding parameter estimate, the programmer could execute
> estimates<-replicate(50000,GetEstimate(alpha=1.5,beta=1))
If GetEstimate() returns a scalar, replicate() generates a vector. It it returns a vector, replicate()
will column bind them into a matrix. Notice that replicate() always calls its argument function with the
same parameters—in this case 1.5 and 2.
9.4 Conditionals
9.4.1 Binary Operators
Conditionals, like the rest of R, are highly vectorized. The comparison
> x < 3
returns a vector of TRUE/FALSE values, if x is a vector. This vector can then be used in computations.
For example. We could set all x values that are less that 3 to zero with one command
> x[x<3] <- 0
The conditional within the brackets evaluates to a TRUE/FALSE vector. Wherever the value is TRUE, the
assignment is made. Of course, the same computation could be done using a for loop and the if command.
> for (i in 1:NROW(x)){
+ if (x[i] < 3) {
+ x[i] <- 0
+ }
+ }
Because R is highly vectorized, the latter code works much more slowly than the former. It is good program-
ming practice to avoid loops and if statements whenever possible when writing in any scripting language.
The Boolean Operators
! x NOT x
x & y x and y elementwise
x && y x and y total object
x | y x or y elementwise
x || y x or y total object
xor(x, y) x xor y (true if one and only one argument is true)
38
9.5 The Ternary Operator
Since code segments of the form
> if (x) {
+ y } else {
+ z }
come up very often in programming, R includes a ternary operator that performs this in one line
> ifelse(x,y,z)
If x evaluates to TRUE, then y is returned. Otherwise z is returned. This turns out to be helpful because of
the vectorized nature of R programming. For example, x could be a vector of TRUE/FALSE values, whereas
the long form would have to be in a loop or use a roundabout coding method to achieve the same result.
39
The output of slowfunction() is stored in myoutput and the elements of mytime are user, system and total
times (in seconds), followed by totals of user and system times of child processes spawned by the expression.
I often find it inconvenient to pass code to system.time(), so instead we can call proc.time(), which
tells how much time this R session has consumed, directly and subtract.
We are generally interested only in the first number returned by system.time() or proc.time().
No header files need to be included unless more advanced functionality is required, such as passing complex
numbers (which are passed as a particular C structure). In that case, include the file R.h.
Do not use malloc() or free() in C code to be used with R. Instead use the R functions Calloc() and
Free(). R does its own memory management, and mixing it with the default C memory stuff is a bad idea.
Outputting from inside C code should be done using Rprintf(), warning(), or error(). These functions
have the same syntax as the regular C command printf(), which should not be used.
It should also be noted that long computations in compiled code cannot be interrupted by the user. In
order to check for an interrupt signal from the user, we include
#include <R_ext/Utils.h>
...
R_CheckUserInterrupt();
extremely demanding problems. If possible, I suggest working directly in R. It’s quite fast—as interpreted languages go.
40
9.9.2 How to Use the Compiled Functions
To compile the library, from the command line (not inside of R) use the command
This will generate a shared library called mycode.so. To call a function from this library we load the library
using dyn.load() and then call the function using the .C() command. This command makes a copy of each
of its arguments and passes them all by reference to C, then returns them as a list. For example, to call the
dot product function above, we could use
> x<-c(1,4,6,2)
> y<-c(3,2.4,1,9)
> dyn.load("mycode.so")
> product<-.C("gdot",myx=as.double(x),myy=as.double(y),myn=as.integer(NROW(x)),myoutput=numeric())
> product$myoutput
[1] 36.6
Notice that when .C() was called, names were given to the arguments only for convenience (so the resulting
list would have names too). The names are not passed to C. It is good practice (and often necessary) to use
as.double() or as.integer() around each parameter passed to .C(). If compiled code does not work or
works incorrectly, this should be checked first.
It is important to create any vectors from within R that will be passed to .C() before calling them. If
the data being passed to .C() is large and making a copy for passing is not desirable, we can instruct .C()
to edit the data in place by passing the parameter DUP=FALSE. The programmer should be very wary when
doing this, because any variable changed in the C code will be changed in R also and there are subtle caveats
associated with this. The help file for .C() or online documentation give more information.
There is also a .Fortran() function. Notice that .C() and .Fortran() are the simple ways to call
functions from these languages, they do not handle NA values or complicated R objects. A more flexible and
powerful way of calling compiled functions is .Call(), which handles many more types of R objects but adds
significantly to the complexity of the programming. The .Call() function is a relatively recent addition to
R, so most of the language was written using the simple but inflexible .C().
#include<Rmath.h>
void ar1(double *y,double *rho,double *N){
int i;
GetRNGstate();
for (i=1;i<N[0];i++){
y[i]=rho[0]*y[i-1]+rnorm(0.0,1.0);
}
PutRNGstate();
}
> dyn.load("ar1.so")
> X<-.C("ar1",x=double(len=5000),rho=as.double(.9),n=as.integer(5000))$x
41
Most common mathematical operations, such as sqrt() are also available through the C interface.
Actual R expressions can also be called from within C, but this is not recommended since it invokes a
new instance of R and is slower than terminating the C code, doing a computation in R, and calling another
C function. The method for doing it is the (now depreciated) call R() function.
10 Changing Configurations
10.1 Default Options
A number of runtime options relating to R’s behavior are governed by the options() function. Running
this function with no arguments returns a list of the current options. One can change the value of a single
option by passing the option name and a new value. For temporary changes, the option list may be saved
and then reused.
> options(digits=10)
> options(error=recover)
> source("log.R")
Error: subscript out of bounds
1: source("log.R")
2: eval.with.vis(ei, envir)
3: eval.with.vis(expr, envir, enclos)
4: mypredict(v12, newdata = newdata)
Selection: 4
42
Called from: eval(expr, envir, enclos)
Browse[1]> i
[1] 1
Browse[1]> j
[1] 301
Pressing enter while in browse mode takes the user back to the menu. After debugging, we can set error to
NULL again.
43
11.3 Saving as LATEX
R objects can also be saved as LATEX tables using the latex() command from the Hmisc library. The most
common use we have had for this command is to save a table of the coefficients and estimates of a regression.
produces a file named “summary.tex” that produces the following when included in a LATEX source file5
12 Conclusion
R provides an effective platform for econometric computation and research. It has built-in functionality
sufficiently advanced for professional research and has a reasonably steep learning curve (if you put knowledge
on the y axis and effort on the x). Because R is a programming language as well as an econometrics program,
it allows for more complex, tailored computations and simulations than one would get in a prepackaged
system. On the other hand, it takes some time to become familiar with the syntax and reasoning of the
language. I hope that this guide eases the burden of learning to program and do standard data analysis in
the finest statistical environment available. Don’t forget to let me know if you feel like I didn’t do this or
have a suggestion about how I could do it better.
5 Under linux, at least, the latex() command also pops up a window showing how the output will look.
44
13 Appendix: Code Examples
13.1 Monte Carlo Simulation
The following block of code creates a vector of randomly distributed data X with 25 members. It then creates
a y vector that is conditionally distributed as
y = 2 + 3x + . (12)
It then does a regression of x on y and stores the slope coefficient. The generation of y and calculation of
the slope coefficient are repeated 500 times. The mean and sample variance of the slope coefficient are then
calculated. A comparison of the sample variance of the estimated coefficient with the analytic solution for
the variance of the slope coefficient is then possible.
of the scalar or vector passed to it. Notice that a better version of this code would use a vectorized comparison,
but this is an example of conditionals, including the else statement. The interested student could rewrite
this function without using a loop.
Notice also the use of the logical ‘and’ operator, &&, in the if statement. The logical ‘or’ operator is
the double vertical bar, ||. These logical operators compare the entire object before and after them. For
example, two vectors that differ in only one place will return FALSE under the && operator. For elementwise
comparisons, use the single & and | operators.
45
13.3 Maximum Likelihood Estimation
Now we consider code to find the likelihood estimator of the coefficients in a nonlinear model. Let us assume
a normal distribution on the additive errors
y = aLb K c + (14)
Notice that the best way to solve this problem is a nonlinear least squares regression using nls(). We do
the maximum likelihood estimation anyway. First we write a function that returns the log likelihood value
(actually the negative of it, since minimization is more convenient) then we optimize using nlm(). Notice
that Y, L, and K are vectors of data and a, b, and c are the parameters we wish to estimate.
date,exdate,cp_flag,strike_price,best_bid,best_offer,volume,impl_volatility,optionid,cfadj,ss_flag
04JAN1997,20JAN1997,C,500000,215.125,216.125,0,,12225289,1,0
04JAN1997,20JAN1997,P,500000,0,0.0625,0,,11080707,1,0
04JAN1997,20JAN1997,C,400000,115.375,116.375,0,,11858328,1,0
Reading this file on my (relatively slow) computer is all but impossible using read.csv().
> LENGTH<-600000
> myformat<-list(date="",exdate="",cp="",strike=0,bid=0,ask=0,
+ volume=0,impvolat=0,id=0,cjadj=0,ss=0)
> date=character(LENGTH)
> exdate=character(LENGTH)
> price=numeric(LENGTH)
> strike=numeric(LENGTH)
> f<-file("301328226.csv")
> open(f,open="r")
> titles<-readLines(f,n=1) # skip the first line
> i<-1
> repeat{
+ b<-scan(f,what=myformat,sep=",",nlines=1,quiet=T)
+ if (length(b$date)==0) break
+ if (b$cp=="P"){
+ date[i]<-b$date
+ exdate[i]<-b$exdate
+ price[i]<-(b$bid+b$ask)/2
+ strike[i]<-b$strike
+ i<-i+1
+ }
+ }
> close(f)
46
This read took about 5 minutes. Notice that I created the vectors ahead of time in order to prevent having
to reallocate every time we do a read. I had previously determined that there were fewer than 600000 puts in
the file. The variable i tells me how many were actually used. If there were more than 600000, the program
would still run, but it would reallocate the vectors at every iteration (which is very slow).
This probably could have been much speeded up by reading many rows at a time, and memory could
have been saved by converting the date strings to dates using as.Date() at each iteration (see section 2.4).
I welcome suggestions on improvements to this example.
1.0
0.9
0.8
Posterior Mode
0.7
sigma squared
http://www.ku.edu/∼pauljohn/R/Rtips.html
47
Other manuals and books (both official and contributed) can be downloaded from the R website. If a
common econometric problem is not addressed in this text, I would like to hear about it so I can improve
the book.
48