14.170: Programming For Economists: Melissa Dell Matt Notowidigdo Paul Schrimpf

14.
170: Programming for

Economists
1/12/2009-1/16/2009
Melissa Dell
Matt Notowidigdo
Paul Schrimpf
Lecture 2, Intermediate Stata
Warm-up and review

Before going to new material, lets talk about the
exercises
exercise1a.do (privatization DD)
TMTOWTDI!
You had to get all of the different exp* variables
into one variable. You had many different
solutions, many of them very creative.
Replace missing values with 0, and then you
added up all the exp* variables
You used the rsum() command in egen (this
treats missing values as zeroes which is
strange, but it works)
You hard-coded everything
Intermediate Stata overview slide

Quick tour of other built-in commands: non-parametric
estimation, quantile regression, etc.
If youre not sure its in there, ask someone. And then consult
reference manuals. And (maybe) e-mail around. Dont re-invent
the wheel! If its not too hard to do, its likely that someone has
already done it.
Examples:
Proportional hazard models (streg, stcox)

Generalized linear models (glm)
Kernel density (kdensity)
Conditional fixed-effects poisson (xtpoisson)
Arellano-Bond dynamic panel estimation (xtabond)
But sometimes newer commands dont (yet) have exactly

what you want, and you will need to implement it yourself
e.g. xtpoisson doesnt have clustered standard errors
Monte carlo simulations in Stata
You should be able to do this based on what you learned last

lecture (you know how to set variables and use control
structures). Just need some matrix syntax.
More with Stata matrix syntax
Precursor to Mata, Stata matrix languae has many useful built-in

matrix algebra functions
Intermediate Stata commands
Hazard models (streg, stcox)

Non-parametric estimation (kdensity)
Quantile regression (qreg)
Conditional fixed-effects poisson (xtpoisson)
Arellano-Bond dynamic panel estimation (xtabond)
I have found these commands easy to use, but the

econometrics behind them is not always simple. Make
sure to understand what you are doing when you are
running them. Its easy to get results, but with many of
these commands, the results are sometimes hard to
interpret.
But first, a quick review and an easy warm-up
Quick review, FE and IV

clear
set obs 10000
gen id = floor( (_n - 1) / 2000)
bys id: gen fe = invnorm(uniform()) if _n == 1
by id: replace fe = fe[1]
gen
gen
gen
gen
gen
gen
spunk
= invnorm(uniform())
z
schooling = invnorm(uniform()) + z + spunk + fe
ability
= invnorm(uniform()) + spunk
e
y = schooling + ability + e + 5*fe
reg y schooling
xtreg y schooling , i(id) fe
xi: reg y schooling i.id
xi i.id
reg y schooling _I*
areg y schooling, absorb(id)
ivreg y (schooling = z) _I*
xtivreg y (schooling = z), i(id)
xtivreg y (schooling = z), i(id) fe
Data check
Results
Results, cont
Results, cont
Results,
cont
Results,
cont
Fixed effects in Stata

Many ways to do fixed effects in Stata. Which is
best?
xi: regress y x i.id is almost always inefficient
xi i.id creates the fixed effects as variables (as
_Iid0, _Iid1, etc.), so assuming you have the space
this lets you re-use them for other commands (e.g.
further estimation, tabulation, etc.)
areg is great for large data sets; it avoids creating
the fixed effect variables because it demeans the data
by group (i.e. it is purely a within estimator). But it is
not straightforward to get the fixed effect estimates
themselves (help areg postestimation)
xtreg is an improved version of areg. It should
probably be used instead (although requires panel id
variable to be integer, cant have a string)
What if you want state-by-year FE in a large data set?

E[y] = g(X*B) + e
g() is called the link function. Statas glm command
supports log, logit, probit, log-log, power, and negative
binomial link functions
Can also make distribution of e non-Gaussian and
make a different parametric assumption on the error
term (Bernoulli, binomial, Poisson, negative binomial,
gamma are supported)
Note that not all combinations make sense (i.e. cant
have Gaussian errors in a probit link function)
This is implemented in Statas ML language (more on
this next lecture)
If link function or error distribution you want isnt in there,
it is very easy to write in Statas ML langauge (again, we
will see this more next lecture)
See Finkelstein (QJE 2007) for an example and
discussion of this technique.
glm digression
Manning (1998)
In many analyses of expenditures on health care, the expenditures
for users are subject to a log transform to reduce, if not eliminate,
the skewness inherent in health expenditure data In such cases,
estimates based on logged models are often much more precise
and robust than direct analysis of the unlogged original dependent
variable. Although such estimates may be more precise and robust,
no one is interested in log model results on the log scale per se.
Congress does not appropriate log dollars. First Bank will not
cash a check for log dollars. Instead, the log scale results must be
retransformed to the original scale so that one can comment on the
average or total response to a covariate x. There is a very real
danger that the log scale results may provide a very misleading,
incomplete, and biased estimate of the impact of covariates on the
untransformed scale, which is usually the scale of ultimate interest.
glm
clear
set obs 100
gen
gen
gen
gen
x = invnormal(uniform())
e = invnormal(uniform())
y = exp(x) + e
log_y = log(y)
reg y x
reg log_y x, robust
glm y x, link(log) family(gaussian)
glm, cont
- Regression in levels
produces coefficient
that is too large,
while regression in
logs produces
coefficient that is too
low (which we expect
since distribution of y
is skewed)
Non-parametric estimation
Stata has built-in support for kernel densities. Often a useful
descriptive tool to display smoothed distributions of data
Can also non-parametrically estimate probability density functions of
interest.
Example: Guerre, Perrigne & Vuong (EMA, 2000) estimation of firstprice auctions with risk-neutral bidders and iid private values:
Estimate distribution of bids non-parametrically
Use observed bids and this estimated distribution to construct
distribution of values
Assume values are distributed according to following CDF:
F (v ) 1 e v
Then you can derive the following bidding function for N=3 bidders
(v 0.5)e 2v 2(1 v)e v 1.5

b
1 2e v e 2 v
QUESTION: Do bidders shade their bids for all values?
GPV with kdensity

clear
set mem 100m
set seed 14170
set obs 5000
local N = 3
gen value = -log(1-uniform())
gen bid = ( (value+0.5)*exp(-2*value)-2*(1+value)*exp(-value)+1.5 ) ///
/ (1-2*exp(-value)+exp(-2*value))
sort bid
gen cdf_G = _n / _N
kdensity bid, width(0.2) generate(b pdf_g) at(bid)
** pseudo-value backed-out from bid distribution
gen pseudo_v = bid + (1/(`N'-1))*cdf_G/pdf_g
twoway
(kdensity value, width(0.2) ) (kdensity pseudo_v, width(0.2) ), ///

title("Kernel densities of actual values and pseudo-values") ///
scheme(s2mono) ylabel(, nogrid) graphregion(fcolor(white)) ///
legend(region(style(none))) ///
legend(label(1 "Actual values")) ///
legend(label(2 "Pseudo-values")) ///
legend(cols(1)) ///
xtitle("valuation")
graph export gpv.eps, replace
GPV with kdensity
Quantile regression (qreg)

qreg log_wage age female edhsg edclg black other _I*, quantile(.1)
matrix temp_betas = e(b)
matrix betas = (nullmat(betas) \ temp_betas)
QUESTIONS:
- What does it mean if the coefficient on edclg differs by quantile?
- What are we learning when the coefficients are different? (HINT: What does it
tell us if the coefficient is nearly the same in every regression)
- What can you do if education is endogenous?
Non-linear least squares (NLLS)

clear
set obs 50
global alpha = 0.65
gen k=exp(invnormal(uniform()))
gen l=exp(invnormal(uniform()))
gen e=invnormal(uniform())
gen y=2.0*(k^($alpha)*l^(1-$alpha))+e
nl (y = {b0} * l^{b2} * k^{b3})
Non-linear least squares (NLLS)

NLLS: minimize the sum of squared
residuals to get parameter estimates
QUESTION: How are the standard errors
calculated?
The nl method provides built-in support for

NLLS minimization, and also provides robust
and clustered standard errors.
The syntax allows for many types of
nonlinear functions of data and parameters
Will see examples of NLLS in Stata ML later
More NLLS (CES)

clear
set obs 100
set seed 14170
n 1/ n
Y A (1 d ) K (d ) L
n
global d = 0.6
global n = 4.0
global A = 2.0
gen k = exp(invnormal(uniform()))
gen l = exp(invnormal(uniform()))
gen e = 0.1 * invnormal(uniform())
** CES production function
gen y = ///
$A*( (1-$d)*k^($n) + $d*l^($n) )^(1/$n) + e
nl (y = {b0}*( (1-{b1})*k^({b2}) + ///
{b1}*l^({b2}) )^(1/{b2}) ), ///
init(b0 1 b1 0.5 b2 1.5) robust
More NLLS
Conditional FE Poisson (xtpoisson)

Useful for strongly skewed count data (e.g. days
absent), especially when there are a lot of zeroes
(since otherwise a log transformation would probably
be fine in practice)
xtpoisson provides support for fixed and random
effects
xtpoisson days_absent gender math reading, i(id) fe
See Acemoglu-Linn (QJE 2004) for a use of this

technique (using number of approved drugs as the
count dependent variable)
Note that they report clustered standard errors, which
are NOT built into Stata
NOTE: this command is implemented in Statas ML
language
Arellano-Bond estimator (xtabond)

Dynamic panel data estimator using GMM
Specification is lagged dependent variable and
use excluded lags as instruments for the other
lags
Example of a GMM implementation in Stata
Syntax:
tsset state year
xtabond log_payroll miningXoilprice _I*, lags(2)
tsset is standard command to tell Stata you

have a time-series data set (the panel variable is
optional for some commands, but for xtabond it
is required)
Other important commands

The following commands
are commonly used and
you should be aware of
them (since they are all
ML estimators, we will
see some of them
tomorrow)
probit
tobit
logit
clogit
ivprobit
ivtobit
I will also not be

discussing these useful
commands:
heckman
cnsreg
mlogit
mprobit
ologit
oprobit
You should look these up

on your own, especially
after Lecture 3
Stata matrix language

Before Mata (Lecture 4), Stata had built-in
matrix language. Still useful even with Mata
because Mata syntax is somewhat
cumbersome
When to use Stata matrix language:
Adding standard errors to existing estimators
Writing new estimators from scratch (when such
estimators are naturally implemented using
matrix algebra)
Storing bootstrapping and Monte Carlo results
(simulations)
Monte Carlo in Stata

There is a simulate command that is supposed
to make your life easier. I dont think it does, but
you should decide for yourself. (help simulate)
Monte Carlo simulations can clarify intuition
when the math isnt obvious.
EXAMPLE: We will use a simulation to
demonstrate the importance of using a robust
variance-covariance matrix in the presence of
heteroskedasticity.
clear
set more off
set mem 100m
set matsize 1000
local B = 1000
matrix Bvals = J(`B', 1, 0)
matrix pvals = J(`B', 2, 0)
forvalues b = 1/`B' {
drop _all
quietly set obs 200
gen cons = 1
gen x = invnormal(uniform())
gen e = x*x*invnormal(uniform())
gen y = 0*x + e
qui regress y x cons, nocons
matrix betas = e(b)
matrix Bvals[`b',1] = betas[1,1]
qui testparm x
matrix pvals[`b',1] = r(p)
qui regress y x cons , robust nocons
qui testparm x
matrix pvals[`b',2] = r(p)
}
drop _all
svmat Bvals
svmat pvals
summ *, det
Monte Carlo
in Stata,
cont
Monte Carlo in Stata, cont

set more off
On UNIX, this will keep the buffer from locking
set matsize 1000
Sets default matrix size
matrix Bvals = J(`B', 1, 0)
Creates `B-by-1 matrix
drop _all
Unlike clear, this only drops the data (NOT

matrices!)
quietly set obs 200
Suppresses output
qui regress y x cons, nocons
qui is abbreviation; nocons means constant not

included
matrix betas = e(b)
e() stores the return values from regression;

e(b) is betas
matrix Bvals[`b',1] = betas[1,1]
syntax to set matrix values
qui testparm x
performs a Wald test to see if x is

statistically significant
qui regress y x cons , robust nocons

svmat Bvals
uses robust standard errors
writes out matrix as a data column
Monte Carlo in Stata, cont
OLS by hand
clear
set obs 10
set seed 14170
gen x1 = invnorm(uniform())
gen y = 1 + x1 + x2 + 0.1 * invnorm(uniform())
gen cons = 1
mkmat x1 x2 cons, matrix(X)
mkmat y, matrix(y)
matrix list X
matrix list y
(X ' X ) X ' y
1
matrix beta_ols = invsym(X'*X) * (X'*y)

matrix e_hat = y - X * beta_ols
matrix V = (e_hat' * e_hat) * invsym(X'*X) / (rowsof(X) - colsof(X))
matrix beta_se = (vecdiag(V))'
local rows = rowsof(V)
forvalues i = 1/`rows' {
matrix beta_se[ì',1] = sqrt(beta_se[ì',1])
}
matrix ols_results = [beta_ols, beta_se]
matrix list ols_results
reg y x1 x2
OLS by hand
clear
set obs 100000
set seed 14170
gen y = 1 + x1 + x2 + 0.1 * invnorm(uniform())
gen cons = 1
mkmat x1 x2 cons, matrix(X)
mkmat y, matrix(y)
matrix list X
matrix list y
matrix beta_ols = invsym(X'*X) * (X'*y)
matrix e_hat = y - X * beta_ols
matrix V = (e_hat' * e_hat) * invsym(X'*X) /
(rowsof(X) - colsof(X))
}
reg y x1 x2
(help set matsize; maximum matrix size is

11,000 on Stata/SE and Stata/MP)
clear
set obs 100000
set seed 14170
gen y = 1 + x1 + x2 + 100 * invnorm(uniform())
OLS by hand
v2.0
global xlist = "x1 x2"

matrix accum XpX = $xlist
matrix vecaccum Xpy = y $xlist
matrix beta_ols = invsym(XpX) * Xpy'
matrix list beta_ols
gen e_hat = y
local i = 1
foreach var of varlist $xlist {
replace e_hat = e_hat - beta_ols[ì',1] * `var'
local i = ì' + 1
}
** constant term!
replace e_hat = e_hat - beta_ols[ì',1]
matrix accum e2 = e_hat, noconstant
matrix V = invsym(XpX) * e2[1,1] / (_N - colsof(XpX))
}
reg y x1 x2
clear
set obs 1000
program drop _all
program add_stat, eclass
ereturn scalar `1' = `2'
end
helper programs
gen z = invnorm(uniform())
gen v = invnorm(uniform())
gen x = .1*invnorm(uniform()) + 2.0*z + 10.0*v
gen y = 3.0*x + (10.0*v + .1*invnorm(uniform()))
reg y x
estimates store ols
reg x z
test z
return list
add_stat "F_stat" r(F)
estimates store fs
reg y z
estimates store rf
ivreg y (x = z)
estimates store iv
estout * using baseline.txt, drop(_cons) ///
stats(F_stat r2 N, fmt(%9.3f %9.3f %9.0f)) modelwidth(15) ///
cells(b( fmt(%9.3f)) se(par fmt(%9.3f)) p(par([ ]) fmt(%9.3f)) ) ///
style(tab) replace notype mlabels(, numbers )
helper programs
helper programs
ADO files in Stata

**
** Monte Carlo to investigate heteroskedasticity-robust s.e.'s
**
clear
set more off
set seed 14170
local count = 0
global B = 1000
forvalues i = 1(1)$B {
quietly {
clear
set obs 2000
gen x = invnorm(uniform())
gen y = 0*x + abs(0.1*x)*invnorm(uniform())
regress y x
test x
if (r(p) < 0.05) {
local count = `count' + 1
}
}
}
local rate = `count' / $B
di "Rejection rate (at alpha=0.05): `rate'"
0.236
ADO files in Stata

**
**
clear
set more off
set seed 14170
local count = 0
global B = 1000
quietly {
clear
set obs 2000
regress y x, robust
test x
if (r(p) < 0.05) {
}
}
}
0.048
ADO files in Stata

(robust_regress.ado file)
program define robust_regress, eclass
syntax varlist
gettoken depvar varlist: varlist
quietly regress `depvar' `varlist'
predict resid, residuals
gen esample = e(sample)
local obs = e(N)
matrix betas = e(b)
matrix accum XpX = `varlist'
gen all = _n
sort all
matrix opaccum W = `varlist', opvar(resid) group(all)
matrix V = invsym(XpX) * W * invsym(XpX)
ereturn post betas V, dep(`depvar') o(òbs') esample(esample)
ereturn display
end
N
Vrobust
x
)
(
x
)
(
X
X
)
i i
i i
i 1
( X X ) 1
ADO files in Stata

**
**
clear
set more off
set seed 14170
local count = 0
global B = 1000
quietly {
clear
set obs 2000
robust_regress y x
test x
if (r(p) < 0.05) {
}
}
}
0.049
ADO files in Stata

clear
set obs 2000
robust_regress y x
regress y x, robust
ADO files in Stata

(robust_regress.ado file)
program define robust_regress, eclass
syntax varlist
gettoken depvar varlist: varlist
quietly reg `depvar' `varlist', robust
predict resid, residuals
gen esample = e(sample)
local obs = e(N)
matrix betas = e(b)
matrix accum XpX = `varlist'
gen all = _n
sort all
matrix opaccum W = `varlist', opvar(resid) group(all)
matrix V = (_N/(_N-colsof(XpX))) * invsym(XpX) * W * invsym(XpX)
ereturn post betas V, dep(`depvar') o(òbs') esample(esample)
ereturn display
end
Vrobust
( X X ) 1
N K
1
(
x
)
(
x
)
i i
i i (X X )
i 1
Reflections on Trusting Trust

Ken Thompson Turing Award speech:
( http://www.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf )
You can't trust code that you did not totally

create yourself. (Especially code from
companies that employ people like me).
(an aside) More on trusting trust

clear
set obs 2000
set seed 14170
gen id = 1+floor((_n - 1)/50)
gen y = x + ///
abs(x)*invnorm(uniform())+id
areg y x, ///
cluster(id) absorb(id)
STATA v9.1
STATA v10.0
Exercises
Go to following URL:
http://web.mit.edu/econ-gea/14.170/exercises/
Download each DO file
No DTA files! All data files loaded from the web (see help
webuse)
1 exercise (increasing difficulty)
A. Monte carlo test of OLS/GLS with serially correlated data
B. Heckman two-step with bootstrapped standard errors
C. Correcting for measurement error of known form

14.170: Programming For Economists: Melissa Dell Matt Notowidigdo Paul Schrimpf

Uploaded by

Copyright:

Available Formats

14.170: Programming For Economists: Melissa Dell Matt Notowidigdo Paul Schrimpf

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

14.170: Programming For Economists: Melissa Dell Matt Notowidigdo Paul Schrimpf

Uploaded by

Copyright:

Available Formats

14.

170: Programming for

Lecture 2, Intermediate Stata

Warm-up and review

Intermediate Stata overview slide

Proportional hazard models (streg, stcox)

But sometimes newer commands dont (yet) have exactly

Monte carlo simulations in Stata

You should be able to do this based on what you learned last

More with Stata matrix syntax

Precursor to Mata, Stata matrix languae has many useful built-in

Intermediate Stata commands

Hazard models (streg, stcox)

I have found these commands easy to use, but the

Quick review, FE and IV

Fixed effects in Stata

Generalized linear models (glm)

(v 0.5)e 2v 2(1 v)e v 1.5

QUESTION: Do bidders shade their bids for all values?

GPV with kdensity

(kdensity value, width(0.2) ) (kdensity pseudo_v, width(0.2) ), ///

GPV with kdensity

Quantile regression (qreg)

Non-linear least squares (NLLS)

Non-linear least squares (NLLS)

The nl method provides built-in support for

More NLLS (CES)

Conditional FE Poisson (xtpoisson)

See Acemoglu-Linn (QJE 2004) for a use of this

Arellano-Bond estimator (xtabond)

tsset is standard command to tell Stata you

Other important commands

I will also not be

You should look these up

Stata matrix language

Monte Carlo in Stata

Monte Carlo in Stata, cont

On UNIX, this will keep the buffer from locking

set matsize 1000

Sets default matrix size

matrix Bvals = J(`B', 1, 0)

Creates `B-by-1 matrix

Unlike clear, this only drops the data (NOT

quietly set obs 200

qui regress y x cons, nocons

qui is abbreviation; nocons means constant not

matrix betas = e(b)

e() stores the return values from regression;

matrix Bvals[`b',1] = betas[1,1]

syntax to set matrix values

performs a Wald test to see if x is

qui regress y x cons , robust nocons

uses robust standard errors

writes out matrix as a data column

Monte Carlo in Stata, cont

matrix beta_ols = invsym(X'*X) * (X'*y)

(help set matsize; maximum matrix size is

global xlist = "x1 x2"

matrix beta_ols = invsym(X'X) (X'*y)