2.lecture2 Ate

ECON 293/MGTECON 634:
Machine Learning and Causal Inference
Stefan Wager
Stanford University
Lecture 2: Average Treatment Effects
12 April 2018
A central goal of machine learning is to understand what usually
happens in a given situation, e.g.,
I Given today’s weather, what’s the chance tomorrow’s air
pollution levels will be dangerously high?
Most economists want to predict what would happen if we changed
the system, e.g.,
I How does the answer to the above question change if we
reduce the number of cars on the road?
This class is about the interface of causal inference and machine
learning, with both terms understood broadly:
I Our discussion of causal inference draws from a long
tradition in economics and epidemiology on which questions
about counterfactuals can be answered using a given type of
data, and how these estimands can be interpreted.
I We use the term machine learning to describe an engineering
heavy approach to data analysis. Given a well-defined task in
which good performance can be empirically validated, we do
not shy away from computationally heavy tools or
potentially heuristic approaches (e.g., decision trees, neural
networks, non-convex optimization).
Today’s lecture is about average treatment effects:
I The potential outcomes model for causal inference in
randomized experiments.
I Observational studies and the propensity score.
I Double robustness, or how to use machine learning for
principled treatment effect estimation.
The potential outcomes framework
For a set of i.i.d. subjects i = 1, ..., n, we observe a tuple

(Xi , Yi , Wi ), comprised of
I A feature vector Xi ∈ Rp ,
I A response Yi ∈ R, and
I A treatment assignment Wi ∈ {0, 1}.
Following the potential outcomes framework (Neyman, 1923;

Rubin, 1974), we posit the existence of quantities Yi (0) and Yi (1),
such that Yi = Yi (Wi ).
I Potential outcomes correspond to the response we would
have measured given that the i-th subject received
treatment (Wi = 1) or no treatment (Wi = 0).
I The causal effect of the treatment is Yi (1) − Yi (0).

Our first goal is to estimate the average treatment effect (ATE)
τ = E [Yi (1) − Yi (0)] .
Of course, we only get to see Yi = Yi (Wi ). This “missing data”

issue is a fundamental problem ins causal inference.
The simplest way to identify the ATE in the potential outcomes is

via a randomized trial:
{Yi (0), Yi (1)} ⊥

⊥ Wi .
In a randomized trial, we can check that:
τ = E [Yi (1)] − E [Yi (0)]

= E Yi (1) Wi = 1 − E Yi (0) Wi = 0

= E Yi Wi = 1 − E Yi Wi = 0 ,
where the last line only has observable moments.
Thus, although we never observe τi = Yi (1) − Yi (0), we can

consistently estimate τ = E [τi ] in a randomized trial.
Example: The outcome Yi is daily air quality index. The
treatment imposes restrictions on driving to reduce traffic.
●
200
Yi (0) Yi (1) τi ●
● ●●
● ●
●●
●
●●● ●
●
●
●
●●
●● ● ●●
154.68 153.49 -1.20 ●● ● ●

●●●
● ● ● ● ●
● ●●● ●●● ●●
●●●
●
● ●● ● ● ●
150
● ● ●● ●● ● ●
● ●● ● ● ●
● ● ●●●● ●● ●
●● ● ●●●● ●
● ●
● ● ● ●
●
● ●
● ● ●
● ●●● ●● ●
● ●●● ● ●●● ●
● ●● ●●
●
Y(1)
● ●●●● ●●●● ●●●● ●● ● ●
135.67 120.40 -15.27 ● ● ●●

●●●
●●● ● ●●
●● ●● ● ●●
●●
● ●● ● ● ●●
●● ● ●●
●● ●●
●● ● ●
●
●● ●● ●●●●
● ●●
● ●●●● ●
●● ● ●●●
●●
●●●●● ●●
●● ● ● ●●
● ●●●
●
●● ●●●● ●●
● ● ●●●●
●
●
●
●●
●●● ● ●
●●●●● ●●●●
●
●●
●● ●
●
●
●●
●
●● ●●●
●
●●●●● ●●●
●
●●●●●
●
●●
●● ●
●●
●
●●●●●●● ● ●
●●●●●
●●●
●
●●●
●
●●●●●●
100
●● ●
●● ●● ● ● ●
●●●● ●●
●● ●
●
● ●
●
● ●● ●● ●
● ●● ●● ●●●● ●● ●
103.46 117.68 14.23 ●●

● ●
●● ●●
● ●
●●●
● ●●● ●
●
●●● ●●●
●
●
●●●● ●
●●●●
●●●●
●●
●●●●
● ●●
● ●● ●
● ●●● ●●●
●●●
●●●
●●●
● ● ●●
●●● ● ●
●
●●●
●● ●
●●● ●
●
●●●
● ●●●● ●
●●
●●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●●●●●● ●
●
●
●●●●●●●●
●●
●●
●●
●●●
●
●●●
● ●●●
●
● ●● ● ●
● ●●●●●
● ●
●
●●
●●●
●●
●
●
●
●
●●
●●●
●●●●● ●●
● ●●●●●●●
● ● ●●
● ●●
● ●●
●
●
●●●
●
●
● ●
●●
●
●●●●●
●●●
● ●●
●
●
●● ●●●●
● ●
●●
●
●
●●●
●
●●● ●
● ●●●
●
●●●● ●●
●●
●
●●
●●●●●
● ●
●●
●
●
●
●
●
●
●
●●●
●●●
●
●
●●
●
●
●
●●
●
●
●
●
●●●
●●●
● ●●
●●●
● ●●
●●● ●●
117.62 95.08 -22.54

●●● ●● ●● ● ●
● ●●
●●
● ●
●
●●●●
●
●
●●
●
●●
● ●●
●
●●●● ●
●
●
● ● ●●●● ●
●
●
●
●● ●●
●
●
●
● ●●●
●
●●
●●●●●
●
●
●
●●●●●●
● ●
50
● ●
●
●
●●●●●●
●●
●
● ●● ●
●●
● ●
● ●
●
●
●
●
●● ●●
●●
●●●
●●
●● ●● ●
●●
●
161.11 146.73 -14.39 ●●●

●
50 100 150 200 250
117.89 105.05 -12.84 Y(0)
400
84.00 75.59 -8.41
73.32 65.68 -7.63
300
100.07 93.80 -6.28
Frequency
200
103.81 82.30 -21.51
··· ··· ···
100
111.68 101.47 -10.21 0
−80 −60 −40 −20 0 20

Y(1) − Y(0)
Example: The outcome Yi is daily air quality index. The
treatment imposes restrictions on driving to reduce traffic.
Yi (0) Yi (1) τi
154.68 — —
135.67 — — I In practice, we only ever
— 117.68 — observe a single potential
— 95.08 — outcome.
— 146.73 — I However, in a RCT, we can
117.89 — — use averages over the
— 75.59 — treated and controls to
— 65.68 — estimate the ATE.
100.07 — — I We estimate τ̂ as
— 82.30 — 110.59 − 100.52 = 10.07.
··· ··· ···
110.59 100.52 —
ATE estimation in randomized trials
We have use the potential outcomes framework to justify the

classical estimator of an average treatment effect:
P P
{i:Wi =1} Yi {i:Wi =0} Yi
τ̂ = − .
|{i : Wi = 1}| |{i : Wi = 0}|
This estimator is unbiased, consistent, asymptotically

Gaussian, and also very simple. But is it the best we can do?
I If one has access
to covariates Xi and can estimate
E Yi Xi , Wi accurately, then one can improve the
precision of the above estimator.
I Any black-box predictor can be used for this (e.g., a forest,
boosted trees, a deep net); the improvement in precision
depends on mean-squared error.
The simplest ATE estimator in an RCT is
P P
{i:Wi =1} Yi {i:Wi =0} Yi
τ̂ = − .
|{i : Wi = 1}| |{i : Wi = 0}|
How could we possibly improve on this?

I In the air quality example, weather has an effect on ozone
(hot days have higher levels), independently of treatment.
I If we randomly assign treatment to more hot days and control
to more cold days, our estimates we exaggerate the
treatment effect, and vice-versa.
I In large samples these effects cancel out, but in small
samples they matter. If we could predict and eliminate the
effect of weather, we’d improve accuracy.
The traditional approach to this is via stratified sampling; here,
we’ll discuss an automatic approach that only assumes the
existence of a good predictor.

Define the conditional response surfaces as

µ(w ) (x) = E Yi Xi = x, Wi = w .
In the potential outcomes model, an oracle who knew the µ(w ) (x)
could use
n
1X
τ̄ = µ(1) (Xi ) − µ(0) (Xi ) .
n
i=1
Our approach starts by seeking to imitate this oracle.

ATE estimation via prediction
In the potential outcomes model, an oracle who knew the µ(w ) (x)
could use
n
1X
τ̄ = µ(1) (Xi ) − µ(0) (Xi ) .
n
i=1
A first, naive approach simply sets

n
1X
τ̂ = µ̂(1) (Xi ) − µ̂(0) (Xi ) .
n
i=1
This is good if µ̂(w ) (x) is obtained via OLS. But it breaks down if
we use regularization.
Example. Suppose that p n, but the true model is sparse,

E Y X = x, W = w = 2X1 + 0.1WX2 .
A lasso might set the coefficient on WX2 to 0, and estimate τ̂ = 0!

A better estimator needs to correct for regularization bias:

n
1X
τ̂ = µ̂(1) (Xi ) − µ̂(0) (Xi ) (optimistic plug-in)
n
i=1
P
{i:Wi =1} Yi − µ̂(1) (Xi )
+ (bias correction for µ̂(1) (·))
|{i : Wi = 1}|
P
{i:Wi =0} Yi − µ̂(0) (Xi )
− (bias correction for µ̂(0) (·))
|{i : Wi = 0}|
Modulo technical details, this is justified for any µ̂(w ) (x). If

µ̂(w ) (x) can predict Yi at all, can improve over basic estimator.
If µ̂(w ) (x) is consistent, i.e., E (µ̂(W ) (X ) − µ(W ) (X ))2 → 0, then

this estimator is optimal in large samples.

Details: Wager et al. High-Dim. Regression Adjust. in RCTs. PNAS, 113(45), 2016.
Example: We have n = 1000, p = 400, and P [W = 1] = 0.4, with
iid
E Y X = x, W = w = 2X1 + 0.1WX2 , Xij ∼ U ([0, 2]) .
Predictions made via a cross-validated lasso (no intercept).
Distribution of estimates:
50
Consider 3 estimators, with 40
30
basic
mean-square errors for τ : 20
10
I basic: 0.105. 0
50
bias-corrected: 0.092.
40
I method
corrected
30 basic
count
corrected
20
I plug-in: 0.210. 10
plugin
50
40
30
plugin
20
10
0
−0.2 0.0 0.2
X
Today’s lecture is about average treatment effects:
I The potential outcomes model for causal inference in
randomized experiments.
I Observational studies and the propensity score.
I Double robustness, or how to use machine learning for
principled treatment effect estimation.
Beyond randomized trials
The simplest way to move beyond randomized controlled trials is to

let randomization probabilities depend on covariate information.
I We are interested in giving teenagers cash incentives to
discourage them from smoking.
I A random subset of ∼ 5% of teenagers in Palo Alto, CA,
and a random subset of ∼ 20% of teenagers in Geneva,
Switzerland are eligible for the study.
Palo Alto Non-S. Smoker Geneva Non-S. Smoker

Treat. 152 5 Treat. 581 350
Control 2362 122 Control 2278 1979
This is not a randomized controlled study, because Genevans

are both more likely to smoke whether or not they get treated, and
more likely to get treated.
The Palo Alto experiment and Geneva experiment are both

individually randomized controlled studies—and looking at the
numbers clearly shows that the treatment helps prevent smoking.
Palo Alto Non-S. Smoker Geneva Non-S. Smoker
Treat. 152 6 Treat. 581 395
Control 2362 122 Control 2278 1979
Looking at aggregate data is misleading, and makes it look like the

treatment hurts.
Palo Alto + Geneva Non-Smoker Smoker
Treatment 733 401
Control 4640 2101
This phenomenon is an example of Simpson’s “paradox”.

Formally, we have covariates Xi ∈ {Palo Alto, Geneva}, and know

that the treatment assignment was random conditionally on Xi :

{Yi (0), Yi (1)} ⊥
⊥ Wi Xi .
We then estimate the overall average treatment effect as:

X |{Xi = x}|
τ̂ = τ̂ (x),
n
x∈X
P P
{i:Xi =x, Wi =1} Yi {i:Xi =x, Wi =0} Yi
τ̂ (x) = − .
|{i : Xi = x, Wi = 1}| |{i : Xi = x, Wi = 0}|
Covariates and unconfoundedness

We assume that the treatment is unconfounded (aka selection on
observables) (Rosenbaum & Rubin, 1983):

{Yi (0), Yi (1)} ⊥
⊥ Wi Xi .
We seek the ATE τ = E [Yi (1) − Yi (0)]. If the Xi is discrete, we

can stratify: estimate an ATE for each x separately, and
aggregate. But what if X is continuous and/or high-dimensional?
Covariates and unconfoundedness

Given unconfoundedness {Yi (0), Yi (1)} ⊥
⊥ Wi Xi , we have
τ = E [Yi (1) − Yi (0)]

= E E Yi (1) Xi − E Yi (0) Xi

= E E Yi Xi , Wi = 1 − E Yi Xi , Wi = 0

= E µ(1) (Xi ) − µ(0) (Xi ) ,

where µ(w ) (x) = E Yi Xi = x, Wi = w . This suggests an
estimator based on a regression adjustment:
n
1X
τ̂ = µ̂(1) (Xi ) − µ̂(0) (Xi ) ,
n
i=1
where µ̂(w ) (Xi ) is obtained by regression Yi on Xi on those

observations with Wi = w .
Is this going to work? Yes with OLS, no in general.
Review: ATE estimation via OLS
A classical approach to the ATE involves estimating µ(0) (x) and

µ(1) (x) via ordinary least-squares regression (OLS). Specifically,
in R notation, we first run two separate regressions (recall that lm
is the R command for running linear regression):
β̂(0) ← lm (Yi ∼ Xi , subset Wi = 0) ,

β̂(1) ← lm (Yi ∼ Xi , subset Wi = 1) .
We then make predictions µ̂(w ) (x) = β̂(w ) x, and obtain a

treatment effect estimate as
n
1X
τ̂ = µ̂(1) (Xi ) − µ̂(0) (Xi ) = β̂(1) − β̂(0) X ,
n
i=1
Pn
where X = i=1 Xi . Note that, X implicitly includes an intercept.
Review: ATE estimation via OLS
library(sandwich) # for robust standard errors
treat_data = read.table("...../nswre74_treated.txt")
control_data = read.table("...../psid3_controls.txt")
combined = rbind(treat_data, control_data)
X = combined[,2:9]; Y = combined[,10]; W = combined[,1]
# First center the X, then run OLS with full W:X

# interactions. With this construction, the
# W-coefficient can be interpreted as ATE.
X.centered = scale(X, center = TRUE, scale = FALSE)
ols.fit = lm(Y ~ W * X.centered)
# Use robust standard errors

tau.hat = coef(ols.fit)["W"]
tau.se = sqrt(sandwich::vcovHC(ols.fit)["W", "W"])
print(paste0("95% CI: ", round(tau.hat),
" +/- ", round(1.96 * tau.se)))
"95% CI: 2107 +/- 2379"
ATE estimation via the lasso?
OLS is optimal for learning linear models in low dimensions, i.e.,
with p predictors using n samples when p n. In many modern
applications, however, p may be of comparable size (or larger than)
the sample size n. In this case, we often use the lasso:
( n )
1X 2
β̂lasso = argmin (Yi − Xi β) + λ kβk1 .
n
i=1
By analogy to the low-dimensional case, may be tempted to use
β̂(0) ← lasso (Yi ∼ Xi , subset Wi = 0) ,

β̂(1) ← lasso (Yi ∼ Xi , subset Wi = 1) ,

τ̂ = β̂(1) − β̂(0) X .
Is this any good? The fundamental difference between the lasso

and OLS is that the lasso has bias (and, in fact, any method in
high dimensions must have bias).
Imposing Sparsity: LASSO Crash Course
We are in a p-dimensional linear model with n samples. Assume

that there are at most a fixed number k of non-zero coefficients:

Yi = Xi β + εi , E εi Xi = 0,
such that kβk0 ≤ k.

Lasso theory provides results on when we can consistently
estimate β, even when the number of features p may be much
larger than n.
The strength of the results depends on the amount of sparsity:
The smaller k, then better guarantees we can get.
Imposing Sparsity: LASSO Crash Course
Suppose X ∈ Rn×p satisfies a restricted eigenvalue condition: no
small group of variables is nearly collinear. Then we can show:
r ! r !
k log(p) log(p)
β̂ − β = OP , β̂ − β = OP k .

2 n 1 n
At a high level, this error arises

p because the lasso shrinks each
coefficient on the order of log(p)/n.
If k n/ log(p), we say the problem is sparse, and the lasso will
make accurate predictions. For example, if
h i 2
iid
Xij ∼ N (0, 1) , then E (X β̂ − X β))2 = β̂ − β .

2
√
If k n/ log(p), we say the problem is ultra-sparse, and we can
build confidence intervals for βj via the debiased lasso.
Why Lasso Regression Adjustments Don’t Work
We have a design with no treatment effect, but with a different

covariate distribution for the treated and controls. Specifically:

E X W = 0 = 0, E X W = 1 = 1,

such that Y (w ) = X β(w ) + noise, Var Y (w ) X = σ 2 and

p
β(0) j
= 1 ({i ≤ k}) σ
= β(1) j
log(p)/n, and so

τ = E [X ] · β(1) − β(0) = 0.
The lasso fits, µ̂(w ) (x) = â(w ) + x β̂(w ) , with intercept â(w ) .
In order to zero-out noise

p terms, the lasso must eliminate all
signals smaller than σ 2 log(p)/n. Here, all β are smaller than
this cutoff, so the lasso pushes them to 0.
Thus, with high probability,
lasso lasso
â(0) ≈ 0, β̂(0) = 0, and
lasso
p lasso
â(1) ≈ k log(p)/n, β̂(1) = 0.
Combining these into an ATE estimate, we get

p
τ̂ lasso = â(1)
lasso lasso
− â(0) lasso
+ X · β̂(1) lasso
− β̂(0) ≈ kσ log(p)/n.
p
Thus, the lasso has a bias on the order of kσ log(p)/n, despite
the fact that the true treatment effect is 0.
I The lasso only looks for strong relationships between X and

the outcome Y .
I But, for estimating ATE, it’s also important to capture
variables with a strong relationship between X and W .
I Strong variables in the propensity model can leak confounding
effects, even if the corresponding X -Y effect is so small the
lasso ignores it.
I This was not a problem with OLS, because OLS is unbiased
(so it tries to fit every coefficient accurately, even if it’s close
to 0).
Improving the Properties of ATE Estimation in High
Dimensions: A “Double-Selection” Method
Belloni, Chernozukov, and Hansen (2014) propose a simple fix to
this problem.
I Run a LASSO of W on X . Select variables with non-zero
coefficients at a selected λ (e.g. cross-validation).
I Run a LASSO of Y on X on both the treated on control
samples. Select variables with non-zero coefficients at a
selected λ (may be different than first λ).
I Run OLS of Y on W interacted with the union of selected
variables. Conclude as in the regular OLS case.
The third step above is not as good at purely predicting Y as
using only second set. But it is more accurate for the ATE.
Result: under “approximate sparsity” of BOTH the propensity and
outcome models, and constant treatment effects, estimated ATE is
asymptotically normal and estimation is efficient.
Single v. Double Selection in BCH Algorithm
Recap: ATE estimation via the lasso
The lasso can be use to estimate
conditional response functions
µ(w ) (x) = E Yi (w ) Xi = x as
µ̂lasso lasso lasso

(w ) (x) = â(w ) + x β̂(w ) .
Because we’re in high dimensions, we need to regularize, and this

leads to bias.
I The lasso is calibrated to make good predictions, but not
necessarily to make good ATE estimates.
I The simple
P lasso regressionlasso adjustment
τ̂ = n1 ni=1 (µ̂lasso
(1) (Xi ) − µ̂ (0) (Xi )) may fail badly.
The BCH method provides a fix to this problem, by
“un-regularizing” features that are important in the propensity
model.
I Does this idea generalize beyond linear models?
I What if the propensity model isn’t sparse?
The propensity score
The confounding effects of Xi can alternatively be captured via the

propensity score,

e(x) = P Wi = 1 Xi = x .
The key fact about the propensity score is that

Wi Yi (1 − Wi )Yi
τ =E − .
e(Xi ) 1 − e(Xi )
The same idea underlies importance weighting,

Horvitz-Thompson sampling, etc.
The propensity score
Inverse-propensity weighting is unbiased because:
τ = E [Yi (1) − Yi (0)]

= E E Yi (1) Xi − E Yi (0) Xi
" #
E Wi Xi E Yi (1) Xi E 1 − Wi Xi E Yi (0) Xi
=E −
e(Xi ) 1 − e(Xi )
" #
E Wi Yi (1) Xi E (1 − Wi ) Yi (0) Xi
=E −
e(Xi ) 1 − e(Xi )

Wi Yi (1 − Wi )Yi
=E − .
e(Xi ) 1 − e(Xi )
The 5-th equality depends on consistency of the potential

outcomes, and the 4-th equality relies on unconfoundedness,

{Yi (0), Yi (1)} ⊥
⊥ Wi Xi .
Inverse-propensity weighting
We know that the average treatment effect is

Wi Yi (1 − Wi )Yi
τ =E − .
e(Xi ) 1 − e(Xi )
A natural idea is to estimate ê(·) via some machine learning

method (e.g., an L1 -penalized logistic regression in high
dimensions), and then use
n
1X Wi Yi (1 − Wi )Yi
τ̂ = − .
n ê(Xi ) 1 − ê(Xi )
i=1
This strategy has several pitfalls, however:

I Getting properly calibrated ê(·) estimates is hard.
I Regularization bias is still a problem.
The rest of this lecture is about a strategy to overcome this issue.
Augmented Inverse-Propensity Weighting
There is a more flexible approach to using machine learning

methods for ATE estimation that relies on both outcome
regression and the propensity score

µ(w ) (x) = E Yi Xi = x, Wi = w , e(x) = P Wi = 1 Xi = x .
Suppose that we have estimates µ̂(w ) (x) from any machine learning
method, and also have propensity estimate ê(x). AIPW then uses:
n
1X Wi
τ̂ = µ̂(1) (Xi ) − µ̂(0) (Xi ) + Yi − µ̂(1) (Xi )
n ê(Xi )
i=1

1 − Wi
− Yi − µ̂(0) (Xi ) .
1 − ê(Xi )
In considerable generality, this is a good estimator of the ATE.

Augmented Inverse-Propensity Weighting
To interpret AIPW, it is helpful to write it as
τ̂AIPW = D + R
n
1X
D= µ̂(1) (Xi ) − µ̂(0) (Xi )
n
i=1
n
1 X Wi 1 − Wi
R= Yi − µ̂(1) (Xi ) − Yi − µ̂(0) (Xi ) .
n ê(Xi ) 1 − ê(Xi )
i=1
D is the direct regression adjustment estimator using µ̂(w ) (x),

and R is an IPW estimator applied to the residuals Yi − µ̂(Wi ) (Xi ).
Qualitatively, AIPW uses propensity weighting on the residuals to
debias the direct estimate.
A Simple Example
Consider an example with Xi ∼ N (0, I ), n = 1, 000 and p = 20:
e(x) = 1/(1 + e −x1 ), µ(0) (x) = (x1 + x2 )+ , µ(1) (x) = (x1 + x3 )+ .
Here, we need to model µ(w ) (x) non-parametrically, but there’s

not quite enough data to nail the functional form quite right.
τ̂AIPW = D + R
n
1X
D= µ̂(1) (Xi ) − µ̂(0) (Xi )
n
i=1
n
1 X Wi 1 − Wi
R= Yi − µ̂(1) (Xi ) − Yi − µ̂(0) (Xi ) .
n ê(Xi ) 1 − ê(Xi )
i=1
Here, the direct regression adjustment D is on average 0.18, the

correction R is on average -0.12, and AIPW gives us on average
0.06, which is closer to the correct answer τ = 0.
Understanding Augmented Inverse-Propensity Weighting
To understand why AIPW works, we can compare it to an oracle
that gets to use the true values of µ(w ) (x) and e(x):
n
1X Wi
τ̃AIPW = µ(1) (Xi ) − µ(0) (Xi ) + Yi − µ(1) (Xi )
n e(Xi )
i=1

1 − Wi
− Yi − µ(0) (Xi ) .
1 − e(Xi )
“Theorem.” If the first-stage function estimates satisfy

2 i 12 h i1
h 1
E (ê(X ) − e(X ))2 = oP
2
E µ̂(w ) (X ) − µ(w ) (X ) √ ,
n
and we also have overlap, then τ̂AIPW and τ̃AIPW satisfy

√
n (τ̂AIPW − τ̃AIPW ) →p 0.
In other words, τ̂AIPW and τ̃AIPW are first-order equivalent.

Understanding Augmented Inverse-Propensity Weighting
The upshot of this result is that we can study τ̃AIPW instead of
τ̂AIPW . Because τ̃AIPW is just an average of independent terms, a
direct application of the central limit theorem implies that
√
n (τ̃AIPW − τ ) ⇒ N (0, V ∗ ) ,
" #
Var [Yi (1)] Xi
V ∗ = Var µ(1) (X ) − µ(0) (X ) + E

e(Xi )
" #
Var [Yi (0)] Xi
+E .
1 − e(Xi )
√
Because τ̂AIPW and τ̃AIPW are equivalent on the n-scale, we then
immediately get, whenever the result from the previous slide holds,
√
n (τ̂AIPW − τ ) ⇒ N (0, V ∗ ) .
Moreover, it can be shown that this behavior is optimal for any

ATE estimator, assuming a generic non-parametric setup.
Inference with Augmented Inverse-Propensity Weighting
n b
We are considering τ̂AIPW = n−1
P
i=1 Γi , with
Wi
Γi = µ̂(1) (Xi ) − µ̂(0) (Xi ) +
b Yi − µ̂(1) (Xi )
ê(Xi )
1 − Wi
− Yi − µ̂(0) (Xi )
1 − ê(Xi )
If the first-stage estimates are reasonably accurate, this estimator
is to first order not affected by errors in ê(·) and µ̂(w ) (·).
As a consequence, for inference, we can act as though τ̂ were the
average of independent terms b
Γi , with variance
n
1 X 2
Var
d [τ̂AIPW ] = Vbn := bΓi − τ̂AIPW .
n(n − 1)
i=1
We can use this to build Gaussian confidence intervals:

h n oi
1/2
P τ ∈ τ̂AIPW ± z1−α/2 Vbn → 1 − α.
Details #1: The assumptions
Our result relies on the assumption that
2 i 12 h i1
h 1
E (ê(X ) − e(X ))2 = oP
2
E µ̂(w ) (X ) − µ(w ) (X ) √ .
n
One simple way to achieve this condition is if both µ̂ and ê are
o(1/n1/4 )-consistent in root-mean squared error.
I In other words, our final estimate τ̂AIPW can be an order of
magnitude more accurate than either nuisance component
(1/n1/2 vs 1/n1/4 ).
I The reason for this phenomenon is that, to first order, the
errors in the µ̂ and ê regressions cancel out.
I This is known as the orthogonal moments construction, and
plays a key role in semiparametric statistics.
Of course, o(1/n1/4 )-consistency is still a strong assumption, and
may not always hold. The topic of when we can get
o(1/n1/4 )-consistency, and also when we can improve on the
assumptions stated above, is the topic of a large literature.
Details #2: Cross-fitting
To get good behavior out of AIPW, we recommend cross-fitting
n
1X (−i) (−i) Wi
(−i)

τ̂ = µ̂(1) (Xi ) − µ̂(0) (Xi ) + (−i) Yi − µ̂(1) (Xi )
n ê (Xi )
i=1
1 − Wi
(−i)
− Yi − µ̂(0) (Xi ) .
1 − ê (−i) (Xi )
In other words, when estimating e(Xi ), use a model that did not
have access to the i-th training example during training.
I A simple approach is to cut the data into K folds. Then, for
each k = 1, ..., K , train a model on all but the k-th fold, and
evaluate its predictions on the k-th fold.
I With forests, leave-one-out estimation is natural, i.e.,
ê (−i) (Xi ) is trained on all but the i-th sample.
Chernozhukov et al. (2017) emphasize the role of cross-fitting in
proving flexible efficiency results for AIPW.
Example from tutorial, with n = 9750 and p = 21.

library(grf)
propensity_fit = regression_forest(Xmod, Wmod)
# If you ask a forest to predict without giving it a test

# set, it automatically does OOB on the training set.
ehat_oob = predict(propensity_fit)$predictions
ehat_naive = predict(propensity_fit,
newdata = Xmod)$predictions
c(OOB=mean(Wmod / ehat_oob), NAIVE=mean(Wmod / ehat_naive))
OOB NAIVE
1.0173325 0.9093602
Calibration plots run a single non-parametric regression of Wi
against ê(Xi ), and are a good way to assess quality of a propensity
fit. Ideally, the calibration curve should be close to the diagonal.
●● ●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●●
1.0
●● ●●
● ●
●
0.6
●
●●
●
●
●●
●
●
●●
●
● ●●
●
●
●●
● ●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●● ●●
●
●
●●
●
●
●●
●
● ●
●
●●
●
●
●●
●
● ●
●
●
●●
●
●
● ●●
●
●
●
●
● ●●
●
●
●
●● ●
●●
●
●● ●
●●
● ●
●
● ●
●
●●
● ●
●● ●
●
●
●● ●
●
●
●
●
●
●●
● ●
● ●
●● ●
●
● ●
0.5
●
0.8
●
● ●
●
●
● ●
●
●
● ●
●●
●
●
●
●
● ●
●
●● ●
●
●
Average Treatment
Average Treatment
●
● ●
●
●
● ●
●
●
●
● ●
●
●
●● ●
●●
●
● ●
●
●● ●
●
●
● ●
●
0.4
●
● ●
●
●
0.6
●
● ●
●
●
●
●
● ●
●
●
● ●
●
●
●●
● ●
●
●
● ●
●
●
●
● ●
●
●
●
● ●
●
●
●
● ●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
● ●
●
●
● ●
●
●
0.3
●
●
● ●
●
●
● ●
0.4
●
● ●●
●
● ●
●
●
●
● ●
●
●
●
●● ●
●
●
● ●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●● ●
●
●
●
● ●
●
●
●
● ●
●
0.2
●
●
● ●
●
●
● ●
●
●
●
0.2
●
● ●
●
●●
● ●
●
●
●
●
● ●
●
●
●
● ●●
●
●
● ●
●
●
●
●
● ●
●
●
● ●
●
●
●
● ●
●
●
●
●
● ●
●
● ●
●
● ●
●● ●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●● ●●
0.1
●●
● ●●● ● ●
●
●●
●
●● ●●
● ●
●
●●●
●
●● ●●
●
●
●
●
●
●
●●
●
●
●●
●
●●●
0.0
●●
●
●
● ● ●
●
●●
●
●
●●
●●
●
●
●●
●
● ●
●
●●
●
●
●●
●
●
●●
●
●
●●
● ●
●
●●
●
●
●●
●
●
●
●
●
● ●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Estimated Propensity Estimated Propensity
without crossfitting with crossfitting

Details #3: Overlap
Overlap means that propensity scores are bounded away from 0

and 1:
η ≤ P Wi = 1 Xi = x ≤ 1 − η, η > 0,
for all possible value of x. The proof assumes overlap, and even
the limiting variance gets bad as overlap gets bad:
" #
Var [Yi (1)] Xi
V ∗ = Var µ(1) (X ) − µ(0) (X ) + E

e(Xi )
" #
Var [Yi (0)] Xi
+E .
1 − e(Xi )
In applications, it is important to check overlap.

The role of overlap
Note that we need e(x) ∈ (0, 1) to be able to calculate treatment
effects for all x.
I Intuitively, how could you possibly infer [Y (0)|Xi = x] if
e(x) = 1?
I Note that for discrete x, the variance of ATE is infinite when
e(x) = 0.
I “Moving the goalposts”: Crump, Hotz, Imbens, Miller (2009)
analyze trimming, which entails dropping observations where
e(x) is too extreme. Typical approaches entail dropping
bottom and top 5% or 10%.
I Approaches that don’t directly require propensity score
weighting may seem to avoid the need for this, but important
to understand role of extrapolation.
I If we subset the data, need to be mindful of what the
estimand is.
Propensity Score Plots: Assessing Overlap
The causal inference literature has developed a variety of

conventions, broadly referred to as “supplementary analysis,” for
assessing credibility of empirical studies. One of the most prevalent
conventions is to plot the propensity scores of treated and control
groups to assess overlap.
I Idea: for each q ∈ (0, 1), plot the fraction of observations in
the treatment group with e(x) = q, and likewise for the
control group.
I Even if there is overlap, when there are large imbalances, this
is a sign that it may be difficult to get an accurate estimate of
the treatment effect.
Propensity Score Plots: Assessing Overlap
Example: Athey, Levin and Seira analysis of timber auctions.

I The paper studies consequences of awarding contracts to
harvest timber via first price sealed auction or open ascending
auction.
I Assignment to first price sealed auction or open ascending
auction:
I In Idaho, auction mechanism is randomized for subset of tracts
with different probabilities in different geographies;
I In California, auction mechanism is determined by small v.
large sales (with cutoffs varying by geography).
I So W = 1 if auction is sealed, and X represents geography,
size and year.
Propensity Score Plots: Assessing Overlap in ID
Very few observations with extreme propensity scores
Propensity Score Plots: Assessing Overlap in CA
Untrimmed v. trimmed so that e(x) ∈ [.025, .975]

Estimation Strategies with Poor Overlap
Lalonde (1986) collected a classical dataset for program

evaluation. For now, we consider the following sample:
I A sample of n1 = 297 treated people, randomly selected
among participants in the National Supported Work
Demonstration.
I A sample of n0 = 2490 control people, collected via the
Population Survey of Income Dynamics.
For some values of x, we essentially know the person is a control
and so e(x) ≈ 0. (For example, if the person was already employed
at the start of the study.)
Overlap in the Lalonde data
1200
900
status
count
600 control
treatment
300
0.00 0.25 0.50 0.75

e.hat
Overlap on the Lalonde dataset, with full set of PSID controls.

Many of the controls have essentially 0 propensity, but there is no
overlap problem near 1.
1200
900
status
count
600 control
treatment
300
0.00 0.25 0.50 0.75

e.hat
We can find propensity-matches for the treated units, but not for
all the controls. A simple way to trim away the overlap problem is
to estimate an average treatment effect on the treated. Here,
this may also be better conceptually justified.
Estimands for Causal Inference
The
average treatment
effect on the treated (ATT)
E Yi (1) − Yi (0) Wi = 1 often has simple interpretation.
Treatment( Control(
Image credit: J. Zubizarreta.

Average treatment effect on the treated
Recall that the average treatment effect on the treated is

τATT = E Yi (1) − Yi (0) Wi = 1 .
As usual, we can estimate it via several strategies. The direct

regression adjustment estimator fits a model µ̂(0) (x) to the
controls, and then uses it to impute what would have happened to
the treated units (on average) in the control condition
1 X
τ̂ATT = Yi − µ̂(0) (Xi ) .
n1
Wi =1
The propensity-weighted estimator uses

X
1 X X ê(Xi ) ê(Xi )
τ̂ATT = Yi − Yi .
n1 1 − ê(Xi ) 1 − ê(Xi )
Wi =1 Wi =0 Wi =0
We never divide by ê(x), so propensities near 0 aren’t a problem.

Average treatment effect on the treated
The augmented propensity-weighted estimator combines both
1 X
τ̂ATT = Yi − µ̂(0) (Xi )
n1
Wi =1
X
X ê(Xi ) ê(Xi )
− Yi − µ̂(0) .
1 − ê(Xi ) 1 − ê(Xi )
Wi =0 Wi =0
Again, this estimator is asymptotically optimal via an

Porthogonal
−1 n b
moments argument. Also, we can write τ̂ATT = n i=1 Γi ,
ê(Xi )
(1 − Wi ) 1−ê(X −

Γi
b Wi Yi − µ̂(0) (Xi ) i ) Y i µ̂ (0)
= − ê(Xi )
,
n n1
P
Wi =0 1−ê(Xi )
and, by
Pthe same argument as before, we can estimate variance via
Vbn = ni=1 (b
Γi − τ̂ATT )2 /(n(n − 1)) to build confidence intervals.
library(grf) # for random forests
treat_data = read.table("...../nswre74_treated.txt")
control_data = read.table("...../psid_controls.txt")
combined = rbind(treat_data, control_data)
X = combined[,2:9]; Y = combined[,10]; W = combined[,1]
cf = causal_forest(X, Y, W)
ate.hat = average_treatment_effect(cf,
target.sample = "all")
print(paste0("95% CI: ", round(ate.hat["estimate"]),
" +/- ", round(1.96 * ate.hat["std.err"])))
Warning: Estimated treatment propensities go as low as 0.003...
[1] "95% CI: -3039 +/- 6388"
att.hat = average_treatment_effect(cf,
target.sample = "treated")
print(paste0("95% CI: ", round(att.hat["estimate"]),
" +/- ", round(1.96 * att.hat["std.err"])))
[1] "95% CI: 1142 +/- 1510"
Addressing failures in overlap
If there are some observations with propensities very near 0 and

some very near 1, we need more aggressive methods:
I One idea is to fit a model for ê(x), throw away all
observations with ê(Xi ) ≤ 0.1 or ê(Xi ) ≥ 0.9, and estimate an
ATE on the rest.
I Another idea is the weight observations by ê(Xi )(1 − ê(Xi )),
so all observations with extreme weights are strongly enough
discounted not to inflate variance.
In both cases, interpretation requires care.
Recap
Treatment effects are important in many scientific analyses.

Once we have identified treatment effects via unconfoundedness,
we can estimate them by combining flexible machine learning
methods with augmented IPW.
I Formally, AIPW yields semiparametrically efficient
estimates of the treatment effect, provided the inputs from
machine learning methods are accurate enough.
I In practice, AIPW makes our procedure robust to
regularization bias.
I AIPW allows from simple confidence intervals that do not
depend on which specific machine learning method we used.
AIPW lets machine learning focus on what it’s good at (i.e.,
accurate predictions), and then uses its outputs for efficient
treatment effect estimation.
Recap
1 Pn
The AIPW estimator can be written as τ̂ = n i=1 Γi ,
b
Wi
Γi = µ̂(1) (Xi ) − µ̂(0) (Xi ) +
b Yi − µ̂(1) (Xi )
ê(Xi )
1 − Wi
− Yi − µ̂(0) (Xi ) .
1 − ê(Xi )
We can form level-α confidence intervals Iα as follows (recall

that the validity of these confidence intervals is not trivial, and
relies on an orthogonal moments argument):
n
1 1 X 2
Iα = τ̂ ± z1−α/2 Vb 2 , Vb = bΓi − τ̂ .
n(n − 1)
i=1
The grf package has a forest-based implementation of AIPW:

cf = causal_forest(X, Y, W)
ate.hat = average_treatment_effect(cf)

2.lecture2 Ate

Uploaded by

Copyright:

Available Formats

2.lecture2 Ate

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2.lecture2 Ate

Uploaded by

Copyright:

Available Formats

ECON 293/MGTECON 634:

Machine Learning and Causal Inference

Lecture 2: Average Treatment Effects

For a set of i.i.d. subjects i = 1, ..., n, we observe a tuple

Following the potential outcomes framework (Neyman, 1923;

For a set of i.i.d. subjects i = 1, ..., n, we observe a tuple

Our first goal is to estimate the average treatment effect (ATE)

τ = E [Yi (1) − Yi (0)] .

Of course, we only get to see Yi = Yi (Wi ). This “missing data”

The simplest way to identify the ATE in the potential outcomes is

{Yi (0), Yi (1)} ⊥

In a randomized trial, we can check that:

τ = E [Yi (1)] − E [Yi (0)]

where the last line only has observable moments.

Thus, although we never observe τi = Yi (1) − Yi (0), we can

154.68 153.49 -1.20 ●● ● ●

135.67 120.40 -15.27 ● ● ●●

103.46 117.68 14.23 ●●

117.62 95.08 -22.54

161.11 146.73 -14.39 ●●●

50 100 150 200 250

117.89 105.05 -12.84 Y(0)

−80 −60 −40 −20 0 20

We have use the potential outcomes framework to justify the

This estimator is unbiased, consistent, asymptotically

How could we possibly improve on this?

For a set of i.i.d. subjects i = 1, ..., n, we observe a tuple

Our approach starts by seeking to imitate this oracle.

A first, naive approach simply sets

A lasso might set the coefficient on WX2 to 0, and estimate τ̂ = 0!

A better estimator needs to correct for regularization bias:

Modulo technical details, this is justified for any µ̂(w ) (x). If

this estimator is optimal in large samples.

Consider 3 estimators, with 40

The simplest way to move beyond randomized controlled trials is to

Palo Alto Non-S. Smoker Geneva Non-S. Smoker

This is not a randomized controlled study, because Genevans

The Palo Alto experiment and Geneva experiment are both

Looking at aggregate data is misleading, and makes it look like the

This phenomenon is an example of Simpson’s “paradox”.

Formally, we have covariates Xi ∈ {Palo Alto, Geneva}, and know

We then estimate the overall average treatment effect as:

For a set of i.i.d. subjects i = 1, ..., n, we observe a tuple

We seek the ATE τ = E [Yi (1) − Yi (0)]. If the Xi is discrete, we

τ = E [Yi (1) − Yi (0)]

where µ̂(w ) (Xi ) is obtained by regression Yi on Xi on those

A classical approach to the ATE involves estimating µ(0) (x) and

β̂(0) ← lm (Yi ∼ Xi , subset Wi = 0) ,

We then make predictions µ̂(w ) (x) = β̂(w ) x, and obtain a

# First center the X, then run OLS with full W:X

# Use robust standard errors

By analogy to the low-dimensional case, may be tempted to use

β̂(0) ← lasso (Yi ∼ Xi , subset Wi = 0) ,

Is this any good? The fundamental difference between the lasso

We are in a p-dimensional linear model with n samples. Assume

such that kβk0 ≤ k.

At a high level, this error arises

We have a design with no treatment effect, but with a different

In order to zero-out noise