2.lecture2 Ate
2.lecture2 Ate
2.lecture2 Ate
Stefan Wager
Stanford University
12 April 2018
A central goal of machine learning is to understand what usually
happens in a given situation, e.g.,
I Given today’s weather, what’s the chance tomorrow’s air
pollution levels will be dangerously high?
Most economists want to predict what would happen if we changed
the system, e.g.,
I How does the answer to the above question change if we
reduce the number of cars on the road?
This class is about the interface of causal inference and machine
learning, with both terms understood broadly:
I Our discussion of causal inference draws from a long
tradition in economics and epidemiology on which questions
about counterfactuals can be answered using a given type of
data, and how these estimands can be interpreted.
I We use the term machine learning to describe an engineering
heavy approach to data analysis. Given a well-defined task in
which good performance can be empirically validated, we do
not shy away from computationally heavy tools or
potentially heuristic approaches (e.g., decision trees, neural
networks, non-convex optimization).
Today’s lecture is about average treatment effects:
I The potential outcomes model for causal inference in
randomized experiments.
I Observational studies and the propensity score.
I Double robustness, or how to use machine learning for
principled treatment effect estimation.
The potential outcomes framework
200
Yi (0) Yi (1) τi ●
● ●●
● ●
●●
●
●●● ●
●
●
●
●●
●● ● ●●
150
● ● ●● ●● ● ●
● ●● ● ● ●
● ● ●●●● ●● ●
●● ● ●●●● ●
● ●
● ● ● ●
●
● ●
● ● ●
● ●●● ●● ●
● ●●● ● ●●● ●
● ●● ●●
●
Y(1)
● ●●●● ●●●● ●●●● ●● ● ●
100
●● ●
●● ●● ● ● ●
●●●● ●●
●● ●
●
● ●
●
● ●● ●● ●
● ●● ●● ●●●● ●● ●
50
● ●
●
●
●●●●●●
●●
●
● ●● ●
●●
● ●
● ●
●
●
●
●
●● ●●
●●
●●●
●●
●● ●● ●
●●
●
400
84.00 75.59 -8.41
73.32 65.68 -7.63
300
100.07 93.80 -6.28
Frequency
200
103.81 82.30 -21.51
··· ··· ···
100
111.68 101.47 -10.21 0
Yi (0) Yi (1) τi
154.68 — —
135.67 — — I In practice, we only ever
— 117.68 — observe a single potential
— 95.08 — outcome.
— 146.73 — I However, in a RCT, we can
117.89 — — use averages over the
— 75.59 — treated and controls to
— 65.68 — estimate the ATE.
100.07 — — I We estimate τ̂ as
— 82.30 — 110.59 − 100.52 = 10.07.
··· ··· ···
110.59 100.52 —
ATE estimation in randomized trials
In the potential outcomes model, an oracle who knew the µ(w ) (x)
could use
n
1X
τ̄ = µ(1) (Xi ) − µ(0) (Xi ) .
n
i=1
This is good if µ̂(w ) (x) is obtained via OLS. But it breaks down if
we use regularization.
Example. Suppose that p n, but the true model is sparse,
E Y X = x, W = w = 2X1 + 0.1WX2 .
Distribution of estimates:
50
30
basic
mean-square errors for τ : 20
10
I basic: 0.105. 0
50
bias-corrected: 0.092.
40
I method
corrected
30 basic
count
corrected
20
I plug-in: 0.210. 10
plugin
50
40
30
plugin
20
10
0
−0.2 0.0 0.2
X
Today’s lecture is about average treatment effects:
I The potential outcomes model for causal inference in
randomized experiments.
I Observational studies and the propensity score.
I Double robustness, or how to use machine learning for
principled treatment effect estimation.
Beyond randomized trials
p
β(0) j
= 1 ({i ≤ k}) σ
= β(1) j
log(p)/n, and so
τ = E [X ] · β(1) − β(0) = 0.
The lasso fits, µ̂(w ) (x) = â(w ) + x β̂(w ) , with intercept â(w ) .
Why Lasso Regression Adjustments Don’t Work
Suppose that we have estimates µ̂(w ) (x) from any machine learning
method, and also have propensity estimate ê(x). AIPW then uses:
n
1X Wi
τ̂ = µ̂(1) (Xi ) − µ̂(0) (Xi ) + Yi − µ̂(1) (Xi )
n ê(Xi )
i=1
1 − Wi
− Yi − µ̂(0) (Xi ) .
1 − ê(Xi )
τ̂AIPW = D + R
n
1X
D= µ̂(1) (Xi ) − µ̂(0) (Xi )
n
i=1
n
1 X Wi 1 − Wi
R= Yi − µ̂(1) (Xi ) − Yi − µ̂(0) (Xi ) .
n ê(Xi ) 1 − ê(Xi )
i=1
τ̂AIPW = D + R
n
1X
D= µ̂(1) (Xi ) − µ̂(0) (Xi )
n
i=1
n
1 X Wi 1 − Wi
R= Yi − µ̂(1) (Xi ) − Yi − µ̂(0) (Xi ) .
n ê(Xi ) 1 − ê(Xi )
i=1
In other words, when estimating e(Xi ), use a model that did not
have access to the i-th training example during training.
I A simple approach is to cut the data into K folds. Then, for
each k = 1, ..., K , train a model on all but the k-th fold, and
evaluate its predictions on the k-th fold.
I With forests, leave-one-out estimation is natural, i.e.,
ê (−i) (Xi ) is trained on all but the i-th sample.
Chernozhukov et al. (2017) emphasize the role of cross-fitting in
proving flexible efficiency results for AIPW.
Details #2: Cross-fitting
ehat_oob = predict(propensity_fit)$predictions
ehat_naive = predict(propensity_fit,
newdata = Xmod)$predictions
OOB NAIVE
1.0173325 0.9093602
Details #2: Cross-fitting
Calibration plots run a single non-parametric regression of Wi
against ê(Xi ), and are a good way to assess quality of a propensity
fit. Ideally, the calibration curve should be close to the diagonal.
●● ●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●●
1.0
●● ●●
● ●
●
0.6
●
●●
●
●
●●
●
●
●●
●
● ●●
●
●
●●
● ●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●● ●●
●
●
●●
●
●
●●
●
● ●
●
●●
●
●
●●
●
● ●
●
●
●●
●
●
● ●●
●
●
●
●
● ●●
●
●
●
●● ●
●●
●
●● ●
●●
● ●
●
● ●
●
●●
● ●
●● ●
●
●
●● ●
●
●
●
●
●
●●
● ●
● ●
●● ●
●
● ●
0.5
●
0.8
●
● ●
●
●
● ●
●
●
● ●
●●
●
●
●
●
● ●
●
●● ●
●
●
Average Treatment
Average Treatment
●
● ●
●
●
● ●
●
●
●
● ●
●
●
●● ●
●●
●
● ●
●
●● ●
●
●
● ●
●
0.4
●
● ●
●
●
0.6
●
● ●
●
●
●
●
● ●
●
●
● ●
●
●
●●
● ●
●
●
● ●
●
●
●
● ●
●
●
●
● ●
●
●
●
● ●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
● ●
●
●
● ●
●
●
0.3
●
●
● ●
●
●
● ●
0.4
●
● ●●
●
● ●
●
●
●
● ●
●
●
●
●● ●
●
●
● ●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●● ●
●
●
●
● ●
●
●
●
● ●
●
0.2
●
●
● ●
●
●
● ●
●
●
●
0.2
●
● ●
●
●●
● ●
●
●
●
●
● ●
●
●
●
● ●●
●
●
● ●
●
●
●
●
● ●
●
●
● ●
●
●
●
● ●
●
●
●
●
● ●
●
● ●
●
● ●
●● ●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●● ●●
0.1
●●
● ●●● ● ●
●
●●
●
●● ●●
● ●
●
●●●
●
●● ●●
●
●
●
●
●
●
●●
●
●
●●
●
●●●
0.0
●●
●
●
● ● ●
●
●●
●
●
●●
●●
●
●
●●
●
● ●
●
●●
●
●
●●
●
●
●●
●
●
●●
● ●
●
●●
●
●
●●
●
●
●
●
●
● ●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Estimated Propensity Estimated Propensity
900
status
count
600 control
treatment
300
900
status
count
600 control
treatment
300
We can find propensity-matches for the treated units, but not for
all the controls. A simple way to trim away the overlap problem is
to estimate an average treatment effect on the treated. Here,
this may also be better conceptually justified.
Estimands for Causal Inference
The
average treatment
effect on the treated (ATT)
E Yi (1) − Yi (0) Wi = 1 often has simple interpretation.
Treatment( Control(
1 X
τ̂ATT = Yi − µ̂(0) (Xi ) .
n1
Wi =1
ê(Xi )
(1 − Wi ) 1−ê(X −
Γi
b Wi Yi − µ̂(0) (Xi ) i ) Y i µ̂ (0)
= − ê(Xi )
,
n n1
P
Wi =0 1−ê(Xi )
and, by
Pthe same argument as before, we can estimate variance via
Vbn = ni=1 (b
Γi − τ̂ATT )2 /(n(n − 1)) to build confidence intervals.
Overlap in the Lalonde data
library(grf) # for random forests
treat_data = read.table("...../nswre74_treated.txt")
control_data = read.table("...../psid_controls.txt")
combined = rbind(treat_data, control_data)
X = combined[,2:9]; Y = combined[,10]; W = combined[,1]
cf = causal_forest(X, Y, W)
ate.hat = average_treatment_effect(cf,
target.sample = "all")
print(paste0("95% CI: ", round(ate.hat["estimate"]),
" +/- ", round(1.96 * ate.hat["std.err"])))
Warning: Estimated treatment propensities go as low as 0.003...
[1] "95% CI: -3039 +/- 6388"
att.hat = average_treatment_effect(cf,
target.sample = "treated")
print(paste0("95% CI: ", round(att.hat["estimate"]),
" +/- ", round(1.96 * att.hat["std.err"])))
[1] "95% CI: 1142 +/- 1510"
Addressing failures in overlap
Wi
Γi = µ̂(1) (Xi ) − µ̂(0) (Xi ) +
b Yi − µ̂(1) (Xi )
ê(Xi )
1 − Wi
− Yi − µ̂(0) (Xi ) .
1 − ê(Xi )