Demystifying and Avoiding The OLS

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Demystifying and avoiding the OLS “weighting problem”:

Unmodeled heterogeneity and straightforward solutions


Chad Hazlett1,2 and Tanvi Shinkre1
1
Department of Statistics and Data Science, UCLA
2
Department of Political Science, UCLA
arXiv:2403.03299v2 [stat.ME] 3 Oct 2024

October 7, 2024

Abstract
Researchers have long run regressions of an outcome variable (Y ) on a treatment (D) and covariates
(X) to estimate treatment effects. Even absent unobserved confounding, the regression coefficient on D
in this setup reports a conditional variance weighted average of strata-wise average effects, not generally
equal to the average treatment effect (ATE). Numerous proposals have been offered to cope with this
“weighting problem”, including interpretational tools to help characterize the weights and diagnostic aids
to help researchers assess the potential severity of this problem. We make two contributions that together
suggest an alternative direction for researchers and this literature. Our first contribution is conceptual,
demystifying these weights. Simply put, under heterogeneous treatment effects (and varying probability
of treatment), the linear regression of Y on D and X will be misspecified. The “weights” of regression
offer one characterization for the coefficient from regression that helps to clarify how it will depart from
the ATE. We also derive a more general expression for the weights than what is usually referenced.
Our second contribution is practical: as these weights simply characterize misspecification bias, we
suggest simply avoiding them through an approach that tolerate heterogeneous effects. A wide range
of longstanding alternatives (regression-imputation/g-computation, interacted regression, and balancing
weights) relax specification assumptions to allow heterogeneous effects. We make explicit the assumption
of “separate linearity”, under which each potential outcome is separately linear in X. This relaxation
of conventional linearity offers a common justification for all of these methods and avoids the weighting
problem, at an efficiency cost that will be small when there are few covariates relative to sample size.

1 Introduction
When estimating the effect of a treatment on an outcome of interest by adjusting for covariates (X),
researchers typically hope to interpret their result as a well-defined causal quantity, such as the average
effect over strata such as the average treatment effect (ATE). For a binary treatment (D) and outcome (Y ),
a simple bivariate regression of Y on D gives the difference in means estimate (the mean of Y when D = 1
minus the mean of Y when D = 0). However, this equivalence breaks down when covariates X are included in
the regression (as in Y ∼ D + X), if the treatment effect may vary across values of X. Instead, the coefficient
produced by this regression represents a weighted average of strata-specific difference in means estimates,
where the weights are larger for strata with probability of treatment closer to 50% (Angrist, 1998; Angrist
and Pischke, 2009). Such a weighted average is not typically of direct interest, and under severe enough
heterogeneity in treatment effects, this can lead to widely incorrect substantive conclusions, as demonstrated
below. Despite the introduction of many more flexible estimation procedures, regression adjustment remains
the “workhorse approach” when adjusting for covariates to make causal claims (Aronow and Samii, 2016).1
Accordingly, authors have proposed diagnostics that index the potential for bias due to these weights or
1 Keele
et al. (2010) found that regression adjustment was used in analyzing 95% of experiments in American Political Science
Review, 95% in Journal of Politics, and 74% in American Journal of Political Science. The ubiquity of regression adjustment

1
tools to aid interpretation given these weights (Aronow and Samii, 2016; Sloczyński, 2022; Chattopadhyay
et al., 2023; Chattopadhyay and Zubizarreta, 2023), and have analyzed the behavior of proposed regression
estimators in different settings in light of these weights (e.g. Chernozhukov et al., 2013).
Beginning from this state of affairs, we make two contributions. The first is a conceptual contribution that
aims to explain and demystify these weights. Under heterogeneous treatment effects (and varying probability
of treatment), the linear regression of Y on D and X will be misspecified. The “weights” of regression are
not a mysterious property of regression, but rather offer one characterization of what the coefficient is equal
to. The weights thus simply allow us to see how, under heterogeneous effects and varying probability of
treatment, the coefficient diverges from the ATE. Technically, we offer a direct derivation of these weights by
relying on the Frisch-Waugh-Lowell (FWL) theorem (Frisch and Waugh, 1933; Lovell, 1963). Our expression
does not rely on the usual assumption (as in Angrist (1998) and later sources) that D is linear in X. However,
it reduces to that of Angrist (1998) in the special case where D is linear in X, and further shows that the
weights cannot be known precisely in empirical practice absent this assumption.
Our second contribution clarifies an assumption that justifies a range of approaches and leads to our
recommendation to practioners. Since the weights simply characterize misspecification bias due to heterogeneity,
we propose the straightforward option of addressing that misspecification and avoid the weighting problem
altogether, rather than cope with it through various proposed interpretational and diagnostic tools. We
consider here the minimal change to practice that handles this concern. Specifically, for investigators satisfied
to seek an approximation to effects rooted in linearity, we describe the separate linearity assumption in
which each potential outcome (Y (D = 1) and Y (D = 0)) is linear in X. This stands in contrast to,
and relaxes slightly, the standard assumption of “single linearity”, by which Y is presumed to be linear
in D and X. Estimation within the separate linearity framework does not require any innovation, as
at least three well-known estimation approaches can be justified under separate linearity: (i) regression
imputation/g-computation/T-learner (Robins, 1986; Künzel et al., 2019), also recently referred to as “multi-
regression” in Chattopadhyay and Zubizarreta, 2023, 2024) who note the roots of this approach in (Peters,
1941) and (Belson, 1956); (ii) regression of Y on D and X and their interaction (Lin, 2013), and (iii)
balancing/calibration weights that achieve mean balance on X in the treated and control groups (e.g.
Hainmueller, 2012). Approaches (i) and (ii) are, as we show, identical in the context of OLS models without
regularization. Mean balancing weights (iii), when constructed to target the ATE, can be justified by
the same assumption of separate linearity, but uses a different estimation approach and produces different
point estimates. All three of these strategies produce unbiased ATE estimates under separate linearity,
without the “weighting problem” suffered under the single linearity regression. These benefits do come at an
efficiency cost. The researcher comparing these approaches to a single OLS model thus faces an efficiency-
bias, or efficiency-interpretation tradeoff. However, as we show, the efficiency cost of avoiding the “weighting
problem” is low in settings where there are few covariates relative to sample size. These considerations
apply to observational research in which an assumption of no unobserved confounding is required for causal
claims. However they can also apply to the analysis of even randomized experiments, concurring with Lin
(2013). In particular, we discuss these implications and consequent recommendations for effect estimates
from block-randomized experiments in which treatment probability may vary by block.

2 Background: OLS weights

2.1 Setting and notation


We consider settings where we are interested in estimating the average effect of binary treatment D on
outcome Y , while accounting for confounder X. The average treatment effect is defined as

τAT E = E[Yi (1) − Yi (0)] (1)


approaches is echoed by other authors as well (Angrist and Pischke, 2009; Humphreys, 2009; Chattopadhyay and Zubizarreta,
2023). This is perhaps due to familiarity, ease of use, the efficiency of OLS and its suitably good performance in many contexts
(Hoffmann, 2023; Green and Aronow, 2011; Kang and Schafer, 2007), and well-established uncertainty estimation considerations
and tools.

2
where Yi (1) and Yi (0) denote the potential outcomes under treatment and control, respectively (Neyman,
1923; Rubin, 1974). In order to estimate this average treatment effect using observed data, we require an
absence of unobserved confounders, concisely expressed as the conditional ignorability assumption,

Di ⊥⊥ {Yi (1), Yi (0)}|Xi (2)

For some discrete variables X that satisfy conditional ignorability, and given consistency of the potential
outcomes, the ATE can be identified according to:
X
τAT E = E[Y (1) − Y (0)|X = x]P (X = x) (3)
x∈X
X
= (E[Y |D = 1, X = x] − E[Y |D = 0, X = x]) P (X = x)
x∈X
X
= DIMx P (X = x) (4)
x

where we suppress the i subscripts for legibility and DIMx is the estimand for the difference in means in
the stratum where X = x. In a given sample, this is approximated using the analog estimator,2
X
τ̂AT E = DIM
\ x P̂ (X = x) (5)
x

The term P̂ (X = x) gives the empirical proportion of units in each stratum, and can be thought of
as the “natural” strata-wise weights, as they marginalize over the stratum according to the fraction of
units falling in that stratum. It would be natural to form an analog estimator for this through sub-
classification/stratification, simply computing the difference in means in each stratum of X and compiling
them per expression 5. However, suppose we instead attempt to estimate the treatment effect by fitting a
regression according to the model
Y = β0 + β1 X + τreg D + ϵ (6)

While regression-based estimation of treatment effects has been widely used in practice across disciplines for
decades, as Angrist (1998) and many scholars since then have emphasized, τ̂reg do not in general yield τ̂AT E ,
even when conditional ignorability holds. Rather, it can be understood as a version of Expression 5 but with
weights on each DIM
\ x that differ from P̂ (X = x). We now turn to a ground-up analysis of these strata-wise
weights intended to demystify them and to recognize them as the natural consequence of misspecification of
the linear model under heterogeneous effects.

2.2 The unit-level weighting representation of OLS

While our plan is to (re-)derive strata-wise weights on the DIM


\ x components, we start with “unit-level
weights”, which arise simply because the regression coefficient is a weighted sum of Yi values. Specifically,
ˆ i ) = p̂(Di = 1|Xi ) = Xi β̂ where Xi = 1 Xi , and β̂ is the estimated coefficients from a linear
 
let d(X
regression of D on X. The unit-wise weighting formula that corresponds to the OLS estimate is defined
according to
X
τ̂reg = wi Yi (7)
i
2 This quantity could also properly be labeled as the (estimate of) the sample average treatment effect (SATE) rather than

the ATE. However, we maintain the ATE notation for simplicity as its expectation over-samples is still the ATE, presuming
the sample in question is a probability sample from the population of interest.

3
for some weights wi . By the FWL theorem (Frisch and Waugh, 1933; Lovell, 1963),

d i , D⊥X )
Cov(Y
P ˆ i ))
Yi (Di − d(X
i
τ̂reg = = Pi
b ⊥X ) ˆ 2
V(D i i (Di − d(Xi ))

which already takes the form of Equation 7, with weights given by


ˆ i)
Di − d(X
wi = P (8)
ˆ i ))2
(Di − d(X
i

As the denominator is a scalar, the behavior of these weights is revealed through the numerator, Di − d(Xˆ i ).
ˆ ˆ
This is simply 1 − d(Xi ) for treated units and d(Xi ) for control units. Compare this to inverse-propensity
ˆ i ) for treated units and 1/(1 − d(X
score weights (IPW), which are proportional to 1/d(X ˆ i )) for control units.
Like the IPW weights, these unit-level weights in Expression 8 place greater weight on units when their
treatment status is more “surprising” given the covariate values. Unlike IPW, the linear regression weights
do not require constructing a ratio that has a denominator that can become close to zero, which can create
explosive weights under IPW.3

Negative weights. The weights wi are constructed to be both positive and negative because Equation 7
is structured as a single weighted sum/average. An isomorphic but useful way of signing the weights is to
instead seek the weights that would appear in a “weighted difference in means” estimator,
X X
τ̂wdim = w̃i Yi − w̃i Yi (9)
i:D=1 i:D=0

where w̃i = Di wi + (1 − Di )(−wi ), which simply changes the sign for control units in order to accommodate
the subtraction rather than the summation in the form of Expression 7.
The weights w̃ signed as in Expression 9 will often be positive, and any positive weight may be naturally
interpreted as the relative contribution of each observation. However, they can be negative as well. Chattopadhyay
and Zubizarreta (2024) note that negative weights “translate into forming effect estimates that may lie outside
the support or range of the observed data”, or that they “[leave] room for biases due to extrapolation from
an incorrectly specified model.
We note that the meaning of negative weights is closely tied to estimates one would obtain from a linear
probability model regressing the treatment indicators on X, even though that model is not run. Specifically,
ˆ i ) > 1, and for control units when d(X
weights will be negative for a treated unit when d(X ˆ i ) < 0. If a treated
unit lies at an X position “so unique to treated units” that a linearly-fitted model produces a predicted value
greater than 1 there, it will have a negative weight. This can be thought of as indicating that a treated unit
is “more like treated units than any control units”, so that it cannot be well compared to control units. The
symmetric argument holds for control units with a negative value.
In this way, negative weights do indicate extreme non-overlap, but in a very peculiar, model dependent sense
and not in any non-parametric way that corresponds to broader notions of common support. These weights
are also not a sure diagnostic for poor overlap or extreme counterfactuals: one could easily imagine a situation
where all units above a certain value of X are treated and all those below another value of X are controls,
indicating poor overlap. However, there can be many (or all) units outside the area of common support
but with positive weights, because the fitted linear probability model for D given X does not exceed 1 (for
3 Equivalent unit-level weights are explored in depth by Chattopadhyay and Zubizarreta (2023) and in Chattopadhyay and

Zubizarreta (2024), where the authors usefully employ these weights to analogize how OLS implicitly compares (weighted)
mean outcomes for the treated to (weighted) mean outcomes for the controls, sharpening the analogy to the difference-in-means
estimator one might apply in a randomized experiment and discussing implications for qualities such as effective balance and
sample size.

4
treated units) or drop below 0 (for control units). Thus, we emphasize that negative weights can provide
a warning that “some units look too much like treated units to have valid comparisons” (and likewise for
controls), but a positive weight is not a guarantee of meaningful comparability in the sense of finding units
of the opposite treatment value nearby in the covariate space.
Finally it may be theoretically clarifying to note that, if one uses a suitable logistic or probit regresion for
the propensity score, this of course avoids predictions for D that fall outside of [0, 1], and the related inverse
propensity score weights will never be negative. However, the transformation simply changes the problem
to one of extreme up-weighting of treated units that fall in areas where the propensity score is close to zero
(or of control units in areas where the propensity score is close to 1). In this sense, a price is paid for poor
overlap, in the sense of predicted treatment values, whether they use weights that imply a linear probability
model, or a propensity score based on a non-linear model for treatment.

2.3 From individual to strata-wise weights


While the analysis of unit level weights is interesting, it serves here as a step-stone to construct the strata-
level weights. These weights are interesting because they allow us to see how the regression coefficient
compares to ATE as a combination of average treatment effects by stratum as in Equation 5. To do so we
consider the case of discrete X and the strata-wise weights, meaning those that can be expressed as,
X
τ̂reg = wx τ̂x (10)

where wx is the weight for the strata in which X = x and τ̂x = Ê[Yi (1) − Yi (0)|Xi = x] is the conditional
average treatment effect for subgroup Xi = x. The resulting τ̂reg correspond to the ATE only when wx =
P (X = x). The best known expression for such strata-wise weights of OLS arises from Angrist (1998),
P ˆ i )(1 − d(X
τ̂x d(X ˆ i ))P̂ (Xi = x)
τ̂reg = Px (11)
ˆ ˆ
x d(Xi )(1 − d(Xi ))P̂ (Xi = x)

This form shows that regression weights strata-wise DIM \ x components not according to P̂ (Xi = x) as
required to represent the ATE, but proportionally to d(X ˆ i )(1 − d(X
ˆ i ))P̂ (Xi = x). Such a regression puts
the most weight on strata in which the conditional variance of treatment status is largest, i.e. when d(Xˆ i ) is
nearest to 50%. An intuition for this notes that OLS seeks to minimize squared error, and the opportunity
to learn the most from strata with middling probabilities of treatment (Angrist and Pischke, 2009). Absent
treatment effect heterogeneity, this is unproblematic. However, if there are high levels of effect heterogeneity
in our data, then depending on how they correspond to strata of X and the probability of treatment in those
strata, these weights could move the regression coefficient far from the average treatment effect (Aronow and
Samii, 2016; Humphreys, 2009; Angrist and Pischke, 2009; Sloczyński, 2022).
These weights, however, are obtained under the additional assumption that the probability of treatment
is linear in X, and not just modeled that way. Angrist and Pischke (2009) satisfy this by assuming the
corresponding outcome regression can be saturated in X. This is often a reasonable assumption with discrete,
low-dimensional X. Nevertheless we may wish to have a more general expression for the weights that does
not rely on such an assumption, either for completeness or in service of generalizing to the case where X is not
discrete or is otherwise infeasible to include in a saturating form (i.e. X with numerous multiple dimensions
and levels). To obtain a more general expression for the weights, we simply organize the individual-level

5
weights above into strata,
X
τ̂reg = wi Yi (12)
i
P ˆ i ))
Yi (Di − d(X
= Pi (13)
ˆ 2
i (Di − d(Xi ))
P
P̂ (Xi = x)E[(D
b ˆ
i − d(Xi ))Yi |Xi ]
= Px (14)
ˆ 2
x P̂ (Xi = x)E[(Di − d(Xi )) |Xi ]
b

P
Rearranging these terms in the form τ̂reg = x wx τx is not in general possible, as the expected outcome
under treatment and the expected outcome under control are weighted differently:

E[(D
b ˆ ˆ ˆ
i − d(Xi ))Yi |Xi = x] = E[Di (1 − d(Xi ))Yi − (1 − Di )d(Xi )Yi |Xi = x]
b
ˆ i ))E[D
= (1 − d(X ˆ i )E[(1
b i Yi |Xi = x] − d(X b − Di )Yi |Xi = x]
ˆ i ))πx E[Y
= (1 − d(X b i |Xi = x, Di = 1] − d(Xˆ i )(1 − πx )E[Y
b i |Xi = x, Di = 0] (15)

Thus the strata-wise weighting generally imposed by OLS involves a combination of the true probability
of treatment given X, and the probability of treatment given X estimated using a linear model. Let
πx = P (Di = 1|Xi = x). Consider the discrepancy, ax , between the true and linearly-approximated
ˆ i ) = πx + ax . The general strata-wise weights corresponding
probability of treatment given X, so that d(X
to OLS are then P
P̂ (Xi = x)πx (1 − πx )τx − ax E[Yi |Xi = x]
τ̂reg = P x (16)
x (Xi = x)(πx (1 − πx − ax ) + ax (πx + ax ))

Appendix A.1 gives additional details. Notice that this expression is infeasible to compute when D is not
linear in X, as πx is unknown. However, if ax = 0 (i.e. D is truly linear in X), this representation reduces to
the variance-weighted representation of the regression coefficient given by Angrist (1998). For many purposes
then the Angrist (1998) representation provides a clear and intuitive conception of the weights. Nevertheless,
we found it necessary to have this more complete formulation in order to obtain the correct answer, as our
simulations below show.

2.4 From single linearity to separate linearity


The simple insight we emphasize for its demystifying value and to suggest our proposal is that the strata-wise
weights just described are merely a description of the regression coefficient obtained by regressing Y on D
and X. An OLS model regressing Y on D and X alone would be correctly specified if the true conditional
expectation function, E[Y |X, D] is linear in D and X as in E[Y |X, D] = β0 + β1 X + β2 D. We call such an
assumption the single linearity assumption, because it requires a single assumption about Y being linear in
some terms. Note that this allows for random variation in treatment effects not correlated with X, but it
does not accommodate treatment effect variation that is correlated with X.
The value of the strata-wise weighting representation is simply that it characterizes the regression coefficient
in a way that readily shows how un-modeled effect heterogeneity in the regression of Y on D and X combines
with varying treatment probability to produce a coefficient that diverges from the ATE.
Accordingly, if the ATE rather than such a coefficient is the target of inference, a natural solution is not to
characterize or diagnose or bound the difference between the coefficient and the ATE, but rather to avoid
the underlying misspecification problem.
Here we consider the minimal change to regression practice that an investigator otherwise comfortable with
a linear model could employ to avoid this misspecification concern and the resulting weighting behavior.
We first define the separate linearity assumption, requiring that each of the potential outcomes is separately

6
linear in X,
E[Y (0)|X] = α0 + αX (17)
E[Y (1)|X] = γ0 + γX (18)

This assumption is slightly weaker than single linearity, which effectively forces α = γ. While we label
this assumption for easy reference and to distinguish it from single linearity, it is not intended to be novel,
and further can be understood as the (often implicit) motivation for a number of longstanding approaches
described next.
We turn next to a variety of known and straightforward estimation approaches that are in keeping with
this assumption and that wholly avoid the awkwardness of “regression’s weighting problem” rather than
employing diagnostic or interpretational aids.

2.5 Estimation approaches suitable for separate linearity


Interactions, imputation, g-computation, and stratification. Fortunately, a number of existing
approaches are suggested by such an assumption. One alternative estimation approach, proposed by Lin
(2013) initially to mitigate bias in covariate-adjusted estimates from randomized experiments, is to use a
regression model that includes an interaction term for the treatment and confounder. The interacted model
provides unbiased estimates of the ATE when there is covariate imbalance regardless of the sample size. The
treatment effect estimate, τ̂interact , is the estimated coefficient on the treatment variable Di in the regression
of Yi on Di , Xi − X̄, and Di (Xi − X̄). The centering of X in these terms is useful for interpretation: while
the fitted models with and without centering X are isomorphic, centering X allows the coefficient on D to
represent the average effect “when X is at its mean”. By linearity, this is also exactly the average marginal
effect of D on Y taken across observations at their observed values of X.4
Another straightforward approach to estimate the ATE under separate linearity is to run separate regressions
of Y on X for the treatment group and the control group, and use the results from each regression to predict
the unobserved outcomes in the other group. Then, using these predicted outcomes, we can estimate the
individual treatment effect for each subject, and take the average of the estimated individual treatment
effects to get the ATE. Let µ̂0 (Xi ) be the model fit to the control units and µ̂1 (Xi ) be the model fit to the
treated units,
!
1 X X
τ̂imp = (Yi − µ̂0 (Xi )) + (µ̂1 (Xi ) − Yi ) (19)
n
Di =1 Di =0
!
1 X X
= ((µ̂1 (Xi ) + ϵ1i ) − µ̂0 (Xi )) + (µ̂1 (Xi ) − (µ̂0 (Xi ) + ϵ0i )) (20)
n
Di =1 Di =0
!
1 X
= µ̂1 (Xi ) − µ̂0 (Xi ) (21)
n i

Because this explicitly puts weights of 1/N on every unit, the estimate does not suffer from the “weighting
problem”.5 Further, the Lin estimate and the regression imputation estimate are easily shown to be identical
4 This centering procedure and the resulting interpretation is complicated when one level of X (or the intercept) must be
dropped, as when X represents block indicators or more generally group fixed effect. We describe this concern and propose a
solution for this in Section 3.3.
5 We note the equality of Expression 19 to Expression 21 above implies that one may either (i) compare each unit’s observed

outcome under the realized treatment status to the modeled outcome for that unit under the opposite, or (ii) for each unit,
compare the modeled outcome under treatment to the modeled outcome under control, without using the observed outcome.
This is a result of relying on OLS for each outcome model, since the average fitted value from a given model will be precisely
equal to the average observed outcome over the same group. Such a property does not hold with estimators that cannot
guarantee ϵ̂ = 0.

7
in this context. Specifically, the Lin estimate models E[Y |D, X] as β0 +β1 D +β2 X +β3 DX, and thus implies

E[Y |X, D = 0] = β0 + β2 X (22)

E[Y |X, D = 1] = (β0 + β1 ) + (β2 + β3 )X (23)


Meanwhile, the imputation estimate is based on two regression models, one for the untreated group and one
for the treated group:
µ̂0 (Xi ) = E[Y |X, D = 0] = α0 + α1 X (24)
µ̂1 (Xi ) = E[Y |X, D = 1] = γ0 + γ1 X (25)
Comparing this to the above equations, we can prove α0 = β0 , α1 = β2 , γ0 = β0 + β1 , and γ1 = β2 + β3
by showing the equivalency of the minimization problems in question (see Appendix A.2 for details). Using
these equivalencies, we can show that the ATE estimate from regression imputation is equivalent to the
estimate from the interacted regression,
n
1X
τ̂imp = (γ0 + γ1 Xi − (α0 + α1 Xi )) (26)
n i=1
= (γ0 − α0 ) + (γ1 − α1 )X̄ (27)
= (β0 + β1 − β0 ) + (β2 + β3 − β2 )X̄ (28)
= β1 = τ̂interact (29)

Since these two estimation methods are identical under OLS, they can be used interchangeably. One can
directly show the unbiasedness of either under assumptions of consistency and conditional ignorability,

" N
#
1 X
E[τ̂imp ] = E (µ̂1 (Xi ) − µ̂0 (Xi )) (30)
N i=1
= E[E[Yi |Di = 1, Xi ]] − E[E[Yi |Di = 0, Xi ]] (31)
= E[Yi (1) − Yi (0)] (32)

Both are also identical to the stratification estimate when X is discrete,


N
1 X 
τ̂interact = τ̂imp = µ̂1 (Xi ) − µ̂0 (Xi ) (33)
N i=1
X
= b 1 (Xi ) − µ̂0 (Xi )|Xi = x]
P̂ (X = x)E[µ̂ (34)
x∈X

This is also known as g-computation, meaning that an estimate from g-computation will also be unbiased
for the ATE in settings with high levels of heterogeneity in treatment effect and probability of treatment
assignment (Robins, 1986; Snowden et al., 2011). It has also been referred to even more recently as “multi-
regression” (Chattopadhyay and Zubizarreta, 2024), who also note earlier uses of this approach as far back
as Peters (1941) and Belson (1956).

Mean balancing weights. We can also connect the separate linearity assumption to a justification for
calibration/balancing weights. Consider mean balancing weights for both the treated and control units, so
that weighted average of covariates for each is equal to the overall unweighted average of covariates,
N
X X 1 X
wi Xi = wi Xi = Xi (35)
N i=1
i:Di =0 i:Di =1

8
While many choices of weights can satisfy these constraints (when feasible), it is desirable to minimize some
measure of their variation.
P In the case of maximum entropy weights (Hainmueller, 2012), this is done by
maximizing the entropy, i wi log(wi /qi ). The key idea is that by achieving equal means on X in the treated
and control group (both equal to the full samples mean on X), then any linear function of X—which include
Y (1) and Y (0) if separate linearity holds— will also have equal means in these groups. A difference in
means estimator using these weights is then unbiased for the ATE, without requiring any appeal to the
relationship between this weighting procedure and propensity score modeling.6 An interesting feature of this
approach is that while it is justified by the same assumptions as regression imputation or the interactive
regression above, their estimation strategies are different. Balancing weights do not require estimating
the (nuisance) coefficients of any model. This may improve tolerance to misspecification (i.e. non-linear
conditional expectations for each potential outcome) at the cost of variance, as borne out in simulations
below.

3 Simulations
To verify that these estimators behave as our analysis claims, we explore the performance of different
estimation approaches under three different data generating processes, first with a discrete covariate X
and then with a continuous one.

3.1 Discrete covariate


The first three DGPs involve a binary treatment variable D, a discrete covariate X in the range [−3, 3], and
an outcome Y which depends on D and X. For each simulation setting, the tables in Figure 1 show the
possible values of X, the corresponding probability of treatment, probability that X = x, and the average
treatment effect τx for the subgroup where X = x. In all simulations, noise is added to the outcome to
achieve an R2 of 0.33 between the systematic (noiseless) portion of Y and the final Y with noise.
As expected, when there is a high level of heterogeneity in treatment effect and in probability of treatment,
the regression adjustment estimate will have a substantial amount of bias. Figure 1 shows the effect estimates
when there is heterogeneity in both treatment effect and probability of treatment between subgroups. In the
first two settings, the potential outcomes are linear in X. Here we see that the OLS estimate (reg) is heavily
biased, and would lead investigators to conclude there is a statistically significant positive effect, though
the true ATE is zero. Notably, slightly more extreme simulations would even make it possible for OLS to
produce the incorrect sign for the treatment effect estimate. Meanwhile, regression imputation (impute),
the Lin estimator (interact), mean balancing (meanbal), and matching (match) all successfully address this
concern. We recommend the use of regression imputation or equivalently the Lin/interacted adjustment as
a simple way to improve on conventional OLS estimation.
Naturally, in the case where the potential outcomes are not linear in the treatment and covariates, the
estimates from the interacted model, imputation, and mean balancing are all biased for the ATE. This is
expected, as the assumptions of both single and separate linearity are violated. Addressing such non-linearity
requires non-linear estimators, such as matching. We also note that the variance-weighted estimate (using
Expression 11) does not always reproduce the actual OLS estimate, as in the first simulation setting, where
P (D = 1|X) is not linear in X. This is easily avoided by fully saturating the model in X, although such a
manuever would not work for the continuous cases below.
The more general (but infeasible) weighting representation (Expression 16) reproduces the OLS estimate
exactly regardless of the functional form of P (Di |Xi ).
6 See Appendix A.3 for proof, with similar results targeting the ATT in Hazlett (2020). See Zhao and Percival (2017) for a

detailed analysis of identification and double-robustness of mean balancing weights.

9
DGP1: P (D|X) increasing in X; {Y (1), Y (0)} linear in X

6
4
X P (D|X) P (X = x) τx

2
1 -3 0.100 0.143 -9
2 -2 0.100 0.143 -6

0
3 -1 0.100 0.143 -3

−2
4 0 0.500 0.143 0

−4
5 1 0.500 0.143 3
6 2 0.500 0.143 6

−6
7 3 0.500 0.143 9
stratify reg impute interact meanbal match

DGP2: P (D|X) increasing linearly in X; {Y (1), Y (0)} linear in X

6
4

X P (D|X) P (X = x) τx
2

1 -3 0.100 0.143 -9
2 -2 0.200 0.143 -6
0

3 -1 0.300 0.143 -3
−2

4 0 0.400 0.143 0
−4

5 1 0.500 0.143 3
6 2 0.600 0.143 6
−6

7 3 0.700 0.143 9
stratify reg impute interact meanbal match

DGP3: P (D|X) increasing in X; nonlinear Y (1), Y (0)


6
4

X P (D|X) P (X = x) τx
2

1 -3 0.100 0.143 5
2 -2 0.200 0.143 5
0

3 -1 0.300 0.143 0
−2

4 0 0.500 0.143 -5
−4

5 1 0.700 0.143 0
6 2 0.800 0.143 5
−6

7 3 0.900 0.143 5
stratify reg impute interact meanbal match

Figure 1: Estimates over 500 iterations of size n = 1000. The dashed red line indicates the true ATE. The
dashed blue and green lines show the regression coeficient expected using the simplified weights of Angrist
and Krueger (1999) (blue) and our more general weights (green).

10
Standard errors

While these results show the expected variability in estimates under resampling from the given DGP, for an
investigator working with one observed dataset, some form of estimated standard error is vitally important
to inference. Table 1 reports the average analytically estimated standard errors for DGP1 above, still with
discrete X. Results are similar in other settings.
For stratification, we take a weighted sum of strata-specific Neyman variances. Here and below we write
expressions in the plug-in/analog sample estimator form.
!
X b i |D = 1, X = x) V(Y
V(Y b i |D = 0, X = x)
b strat ) =
V(τ̂ p̂(X = x)2 + (36)
n1 n0
x∈X

For the simple and interacted regression, we calculate the HC2 standard error of the coefficient on the
treatment variable. For regression imputation, we calculate the standard error again in the Neyman style,

! !
1X 1X
V
b (µ̂1 (Xi ) − µ̂0 (Xi )) =V
b (γXi − αXi ) (37)
n i n i
b X̄ − αX̄)
= V(γ (38)
= X̄ T Σ
b 1 X̄ + X̄ Σ
b 0 X̄ (39)

where Σ̂1 and Σ̂0 are the estimated variance-covariance matrices for the treatment model and the control
model, respectively.
For mean balancing we show two types of analytical standard errors. First, we use the (HC2) standard
error from the weighted regression of just Y on D. Second, meanbal-adj uses the HC2 standard errors from
a regression of Y on D and X, again with the estimated weights. Both methods produce identical point
estimates (when perfect mean balance is achieved by the weights), but the analytical standard errors of the
meanbal-adj approach benefit from partialing out the X. This is akin to how conventional OLS standard
errors, under a fixed design, partial X out of Y so that the estimates are built on the conditional/residual
variance of Y rather than the total variance. For matching we use the Abadie-Imbens standard error (Abadie
and Imbens, 2006).

bias rmse avg analytical SE empirical SE


stratify -0.021 1.431 1.478 1.432
reg 2.021 2.376 1.264 1.250
impute 0.005 1.315 1.321 1.316
interact 0.005 1.315 1.321 1.316
meanbal -0.001 1.328 1.812 1.329
meanbal-adj -0.002 1.328 1.379 1.329
match -0.005 1.434 1.514 1.436

Table 1: Simulation results under DGP1. The bias and rmse columns repeat information shown graphically
in Figure 1. The empirical SE gives the actual standard deviation of the estimates across resamples. The
average analytical SE is the average of the standard errors that would have been computed analytically in
each iteration, using the estimators described in the text.

We find, first, that the empirical standard errors are extremely similar across the methods relying on single
or double linearity (reg, impute, interact, meanbal, meanbal-adj). The approaches not relying on single
or separate linearity (stratify and match) show somewhat larger standard errors, as expected given their

11
greater flexibility. Second, each method’s average analytical standard is within 5% of the empirical standard
error, with the exception of meanbal, with an average analytical standard error almost 40% larger than the
empirical value. This is repaired, however, by using meanbal-adj.
Finally, some investigators may be concerned with efficiency in the sense of sampling variability in the
estimate and its consequences for inference. This can be understood as a bias-variance tradeoff. The
comparison between OLS (reg) and the Lin/interaction approach (interact) offers a simple starting point,
since the later will add one additional parameter to the regression for every covariate dimension. Because
the number of observations is large relative to the number of covariates, this has very little impact on the
estimate’s variability across resamples. For example in DGP 1, Table 1 shows that the empirical SE is only
about 5% larger for interact than for reg. The behavior of impute is of course identical. The empirical SE
from meanbal and meanbaladj are similarly only 6% larger than from reg. If an investigator is primarily
concerned with root mean square error (RMSE) of the estimates around the true value, RMSE values fall
by nearly half for each of these methods relative to reg. That is, the small loss of efficiency in these settings
(increasing variance) is more than made up for by reductions in bias as they factor into the RMSE.
It is important to recognize, however, the favorable nature of our simulation setting in this regard. If the
number of covariates was large enough relative to the sample size, if treatment probability varied little by
stratum of X, and/or if efficiency was a greater concern than bias or RMSE, then investigators might have
cause to prefer OLS and adopting its weighted-ATE as the target estimand for their inferential purposes.

3.2 Continuous covariate


While we have considered discrete X thus far, we do so for the sake of intuition regarding strata, but the
lessons apply to settings with continuous X as well. In the simulations shown in Figures 2a and 2b, X
is randomly sampled from uniformaly from [−3, 3], and Y (1) = Y (0) + 3X. However, P (D|X) takes on a
different form in each setting. For the first continuous specification, P (D|X) changes when X crosses 0. In
the second, P (D|X) increase linearly with X. These are analogous to the discrete cases above. For the third
specification, P (D|X) increases linearly in X, but Y (1) is nonlinear in X.
In each of these settings, we see that the OLS estimate is biased for the true ATE. As before, the interaction,
imputation, and mean balancing estimators perform well, except in 2c where even separate linearity fails.

12
6
4

5
2

4
0

3
−2

−2

2
1
−4

−4

0
reg

impute

interact

meanbal

match

reg

impute

interact

meanbal

match

reg

impute

interact

meanbal

match
(a) X is randomly sampled (b) X is randomly sampled (c) X is randomly sampled from
from Unif[-3,3], P (D|X) = from Unif[-3,3], P (D|X) = Unif[-3,3], P (D|X) = 0.1X + 0.4,
0.1 when X < 0 and 0.1X + 0.4, and Y (1) = and Y (1) = Y (0) + X 2 .
P (D|X) = 0.5 when X ≥ 0, Y (0) + 3X
and Y (1) = Y (0) + 3X

Figure 2: Effect estimates for 200 samples of size n=1000 under data generating processes where X is
continuous and different estimation strategies. The dashed red line is set at the true ATE. The dashed blue
line is set at the variance-weighted estimate using the true subgroup level treatment effects.

3.3 Covariate adjustment in experiments, block randomization, and baseline-


free average marginal effects
These lessons may at first seem to be directed to researchers working with observational data such that
conditioning on X is a requirement. However, they also apply in settings where conditioning on X is used to
improve precision/ minimize finite sample differences from the expected result. Our advice essentially echoes
that of Lin (2013) and a broad subsequent literature calling for a model that interacts treatment with the
(centered) covariates. As noted above, this is identical to analogous imputation/g-computation/T-learner
approaches when using OLS for the underlying models.
Consider first the cases where investigators have a randomized experiment, but adjust for covariates in the
analysis to improve precision. When D is fully randomized, the probability of treatment across values of X
will typically not vary greatly, except by chance. This implies that the bias due to regression’s weighting
behavior will typically be small. Nevertheless, any discrepancy between the ATE and coefficient on account
of these weights can be avoided entirely.
Second, block randomization is a powerful tool often employed by experimentalists to reduce finite sample
deviations from expectation. For the estimates to reflect the reduced variance this affords, it is standard
to regress the outcome (Y ) on the treatment (D) and block indicators (B) as fixed effect covariates. The
weights implied by regression of Y on D and block indicators B would be given by

PP
τ̂b P (Di = 1|Bi = b)(1 − P (Di = 1|Bi = b))P (Bi = b)
τ̂BF E = Pb=1
P
(40)
b=1 P (Di = 1|Bi = b)(1 − P (Di = 1|Bi = b))P (Bi = b)

where it is innocuous to assume that P (D = 1|B) is linear in the block indicators, as this regression would

13
be fully saturated in B. If all blocks have the same probability of treatment, the weights are equal, so the
regression coefficient will represent the difference in means per block, averaged over the size of each block, i.e.
the ATE. This will often be the case. However, if the treatment probability varies by block, either by design
or otherwise, then the resulting coefficient estimate would not in general be the ATE. Rather, regression
will put higher weight on blocks with probabilities of treatment nearer to 50%. As above, this is simply a
consequence of misspecification generated by heterogeneous treatment effect estimates by block. Including
interactions between the treatment and the block fixed effects would address this, under the separate linearity
assumption,

E[Yi (1)|Bi ] = α0 + α1 1{Bi = 1} + .. + αb 1{Bi = P } (41)


E[Yi (0)|Bi ] = γ0 + γ1 1{Bi = 1} + ... + γb 1{Bi = P } (42)

One complication when applying the interacted approach here is that care must be taken regarding the
interpretation due to the interaction. In typical usage, one block indicator will be omitted to avoid co-
linearity with the intercept. If no centering/de-meaning is done on the block indicators, the coefficient
estimate would represent the estimated effect (difference in means) in whichever block had its indicator
omitted. The solution of centering covariates (Lin, 2013) as utilized above is now complicated by this
omission. It is possible to omit the intercept, rather than the indicator for one block, and utilize the
centering. However, a more general solution is to avoid any consideration of centering and omitting one
level/the intercept, and works when dealing with one or multiple categorical variables. This is to simply
∂Y
compute the marginal effect estimate for each observation, in whatever block it is in, ∂D B=b
. Averaging
these across observations (giving equal weight to each observations) produces the average marginal effect
(AME),

n
1Xd ∂Y
τ̂AM E = (43)
n i=1 ∂D B=b

∂ Y
where ∂Dc
B=b
is the estimated marginal effect in block b where this individual unit is found. For example,
in the regression

Yi = β0 + τ D + β1 1{Bi = 1} + β2 1{Bi = 2} + ... + βP 1{Bi = P }


+ α1 1{Bi = 1}D + α2 1{Bi = 2}D + ... + αP 1{Bi = P }D + ϵi

the estimated marginal effect in block B = b is τ̂ + αp , and so the average marginal effect of interest over
the whole sample is simply
n P
1 XX
τ̂AM E = τ̂ + (α̂p 1{Bi = b}) (44)
n i=1
b=1

Using this approach to interpret the fitted regression with interactions always produces an estimate with
the desired interpretation of “the estimated marginal effect, which can differ by block, averaged over blocks
according to how many observations fall in each.” In the case of block indicators or other categorical variables,
this approach avoids errors in relation to choices about what levels to omit in the regression, whether to
omit the intercept, or having to center these variables. The variance for this estimator is
P
X X
V(τ̂AM E ) = V(τ̂ ) + P (Bi = p)2 V(α̂p ) + 2 P (Bi = k)P (Bi = j)Cov(α̂k , α̂j )
p=1 k<j
P
X
+2 P (Bi = p)Cov(τ̂ , α̂p )
p=1

14
For comparison, we also consider two other approaches. One practice is to weight the blocked fixed effects
regression by the (stabilized) inverse probability of treatment assignment for each block,

P (Di = 1) P (Di = 0)
wiIPW = Di + (1 − Di ) (45)
P (Di = 1|B = bi ) P (Di = 0|B = bi )

These weights are constructed so that within each block, the treated and control units make up equal
weighted proportions. They therefore neutralize any differences in the P (D = 1|B) across blocks. If we
perform a weighted regression of Y on D and the block dummy variables using these weights, the coefficient
estimate for D will be unbiased for the ATE.7
Finally we also the stratification estimator (Expression 5) but with blocks as strata,
|B|
X
τ̂blockDIM = P̂ (B = b)(ȲD=1,B=b − ȲD=0,B=b ). (46)
b=1

Figure 3 compares estimates from the “plain” block fixed effects (block FE) regression to the interacted
regression with the average marginal effect interpretation (block FE interact), the IPW-weighted regression
(block FE IPW), and the block DIM average (mean block DIM) in a simulation setting where treatment
probability varies by block and there is heterogeneity in treatment effect. As expected, the block fixed
effects OLS model suffers from the weighting problem. All three alternative approaches produce unbiased
and identical estimates. We recommend the use of one of these alternative estimators to improve upon
estimation of the ATE under block randomization.8
1.6
1.5
1.4
1.3
1.2

block FE block FE interact block FE IPW mean block DiM

Figure 3: Effect estimates for 500 iterations of size n = 1200 each. D is assigned by (complete) randomization
within each block with probability P (D|B), and Y = 2 + D∗ τ . For blocks 1-6, P (D|B) = (0.25, 0.25, 0.5,
0.25, 0.25, 0.5) respectively, and τ = 4∗ P (D|B). The black line indicates the true ATE.

7 A weighted difference in means (rather than regression including B) with these weights would produce the same point

estimate, but the improvement in efficiency obtained under the block randomization design will not be fully reflected in the
estimated standard error. This occurs because the block indicators will be orthogonal to treatment (once weighted) and so do
not affect the coefficient estimate on treatment, but the omission of B prevents the model from reducing the residuals that
enter the standard error.
8 These results are consistent with those demonstrated in https://declaredesign.org/blog/posts/biased-fixed-effects.html,

which use the DeclareDesign simulation approach (Blair et al., 2019) and conclude that the mean blockwise DiM/ stratification
approach is suitable.

15
4 Conclusion
Given the wide use of OLS regression of Y on D and X in practice, and the potential gap between
the coefficient and target quantities like the ATE, prior work has provided several forms of guidance to
practitioners regarding this weighting problem. Humphreys (2009) notes the conditions under which the
OLS result will be nearer to ATT or ATC, and observes monotonicity conditions under which the coefficient
estimate will fall between these two endpoints. Aronow and Samii (2016) suggest characterizing the “effective
sample” for which the regression coefficient would be an unbiased estimate of the ATE by applying the
variance weights to the covariate distribution. Other papers have provided diagnostics for quantifying the
severity of this bias. Chattopadhyay and Zubizarreta (2023) characterize the implied unit-level weights of
regression adjustment (as we do as an intermediate result), and propose diagnostics that use these weights to
analyze covariate balance, extrapolation outside the support, variance of the estimator, and effective sample
size. These ideas relate closely to those of Aronow and Samii (2016) in that they compare the covariate
distribution targeted by the estimator to that of the target population. Sloczyński (2022) derives a diagnostic
related to variation in treatment probability,

ρ2 V[P (D|X)|D = 1] − (1 − ρ)2 V[P (D|X)|D = 0]


δ= (47)
ρV[P (D|X)|D = 1] + (1 − ρ)V[P (D|X)|D = 0]

where ρ is the unconditional probability of treatment. The bias due to heterogeneity will be δ(τAT C − τAT T )
where τAT C and τAT T are the average treatment effect among treated and control, respectively. Thus
δ captures one aspect of the potential for bias—variation in treatment probability—but reasoning about
the severity of bias requires knowledge of the heterogeneity in treatment effects, which is unknown to the
investigator.
Our argument is that in many cases, investigators could consider different modeling choices that avoid this
“weighting problem” altogether. This begins by understanding the weighting problem as a symptom of
misspecification when the treatment effect (and probability of treatment) vary with X. We provide a new,
more general expression for the “weights of regression” by algebraically manipulating a representation of
the OLS coefficient. The key fact is that the apparent “weights of regression” simply account for the way
misspecification alters the coefficient away from the ATE. Second, viewed this way, a natural proposal
for research practice is to rely instead on specifications that will not necessarily be violated by effect
heterogeneity. To focus on minimal changes in practice, we propose that investigators at least relax the
“single linearity” assumption (that Y is linear in D and X) to “separate linearity” (treatment and control
potential outcome are separately linear in X). Moreover, a number of existing, straightforward estimation
approaches produce unbiased ATE estimates under this assumption. First, regression imputation (including
g-estimation and the T-learner) with OLS as well as including the interaction of X and D (as in Lin, 2013)
are all suitable, and identical to each other in this setting. This longstanding approach dates back to at
least Peters (1941) and Belson (1956), as noted by Chattopadhyay and Zubizarreta (2024) who label it as
“multi-regression”. In addition, for “mean balancing” estimators applied to the ATE—those that choose
weights to achieve the same mean of X for the treated and control group as in the full sample—are also
unbiased under separate linearity. These are not identical to the above as they avoid fitting any model, and
our simulations suggest this property can partially mitigate bias when even the separate linearity assumption
fails.
Any of these approaches can be useful not only in observational studies (so far as investigators can claim
conditional ignorability), but also when using covariate adjustment after randomization, including the special
case of analyzing block randomized experiments.
While we recommend to avoid the weights through these other procedures, this is not without limits. The
principal cost of these approaches relative to OLS is the potential loss in efficiency. The interact/imputation/g-
computation/T-learner approach adds an additional nuisance parameter per covariate. The circumstances
investigated here—with just one covariate, a high degree of treatment effect heterogeneity, and large variation
in treatment probability—clearly favor these approaches, where they eliminate bias and reduce RMSE by
roughly half, while increasing the standard error by only about 5%. However there may be settings in which
the improved bias does not so obviously dominate the efficiency cost. For example, where there are many

16
covariates relative to the sample size, little difference in the probability of treatment by stratum, or little
theoretical reason to expect treatment effect heterogeneity, investigators may have cause to prefer OLS if
efficiency concerns are of paramount interest. In those cases, diagnostics or interpretational aids that gauge
the severity of the misspecification bias (weights) may be useful.
Finally, when the linearity in potential outcomes assumption is not satisfied, none of these methods can
guarantee unbiasedness. Our work here regards only the relaxation from single linearity to separate linearity.
Investigators not satisfied with the linear approximation to treatment effects in this way could reasonably
consider more flexible approaches—though doing so may incur additional uncertainty costs as illustrated in
Table 1. Nevertheless, we agree with assessments such as Keele et al. (2010) and Aronow and Samii (2016)
that current practice largely remains reliant on linear approximations to adjust for covariates while hoping
to interpret the result as a meaningful causal effect. Our analysis suggests that, especially where there
are enough observations relative to covariates to support imputation/g-computation/T-learner, interactive
regression, or mean balancing, this may be preferable to suffering regression’s “weighting problem”, as it
avoids this form of bias under slightly weaker assumptions and requires only a small change to current
practice.

References
Abadie, A. and Imbens, G. W. (2006). Large Sample Properties of Matching
Estimators for Average Treatment Effects. Econometrica, 74(1):235–267. eprint:
https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1468-0262.2006.00655.x.
Angrist, J. D. (1998). Estimating the Labor Market Impact of Voluntary Military Service Using Social
Security Data on Military Applicants. Econometrica, 66(2):249–288. Publisher: Econometric Society.
Angrist, J. D. and Krueger, A. B. (1999). Empirical strategies in labor economics. In Handbook of Labor
Economics, volume 3, Part A, pages 1277–1366. Elsevier.
Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist’s Companion.
Princeton University Press.
Aronow, P. M. and Samii, C. (2016). Does Regression Produce Representative Estimates of Causal Effects?
American Journal of Political Science, 60(1):250–267.
Belson, W. A. (1956). A technique for studying the effects of a television broadcast. Journal of the Royal
Statistical Society: Series C (Applied Statistics), 5(3):195–202.
Blair, G., Cooper, J., Coppock, A., and Humphreys, M. (2019). Declaring and diagnosing research designs.
American Political Science Review, 113(3):838–859.
Chattopadhyay, A., Greifer, N., and Zubizarreta, J. R. (2023). lmw: Linear Model Weights for Causal
Inference. arXiv:2303.08790 [stat].
Chattopadhyay, A. and Zubizarreta, J. R. (2023). On the implied weights of linear regression for causal
inference. Biometrika, 110(3):615–629.
Chattopadhyay, A. and Zubizarreta, J. R. (2024). Causation, Comparison, and Regression. Harvard Data
Science Review, 6(1). https://hdsr.mitpress.mit.edu/pub/1ybwbmlw.
Chernozhukov, V., Fernández-Val, I., Hahn, J., and Newey, W. (2013). Average and quantile effects in
nonseparable panel models. Econometrica, 81(2):535–580.
Frisch, R. and Waugh, F. V. (1933). Partial time regressions as compared with individual trends.
Econometrica, 1(4):387–401.
Green, D. P. and Aronow, P. M. (2011). Analyzing Experimental Data Using Regression: When is Bias a
Practical Concern?

17
Hainmueller, J. (2012). Entropy Balancing for Causal Effects: A Multivariate Reweighting Method
to Produce Balanced Samples in Observational Studies. Political Analysis, 20(1):25–46. Publisher:
Cambridge University Press.
Hazlett, C. (2020). Kernel balancing. Statistica Sinica, 30(3):1155–1189.

Hoffmann, N. I. (2023). Double robust, flexible adjustment methods for causal inference: An overview and
an evaluation. URL: https://osf.io/preprints/socarxiv/dzayg.
Humphreys, M. (2009). Bounds on least squares estimates of causal effects in the presence of heterogeneous
assignment probabilities.
Kang, J. D. Y. and Schafer, J. L. (2007). Demystifying Double Robustness: A Comparison of Alternative
Strategies for Estimating a Population Mean from Incomplete Data. Statistical Science, 22(4):523–539.
Publisher: Institute of Mathematical Statistics.
Keele, L., White, I. K., and McConnaughy, C. (2010). Adjusting experimental data: Models versus design.
In APSA 2010 Annual Meeting Paper.

Künzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019). Metalearners for estimating heterogeneous
treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116(10):4156–
4165. Publisher: Proceedings of the National Academy of Sciences.
Lin, W. (2013). Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s
critique. The Annals of Applied Statistics, 7(1):295–318. Publisher: Institute of Mathematical Statistics.

Lovell, M. C. (1963). Seasonal adjustment of economic time series and multiple regression analysis. Journal
of the American Statistical Association, 58(304):993–1010.
Neyman, J. (1923). Sur les applications de la théorie des probabilités aux experiences agricoles: Essai des
principes. Roczniki Nauk Rolniczych, 10(1):1–51.

Peters, C. C. (1941). A method of matching groups for experiment with no loss of population. The Journal
of Educational Research, 34(8):606–612.
Robins, J. (1986). A new approach to causal inference in mortality studies with a sustained exposure
period—application to control of the healthy worker survivor effect. Mathematical Modelling, 7(9):1393–
1512.

Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies.
Journal of educational Psychology, 66(5):688.
Snowden, J. M., Rose, S., and Mortimer, K. M. (2011). Implementation of G-Computation on a Simulated
Data Set: Demonstration of a Causal Inference Technique. American Journal of Epidemiology, 173(7):731–
738.

Sloczyński, T. (2022). Interpreting OLS Estimands When Treatment Effects Are Heterogeneous: Smaller
Groups Get Larger Weights. The Review of Economics and Statistics, pages 1–9.
Zhao, Q. and Percival, D. (2017). Entropy balancing is doubly robust. Journal of Causal Inference,
5(1):20160010.

18
A Appendix

A.1 Derivation of more general weights


The above weights rely on the assumption that E[Di |Xi ] is linear. We can derive more general weights that
do not involve this assumption by using a more general representation of D⊥X :

d i , D⊥X )
Cov(Y i
τ̂reg = (48)
b ⊥X )
V(D i

D⊥X = D − X(XT X)−1 XT D = D − Xθ (49)

where X = ⃗1 ⃗ . We first derive the unit-wise weights, which are somewhat trivial.
 
X

d i , D⊥X ) = Ê[Yi D⊥X ] − Ê[Yi ]Ê[D⊥X ]


Cov(Y i i i

= Ê[Yi Di⊥X ]
X
= Yi (Di − Xi θ)
i
X
= ˆ i ))
Yi (Di − d(X
i

ˆ i ) = Xi β where Xi = 1
 
where d(X Xi and Xi = x.

b ⊥X ) = Ê[D⊥X 2 ] − Ê[D⊥X ]2
V(D i i i
2
= Ê[Di⊥X ]
X
= (Di − Xi θ)2
i
X
= ˆ i ))2
(Di − d(X
i

So, the unit-wise weighted representation is as follows:

d i , D⊥X )
Cov(Y
P ˆ i ))
Yi (Di − d(X
i
τ̂reg = = Pi (50)
⊥X ˆ 2
V(Di ) i (Di − d(Xi ))
b

ˆ i ) = Xi θ where Xi = 1
 
Let πx = P (Di = 1|Xi = x) and d(X Xi and Xi = x. We can try to manipulate
the above representation to get the strata-wise weights:

19
X
d i , Di⊥X ) =
Cov(Y ˆ i ))|Xi ]
P (X = x)Ê[Yi (Di − d(X
x
X
= ˆ i )Ê[Yi |Xi ]
P (X = x)Ê[Yi Di |Xi ] − d(X
x
X
= ˆ i )Ê[Yi |Xi ])
P (X = x)(P (Di = 1|Xi )Ê[Yi |Di = 1, Xi ] − d(X
x
X
= ˆ i )(πx Ê[Yi |Di = 1, Xi ] + (1 − πx )Ê[Yi |Di = 0, Xi ))
P (X = x)(πx Ê[Yi |Di = 1, Xi ] − d(X
x
X
= ˆ i ))Ê[Yi |Di = 1, Xi ] − d(X
P (X = x)(πx (1 − d(X ˆ i )(1 − πx )Ê[Y |Di = 0, Xi ])
x

This can be rewritten in a couple different ways, where τx is the ATE given X = x:
X
d i , D⊥X ) =
Cov(Y ˆ i ))) − Ê[Y |Di = 0, Xi ](d(X
P (X = x)(Ê[Yi |Di = 1, Xi ](πx (1 − d(X ˆ i )(1 − πx ))
i
x
X
= ˆ i )Ê[Yi |Di = 0, Xi ] − d(X
P (X = x)(πx Ê[Yi |Di = 1, Xi ] − d(X ˆ i )πx τx
x
X
ˆ i ))(Ê[Yi |Di = 1, Xi ] − p̂i 1 − πi
= P (X = x)πx (1 − d(X Ê[Yi |Di = 0, Xi ])
x
1 − p̂i πi

Next, to get the full expression for τ̂reg , we calculate V(Di⊥X ).


X 2
b ⊥X ) =
V(D P (X = x)Ê[Di⊥X |X]
i
x
X
= P (X = x)Ê[Di |Xi ] − 2Xi β Ê[Di |Xi ] + (Xi β)2
x
X
= ˆ i )πx + d(X
P (X = x)(πx − 2d(X ˆ i )2 )
x
X
= ˆ i )) + d(X
P (X = x)(πx (1 − d(X ˆ i )(d(X
ˆ i ) − πx ))
x

So we end up with the following expression for τ̂reg .

P ˆ i ))) − Ê[Y |Di = 0, Xi ](d(X


P (X = x)(Ê[Yi |Di = 1, Xi ](πx (1 − d(X ˆ i )(1 − πx ))
x
τ̂reg = (51)
P ˆ ˆ ˆ
x P (X = x)(πx (1 − d(Xi )) + d(Xi )(d(Xi ) − πx ))

ˆ i ) = πx + ax , where ax is the difference between the true probability of treatment given


Suppose we let d(X
X = x and the probability of treatment estimated by a linear model. We can then write Cov(Yd i , D⊥X ) in
i
terms of ax :

20
X
d i , Di⊥X ) =
Cov(Y ˆ i )πx τx
P (X = x)πx Ê[Yi |Di = 1, Xi ] − p̂Ê[Yi |Di = 0, Xi ] − d(X
x
X
= P (X = x)πx Ê[Yi |Di = 1, Xi ] − (πx + ax )Ê[Yi |Di = 0, Xi ] − (πx + ax )πx τx
x
X
= P (X = x)πx Ê[Yi |Di = 1, Xi ] − πx Ê[Yi |Di = 0, Xi ] − ax Ê[Yi |Di = 0, Xi ] − πx2 τx − ax πx τx
x
X
= P (X = x)πx (1 − πx )τx − ax (Ê[Yi |Di = 0, Xi ] + πx τx )
x
X
= P (X = x)πx (1 − πx )τx − ax Ê[Yi |Xi = x]
x

Using this, we can write the weighted representation of the regression coefficient in terms of ax :
P
x P (X = x)πx (1 − πx )τx − ax Ê[Yi |Xi = x]
τ̂reg = P (52)
x P (X = x)(πx (1 − πx − ax ) + (πx + ax )(πx + ax − πx ))

A.2 Equivalency of impute and interact


We can show that the regression imputation estimate is equivalent to the Lin/interacted regression estimate
by showing that the minimization problems solved by the two methods are equivalent.
We can start with the Lin estimate. We want to estimate β̂ in the equation:

E[Y |D, X] = β̂0 + β̂1 D + β̂2 X + β̂3 DX

We do this by minimizing the least squares error, in other words, β̂ is the solution to the minimization
problem:

n
X
min (Yi − (β0 + β1 Di + β2 Xi + β3 Di Xi ))2
β
i=1

Now
PNlet’s look at the regressions
PN involved in the imputation estimate. The imputation estimate is θ̂ =
1 1
N Ŷ
i=1 i (1) − Ŷi (0) = N i=1 0 + γ1 X − (α0 + α1 X). Here, γ and α are found by solving the following
γ
minimization problems:

X
min (Yi − (γ0 + γ1 Xi ))2
γ
Di =1
X
min (Yi − (α0 + α1 Xi ))2
α
Di =0

21
This is equivalent to solving the single minimization problem:
X X
= min (Yi − (γ0 + γ1 Xi ))2 + (Yi − (α0 + α1 Xi ))2
γ,α
Di =1 Di =0
Xn
= min Di (Yi − (γ0 + γ1 Xi ))2 + (1 − Di )(Yi − (α0 + α1 Xi ))2
γ,α
i=1
n
X
= min Di (Yi2 − 2Yi (γ0 + γ1 Xi ) + (γ0 + γ1 Xi )2 ) + (1 − Di )(Yi2 − 2Yi (α0 + α1 Xi ) + (α0 + α1 Xi )2 )
γ,α
i=1
n
X
= min Yi2 − 2Yi (α0 + α1 Xi + (γ0 − α0 )Di + (γ1 − α1 )Di Xi ) + (α0 + α1 Xi + (γ0 − α0 )Di + (γ1 − α1 )Di Xi )2
γ,α
i=1
n
X
= min (Yi − (α0 + α1 Xi + (γ0 − α0 )Di + (γ1 − α1 )Di Xi ))2
γ,α
i=1
n
X
= min (Yi − (β0 + β1 Di + β2 Xi + β3 Di Xi ))2
β
i=1

where β0 = α0 , β1 = γ0 − α0 , β2 = α1 , and β3 = γ1 − α1

A.3 Unbiasedness of weighted DiM using mean balancing weights


Assume the potential outcomes are separately linear in X:

E[Yi (1)|Xi ] = γXi (53)


E[Yi (0)|Xi ] = αXi (54)

Then we can show that the ATE estimate from using mean balancing weights is unbiased:

" #
X X
E [τ̂meanbal ] = E Yi wi − Yi wi (55)
Di =1 Di =0
X X
= E[Yi wi |Di = 1] − E[Yi wi |Di = 0] (56)
Di =1 Di =0
X X
= E[Yi (1)wi |Di = 1] − E[Yi (0)wi |Di = 0] (57)
Di =1 Di =0
X X
= E[(γXi + ϵi )wi |Di = 1] − E[(αXi + ϵi )wi |Di = 0] (58)
Di =1 Di =0
X X
=γ E[Xi wi |Di = 1] − α E[Xi wi |Di = 0] (59)
Di =1 Di =0
" # " #
X X
= γE Xi wi − αE Xi wi (60)
Di =1 Di =0
   
= γE X̄ − αE X̄ (61)
= E[Yi (1)] − E[Yi (0)] (62)
= E[Yi (1) − Yi (0)] (63)

1
PN
where X = N i=1 Xi is the sample mean of the covariates.

22
A.4 Unbiasedness of impute and interact
We can show that both the regression imputation method and the interacted regression are unbiased for
the ATE. We start with regression imputation. We have one model for Y among the treated, µ̂1 , and one
among the controls, µ̂0 .
Then,

!
1 X X
τ̂imp = (Yi − µ̂0 (Xi )) + (µ̂1 (Xi ) − Yi ) (64)
n
Di =1 Di =0
1X
= (µ̂1 (Xi ) − µˆ0 (Xi )) (65)
n i
X 1{X=x}
= (µ̂1 (x) − µ̂0 (x)) (66)
n
x∈X
X
= Pb(X = x) (µ̂1 (x) − µ̂0 (x)) (67)
x∈X
X
= Pb(X = x)DIMx (68)
x∈X

We see that the imputation approach weights each stratum by the probability of inclusion in that stratum
– these weights result in an estimate that is unbiased for the ATE.

23

You might also like