Demystifying and Avoiding The OLS
Demystifying and Avoiding The OLS
Demystifying and Avoiding The OLS
October 7, 2024
Abstract
Researchers have long run regressions of an outcome variable (Y ) on a treatment (D) and covariates
(X) to estimate treatment effects. Even absent unobserved confounding, the regression coefficient on D
in this setup reports a conditional variance weighted average of strata-wise average effects, not generally
equal to the average treatment effect (ATE). Numerous proposals have been offered to cope with this
“weighting problem”, including interpretational tools to help characterize the weights and diagnostic aids
to help researchers assess the potential severity of this problem. We make two contributions that together
suggest an alternative direction for researchers and this literature. Our first contribution is conceptual,
demystifying these weights. Simply put, under heterogeneous treatment effects (and varying probability
of treatment), the linear regression of Y on D and X will be misspecified. The “weights” of regression
offer one characterization for the coefficient from regression that helps to clarify how it will depart from
the ATE. We also derive a more general expression for the weights than what is usually referenced.
Our second contribution is practical: as these weights simply characterize misspecification bias, we
suggest simply avoiding them through an approach that tolerate heterogeneous effects. A wide range
of longstanding alternatives (regression-imputation/g-computation, interacted regression, and balancing
weights) relax specification assumptions to allow heterogeneous effects. We make explicit the assumption
of “separate linearity”, under which each potential outcome is separately linear in X. This relaxation
of conventional linearity offers a common justification for all of these methods and avoids the weighting
problem, at an efficiency cost that will be small when there are few covariates relative to sample size.
1 Introduction
When estimating the effect of a treatment on an outcome of interest by adjusting for covariates (X),
researchers typically hope to interpret their result as a well-defined causal quantity, such as the average
effect over strata such as the average treatment effect (ATE). For a binary treatment (D) and outcome (Y ),
a simple bivariate regression of Y on D gives the difference in means estimate (the mean of Y when D = 1
minus the mean of Y when D = 0). However, this equivalence breaks down when covariates X are included in
the regression (as in Y ∼ D + X), if the treatment effect may vary across values of X. Instead, the coefficient
produced by this regression represents a weighted average of strata-specific difference in means estimates,
where the weights are larger for strata with probability of treatment closer to 50% (Angrist, 1998; Angrist
and Pischke, 2009). Such a weighted average is not typically of direct interest, and under severe enough
heterogeneity in treatment effects, this can lead to widely incorrect substantive conclusions, as demonstrated
below. Despite the introduction of many more flexible estimation procedures, regression adjustment remains
the “workhorse approach” when adjusting for covariates to make causal claims (Aronow and Samii, 2016).1
Accordingly, authors have proposed diagnostics that index the potential for bias due to these weights or
1 Keele
et al. (2010) found that regression adjustment was used in analyzing 95% of experiments in American Political Science
Review, 95% in Journal of Politics, and 74% in American Journal of Political Science. The ubiquity of regression adjustment
1
tools to aid interpretation given these weights (Aronow and Samii, 2016; Sloczyński, 2022; Chattopadhyay
et al., 2023; Chattopadhyay and Zubizarreta, 2023), and have analyzed the behavior of proposed regression
estimators in different settings in light of these weights (e.g. Chernozhukov et al., 2013).
Beginning from this state of affairs, we make two contributions. The first is a conceptual contribution that
aims to explain and demystify these weights. Under heterogeneous treatment effects (and varying probability
of treatment), the linear regression of Y on D and X will be misspecified. The “weights” of regression are
not a mysterious property of regression, but rather offer one characterization of what the coefficient is equal
to. The weights thus simply allow us to see how, under heterogeneous effects and varying probability of
treatment, the coefficient diverges from the ATE. Technically, we offer a direct derivation of these weights by
relying on the Frisch-Waugh-Lowell (FWL) theorem (Frisch and Waugh, 1933; Lovell, 1963). Our expression
does not rely on the usual assumption (as in Angrist (1998) and later sources) that D is linear in X. However,
it reduces to that of Angrist (1998) in the special case where D is linear in X, and further shows that the
weights cannot be known precisely in empirical practice absent this assumption.
Our second contribution clarifies an assumption that justifies a range of approaches and leads to our
recommendation to practioners. Since the weights simply characterize misspecification bias due to heterogeneity,
we propose the straightforward option of addressing that misspecification and avoid the weighting problem
altogether, rather than cope with it through various proposed interpretational and diagnostic tools. We
consider here the minimal change to practice that handles this concern. Specifically, for investigators satisfied
to seek an approximation to effects rooted in linearity, we describe the separate linearity assumption in
which each potential outcome (Y (D = 1) and Y (D = 0)) is linear in X. This stands in contrast to,
and relaxes slightly, the standard assumption of “single linearity”, by which Y is presumed to be linear
in D and X. Estimation within the separate linearity framework does not require any innovation, as
at least three well-known estimation approaches can be justified under separate linearity: (i) regression
imputation/g-computation/T-learner (Robins, 1986; Künzel et al., 2019), also recently referred to as “multi-
regression” in Chattopadhyay and Zubizarreta, 2023, 2024) who note the roots of this approach in (Peters,
1941) and (Belson, 1956); (ii) regression of Y on D and X and their interaction (Lin, 2013), and (iii)
balancing/calibration weights that achieve mean balance on X in the treated and control groups (e.g.
Hainmueller, 2012). Approaches (i) and (ii) are, as we show, identical in the context of OLS models without
regularization. Mean balancing weights (iii), when constructed to target the ATE, can be justified by
the same assumption of separate linearity, but uses a different estimation approach and produces different
point estimates. All three of these strategies produce unbiased ATE estimates under separate linearity,
without the “weighting problem” suffered under the single linearity regression. These benefits do come at an
efficiency cost. The researcher comparing these approaches to a single OLS model thus faces an efficiency-
bias, or efficiency-interpretation tradeoff. However, as we show, the efficiency cost of avoiding the “weighting
problem” is low in settings where there are few covariates relative to sample size. These considerations
apply to observational research in which an assumption of no unobserved confounding is required for causal
claims. However they can also apply to the analysis of even randomized experiments, concurring with Lin
(2013). In particular, we discuss these implications and consequent recommendations for effect estimates
from block-randomized experiments in which treatment probability may vary by block.
2
where Yi (1) and Yi (0) denote the potential outcomes under treatment and control, respectively (Neyman,
1923; Rubin, 1974). In order to estimate this average treatment effect using observed data, we require an
absence of unobserved confounders, concisely expressed as the conditional ignorability assumption,
For some discrete variables X that satisfy conditional ignorability, and given consistency of the potential
outcomes, the ATE can be identified according to:
X
τAT E = E[Y (1) − Y (0)|X = x]P (X = x) (3)
x∈X
X
= (E[Y |D = 1, X = x] − E[Y |D = 0, X = x]) P (X = x)
x∈X
X
= DIMx P (X = x) (4)
x
where we suppress the i subscripts for legibility and DIMx is the estimand for the difference in means in
the stratum where X = x. In a given sample, this is approximated using the analog estimator,2
X
τ̂AT E = DIM
\ x P̂ (X = x) (5)
x
The term P̂ (X = x) gives the empirical proportion of units in each stratum, and can be thought of
as the “natural” strata-wise weights, as they marginalize over the stratum according to the fraction of
units falling in that stratum. It would be natural to form an analog estimator for this through sub-
classification/stratification, simply computing the difference in means in each stratum of X and compiling
them per expression 5. However, suppose we instead attempt to estimate the treatment effect by fitting a
regression according to the model
Y = β0 + β1 X + τreg D + ϵ (6)
While regression-based estimation of treatment effects has been widely used in practice across disciplines for
decades, as Angrist (1998) and many scholars since then have emphasized, τ̂reg do not in general yield τ̂AT E ,
even when conditional ignorability holds. Rather, it can be understood as a version of Expression 5 but with
weights on each DIM
\ x that differ from P̂ (X = x). We now turn to a ground-up analysis of these strata-wise
weights intended to demystify them and to recognize them as the natural consequence of misspecification of
the linear model under heterogeneous effects.
the ATE. However, we maintain the ATE notation for simplicity as its expectation over-samples is still the ATE, presuming
the sample in question is a probability sample from the population of interest.
3
for some weights wi . By the FWL theorem (Frisch and Waugh, 1933; Lovell, 1963),
d i , D⊥X )
Cov(Y
P ˆ i ))
Yi (Di − d(X
i
τ̂reg = = Pi
b ⊥X ) ˆ 2
V(D i i (Di − d(Xi ))
As the denominator is a scalar, the behavior of these weights is revealed through the numerator, Di − d(Xˆ i ).
ˆ ˆ
This is simply 1 − d(Xi ) for treated units and d(Xi ) for control units. Compare this to inverse-propensity
ˆ i ) for treated units and 1/(1 − d(X
score weights (IPW), which are proportional to 1/d(X ˆ i )) for control units.
Like the IPW weights, these unit-level weights in Expression 8 place greater weight on units when their
treatment status is more “surprising” given the covariate values. Unlike IPW, the linear regression weights
do not require constructing a ratio that has a denominator that can become close to zero, which can create
explosive weights under IPW.3
Negative weights. The weights wi are constructed to be both positive and negative because Equation 7
is structured as a single weighted sum/average. An isomorphic but useful way of signing the weights is to
instead seek the weights that would appear in a “weighted difference in means” estimator,
X X
τ̂wdim = w̃i Yi − w̃i Yi (9)
i:D=1 i:D=0
where w̃i = Di wi + (1 − Di )(−wi ), which simply changes the sign for control units in order to accommodate
the subtraction rather than the summation in the form of Expression 7.
The weights w̃ signed as in Expression 9 will often be positive, and any positive weight may be naturally
interpreted as the relative contribution of each observation. However, they can be negative as well. Chattopadhyay
and Zubizarreta (2024) note that negative weights “translate into forming effect estimates that may lie outside
the support or range of the observed data”, or that they “[leave] room for biases due to extrapolation from
an incorrectly specified model.
We note that the meaning of negative weights is closely tied to estimates one would obtain from a linear
probability model regressing the treatment indicators on X, even though that model is not run. Specifically,
ˆ i ) > 1, and for control units when d(X
weights will be negative for a treated unit when d(X ˆ i ) < 0. If a treated
unit lies at an X position “so unique to treated units” that a linearly-fitted model produces a predicted value
greater than 1 there, it will have a negative weight. This can be thought of as indicating that a treated unit
is “more like treated units than any control units”, so that it cannot be well compared to control units. The
symmetric argument holds for control units with a negative value.
In this way, negative weights do indicate extreme non-overlap, but in a very peculiar, model dependent sense
and not in any non-parametric way that corresponds to broader notions of common support. These weights
are also not a sure diagnostic for poor overlap or extreme counterfactuals: one could easily imagine a situation
where all units above a certain value of X are treated and all those below another value of X are controls,
indicating poor overlap. However, there can be many (or all) units outside the area of common support
but with positive weights, because the fitted linear probability model for D given X does not exceed 1 (for
3 Equivalent unit-level weights are explored in depth by Chattopadhyay and Zubizarreta (2023) and in Chattopadhyay and
Zubizarreta (2024), where the authors usefully employ these weights to analogize how OLS implicitly compares (weighted)
mean outcomes for the treated to (weighted) mean outcomes for the controls, sharpening the analogy to the difference-in-means
estimator one might apply in a randomized experiment and discussing implications for qualities such as effective balance and
sample size.
4
treated units) or drop below 0 (for control units). Thus, we emphasize that negative weights can provide
a warning that “some units look too much like treated units to have valid comparisons” (and likewise for
controls), but a positive weight is not a guarantee of meaningful comparability in the sense of finding units
of the opposite treatment value nearby in the covariate space.
Finally it may be theoretically clarifying to note that, if one uses a suitable logistic or probit regresion for
the propensity score, this of course avoids predictions for D that fall outside of [0, 1], and the related inverse
propensity score weights will never be negative. However, the transformation simply changes the problem
to one of extreme up-weighting of treated units that fall in areas where the propensity score is close to zero
(or of control units in areas where the propensity score is close to 1). In this sense, a price is paid for poor
overlap, in the sense of predicted treatment values, whether they use weights that imply a linear probability
model, or a propensity score based on a non-linear model for treatment.
where wx is the weight for the strata in which X = x and τ̂x = Ê[Yi (1) − Yi (0)|Xi = x] is the conditional
average treatment effect for subgroup Xi = x. The resulting τ̂reg correspond to the ATE only when wx =
P (X = x). The best known expression for such strata-wise weights of OLS arises from Angrist (1998),
P ˆ i )(1 − d(X
τ̂x d(X ˆ i ))P̂ (Xi = x)
τ̂reg = Px (11)
ˆ ˆ
x d(Xi )(1 − d(Xi ))P̂ (Xi = x)
This form shows that regression weights strata-wise DIM \ x components not according to P̂ (Xi = x) as
required to represent the ATE, but proportionally to d(X ˆ i )(1 − d(X
ˆ i ))P̂ (Xi = x). Such a regression puts
the most weight on strata in which the conditional variance of treatment status is largest, i.e. when d(Xˆ i ) is
nearest to 50%. An intuition for this notes that OLS seeks to minimize squared error, and the opportunity
to learn the most from strata with middling probabilities of treatment (Angrist and Pischke, 2009). Absent
treatment effect heterogeneity, this is unproblematic. However, if there are high levels of effect heterogeneity
in our data, then depending on how they correspond to strata of X and the probability of treatment in those
strata, these weights could move the regression coefficient far from the average treatment effect (Aronow and
Samii, 2016; Humphreys, 2009; Angrist and Pischke, 2009; Sloczyński, 2022).
These weights, however, are obtained under the additional assumption that the probability of treatment
is linear in X, and not just modeled that way. Angrist and Pischke (2009) satisfy this by assuming the
corresponding outcome regression can be saturated in X. This is often a reasonable assumption with discrete,
low-dimensional X. Nevertheless we may wish to have a more general expression for the weights that does
not rely on such an assumption, either for completeness or in service of generalizing to the case where X is not
discrete or is otherwise infeasible to include in a saturating form (i.e. X with numerous multiple dimensions
and levels). To obtain a more general expression for the weights, we simply organize the individual-level
5
weights above into strata,
X
τ̂reg = wi Yi (12)
i
P ˆ i ))
Yi (Di − d(X
= Pi (13)
ˆ 2
i (Di − d(Xi ))
P
P̂ (Xi = x)E[(D
b ˆ
i − d(Xi ))Yi |Xi ]
= Px (14)
ˆ 2
x P̂ (Xi = x)E[(Di − d(Xi )) |Xi ]
b
P
Rearranging these terms in the form τ̂reg = x wx τx is not in general possible, as the expected outcome
under treatment and the expected outcome under control are weighted differently:
E[(D
b ˆ ˆ ˆ
i − d(Xi ))Yi |Xi = x] = E[Di (1 − d(Xi ))Yi − (1 − Di )d(Xi )Yi |Xi = x]
b
ˆ i ))E[D
= (1 − d(X ˆ i )E[(1
b i Yi |Xi = x] − d(X b − Di )Yi |Xi = x]
ˆ i ))πx E[Y
= (1 − d(X b i |Xi = x, Di = 1] − d(Xˆ i )(1 − πx )E[Y
b i |Xi = x, Di = 0] (15)
Thus the strata-wise weighting generally imposed by OLS involves a combination of the true probability
of treatment given X, and the probability of treatment given X estimated using a linear model. Let
πx = P (Di = 1|Xi = x). Consider the discrepancy, ax , between the true and linearly-approximated
ˆ i ) = πx + ax . The general strata-wise weights corresponding
probability of treatment given X, so that d(X
to OLS are then P
P̂ (Xi = x)πx (1 − πx )τx − ax E[Yi |Xi = x]
τ̂reg = P x (16)
x (Xi = x)(πx (1 − πx − ax ) + ax (πx + ax ))
P̂
Appendix A.1 gives additional details. Notice that this expression is infeasible to compute when D is not
linear in X, as πx is unknown. However, if ax = 0 (i.e. D is truly linear in X), this representation reduces to
the variance-weighted representation of the regression coefficient given by Angrist (1998). For many purposes
then the Angrist (1998) representation provides a clear and intuitive conception of the weights. Nevertheless,
we found it necessary to have this more complete formulation in order to obtain the correct answer, as our
simulations below show.
6
linear in X,
E[Y (0)|X] = α0 + αX (17)
E[Y (1)|X] = γ0 + γX (18)
This assumption is slightly weaker than single linearity, which effectively forces α = γ. While we label
this assumption for easy reference and to distinguish it from single linearity, it is not intended to be novel,
and further can be understood as the (often implicit) motivation for a number of longstanding approaches
described next.
We turn next to a variety of known and straightforward estimation approaches that are in keeping with
this assumption and that wholly avoid the awkwardness of “regression’s weighting problem” rather than
employing diagnostic or interpretational aids.
Because this explicitly puts weights of 1/N on every unit, the estimate does not suffer from the “weighting
problem”.5 Further, the Lin estimate and the regression imputation estimate are easily shown to be identical
4 This centering procedure and the resulting interpretation is complicated when one level of X (or the intercept) must be
dropped, as when X represents block indicators or more generally group fixed effect. We describe this concern and propose a
solution for this in Section 3.3.
5 We note the equality of Expression 19 to Expression 21 above implies that one may either (i) compare each unit’s observed
outcome under the realized treatment status to the modeled outcome for that unit under the opposite, or (ii) for each unit,
compare the modeled outcome under treatment to the modeled outcome under control, without using the observed outcome.
This is a result of relying on OLS for each outcome model, since the average fitted value from a given model will be precisely
equal to the average observed outcome over the same group. Such a property does not hold with estimators that cannot
guarantee ϵ̂ = 0.
7
in this context. Specifically, the Lin estimate models E[Y |D, X] as β0 +β1 D +β2 X +β3 DX, and thus implies
Since these two estimation methods are identical under OLS, they can be used interchangeably. One can
directly show the unbiasedness of either under assumptions of consistency and conditional ignorability,
" N
#
1 X
E[τ̂imp ] = E (µ̂1 (Xi ) − µ̂0 (Xi )) (30)
N i=1
= E[E[Yi |Di = 1, Xi ]] − E[E[Yi |Di = 0, Xi ]] (31)
= E[Yi (1) − Yi (0)] (32)
This is also known as g-computation, meaning that an estimate from g-computation will also be unbiased
for the ATE in settings with high levels of heterogeneity in treatment effect and probability of treatment
assignment (Robins, 1986; Snowden et al., 2011). It has also been referred to even more recently as “multi-
regression” (Chattopadhyay and Zubizarreta, 2024), who also note earlier uses of this approach as far back
as Peters (1941) and Belson (1956).
Mean balancing weights. We can also connect the separate linearity assumption to a justification for
calibration/balancing weights. Consider mean balancing weights for both the treated and control units, so
that weighted average of covariates for each is equal to the overall unweighted average of covariates,
N
X X 1 X
wi Xi = wi Xi = Xi (35)
N i=1
i:Di =0 i:Di =1
8
While many choices of weights can satisfy these constraints (when feasible), it is desirable to minimize some
measure of their variation.
P In the case of maximum entropy weights (Hainmueller, 2012), this is done by
maximizing the entropy, i wi log(wi /qi ). The key idea is that by achieving equal means on X in the treated
and control group (both equal to the full samples mean on X), then any linear function of X—which include
Y (1) and Y (0) if separate linearity holds— will also have equal means in these groups. A difference in
means estimator using these weights is then unbiased for the ATE, without requiring any appeal to the
relationship between this weighting procedure and propensity score modeling.6 An interesting feature of this
approach is that while it is justified by the same assumptions as regression imputation or the interactive
regression above, their estimation strategies are different. Balancing weights do not require estimating
the (nuisance) coefficients of any model. This may improve tolerance to misspecification (i.e. non-linear
conditional expectations for each potential outcome) at the cost of variance, as borne out in simulations
below.
3 Simulations
To verify that these estimators behave as our analysis claims, we explore the performance of different
estimation approaches under three different data generating processes, first with a discrete covariate X
and then with a continuous one.
9
DGP1: P (D|X) increasing in X; {Y (1), Y (0)} linear in X
6
4
X P (D|X) P (X = x) τx
2
1 -3 0.100 0.143 -9
2 -2 0.100 0.143 -6
0
3 -1 0.100 0.143 -3
−2
4 0 0.500 0.143 0
−4
5 1 0.500 0.143 3
6 2 0.500 0.143 6
−6
7 3 0.500 0.143 9
stratify reg impute interact meanbal match
6
4
X P (D|X) P (X = x) τx
2
1 -3 0.100 0.143 -9
2 -2 0.200 0.143 -6
0
3 -1 0.300 0.143 -3
−2
4 0 0.400 0.143 0
−4
5 1 0.500 0.143 3
6 2 0.600 0.143 6
−6
7 3 0.700 0.143 9
stratify reg impute interact meanbal match
X P (D|X) P (X = x) τx
2
1 -3 0.100 0.143 5
2 -2 0.200 0.143 5
0
3 -1 0.300 0.143 0
−2
4 0 0.500 0.143 -5
−4
5 1 0.700 0.143 0
6 2 0.800 0.143 5
−6
7 3 0.900 0.143 5
stratify reg impute interact meanbal match
Figure 1: Estimates over 500 iterations of size n = 1000. The dashed red line indicates the true ATE. The
dashed blue and green lines show the regression coeficient expected using the simplified weights of Angrist
and Krueger (1999) (blue) and our more general weights (green).
10
Standard errors
While these results show the expected variability in estimates under resampling from the given DGP, for an
investigator working with one observed dataset, some form of estimated standard error is vitally important
to inference. Table 1 reports the average analytically estimated standard errors for DGP1 above, still with
discrete X. Results are similar in other settings.
For stratification, we take a weighted sum of strata-specific Neyman variances. Here and below we write
expressions in the plug-in/analog sample estimator form.
!
X b i |D = 1, X = x) V(Y
V(Y b i |D = 0, X = x)
b strat ) =
V(τ̂ p̂(X = x)2 + (36)
n1 n0
x∈X
For the simple and interacted regression, we calculate the HC2 standard error of the coefficient on the
treatment variable. For regression imputation, we calculate the standard error again in the Neyman style,
! !
1X 1X
V
b (µ̂1 (Xi ) − µ̂0 (Xi )) =V
b (γXi − αXi ) (37)
n i n i
b X̄ − αX̄)
= V(γ (38)
= X̄ T Σ
b 1 X̄ + X̄ Σ
b 0 X̄ (39)
where Σ̂1 and Σ̂0 are the estimated variance-covariance matrices for the treatment model and the control
model, respectively.
For mean balancing we show two types of analytical standard errors. First, we use the (HC2) standard
error from the weighted regression of just Y on D. Second, meanbal-adj uses the HC2 standard errors from
a regression of Y on D and X, again with the estimated weights. Both methods produce identical point
estimates (when perfect mean balance is achieved by the weights), but the analytical standard errors of the
meanbal-adj approach benefit from partialing out the X. This is akin to how conventional OLS standard
errors, under a fixed design, partial X out of Y so that the estimates are built on the conditional/residual
variance of Y rather than the total variance. For matching we use the Abadie-Imbens standard error (Abadie
and Imbens, 2006).
Table 1: Simulation results under DGP1. The bias and rmse columns repeat information shown graphically
in Figure 1. The empirical SE gives the actual standard deviation of the estimates across resamples. The
average analytical SE is the average of the standard errors that would have been computed analytically in
each iteration, using the estimators described in the text.
We find, first, that the empirical standard errors are extremely similar across the methods relying on single
or double linearity (reg, impute, interact, meanbal, meanbal-adj). The approaches not relying on single
or separate linearity (stratify and match) show somewhat larger standard errors, as expected given their
11
greater flexibility. Second, each method’s average analytical standard is within 5% of the empirical standard
error, with the exception of meanbal, with an average analytical standard error almost 40% larger than the
empirical value. This is repaired, however, by using meanbal-adj.
Finally, some investigators may be concerned with efficiency in the sense of sampling variability in the
estimate and its consequences for inference. This can be understood as a bias-variance tradeoff. The
comparison between OLS (reg) and the Lin/interaction approach (interact) offers a simple starting point,
since the later will add one additional parameter to the regression for every covariate dimension. Because
the number of observations is large relative to the number of covariates, this has very little impact on the
estimate’s variability across resamples. For example in DGP 1, Table 1 shows that the empirical SE is only
about 5% larger for interact than for reg. The behavior of impute is of course identical. The empirical SE
from meanbal and meanbaladj are similarly only 6% larger than from reg. If an investigator is primarily
concerned with root mean square error (RMSE) of the estimates around the true value, RMSE values fall
by nearly half for each of these methods relative to reg. That is, the small loss of efficiency in these settings
(increasing variance) is more than made up for by reductions in bias as they factor into the RMSE.
It is important to recognize, however, the favorable nature of our simulation setting in this regard. If the
number of covariates was large enough relative to the sample size, if treatment probability varied little by
stratum of X, and/or if efficiency was a greater concern than bias or RMSE, then investigators might have
cause to prefer OLS and adopting its weighted-ATE as the target estimand for their inferential purposes.
12
6
4
5
2
4
0
3
−2
−2
2
1
−4
−4
0
reg
impute
interact
meanbal
match
reg
impute
interact
meanbal
match
reg
impute
interact
meanbal
match
(a) X is randomly sampled (b) X is randomly sampled (c) X is randomly sampled from
from Unif[-3,3], P (D|X) = from Unif[-3,3], P (D|X) = Unif[-3,3], P (D|X) = 0.1X + 0.4,
0.1 when X < 0 and 0.1X + 0.4, and Y (1) = and Y (1) = Y (0) + X 2 .
P (D|X) = 0.5 when X ≥ 0, Y (0) + 3X
and Y (1) = Y (0) + 3X
Figure 2: Effect estimates for 200 samples of size n=1000 under data generating processes where X is
continuous and different estimation strategies. The dashed red line is set at the true ATE. The dashed blue
line is set at the variance-weighted estimate using the true subgroup level treatment effects.
PP
τ̂b P (Di = 1|Bi = b)(1 − P (Di = 1|Bi = b))P (Bi = b)
τ̂BF E = Pb=1
P
(40)
b=1 P (Di = 1|Bi = b)(1 − P (Di = 1|Bi = b))P (Bi = b)
where it is innocuous to assume that P (D = 1|B) is linear in the block indicators, as this regression would
13
be fully saturated in B. If all blocks have the same probability of treatment, the weights are equal, so the
regression coefficient will represent the difference in means per block, averaged over the size of each block, i.e.
the ATE. This will often be the case. However, if the treatment probability varies by block, either by design
or otherwise, then the resulting coefficient estimate would not in general be the ATE. Rather, regression
will put higher weight on blocks with probabilities of treatment nearer to 50%. As above, this is simply a
consequence of misspecification generated by heterogeneous treatment effect estimates by block. Including
interactions between the treatment and the block fixed effects would address this, under the separate linearity
assumption,
One complication when applying the interacted approach here is that care must be taken regarding the
interpretation due to the interaction. In typical usage, one block indicator will be omitted to avoid co-
linearity with the intercept. If no centering/de-meaning is done on the block indicators, the coefficient
estimate would represent the estimated effect (difference in means) in whichever block had its indicator
omitted. The solution of centering covariates (Lin, 2013) as utilized above is now complicated by this
omission. It is possible to omit the intercept, rather than the indicator for one block, and utilize the
centering. However, a more general solution is to avoid any consideration of centering and omitting one
level/the intercept, and works when dealing with one or multiple categorical variables. This is to simply
∂Y
compute the marginal effect estimate for each observation, in whatever block it is in, ∂D B=b
. Averaging
these across observations (giving equal weight to each observations) produces the average marginal effect
(AME),
n
1Xd ∂Y
τ̂AM E = (43)
n i=1 ∂D B=b
∂ Y
where ∂Dc
B=b
is the estimated marginal effect in block b where this individual unit is found. For example,
in the regression
the estimated marginal effect in block B = b is τ̂ + αp , and so the average marginal effect of interest over
the whole sample is simply
n P
1 XX
τ̂AM E = τ̂ + (α̂p 1{Bi = b}) (44)
n i=1
b=1
Using this approach to interpret the fitted regression with interactions always produces an estimate with
the desired interpretation of “the estimated marginal effect, which can differ by block, averaged over blocks
according to how many observations fall in each.” In the case of block indicators or other categorical variables,
this approach avoids errors in relation to choices about what levels to omit in the regression, whether to
omit the intercept, or having to center these variables. The variance for this estimator is
P
X X
V(τ̂AM E ) = V(τ̂ ) + P (Bi = p)2 V(α̂p ) + 2 P (Bi = k)P (Bi = j)Cov(α̂k , α̂j )
p=1 k<j
P
X
+2 P (Bi = p)Cov(τ̂ , α̂p )
p=1
14
For comparison, we also consider two other approaches. One practice is to weight the blocked fixed effects
regression by the (stabilized) inverse probability of treatment assignment for each block,
P (Di = 1) P (Di = 0)
wiIPW = Di + (1 − Di ) (45)
P (Di = 1|B = bi ) P (Di = 0|B = bi )
These weights are constructed so that within each block, the treated and control units make up equal
weighted proportions. They therefore neutralize any differences in the P (D = 1|B) across blocks. If we
perform a weighted regression of Y on D and the block dummy variables using these weights, the coefficient
estimate for D will be unbiased for the ATE.7
Finally we also the stratification estimator (Expression 5) but with blocks as strata,
|B|
X
τ̂blockDIM = P̂ (B = b)(ȲD=1,B=b − ȲD=0,B=b ). (46)
b=1
Figure 3 compares estimates from the “plain” block fixed effects (block FE) regression to the interacted
regression with the average marginal effect interpretation (block FE interact), the IPW-weighted regression
(block FE IPW), and the block DIM average (mean block DIM) in a simulation setting where treatment
probability varies by block and there is heterogeneity in treatment effect. As expected, the block fixed
effects OLS model suffers from the weighting problem. All three alternative approaches produce unbiased
and identical estimates. We recommend the use of one of these alternative estimators to improve upon
estimation of the ATE under block randomization.8
1.6
1.5
1.4
1.3
1.2
Figure 3: Effect estimates for 500 iterations of size n = 1200 each. D is assigned by (complete) randomization
within each block with probability P (D|B), and Y = 2 + D∗ τ . For blocks 1-6, P (D|B) = (0.25, 0.25, 0.5,
0.25, 0.25, 0.5) respectively, and τ = 4∗ P (D|B). The black line indicates the true ATE.
7 A weighted difference in means (rather than regression including B) with these weights would produce the same point
estimate, but the improvement in efficiency obtained under the block randomization design will not be fully reflected in the
estimated standard error. This occurs because the block indicators will be orthogonal to treatment (once weighted) and so do
not affect the coefficient estimate on treatment, but the omission of B prevents the model from reducing the residuals that
enter the standard error.
8 These results are consistent with those demonstrated in https://declaredesign.org/blog/posts/biased-fixed-effects.html,
which use the DeclareDesign simulation approach (Blair et al., 2019) and conclude that the mean blockwise DiM/ stratification
approach is suitable.
15
4 Conclusion
Given the wide use of OLS regression of Y on D and X in practice, and the potential gap between
the coefficient and target quantities like the ATE, prior work has provided several forms of guidance to
practitioners regarding this weighting problem. Humphreys (2009) notes the conditions under which the
OLS result will be nearer to ATT or ATC, and observes monotonicity conditions under which the coefficient
estimate will fall between these two endpoints. Aronow and Samii (2016) suggest characterizing the “effective
sample” for which the regression coefficient would be an unbiased estimate of the ATE by applying the
variance weights to the covariate distribution. Other papers have provided diagnostics for quantifying the
severity of this bias. Chattopadhyay and Zubizarreta (2023) characterize the implied unit-level weights of
regression adjustment (as we do as an intermediate result), and propose diagnostics that use these weights to
analyze covariate balance, extrapolation outside the support, variance of the estimator, and effective sample
size. These ideas relate closely to those of Aronow and Samii (2016) in that they compare the covariate
distribution targeted by the estimator to that of the target population. Sloczyński (2022) derives a diagnostic
related to variation in treatment probability,
where ρ is the unconditional probability of treatment. The bias due to heterogeneity will be δ(τAT C − τAT T )
where τAT C and τAT T are the average treatment effect among treated and control, respectively. Thus
δ captures one aspect of the potential for bias—variation in treatment probability—but reasoning about
the severity of bias requires knowledge of the heterogeneity in treatment effects, which is unknown to the
investigator.
Our argument is that in many cases, investigators could consider different modeling choices that avoid this
“weighting problem” altogether. This begins by understanding the weighting problem as a symptom of
misspecification when the treatment effect (and probability of treatment) vary with X. We provide a new,
more general expression for the “weights of regression” by algebraically manipulating a representation of
the OLS coefficient. The key fact is that the apparent “weights of regression” simply account for the way
misspecification alters the coefficient away from the ATE. Second, viewed this way, a natural proposal
for research practice is to rely instead on specifications that will not necessarily be violated by effect
heterogeneity. To focus on minimal changes in practice, we propose that investigators at least relax the
“single linearity” assumption (that Y is linear in D and X) to “separate linearity” (treatment and control
potential outcome are separately linear in X). Moreover, a number of existing, straightforward estimation
approaches produce unbiased ATE estimates under this assumption. First, regression imputation (including
g-estimation and the T-learner) with OLS as well as including the interaction of X and D (as in Lin, 2013)
are all suitable, and identical to each other in this setting. This longstanding approach dates back to at
least Peters (1941) and Belson (1956), as noted by Chattopadhyay and Zubizarreta (2024) who label it as
“multi-regression”. In addition, for “mean balancing” estimators applied to the ATE—those that choose
weights to achieve the same mean of X for the treated and control group as in the full sample—are also
unbiased under separate linearity. These are not identical to the above as they avoid fitting any model, and
our simulations suggest this property can partially mitigate bias when even the separate linearity assumption
fails.
Any of these approaches can be useful not only in observational studies (so far as investigators can claim
conditional ignorability), but also when using covariate adjustment after randomization, including the special
case of analyzing block randomized experiments.
While we recommend to avoid the weights through these other procedures, this is not without limits. The
principal cost of these approaches relative to OLS is the potential loss in efficiency. The interact/imputation/g-
computation/T-learner approach adds an additional nuisance parameter per covariate. The circumstances
investigated here—with just one covariate, a high degree of treatment effect heterogeneity, and large variation
in treatment probability—clearly favor these approaches, where they eliminate bias and reduce RMSE by
roughly half, while increasing the standard error by only about 5%. However there may be settings in which
the improved bias does not so obviously dominate the efficiency cost. For example, where there are many
16
covariates relative to the sample size, little difference in the probability of treatment by stratum, or little
theoretical reason to expect treatment effect heterogeneity, investigators may have cause to prefer OLS if
efficiency concerns are of paramount interest. In those cases, diagnostics or interpretational aids that gauge
the severity of the misspecification bias (weights) may be useful.
Finally, when the linearity in potential outcomes assumption is not satisfied, none of these methods can
guarantee unbiasedness. Our work here regards only the relaxation from single linearity to separate linearity.
Investigators not satisfied with the linear approximation to treatment effects in this way could reasonably
consider more flexible approaches—though doing so may incur additional uncertainty costs as illustrated in
Table 1. Nevertheless, we agree with assessments such as Keele et al. (2010) and Aronow and Samii (2016)
that current practice largely remains reliant on linear approximations to adjust for covariates while hoping
to interpret the result as a meaningful causal effect. Our analysis suggests that, especially where there
are enough observations relative to covariates to support imputation/g-computation/T-learner, interactive
regression, or mean balancing, this may be preferable to suffering regression’s “weighting problem”, as it
avoids this form of bias under slightly weaker assumptions and requires only a small change to current
practice.
References
Abadie, A. and Imbens, G. W. (2006). Large Sample Properties of Matching
Estimators for Average Treatment Effects. Econometrica, 74(1):235–267. eprint:
https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1468-0262.2006.00655.x.
Angrist, J. D. (1998). Estimating the Labor Market Impact of Voluntary Military Service Using Social
Security Data on Military Applicants. Econometrica, 66(2):249–288. Publisher: Econometric Society.
Angrist, J. D. and Krueger, A. B. (1999). Empirical strategies in labor economics. In Handbook of Labor
Economics, volume 3, Part A, pages 1277–1366. Elsevier.
Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist’s Companion.
Princeton University Press.
Aronow, P. M. and Samii, C. (2016). Does Regression Produce Representative Estimates of Causal Effects?
American Journal of Political Science, 60(1):250–267.
Belson, W. A. (1956). A technique for studying the effects of a television broadcast. Journal of the Royal
Statistical Society: Series C (Applied Statistics), 5(3):195–202.
Blair, G., Cooper, J., Coppock, A., and Humphreys, M. (2019). Declaring and diagnosing research designs.
American Political Science Review, 113(3):838–859.
Chattopadhyay, A., Greifer, N., and Zubizarreta, J. R. (2023). lmw: Linear Model Weights for Causal
Inference. arXiv:2303.08790 [stat].
Chattopadhyay, A. and Zubizarreta, J. R. (2023). On the implied weights of linear regression for causal
inference. Biometrika, 110(3):615–629.
Chattopadhyay, A. and Zubizarreta, J. R. (2024). Causation, Comparison, and Regression. Harvard Data
Science Review, 6(1). https://hdsr.mitpress.mit.edu/pub/1ybwbmlw.
Chernozhukov, V., Fernández-Val, I., Hahn, J., and Newey, W. (2013). Average and quantile effects in
nonseparable panel models. Econometrica, 81(2):535–580.
Frisch, R. and Waugh, F. V. (1933). Partial time regressions as compared with individual trends.
Econometrica, 1(4):387–401.
Green, D. P. and Aronow, P. M. (2011). Analyzing Experimental Data Using Regression: When is Bias a
Practical Concern?
17
Hainmueller, J. (2012). Entropy Balancing for Causal Effects: A Multivariate Reweighting Method
to Produce Balanced Samples in Observational Studies. Political Analysis, 20(1):25–46. Publisher:
Cambridge University Press.
Hazlett, C. (2020). Kernel balancing. Statistica Sinica, 30(3):1155–1189.
Hoffmann, N. I. (2023). Double robust, flexible adjustment methods for causal inference: An overview and
an evaluation. URL: https://osf.io/preprints/socarxiv/dzayg.
Humphreys, M. (2009). Bounds on least squares estimates of causal effects in the presence of heterogeneous
assignment probabilities.
Kang, J. D. Y. and Schafer, J. L. (2007). Demystifying Double Robustness: A Comparison of Alternative
Strategies for Estimating a Population Mean from Incomplete Data. Statistical Science, 22(4):523–539.
Publisher: Institute of Mathematical Statistics.
Keele, L., White, I. K., and McConnaughy, C. (2010). Adjusting experimental data: Models versus design.
In APSA 2010 Annual Meeting Paper.
Künzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019). Metalearners for estimating heterogeneous
treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116(10):4156–
4165. Publisher: Proceedings of the National Academy of Sciences.
Lin, W. (2013). Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s
critique. The Annals of Applied Statistics, 7(1):295–318. Publisher: Institute of Mathematical Statistics.
Lovell, M. C. (1963). Seasonal adjustment of economic time series and multiple regression analysis. Journal
of the American Statistical Association, 58(304):993–1010.
Neyman, J. (1923). Sur les applications de la théorie des probabilités aux experiences agricoles: Essai des
principes. Roczniki Nauk Rolniczych, 10(1):1–51.
Peters, C. C. (1941). A method of matching groups for experiment with no loss of population. The Journal
of Educational Research, 34(8):606–612.
Robins, J. (1986). A new approach to causal inference in mortality studies with a sustained exposure
period—application to control of the healthy worker survivor effect. Mathematical Modelling, 7(9):1393–
1512.
Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies.
Journal of educational Psychology, 66(5):688.
Snowden, J. M., Rose, S., and Mortimer, K. M. (2011). Implementation of G-Computation on a Simulated
Data Set: Demonstration of a Causal Inference Technique. American Journal of Epidemiology, 173(7):731–
738.
Sloczyński, T. (2022). Interpreting OLS Estimands When Treatment Effects Are Heterogeneous: Smaller
Groups Get Larger Weights. The Review of Economics and Statistics, pages 1–9.
Zhao, Q. and Percival, D. (2017). Entropy balancing is doubly robust. Journal of Causal Inference,
5(1):20160010.
18
A Appendix
d i , D⊥X )
Cov(Y i
τ̂reg = (48)
b ⊥X )
V(D i
where X = ⃗1 ⃗ . We first derive the unit-wise weights, which are somewhat trivial.
X
= Ê[Yi Di⊥X ]
X
= Yi (Di − Xi θ)
i
X
= ˆ i ))
Yi (Di − d(X
i
ˆ i ) = Xi β where Xi = 1
where d(X Xi and Xi = x.
b ⊥X ) = Ê[D⊥X 2 ] − Ê[D⊥X ]2
V(D i i i
2
= Ê[Di⊥X ]
X
= (Di − Xi θ)2
i
X
= ˆ i ))2
(Di − d(X
i
d i , D⊥X )
Cov(Y
P ˆ i ))
Yi (Di − d(X
i
τ̂reg = = Pi (50)
⊥X ˆ 2
V(Di ) i (Di − d(Xi ))
b
ˆ i ) = Xi θ where Xi = 1
Let πx = P (Di = 1|Xi = x) and d(X Xi and Xi = x. We can try to manipulate
the above representation to get the strata-wise weights:
19
X
d i , Di⊥X ) =
Cov(Y ˆ i ))|Xi ]
P (X = x)Ê[Yi (Di − d(X
x
X
= ˆ i )Ê[Yi |Xi ]
P (X = x)Ê[Yi Di |Xi ] − d(X
x
X
= ˆ i )Ê[Yi |Xi ])
P (X = x)(P (Di = 1|Xi )Ê[Yi |Di = 1, Xi ] − d(X
x
X
= ˆ i )(πx Ê[Yi |Di = 1, Xi ] + (1 − πx )Ê[Yi |Di = 0, Xi ))
P (X = x)(πx Ê[Yi |Di = 1, Xi ] − d(X
x
X
= ˆ i ))Ê[Yi |Di = 1, Xi ] − d(X
P (X = x)(πx (1 − d(X ˆ i )(1 − πx )Ê[Y |Di = 0, Xi ])
x
This can be rewritten in a couple different ways, where τx is the ATE given X = x:
X
d i , D⊥X ) =
Cov(Y ˆ i ))) − Ê[Y |Di = 0, Xi ](d(X
P (X = x)(Ê[Yi |Di = 1, Xi ](πx (1 − d(X ˆ i )(1 − πx ))
i
x
X
= ˆ i )Ê[Yi |Di = 0, Xi ] − d(X
P (X = x)(πx Ê[Yi |Di = 1, Xi ] − d(X ˆ i )πx τx
x
X
ˆ i ))(Ê[Yi |Di = 1, Xi ] − p̂i 1 − πi
= P (X = x)πx (1 − d(X Ê[Yi |Di = 0, Xi ])
x
1 − p̂i πi
20
X
d i , Di⊥X ) =
Cov(Y ˆ i )πx τx
P (X = x)πx Ê[Yi |Di = 1, Xi ] − p̂Ê[Yi |Di = 0, Xi ] − d(X
x
X
= P (X = x)πx Ê[Yi |Di = 1, Xi ] − (πx + ax )Ê[Yi |Di = 0, Xi ] − (πx + ax )πx τx
x
X
= P (X = x)πx Ê[Yi |Di = 1, Xi ] − πx Ê[Yi |Di = 0, Xi ] − ax Ê[Yi |Di = 0, Xi ] − πx2 τx − ax πx τx
x
X
= P (X = x)πx (1 − πx )τx − ax (Ê[Yi |Di = 0, Xi ] + πx τx )
x
X
= P (X = x)πx (1 − πx )τx − ax Ê[Yi |Xi = x]
x
Using this, we can write the weighted representation of the regression coefficient in terms of ax :
P
x P (X = x)πx (1 − πx )τx − ax Ê[Yi |Xi = x]
τ̂reg = P (52)
x P (X = x)(πx (1 − πx − ax ) + (πx + ax )(πx + ax − πx ))
We do this by minimizing the least squares error, in other words, β̂ is the solution to the minimization
problem:
n
X
min (Yi − (β0 + β1 Di + β2 Xi + β3 Di Xi ))2
β
i=1
Now
PNlet’s look at the regressions
PN involved in the imputation estimate. The imputation estimate is θ̂ =
1 1
N Ŷ
i=1 i (1) − Ŷi (0) = N i=1 0 + γ1 X − (α0 + α1 X). Here, γ and α are found by solving the following
γ
minimization problems:
X
min (Yi − (γ0 + γ1 Xi ))2
γ
Di =1
X
min (Yi − (α0 + α1 Xi ))2
α
Di =0
21
This is equivalent to solving the single minimization problem:
X X
= min (Yi − (γ0 + γ1 Xi ))2 + (Yi − (α0 + α1 Xi ))2
γ,α
Di =1 Di =0
Xn
= min Di (Yi − (γ0 + γ1 Xi ))2 + (1 − Di )(Yi − (α0 + α1 Xi ))2
γ,α
i=1
n
X
= min Di (Yi2 − 2Yi (γ0 + γ1 Xi ) + (γ0 + γ1 Xi )2 ) + (1 − Di )(Yi2 − 2Yi (α0 + α1 Xi ) + (α0 + α1 Xi )2 )
γ,α
i=1
n
X
= min Yi2 − 2Yi (α0 + α1 Xi + (γ0 − α0 )Di + (γ1 − α1 )Di Xi ) + (α0 + α1 Xi + (γ0 − α0 )Di + (γ1 − α1 )Di Xi )2
γ,α
i=1
n
X
= min (Yi − (α0 + α1 Xi + (γ0 − α0 )Di + (γ1 − α1 )Di Xi ))2
γ,α
i=1
n
X
= min (Yi − (β0 + β1 Di + β2 Xi + β3 Di Xi ))2
β
i=1
where β0 = α0 , β1 = γ0 − α0 , β2 = α1 , and β3 = γ1 − α1
Then we can show that the ATE estimate from using mean balancing weights is unbiased:
" #
X X
E [τ̂meanbal ] = E Yi wi − Yi wi (55)
Di =1 Di =0
X X
= E[Yi wi |Di = 1] − E[Yi wi |Di = 0] (56)
Di =1 Di =0
X X
= E[Yi (1)wi |Di = 1] − E[Yi (0)wi |Di = 0] (57)
Di =1 Di =0
X X
= E[(γXi + ϵi )wi |Di = 1] − E[(αXi + ϵi )wi |Di = 0] (58)
Di =1 Di =0
X X
=γ E[Xi wi |Di = 1] − α E[Xi wi |Di = 0] (59)
Di =1 Di =0
" # " #
X X
= γE Xi wi − αE Xi wi (60)
Di =1 Di =0
= γE X̄ − αE X̄ (61)
= E[Yi (1)] − E[Yi (0)] (62)
= E[Yi (1) − Yi (0)] (63)
1
PN
where X = N i=1 Xi is the sample mean of the covariates.
22
A.4 Unbiasedness of impute and interact
We can show that both the regression imputation method and the interacted regression are unbiased for
the ATE. We start with regression imputation. We have one model for Y among the treated, µ̂1 , and one
among the controls, µ̂0 .
Then,
!
1 X X
τ̂imp = (Yi − µ̂0 (Xi )) + (µ̂1 (Xi ) − Yi ) (64)
n
Di =1 Di =0
1X
= (µ̂1 (Xi ) − µˆ0 (Xi )) (65)
n i
X 1{X=x}
= (µ̂1 (x) − µ̂0 (x)) (66)
n
x∈X
X
= Pb(X = x) (µ̂1 (x) − µ̂0 (x)) (67)
x∈X
X
= Pb(X = x)DIMx (68)
x∈X
We see that the imputation approach weights each stratum by the probability of inclusion in that stratum
– these weights result in an estimate that is unbiased for the ATE.
23