Análisis de Regresión
Análisis de Regresión
Análisis de Regresión
Design-based Uncertainty in
Regression Analysis ∗
arXiv:1706.01778v1 [math.ST] 6 Jun 2017
Abstract
Consider a researcher estimating the parameters of a regression function based on data for
all 50 states in the United States or on data for all visits to a website. What is the inter-
pretation of the estimated parameters and the standard errors? In practice, researchers
typically assume that the sample is randomly drawn from a large population of interest
and report standard errors that are designed to capture sampling variation. This is com-
mon practice, even in applications where it is difficult to articulate what that population
of interest is, and how it differs from the sample. In this article, we explore an alternative
approach to inference, which is partly design-based. In a design-based setting, the values of
some of the regressors can be manipulated, perhaps through a policy intervention. Design-
based uncertainty emanates from lack of knowledge about the values that the regression
outcome would have taken under alternative interventions. We derive standard errors that
account for design-based uncertainty instead of, or in addition to, sampling-based uncer-
tainty. We show that our standard errors in general are smaller than the infinite-population
sampling-based standard errors and provide conditions under which they coincide.
∗
We are grateful for comments by Daron Acemoglu, Joshua Angrist, Matias Cattaneo, Jim Poterba,
Tymon Sloczyński, Bas Werker, and seminar participants at Microsoft Research, Michigan, Brown
University, MIT, Stanford, Princeton, NYU, Columbia, Tilburg University, the Tinbergen Institute,
American University, Montreal, Michigan State, Maryland, Pompeu Fabra, Carlos III and University
College London, three referees, a co-editor, and especially for discussions with Gary Chamberlain. An
earlier version of this paper circulated under the title “Finite Population Causal Standard Errors”
(Abadie et al. (2014)).
†
Professor of Economics, Massachusetts Institute of Technology, and NBER, abadie@mit.edu.
‡
Professor of Economics, Graduate School of Business, Stanford University, and NBER,
athey@stanford.edu.
§
Professor of Economics, Graduate School of Business, and department of Economics, Stanford
University, and NBER, imbens@stanford.edu.
¶
University Distinguished Professor, Department of Economics, Michigan State University,
wooldri1@msu.edu
1 Introduction
The dominant approach to inference in regression analysis in the social sciences takes a sampling
perspective on uncertainty. This perspective relies on the assumption that the observed units
can be viewed as a sample drawn randomly from a large population of interest. In many cases
this random sampling perspective is a natural and attractive one. For example, if one analyzes
individual-level data from the U.S. Current Population Survey, the Panel Study of Income
Dynamics, or the 1% public use sample from the U.S. Census, it is natural to regard the sample
as a small random subset of the population of interest. In many other settings, however, this
sampling perspective is less attractive. For example, suppose that the data set to be analyzed
contains information on all 50 states of the United States, all the countries in the world, or all
visits to a website. If, for all units in this data set, we observe an outcome and some attributes
at a single point in time, and we ask how the average outcome varies across two subpopulations
defined by these attributes, the answer is a quantity that is known with certainty. Hence,
the standard error should be zero. However, researchers analyzing this type of data typically
report standard errors that are formally justified by the random sampling perspective. This
widespread practice implicitly forces the object of interest to be a data generating process, or
superpopulation, from which the actual population is drawn at random. In such a setting,
uncertainty arises from lack of observability of the superpopulation. While this may be an
appealing framework in some instances, it is clearly not so in cases where the interest resides in
an actual finite population and, in any event, a researcher may want to first define the object
of interest and then use an appropriate mode of inference, rather than allowing the mode of
inference to implicitly define the object of interest of her/his investigation.
In this article, we provide an alternative framework for the interpretation of uncertainty in
regression analysis regardless of whether a fraction of the population or the entire population
is included in the sample. While our framework accommodates sampling-based uncertainty, it
also takes into account design-based uncertainty, which arises when the parameter of interest is
defined in terms of the unobserved outcomes that some units would attain under a certain inter-
vention. Design-based uncertainty is often explicitly accounted for in the analysis of randomized
experiments where it is the basis of randomization inference (Neyman, 1923; Rosenbaum, 2002;
Imbens and Rubin, 2015), but it is rarely explicitly acknowledged in regression analyses or, more
generally, in observational studies (exceptions in special cases include Samii and Aronow, 2012;
1
Freedman, 2008a and 2008b; Lin, 2013).
To illustrate the differences between sampling-based inference and design-based inference, we
present two examples in Tables 1 and 2. In the example of Table 1, there is a finite population
consiting of n units, each characterized by the values of a pair of variables, Yi and Zi . Here,
we can define an estimand as a function of the pairs {(Yi , Zi )}ni=1 for the entire population. For
example, the estimand could be the difference in the population average value of the outcome,
Yi , by values of the attribute, Zi . Uncertainty about the estimand exists when we observe the
values (Yi , Zi ) only for a subset of the population, the sample. In Table 1 inclusion of unit i
in a sample is coded by the binary variable Ri . In this setting, an estimator can be naturally
defined as the difference in the average value of the outcome, Yi , by values of the attribute, Zi ,
in the sample. Sampling-based inference uses information about the process that determines
R1 , . . . , Rn , to assess the variability of estimators across different potential samples.
1 X X 1 ? ? 0 ? ? 0 ...
2 ? ? 0 ? ? 0 ? ? 0 ...
3 ? ? 0 X X 1 X X 1 ...
4 ? ? 0 X X 1 ? ? 0 ...
.. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . ...
n X X 1 ? ? 0 ? ? 0 ...
Table 2 depicts a different scenario. We encounter again a finite population of size n. For each
population unit we now observe the value of one of two variables, either Yi (1) or Yi (0), but not
both. Yi (1) and Yi (0) represent the potential outcomes that unit i would attain under exposure
or lack of exposure to certain intervention (or treatment) of interest. In Table 2 exposure to the
intervention is coded by the binary treatment variable, Xi . We observe Yi (1) if Xi = 1, and Yi (0)
if Xi = 0. The estimand is a function of the full set of pairs {(Yi (1), Yi (0))}ni=1 , for example,
P
the average causal effect (1/n) ni=1 (Yi (1) − Yi (0)). As in the first example, the estimator is a
function of the observed data, e.g., the difference in the average of observed values of Yi (1) and
2
Yi (0). Design-based inference uses information about the process that determines X1 , . . . , Xn ,
to assess the variability of estimators across different potential samples. Notice that, under this
mode of inference, uncertainty about the estimand remains even when we observe the entire
population, as in Table 2.
1 X ? 1 X ? 1 ? X 0 ...
2 ? X 0 ? X 0 ? X 0 ...
3 ? X 0 X ? 1 X ? 1 ...
4 ? X 0 ? X 0 X ? 1 ...
.. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . ...
n X ? 1 ? X 0 ? X 0 ...
More generally, of course, we can have complex missing data processes that combine features
of these two examples, with some units not included in the sample at all, and with one of the
two potential outcomes not observed for the sample units. The inferential procedures proposed
in this article address both sources of variability. As the examples in Tables 1 and 2 illustrate,
articulating the exact nature of the estimand of interest and the source of uncertainty that
makes the estimator stochastic are crucial steps to valid inference. For this purpose, it will be
useful to distinguish between descriptive estimands, where uncertainty stems solely from not
observing all units in the population of interest, and causal estimands, where the uncertainty
stems, at least partially, from unobservability of potential outcomes.
The main formal contribution of this article is to generalize the results for the approximate
variance for multiple linear regression estimators associated with the work by Eicker (1967),
Huber (1967), and White (1980a,b, 1982), EHW from hereon, in two directions. First, we allow
sampling from a finite population and, second, we allow for design-based uncertainty in addition
to, or instead of, the sampling-based uncertainty that the EHW results based on. The first
generalization decreases the variance, and the second increases the variance. Incorporating these
generalizations requires developing a new framework for regression analysis with assumptions
3
that differ from the standard ones. This framework nests as special cases the Neyman (1923),
Samii and Aronow (2012), Freedman (2008a), Freedman (2008b), and Lin (2013) regression
analyses for data from randomized experiments. We show that in large samples the widely
used EHW robust standard errors are conservative. Moreover, we show that the presence of
attributes – that is, immutable characteristics of the units – can be exploited to improve on the
EHW variance estimator, and we propose variance estimators that do so. Finally, we show that
in some special cases, in particular the case where the regression function is correctly specified,
the EHW standard errors are asymptotically correct.
One important practical advantage of our framework is that it justifies non-zero standard
errors in cases where we observe all units in the population but design-based uncertainty remains.
A second advantage of the formal separation into sampling-based and design-based uncertainty is
that it allows us to discuss the distinction between internal and external validity (Shadish et al.,
2002; Manski, 2013; Deaton, 2010) in terms of these two sources of uncertainty. For internal
validity there are no assumptions required on the sampling process, and conversely, for external
validity there are no assumptions required on the design.
2 A Simple Example
In this section we set the stage for the problems discussed in the current article by discussing
least squares estimation in a simple example with a single binary regressor. We make four
points. First, we show how design-based uncertainty affects the variance of regression estimators.
Second, we show that the standard Eicker-Huber-White (EHW) variance estimator remains
conservative when we take into account design-based uncertainty. Third, we show that there is a
simple finite-population correction to the EHW variance estimator for descriptive estimands but
not for causal estimands. Fourth, we discuss the relation between the two sources of uncertainty
and the notions of internal and external validity of the estimand.
We focus on a setting with a finite population of size n. We sample N units from this
population, with Ri ∈ {0, 1} indicating whether a unit was sampled (Ri = 1) or not (Ri = 0),
P
so that N = ni=1 Ri . There is a single binary regressor, Xi ∈ {0, 1}, and nx (resp. Nx ) are
the number of units in the population (resp. the sample) with Xi = x. To make the discussion
specific, suppose the binary regressor Xi is an indicator for a state regulation, say the state
having a minimum wage higher than the federal minimum wage. We view the regressor not as
4
a fixed attribute or characteristic of each unit, but instead as a cause or policy variable whose
value could have been different from the observed value. This generates missing data of the
type shown in Table 2, where only some of the states of the world are observed, implying that
there is design-based uncertainty. Formally, using the Rubin causal model or potential outcome
framework (Neyman, 1923; Rubin, 1974; Holland, 1986; Imbens and Rubin, 2015), we postulate
the existence of two potential outcomes for each unit, denoted by Yi (1) and Yi (0), for state
average earnings without and with a state minimum wage, with Yi , the realized outcome, given
the actual or prevailing minimum wage, defined as:
Yi (1) if Xi = 1,
Yi = Yi (Xi ) =
Yi (0) if Xi = 0.
These potential outcomes are viewed as non-stochastic attributes for unit i, irrespective of the
realized value of Xi . They, as well as the additional observed attributes, Zi , remain fixed in
repeated sampling thought experiments, whereas Ri and Xi are stochastic and, as a result, so are
the realize outcomes, Yi . In the current section we abstract from the presence of fixed observed
attributes, Zi , which will play an important role in Section 3. Let Y , Y (1), Y (0), R, and X be
the n-vectors with i-th element equal to Yi , Yi (1), Yi (0), Ri , and Xi respectively. For sampled
units (units with Ri = 1) we observe Xi , and Yi .
In general, estimands are functions of the full set of population values (Y (1), Y (0), X, R).
We consider two types of estimands, descriptive and causal. If an estimand can be written as a
function of (Y , X), free of dependence on R and on the potential outcomes beyond the realized
outcome, we label it a descriptive estimand. Intuitively a descriptive estimand is an estimand
whose value would be known with certainty if we observe all the realized values of all variables
for all units in the population. If an estimand cannot be written as a function of (Y , X, R)
because it depends on the potential outcomes Y (1) and Y (0), then we label it a causal estimand.
We now consider in our binary regressor example three closely related estimands, one de-
scriptive and two causal. The first estimand is the difference in population averages by the
prevailing minimum wage,
n n
1 X 1 X
θdescr = θdescr (Y , X) = Xi Y i − (1 − Xi )Yi .
n1 i=1 n0 i=1
5
is the sample average causal effect,
1 X
n
causal,sample causal,sample
θ =θ (Y (1), Y (0), R) = Ri Yi (1) − Yi (0) .
N i=1
6
We start by studying the first moment of the estimator, conditional on (N1 , N0 ), and only
for the cases where N1 ≥ 1 and N0 ≥ 1 (and thus n1 ≥ 1 and n0 ≥ 1). We leave this
latter conditioning implicit in the notation throughout this section. A supplementary appendix
contains proofs of the results in this section. First, taking the expectation only over the random
sampling, under Assumption 1:
E θb | X, N1 , N0 = θdescr . (2.1)
Notice that this result does not require random assignment. Second, taking the expectation
only over the random assignment, under Assumption 2:
E θb | R, N1 , N0 = θcausal,sample . (2.2)
This equality does not require random sampling. Third, taking the expectation over both the
sampling and the assignment, maintaining both Assumptions 1 and 2:
E θb | N1 , N0 = E θdescr | N1 , N0 = E θcausal,sample | N1 , N0 = θcausal .
Next we look at the variance of the estimator. Here we maintain both the random as-
signment and random sampling assumption. From Equations (2.1) and (2.2), it follows that
var(θb | X, N1 , N0 ) measures dispersion with respect to θdescr , while var(θb | R, N1 , N0 ) measures
dispersion with respect to θcausal,sample . By the law of total variance, we can decompose:
h i
var θb | N1 , N0 = E var θb | X, N1 , N0 N1 , N0 + var θdescr | N1, N0
h i
= E var b
θ | R, N1 , N0 | N1 , N0 + var θcausal,sample | N1 , N0 . (2.3)
Let
n n
!2
1 X 1X
S12 = Yi (1) − Yj (1) .
n − 1 i=1 n j=1
S02 and Sθ2 are analogously defined for Y1 (0), . . . , Yn (0) and θ1 , . . . , θn , respectively, where θi =
Yi (1) − Yi (0). The variance of θb can be expressed as
V total (N1 , N0 , n1 , n0 ) = var θb | N1 , N0 ,
S2 S2 S2
= 1 + 0 − θ, (2.4)
N1 N0 n
7
which is a variant of the result in Neyman (1923).
For the first decomposition in equation (2.3), the sampling-based component of the total
variance is
h i
V sampling (N1 , N0 , n1 , n0 ) = E var θb | X, N1 , N0 N1 , N0 ,
S12 N1 S02 N0
= 1− + 1− ,
N1 n1 N0 n0
and the design-based component, beyond the sampling-based component, is
V design|sampling (N1 , N0 , n1 , n0 ) = var θdescr | N1 , N0
S2 S2 S2
= 1 + 0 − θ.
n1 n0 n
For the second decomposition in equation (2.3), the design-based component of the variance is
h i
V design (N1 , N0 , n1 , n0 ) = E var θb | R, N1 , N0 | N1 , N0
S12 S2 S2
= + 0 − θ,
N1 N0 N
and the sample-based component, beyond the design-based component, is
V sampling|design (N1 , N0 , n1 , n0 ) = var θcausal,sample | N1 , N0
Sθ2 N
= 1− .
N n
Comment 1. Causal versus Descriptive Estimands
A key comparison is between the sampling variance for the estimator for the descriptive estimand
and the design variance for the estimator for the sample average causal effect
sampling S12 N1 S02 N0
V (N1 , N0 , n1 , n0 ) = 1− + 1− ,
N1 n1 N0 n0
versus
S12 S2 S2
V design (N1 , N0 , n1 , n0 ) = + 0 − θ.
N1 N0 N
In general these variances cannot be ranked: the sampling variance can be very close to zero if
the sampling rate N/n is close to one, but it can also be larger than the design variance if the
sampling rate is small and the variance of the treatment effect, Sθ2 , is substantial.
8
Comment 2. Finite Population Correction
If the estimand is θcausal or θdescr , ignoring the fact that the population is finite generally leads
to an overstatement of the variance:
Sθ2
V total (N1 , N0 , ∞, ∞) − V total (N1 , N0 , n1 , n0 ) = ≥ 0,
n
S12 S02
V sampling (N1 , N0 , ∞, ∞) − V sampling (N1 , N0 , n1 , n0 ) = + ≥ 0.
n1 n0
If the estimand is θcausal,sample , however, the population size is irrelevant:
In other words, if the population is large relative to the sample, it is sufficient to consider
the sampling-based variance. This can be viewed as the implicit justification for the common
practice of ignoring design-based uncertainty. If, at the other extreme, the sample is equal to
the population, the sampling-based variance component is zero and the design-based component
is equal to the total variance:
9
assumption and is not affected by the assumptions on the assignment process. However, for θb
to be a good estimator for θcausal , which is often the most interesting estimand, we need both
internal and external validity, and thus both random assignment and random sampling.
For the binary regressor example the EHW variance estimator can be written as
N1 − 1 b2 N0 − 1 b2
Vb ehw = S1 + S0 ,
N12 N02
Sb2 Sb2
Ve ehw = 1 + 0 ,
N1 N0
with expectation equal to the sampling variance in the infinite population case,
h i S2 S2
V ehw = E Ve ehw = 1 + 0 = V sampling (N1 , N0 , ∞, ∞).
N1 N0
This variance is also the one proposed by Neyman (1923). Bootstrapping the estimator would
approximately give the same variance.
3.1 Set Up
Consider a sequence of finite populations indexed by population size, n. Unit i in population
n is characterized by a set of fixed attributes Zn,i (including an intercept) and by a potential
outcome function, Yn,i (), which maps causes, Un,i , into outcomes, Yn,i = Yn,i (Un,i ). Zn,i and Un,i
are real-valued column vectors, while Yn,i is scalar. We do not place restrictions on the types of
the variables: they can be continuous, discrete, or mixed.
There is a sequence of samples associated with the population sequence. We will use Rn,i = 1
to indicate that unit i of population n is sampled, and Rn,i = 0 to indicate that it is not sampled.
For each unit in sample n, we observe the triple, (Yn,i , Un,i , Zn,i ).
A key feature of the analysis in this section relative to Section 2 is that we now allow for
more complicated assignment mechanisms. In particular, we relax the assumption that the
causes have identical distributions.
11
Assumption 3. (Assignment Mechanism) The assignments Un,1 , . . . , Un,n are jointly inde-
pendent, and independent of Rn,1 , . . . , Rn,n , but not (necessarily) identically distributed (i.n.i.d.).
For what follows, it is convenient to work with a transformation Xn,1 , . . . , Xn,n of Un,1 , . . . , Un,n
such that
" n # n
X X
′ ′
E Xn,i Zn,i = E[Xn,i ]Zn,i = 0. (3.1)
i=1 i=1
Pn ′
This can be accomplished in the following way. We assume that the population matrix i=1 Zn,i Zn,i
is full-rank. Then, equation (3.1) holds for
where
n
! n
!−1
X X
′ ′
Λn = E[Un,i ]Zn,i Zn,i Zn,i .
i=1 i=1
It is important to notice that, because Λn Zn,i is deterministic in our setting and Un,1 , . . . , Un,n
are i.n.i.d., the variables Xn,1 , . . . , Xn,n are i.n.i.d. too.
For population n, let Y n , X n , Z n , Rn , and Y n () be matrices that collect outcomes, causes,
attributes, sampling indicators, and potential outcome functions, where each population unit
has the same row index in each of the matrices. In our setting, the sampling indicators Rn and
the causes X n are stochastic. The attributes Z n and the potential outcome functions Y n () are
taken as fixed. Expectations are taken over the distribution of (Rn , X n ).
We analyze the properties of the estimator θbn obtained by minimizing least square errors in
the sample:
n
X 2
(θbn , b
γn ) = argmin ′
Rn,i Yn,i − Xn,i ′
θ − Zn,i γ .
(θ,γ) i=1
′ ′
The properties of the population regression residuals, en,i = Yn,i − Xn,i θn − Zn,i γn , depend on
the exact nature of the estimands, (θn , γn ). In what follows, we will consider alternative target
parameters, which in turn will imply different properties for en,i . Notice also that, although the
transformation in (3.2) is typically unfeasible (because the values of E[Un,i ] may not be known),
θbn is not affected by the transformation in the sense that the least squares estimators (θen , e
γn ),
defined as
n
X 2
(θen , e
γn ) = argmin ′
Rn,i Yn,i − Un,i ′
θ − Zn,i γ ,
(θ,γ) i=1
12
satisfy θbn = θen (although, in general, γbn 6= γen ). As a result, we can analyze the properties of θbn
focusing on the properties of the regression on Xn,1 , . . . , Xn,n instead of on Un,1 , . . . , Un,n .
We assume random sampling with some conditions on the sampling rate to ensure that the
sample size increases with the population size.
for all n-vectors r with i-th element ri ∈ {0, 1}. (ii) The sequence of sampling rates, ρn , satisfies
nρn → ∞ and ρn → ρ ∈ [0, 1].
Assumption 4(i) states that each population unit is sample with probability ρn independently
of the others. The first part of Assumption 4(ii) guarantees that as the population size increases,
the (expected) sample size also increases. The second part of Assumption 4(ii) allows for the
possibility that asymptotically the sample size is a negligible fraction of the population size so
that the EHW results, corresponding to ρ = 0, are included as a special case of our results.
Next assumption is a regularity condition bounding moments.
Assumption 5. (Moments) There exists some δ > 0 such that the sequences
n n n
1X 1X 1X
E[|Yn,i |4+δ ], E[kXn,i k4+δ ], kZn,i k4+δ
n i=1 n i=1 n i=1
Let
′ ′
Xn Yn,i Yn,i Xn Yn,i Yn,i
1 Xn,i Xn,i , 1
Wn = Ωn = E Xn,i Xn,i .
n i=1 n i=1
Zn,i Zn,i Zn,i Zn,i
So Ωn = E[Wn ], where the expectation is taken over the distribution of X n . We will consider
also sample counterparts of Wn and Ωn :
′ ′
XN Yn,i Yn,i Xn Yn,i Yn,i
fn = 1
W Rn,i Xn,i Xn,i , en = 1
Ω Rn,i E Xn,i Xn,i ,
N i=1 N i=1
Zn,i Zn,i Zn,i Zn,i
13
e n = E[W
where Ω fn |Rn ]. We will use superscripts to indicate submatrices. For example,
WnY Y WnY X WnY Z
Wn = WnXY WnXX WnXZ ,
WnZY WnZX WnZZ
fn , and Ω
with analogous partitions for Ωn , W e n . Notice that the transformation in (3.2) implies
that ΩXZ
n and ΩZX
n are matrices will all zero entries.
fn and Ω
We first obtain converge results for the sample objects, W en.
p p p
fn −Ωn →
Lemma 1. Suppose Assumptions 3-5 hold. Then, W e n −Ωn →
0,Ω fn −Wn →
0 and W 0.
(i) Estimands are functionals of (Y n (), X n , Z n , Rn ), exchangeable in the rows of the argu-
ments.
(ii) Descriptive estimands are estimands that can be written in terms of Y n , X n , and Z n ,
free of dependence on Rn , and free of dependence on Y n () beyond dependence on Y n .
(iii) Causal estimands are estimands that cannot be written in terms of Y n , X n , Z n , and Rn ,
because they depend on the potential outcome functions Y n () beyond the realized outcomes,
Y n.
Causal estimands depend on the values of potential outcomes beyond the values that can
be inferred from the realized outcomes. Given a sample, the only reason we may not be able to
infer the value of a descriptive estimand is that we may not see all the units in the population.
In contrast, even if we observe all units in a population, we may not be able to infer the value
of a causal estimand because its value depends on potential outcomes.
14
We define three estimands of interest,
descr XX −1 XY
θn Wn WnXZ Wn
descr = ZX ZZ , (3.3)
γn Wn Wn WnZY
!−1 !
θncausal,sample e XX Ω
Ω e XZ e XY
Ω
n n n
= e ZX Ω
e ZZ e ZY , (3.4)
γncausal,sample Ω n n Ω n
and
−1
θncausal ΩXX
n ΩXZ
n ΩXY
n
= . (3.5)
γncausal ΩZX
n ΩZZ
n ΩZY
n
Alternatively, the estimands in (3.3) to (3.5) can be defined as the coefficients that correspond
′ ′
to the orthogonality conditions in terms of the residuals en,i = Yn,i − Xn,i θn − Zn,i γn ,
n n n
1X Xn,i 1X Xn,i 1X Xn,i
en,i = 0, Rn,i E en,i = 0, E en,i = 0,
n i=1 Zn,i n i=1 Zn,i n i=1 Zn,i
respectively. We will study the properties of the least squares estimator, θbn , defined by
!−1 !
b
θn f
W XX f XZ
W f
W XY
n n n
= ,
γn
b f ZX W
W f ZZ f ZY
W
n n n
15
Assumption 7. (Linearity of the Expected Assignment) There exists a sequence of
real matrices Bn such that
E[Un,i ] = Bn Zn,i .
16
and, with probability approaching one,
n
!−1 n
X X XX
θncausal,sample = XX
Rn,i E Wn,i Rn,i E Wn,i θn,i ,
i=1 i=1
XX ′
where Wn,i = Xn,i Xn,i .
The linearity in Assumption 8 is a strong restriction in many settings. In some other settings,
in particular, when the causal variable is binary or, more generally when the causal variable takes
on only a finite number of values, it is immediate to enforce this assumption by including in Un,i
indicator variables representing each but one of the possible values of the cause. Assumption
8 can be relaxed at the cost of introducing additional complication in the interpretation of the
estimands.
Theorem 2. Suppose that Assumptions 3-7 hold. Moreover, assume that Xn,1 , . . . , Xn,n are
continuous random variables with convex and compact supports, and that the potential outcome
functions, Yn,i () are continuously differentiable. Then, there exist random variables vn,1 , . . . , vn,n
such that, for n sufficiently large,
n
!−1 n
X XX X XX
θncausal = E Wn,i E Wn,i ϕn,i ,
i=1 i=1
and
n
!−1 n
X XX X XX
θncausal,sample = Rn,i E Wn,i Rn,i E Wn,i ϕn,i ,
i=1 i=1
Comment 7. Here, we provide a simple example that shows how the result in Theorems 1 and 2
may not hold in the absence of Assumption 7. Consider the population with three units described
Table 3
in Table 3 (where, for simplicity, we drop the subscript n). In this example, E[Ui ] = 3bZi2 − 2b
17
is a non-linear function of Zi . Notice that
3
X 3
X
E[Ui ]/3 = E[Ui ]Zi /3 = 0,
i=1 i=1
so that Xi = Ui . Therefore, E[Xi2 ] = E[Ui2 ]. Also, because potential outcomes do not depend
on Xi , it follows that E[Xi Yi ] = E[Xi ]Yi = E[Ui ]Yi . As a result,
3
!−1 3
X X ab
θcausal = E[Xi2 ] E[Xi Yi ] = ,
i=1 i=1
2b2 + 1
which is different from zero as long as ab 6= 0. In this example all the potential outcome functions
Yi (x) are flat as a function of x, so all unit-level causal effects of the type Yi (x) − Yi (x′ ) are zero,
and yet the causal least squares estimand can be positive or negative depending on the values
of a and b.
′
εn,i = Yn,i − Xn,i θncausal − Zn,i
′
γncausal . (3.6)
Comment 8. The definition of the residuals, εn,1, . . . , εn,n , mirrors that in conventional regres-
sion analysis, but their properties are conceptually different. For instance, the residuals need
not be stochastic. If they are stochastic, they are so because of their dependence on X n .
Under the assumption that the Xn,i are jointly independent (but not necessarily identically
distributed), the n products Xn,i εn,i are jointly independent but not identically distributed.
Most importantly, in general the expectations E[Xn,i εn,i ] may vary across i, and need not all
be zero. However, as shown in Section 3.2, the averages of these expectations over the entire
population are guaranteed to be zero by the definition of (θncausal , γncausal ). Define the limits of
18
the population variance,
n
cond 1X
∆ = lim var (Xn,i εn,i ) ,
n→∞ n
i=1
The difference between ∆ehw and ∆cond is the limit of the average outer product of the means,
n
µ ehw cond 1X
∆ =∆ −∆ = lim E[Xn,i εn,i ]E[Xn,i εn,i ]′ ,
n→∞ n
i=1
Assumption 9. (Existence of Limits) ∆cond and ∆ehw exist and are positive definite.
Theorem 3. Suppose Assumptions 3-9 hold, and let H = ΩXX = limn→∞ ΩXX
n . Then,
(i)
√
d
N θbn − θncausal −→ N 0, H −1 ρ∆cond + (1 − ρ)∆ehw H −1 ,
(ii)
√
b causal,sample d
N θn − θn −→ N 0, H −1 ∆cond H −1 ,
(iii)
√
d
N θbn − θndescr −→ N 0, (1 − ρ)H −1 ∆ehw H −1 .
Comment 9. For both the population causal and the descriptive estimand the asymptotic
variance in the case with ρ = 0 reduces to the standard EHW variance, H −1 ∆ehw H −1 . If the
sample size is non-negligible as a fraction of the population size, ρ > 0, the difference between
the EHW variance and the finite population causal variance is positive semi-definite and equal
to ρH −1 (∆ehw − ∆cond )H −1.
19
3.5 The Variance Under Correct Specification
Consider a constant treatment effect assumption, which is required for a correct specification of
a linear regression function as a function that describes potential outcomes.
′
Yn,i = Un,i θn + ξn,i
we obtain that equation (3.6) holds for γncausal = Λ′n θn + λn and εn,i = ξn,i − Zn,i
′
λn . In this case,
the residuals, εn,i , are non-stochastic. As a result, E[Xn,i εn,i ] = E[Xn,i ]εn,i = 0, with implies
∆µ = ∆ehw − ∆cond = 0. This leads to the following result.
Notice that the result of the theorem applies also with θncausal,sample replacing θncausal because
the two parameter vectors are identical (with probability approaching one) under Assumption
10.
Comment 10. The key insight in this theorem is that the asymptotic variance of θbn does not
depend on the ratio of the sample to the population size when the regression function is correctly
specified. Therefore, it follows that the usual EHW variance matrix is correct for θbn under these
assumptions. For the case with Xn,i binary and no attributes beyond the intercept, this result
can be inferred directly from Neyman’s results for randomized experiments (Neyman, 1923).
In that case, the result of Theorem 4 follows from the restriction of constant treatment effects,
20
Yn,i (1) − Yn,i (0) = θn , which is extended to the more general case of non-binary regressors in
Assumption 10. The asymptotic variance of γbn , the least squares estimator of the coefficients
on the attributes, still depends on the ratio of sample to population size, and it can be shown
that the conventional robust EHW estimator continues to over-estimate the variance of b
γn .
Then one can estimate H as the average of the matrix of outer products over the sample:
1 X
n ′
b
Hn = b n Zn,i Un,i − Λ
Rn,i Un,i − Λ b n Zn,i .
N i=1
It is also straightforward to estimate ∆ehw . First we estimate the residuals for the units in the
b n Zn,i )′ θbn − Z ′ b
sample, εbn,i = Yn,i − (Un,i − Λ ehw
n,i γn , and then we estimate ∆ as:
n
b ehw 1 X b n Zn,i ) εb2 (Un,i − Λ
b n Zn,i )′ .
∆n = Rn,i (Un,i − Λ n,i
N i=1
Vbnehw = H
b n−1 ∆
b ehw b −1
n Hn .
p
Vbnehw −→ V ehw .
Alternatively one can use resampling methods such as the bootstrap (e.g., Efron, 1987).
It is more challenging to estimate ∆cond . The reason is the same that makes it impossible
to obtain unbiased estimates of the variance of the estimator for the average treatment effect in
the example in Section 2. In that case there are three terms in the expression for the variance in
21
equation (2.4). The first two are straightforward to estimate, but the third one, Sθ2 /n cannot be
estimated consistently because we do not observe both potential outcomes for the same units.
Often, researchers use the conservative estimator based on ignoring Sθ2 /n. If we proceed in
the same fashion for the regression context of Section 3, we obtain the conservative estimator
Vb ehw , based on ignoring ∆µ . We show, however, that in the presence of attributes we can
improve the variance estimator. We build on Abadie and Imbens (2008), Abadie et al. (2014),
and Fogarty (2016) who, in contexts different than the one studied in this article, have used
the explanatory power of attributes to improve variance estimators. While Abadie and Imbens
(2008), Abadie et al. (2014) use nearest-neighbor techniques, here we follow Fogarty (2016) and
apply linear regression techniques. The proposed estimator replaces the expectations E[Xn,i εn,i ],
which cannot be consistently estimated, with predictors from a linear least squares projection
bn,i = Un,i − Λ
of estimates of Xn,i εn,i on the attributes, Zn,i . Let X b n Zn,i , and
n
! n
!−1
bn = 1 X bn,i εbn,i Z ′ 1 X ′
G Rn,i X n,i Rn,i Zn,i Zn,i .
N i=1 N i=1
Assumption 11.
n
1X ′
E[Xn,i εn,i ]Zn,i
n i=1
has a limit.
1 X
n ′
b Z
∆n = Rn,i Xbn,i εbn,i − G
bn Zn,i X bn,i εbn,i − G
bn Zn,i .
N i=1
bn Zn,i in lieu of a consistent estimator of E[Xn,i εn,i ]. Notice that we do not assume
which uses G
that E[Xn,i εn,i ] is linear in Zn,i . However, we will show that, as long as the attributes can
linearly explain some of the variance in X bn,i εbn,i , the estimator ∆
b Z is smaller (in a matrix sense)
n
b ehw b Z
than ∆ . Moreover, ∆ remains conservative in large samples. These results are provided in
n n
22
Lemma 3. Suppose Assumptions 3-7, 9 and 11 hold with δ = 4. Then, 0 ≤ ∆ b Zn ≤ ∆
b ehw
n , and
p
b Zn → ∆Z , where ∆cond ≤ ∆Z ≤ ∆ehw (all inequalities are to be understood in a matrix sense).
∆
Variance estimators follow immediately from Lemma 3 by replacing ∆cond with the estimate
b Z in the asymptotic variance formulas of Theorem 3. These estimators are not larger (and
∆ n
b ehw , and they remain conservative in large sam-
typically smaller) than estimators based on ∆ n
ples. For simplicity, Lemma 3 is based on a linear predictor for E[Xn,i εn,i ]. Modifications that
accommodate nonlinear predictors are immediate, at the cost of additional assumptions.
Comment 11. A special case of the adjusted variance estimate is an estimate obtained from
stratifying the sample on the basis of attributes Zn,i . In particular, if Zn,i includes exhaustive,
mutually exclusive dummy variables – or, if we reduce the information in Zn,i down to such
ˆ Z reduces to the middle of the sandwich in a commonly used estimator in the
indicators – then ∆ n
context of standard stratified sampling. (See, for example, (Wooldridge (2010), Section 20.2.2).)
Then, the residuals from regressing X̂n,i ε̂n,i on Zn,i are simply stratum-specific demeaned versions
of X̂n,i ε̂n,i . Such a variance estimator is easy to obtain using standard software packages that
support regression with survey samples.
23
intended to inform policy the object of interest depends on future, not simply on past, outcomes.
This creates substantial problems for inference. Here we discuss some of the complications, but
much of this is left for future work. Our two main points are, first, that it is important to
be explicit about the estimand, and second, that the conventional robust standard errors were
not designed to solve these problems and do not do so without strong, typically implausible,
assumptions.
Formally questions that involve future values of outcomes for countries could be formulated in
terms of a population of interest that includes as its units each country in a variety of different
states of the world that might be realized in future years. This population is large if there
are many possible realizations of states of the world (e.g., rainfall, local political conditions,
natural resource discoveries, etc.), with a potentially complex dependence structure. Given such
a population the researcher may wish to estimate, say the difference in average 2020 outcomes
for two sets of countries, and calculate standard errors based on values for the outcomes for
the same set of countries in an earlier year, say 2020. A natural estimator for the difference
in average values for Northern and Southern countries in 2020 would be the corresponding
difference in average values in 2013. However, even though such data would allow us to infer
without uncertainty the difference in average outcomes for Northern and Southern countries in
2013, there would be uncertainty regarding the true value of that difference in the year 2020. In
order to construct confidence intervals for the difference in 2020, the researcher must make some
assumptions about how country outcomes will vary from year to year. An extreme assumption is
that outcomes in 2013 and 2020 for the same country are independent conditional on attributes,
which would justify the conventional EHW variance estimator. However, assuming that there
is no correlation between outcomes for the same country in successive years appears highly
implausible. In fact any assumption about the magnitude of this correlation in the absence of
direct information about it in the form of panel data would appear to be controversial. Such
assumptions would also depend heavily on the future year for which we would wish to estimate
the difference in averages, again highlighting the importance of being precise about the estimand.
Although in this case there is uncertainty regarding the difference in average outcomes in
2020 despite the fact that the researchers observes (some) information on all countries in the
population of interest, we emphasize that the assumptions required to validate the application of
EHW standard errors in this setting are strong and arguably implausible. Moreover, researchers
rarely formally state the population of interest, let alone state and justify the assumptions that
24
justify inference.
Generally, if future predictions are truly the primary question of interest, it seems prudent
to explicitly state the assumptions that justify particular calculations for standard errors. Es-
pecially in the absence of panel data, the results are likely to be sensitive to such assumptions.
With panel data the researcher may be able to estimate the dynamic process underlying the
potential outcomes in order to obtain standard errors for the future predictions. In practice
it may be useful to report standard errors for various estimands. For example, if the primary
estimand is an average causal effect in the future, it may still be useful to report estimates
and standard errors for the same contemporaneous average causal effect, in combination with
estimates and standard errors for the future average causal effect, in order to understand the
additional uncertainty that comes with predictions for a future period. We leave this direction
for future work.
6 Conclusion
In this article we study the interpretation of standard errors in regression analysis when the
assumption that the sample is drawn randomly from a large population of interest is not at-
tractive. The conventional robust standard errors justified by the random sampling assumption
do not necessarily apply in this case. We show that, by viewing covariates as potential causes
in a Rubin Causal Model or potential outcome framework, we can provide a coherent interpre-
tation for standard errors that allows for uncertainty coming from both random sampling and
from conditional random assignment. The proposed standard errors may be different from the
conventional ones.
In the current article we focus exclusively on regression models, and we provide a full analysis
of inference for only a certain class of regression models with some of the covariates causal and
some attributes. Thus, this article is only a first step in a broader research program. The
concerns we have raised in this article arise in many other settings and for other kinds of
hypotheses, and the implications would need to be worked out for those settings. Section 5
suggests some directions we think are particularly natural to consider.
25
Appendix
I. A Bayesian Approach
Given that we are advocating for a different conceptual approach to modeling inference, it is useful to
look at the problem from more than one perspective. In this section we consider a Bayesian perspective
and re-analyze the example from Section 2. Using a simple parametric model we show that in a Bayesian
approach the same issues arise in the choice of estimand. Viewing the problem from a Bayesian
perspective reinforces the point that formally modeling the population and the sampling process leads
to the conclusion that inference is different for descriptive and causal questions. Note that in this
discussion the notation will necessarily be slightly different from the rest of the article; notation and
assumptions introduced in this subsection apply only within this subsection.
Define Y n (1), Y n (0) to be the n vectors with typical elements Yi (1) and Yi (0), respectively. We view
the n-vectors Y n (1), Y n (0), Rn , and X n as random variables, some observed and some unobserved.
We assume the rows of the n × 4 matrix [Y n (1), Y n (0), R n , X n ] are exchangeable. Then, by appealing
to DeFinetti’s theorem, we model this, with no essential loss of generality (for large n) as the product of
n independent and identically distributed random quadruples (Yi (1), Yi (0), Ri , Xi ) given some unknown
parameter β:
n
Y
f (Y n (1), Y n (0), Rn , X n ) = f (Yi (1), Yi (0), Ri , Xi |β).
i=1
Inference then proceeds by specifying a prior distribution for β, say p(β). To make this specific, consider
following model. Let Xi and Ri have Binomial distributions with parameters q and ρ,
Pr(Xi = 1|Yi (1), Yi (0), Ri ) = q, Pr(Ri = 1|Yi (1), Yi (0)) = ρ.
The pairs (Yi (1), Yi (0)) are assumed to be jointly normally distributed:
Yi (1) 2 2 µ1 σ12 κσ1 σ0
µ ,µ ,σ ,σ ,κ ∼ N , ,
Yi (0) 1 0 1 0 µ0 κσ1 σ0 σ12
26
It is interesting to compare these estimands to an additional estimand, the super-population average
treatment effect,
θ causal = µ1 − µ0 .
In general these three estimands are distinct, with their own posterior distributions, but in some cases,
notably when n is large, the three posterior distributions are similar.
It is instructive to consider a very simple case where analytic solutions for the posterior distribution for
θndescr , θncausal , and θ causal are available. Suppose σ12 , σ02 , κ and q are known, so that the only unknown
parameters are the two means µ1 and µ0 . Finally, let us use independent, diffuse (improper), prior
distributions for µ1 and µ0 .
Then, a standard result is that the posterior distribution for (µ1 , µ0 ) given (Rn , X n , Ye n ) is
2
µ1 Ȳ1 σ1 /N1 0
Rn , X n , Ye n ∼ N , ,
µ0 Ȳ0 0 σ02 /N0
where N1 is the number of units with Ri = 1 and Xi = 1, and N0 is the number of units with Ri = 1
and Xi = 0. This directly leads to the posterior distribution for θ causal :
causal e σ12 σ02
θ |Rn , X n , Y n ∼ N Ȳ1 − Ȳ0 , + .
N1 N0
A longer calculation leads to the posterior distribution for the descriptive estimand:
descr e σ12 N1 σ02 N0
θn |Rn , X n , Y n ∼ N Ȳ1 − Ȳ0 , 1− + 1− .
N1 n1 N0 n0
The implied posterior interval for θndescr is very similar to the corresponding confidence interval based
on the normal approximation to the sampling distribution for Ȳ1 − Ȳ0 . If n1 and n0 are large, this
posterior distribution is close to the posterior distribution of the causal estimand. If, on the other hand,
N1 = n1 and N0 = n0 , then the posterior distribution of the descriptive estimand becomes degenerate
and centered at Ȳ1 − Ȳ0 .
A somewhat longer calculation for θncausal leads to
causal e N0 N1
θn |Rn , X n , Y n ∼ N Ȳ1 − Ȳ0 , 2 σ12 (1 − κ2 ) + 2 σ02 (1 − κ2 )
n n
n−N 2 n−N 2 n−N
+ 2
σ1 + 2
σ0 − 2 κσ1 σ0
n n n2
!
σ12 σ0 N1 2 σ02 σ 1 N0 2
+ 1− 1−κ + 1− 1−κ .
N1 σ1 n N0 σ0 n
Consider the special case of constant treatment effects, where Yi (1) − Yi (0) = µ1 − µ0 . Then, κ = 1,
and σ1 = σ0 , and the posterior distribution of θncausal is the same as the posterior distribution of θ causal .
The same posterior distribution arises in the limit if n goes to infinity, regardless of the values of κ, σ1 ,
and σ0 .
To sum up, if the population is large, relative to the sample, the posterior distributions of θndescr , θncausal
and θ causal agree. However, if the population is small, the three posterior distributions differ, and the
27
researcher needs to be precise in defining the estimand. In such cases, simply focusing on the super-
population estimand θ causal = µ1 − µ0 is arguably not appropriate, and the posterior inferences for such
estimands will differ from those for other estimands such as θncausal or θndescr .
II. Proofs
P Pn
Proof of Theorem 1: Under the stated conditions, the matrices ni=1 Zn,i Zn,i ′ and ′
i=1 Rn,i Zn,i Zn,i
are invertible with probability approaching one. As a result, with probability approaching one
n
! n
!−1
X X
′ ′
Bn = Rn,i E[Un,i ]Zn,i Rn,i Zn,i Zn,i
i=1 i=1
n
! n
!−1
X X
′ ′
= E[Un,i ]Zn,i Zn,i Zn,i = Λn .
i=1 i=1
As a result, we obtain
n
!−1 n
X X
θncausal = ′
E[Xn,i Xn,i ] E[Xn,i Yn,i ],
i=1 i=1
and
n
!−1 n
X X
θncausal,sample = ′
Rn,i E[Xn,i Xn,i ] Rn,i E[Xn,i Yn,i ].
i=1 i=1
Now,
′
E[Xn,i Yn,i ] = E[Xn,i Un,i ]θn,i + E[Xn,i ]ξn,i
′
= E[Xn,i Xn,i ]θn,i .
Proof of Theorem 2: Let ∇Yn,i () be the gradient of Yn,i (). By the mean value theorem there exists
′ ∇Y (B Z
sets Tn,i ⊆ [0, 1] such that for any tn,i ∈ Tn,i , we have Yn,i (Un,i ) = Yn,i (Bn Zn,i ) + Xn,i n,i n n,i +
tn,i Xn,i ). We define ϕn,i = ∇Yn,i (vn,i ), where vn,i = Bn Zn,i + t̄n,i Xn,i and t̄n,i = sup Tn,i . Now,
′ ϕ ] = E[X ′ ϕ ]. The rest of the proof is as for Theorem
E[Xn,i Yn,i ] = E[Xn,i ]Yn,i (Bn Zn,i ) + E[Xn,i n,i n,i n,i
1.
Lemma A.1. Let Vn,i is a row-wise independent triangular array and µn,i = E[Vn,i ]. Suppose that
Rn,1 , . . . , Rn,n are independent of Vn,1 , . . . , Vn,n and that Assumption 4 holds. Moreover, assume that
1X h i
n
2+δ
E |Vn,i |
n
i=1
28
is bounded for some δ > 0,
n
X
µn,i = 0, (A.1)
i=1
n
1X
var(Vn,i ) → σ 2 ,
n
i=1
and
n
1X 2
µn,i → κ2 ,
n
i=1
and
N nρn (1 − ρn )
var = → 0.
nρn (nρn )2
Let
n
1X
s2n = var(Vn,i ) + (1 − ρn )µ2n,i .
n
i=1
29
and
2
var (Rn,i Vn,i − ρn µn,i ) = ρn E[Vn,i ] − ρ2n µ2n,i
= ρn var(Vn,i ) + (1 − ρn )µ2n,i .
Therefore,
n
X
Rn,i Vn,i − ρn µn,i
var √ = 1.
sn nρn
i=1
1/(2+δ)
Using ρn ≤ ρn , |µn,i |2+δ ≤ E[|Vn,i |2+δ ], and Minkowski’s inequality, we obtain:
" # n 1 h 2+δ
Xn
Rn,i Vn,i − ρn µn,i 2+δ 1 X i 1
E √ ≤ 2+δ ρn2+δ
E |Vn,i |2+δ 2+δ
+ ρn |µn,i |
sn nρn s (nρ )1+δ/2
i=1 n n i=1
22+δ ρ Xn h i
n
≤ E |Vn,i |2+δ
s2+δ
n (nρn )1+δ/2 i=1
!
22+δ 1 Xn h i
2+δ
= 2+δ E |Vn,i | → 0.
sn (nρn )δ/2 n
i=1
′ θ causal,sample −
Lemma A.2. Suppose Assumptions 3-9 hold, and let ∆µ = ∆ehw −∆cond , εen,i = Yn,i −Xn,i n
′ γ causal,sample , and ν ′ descr − X ′ γ descr . Then,
Xn,i n n,i = Yn,i − Xn,i θn n,i n
(i)
n
1 X d
√ Rn,i Xn,i εn,i −→ N (0, ∆cond + (1 − ρ)∆µ ),
N i=1
(ii)
n
1 X d
√ Rn,i Xn,i εen,i −→ N (0, ∆cond ),
N i=1
(iii)
n
1 X d
√ Rn,i Xn,i νn,i −→ N (0, (1 − ρ)∆ehw ).
N i=1
30
Proof of Lemma A.2: To prove (i), consider Vn,i = a′ Xn,i εn,i for a ∈ Rk . We will verify the
conditions Lemma A.1. Notice that,
1X h i kak2+δ X h 2+δ i
n n
E |Vn,i |2+δ ≤ E kXn,i k2+δ |Yn,i | + kXn,i kkθn k + kZn,i kkγn k .
n n
i=1 i=1
By Minkowski’s inequality and Assumption 5, the right-hand side of last equation is bounded. In
addition,
n
X n
X
µn,i = a′ E[Xn,i εn,i ] = 0.
i=1 i=1
Let a 6= 0. Then,
n n
!
1X 1X
var(Vn,i ) = a′ var (Xn,i εn,i ) a → a′ ∆cond a > 0.
n n
i=1 i=1
n n
!
1X 2 1X
µn,i = a′ ′
E[Xn,i εn,i ]E[εn,i Xn,i ] a → a′ ∆µ a.
n n
i=1 i=1
This implies
n
!
1 X d
a′ √ Rn,i Xn,i εn,i → N (0, a′ (∆cond + (1 − ρ)∆µ )a).
N i=1
Therefore,
n !−1 n
√ θbn − θncausal 1 X ′
Xn,i Xn,i ′
Xn,i Zn,i 1 X Xn,i εn,i
N = Rn,i ′ ′ √ Rn,i
bn − γncausal
γ N Zn,i Xn,i Zn,i Zn,i N i=1 Zn,i εn,i
i=1
31
−1 n
ΩXX
n ΩXZ
n 1 X Xn,i εn,i
= √ Rn,i + rn ,
ΩZX
n ΩZZ
n N i=1 Zn,i εn,i
where
!−1
−1 Xn
fnXX fnXZ ΩXX ΩXZ
rn =
W W
− n n √1 Rn,i
Xn,i εn,i
.
fnZX
W fnZZ
W ΩZX
n ΩZZ
n N i=1 Zn,i εn,i
√ Pn
Because (i) ΩXZ
n = 0, (ii) the first term of
√ Pn r n is o p (1), and (iii) (1/ N ) i=1 Rn,i Xn,i εn,i is Op (1)
(under the conditions stated above), (1/ N ) i=1 Rn,i Zn,i εn,i = Op (1) would imply
n
√ −1 1 X
N (θbn − θncausal ) = ΩXX
n √ Rn,i Xn,i εn,i + op (1).
N i=1
√ P
By Markov’s inequality, it is enough to show that the second moment of (1/ N) ni=1 Rn,i Zn,i εn,i is
uniformly bounded. As before, we will assign an arbitrary value of zero to this quantity for the case
N = 0. Therefore,
" !2 #
Rn,i
n n
1 X X
E √ Rn,i Zn,i εn,i = E N > 0 Zn,i E[ε2n,i ]Zn,i
′
.
N N
i=1 i=1
Notice that
X
Rn,i
n
m/n Pr(N = m) 1
E N > 0 = = .
N m Pr(N > 0) n
m=1
is uniformly bounded, which is implied by Assumption 5. The proofs of (ii) and (iii) are analogous.
Proof of Theorem 4: The result follows directly E[Xn,i εn,i ] = 0.
Proof of Lemma 2: First, notice that (with probability approaching one) Λn exists and it is equal
to Bn . This implies,
n
! n
!−1
1 X 1 X
b n − Λn =
Λ ′
Rn,i Xn,i Zn,i ′
Rn,i Zn,i Zn,i
N N
i=1 i=1
which converges to zero in probability by Lemma 1 and Assumption 6. Direct calculations yield
p
bn − W
H fnXX = (Λ
b n − Λn )W
fnZZ (Λ
b n − Λn )′ − W
fnXZ (Λ
b n − Λn )′ − (Λ
b n − Λn )W
fnXZ → 0.
p
bn →
Now, Lemma 1 and Assumption 6 imply H H, where H is full rank. Theorem 3 direclty implies
p p
θbn − θncausal → 0. γ
bn − γncausal → 0 follows from Lemma 1. Let
n n
˘ ehw 1 X e 1 X
∆n = Rn,i Xn,i εb2n,i Xn,i
′
, ehw
∆n = Rn,i Xn,i ε2n,i Xn,i
′
,
N N
i=1 i=1
32
and
n
1X
∆ehw
n = E[Xn,i ε2n,i Xn,i
′
].
n
i=1
′ : Z ′ ). In addition, let
Let α be a multi-index of dimension equal to the length of Tn,i = (Yn,i : Xn,i n,i
n n
e α 1 X eα 1 X α
Tn = Tn,i = Rn,i Tn,i ,
N N
i=1 i=1
and
n
1X
Ψαn = α
E[Wn,i ].
n
i=1
Using the same argument as in the proof of Lemma 1 and given that Assumption 5 holds with δ = 4,
p p
it follows that Tenα − Ψαn → 0 for |α| ≤ 4. This result directly implies ∆ e ehw − ∆ehw →
n n 0. By the same
p p
argument plus convergence of θbn and γ bn , it follows that ∆b ehw
n − ˘
∆ ehw →
n 0 and ˘
∆ n
e
ehw − ∆ ehw →
n 0. Now,
b ehw ehw b ehw ˘ ehw ˘ ehw e ehw e ehw ehw ehw p
ehw ) →
the result follows from ∆ n −∆ = (∆ n − ∆n )+(∆n − ∆n )+(∆n −∆n )+(∆n −∆
0, where the last difference goes to zero by Assumption 9.
Proof of Lemma 3: Notice that,
n
bZ b ehw − ∆
b proj b proj 1 X bn Zn,i Zn,i
′ b′
∆ n = ∆n n , where ∆n = Rn,i G Gn ,
N
i=1
bZ
so that ∆ b ehw in a matrix sense.
n is no larger than ∆n
Let
n
! n
!−1
1X ′ 1X ′
Gn = E[Xn,i εn,i ]Zn,i Zn,i Zn,i ,
n n
i=1 i=1
Let
n
1X
∆µn = ′
E[Xn,i εn,i ]E[εn,i Xn,i ].
n
i=1
Notice that
n
1X
∆µn − ∆proj
n = ′
E[Xn,i εn,i ]E[εn,i Xn,i ]
n
i=1
33
n
! n
!−1 n
!
1X ′ 1X ′ 1X ′
− E[Xn,i εn,i ]Zn,i Zn,i Zn,i Zn,i E[εn,i Xn,i ] .
n n n
i=1 i=1 i=1
∆µn − ∆proj
n = A′n (I n − Dn (D ′n D n )−1 D ′n )An ,
∆cond
n ≤ ∆Z ehw
n ≤ ∆n
where the inequalities are to be understood in a matrix sense. Now, it follow from Assumption 11 that
Gn and, therefore, ∆proj
n and ∆Zn have limits. Then,
∆cond ≤ ∆Z ≤ ∆ehw
References
Abadie, A., S. Athey, G. W. Imbens, and J. M. Wooldridge (2014). Finite population causal standard
errors. Technical report, National Bureau of Economic Research.
Abadie, A. and G. W. Imbens (2008). Estimation of the conditional variance in paired experiments.
Annales d’Economie et de Statistique, 175–187.
Abadie, A., G. W. Imbens, and F. Zheng (2014). Inference for misspecified models with fixed regressors.
Journal of the American Statistical Association 109 (508), 1601–1614.
Angrist, J. and S. Pischke (2008). Mostly Harmless Econometrics: An Empiricists’ Companion. Prince-
ton University Press.
Angrist, J. D. (1998). Estimating the labor market impact of voluntary military service using social
security data on military applicants. Econometrica 66 (2), 249–288.
Aronow, P. M. and C. Samii (2016). Does regression produce representative estimates of causal effects?
American Journal of Political Science 60 (1), 250–267.
Davidson, J. (1994). Stochastic Limit Theory: An Introduction for Econometricians. Advanced Texts
in Econometrics. Oxford University Press.
Deaton, A. (2010). Instruments, randomization, and learning about development. Journal of Economic
Literature 48 (2), 424–455.
Efron, B. (1987). The Jackknife, the Bootstrap, and Other Resampling Plans, Volume 38 of CBMS-NSF
Regional Conference Series in Applied Mathematics. Society for Industrial and Applied Mathematics.
Eicker, F. (1967). Limit theorems for regressions with unequal and dependent errors. In Proceedings
of the fifth Berkeley symposium on mathematical statistics and probability, Volume 1, pp. 59–82.
34
Fogarty, C. B. (2016). Regression assisted inference for the average treatment effect in paired experi-
ments. arXiv preprint arXiv:1612.05179 .
Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statistical Associa-
tion 81 (396), 945–970.
Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In
Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Volume 1,
pp. 221–233.
Imbens, G. W. and D. B. Rubin (2015). Causal Inference in Statistics, Social, and Biomedical Sciences.
Cambridge University Press.
Lin, W. (2013). Agnostic notes on regression adjustments for experimental data: Reexamining freed-
man’s critique. The Annals of Applied Statistics 7 (1), 295–318.
Manski, C. F. (2013). Public policy in an uncertain world: analysis and decisions. Harvard University
Press.
Rosenbaum, P. R. and D. B. Rubin (1983). The central role of the propensity score in observational
studies for causal effects. Biometrika 70 (1), 41–55.
Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies.
Journal of Educational Psychology 66 (5), 688–701.
Shadish, W. R., T. D. Cook, and D. T. Campbell (2002). Experimental and quasi-experimental designs
for generalized causal inference. Houghton, Mifflin and Company.
Sloczyński, T. (2017). A general weighted average representation of the ordinary and two-stage least
squares estimands.
White, H. (1980a). A heteroskedasticity-consistent covariance matrix estimator and a direct test for
heteroskedasticity. Econometrica 48 (1), 817–838.
35
White, H. (1980b). Using least squares to approximate unknown regression functions. International
Economic Review 21 (1), 149–170.
White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica 50 (1), 1–25.
Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. MIT press.
36