Scott Cunningham Causal Inference 2020 SEGUNDA PARTE

Download as pdf or txt
Download as pdf or txt
You are on page 1of 107

Probability and Regression Review

σumbers is hardly real and they never have feelings. ψut you push too hard, even numbers got
limits.
Mos Def

ψasic probability theory. In practice, causal inference is based on statistical


models that range from the very simple to extremely advanced. And
building such models requires some rudimentary knowledge of probability
theory, so let’s begin with some definitions. A random process is a process
that can be repeated many times with different outcomes each time. The
sample space is the set of all the possible outcomes of a random process.
→e distinguish between discrete and continuous random processes Table 1
below. Discrete processes produce, integers, whereas continuous processes
produce fractions as well.
→e define independent events two ways. The first refers to logical
independence. For instance, two events occur but there is no reason to
believe that the two events affect each other. →hen it is assumed that they
do affect each other, this is a logical fallacy called post hoc ergo propter
hoc, which is Latin for “after this, therefore because of this.” This fallacy
recognizes that the temporal ordering of events is not sufficient to be able to
say that the first thing caused the second.
Table 1. Examples of discrete and continuous random processes.

The second definition of an independent event is statistical independence.


→e’ll illustrate the latter with an example from the idea of sampling with
and without replacement. Let’s use a randomly shuffled deck of cards for an
example. For a deck of 52 cards, what is the probability that the first card
will be an ace?

There are 52 possible outcomes in the sample space, or the set of all
possible outcomes of the random process. Of those 52 possible outcomes,
we are concerned with the frequency of an ace occurring. There are four
aces in the deck, so 4 52 = 0.077.
Assume that the first card was an ace. Now we ask the question again. If
we shuffle the deck, what is the probability the next card drawn is also an
ace? It is no longer 13 1 because we did not sample with replacement. →e
sampled without replacement. Thus the new probability is

Under sampling without replacement, the two events—ace on Card1 and an


ace on Card2 if Card1 was an ace—aren’t independent events. To make the
two events independent, you would have to put the ace back and shuffle the
deck. So two events, A and ψ, are independent if and only if:

An example of two independent events would be rolling a 5 with one die


after having rolled a 3 with another die. The two events are independent, so
the probability of rolling a 5 is always 0.17 regardless of what we rolled on
the first die.1
But what if we want to know the probability of some event occurring that
requires that multiple events first to occur? For instance, let’s say we’re
talking about the Cleveland Cavaliers winning the NBA championship. In
2016, the Golden State →arriors were 3–1 in a best-of-seven playoff. →hat
had to happen for the →arriors to lose the playoff? The Cavaliers had to win
three in a row. In this instance, to find the probability, we have to take the
product of all marginal probabilities, or Pr(·)n, where Pr(·) is the marginal
probability of one event occurring, and n is the number of repetitions of that
one event. If the unconditional probability of a Cleveland win is 0.5, and
each game is independent, then the probability that Cleveland could come
back from a 3–1 deficit is the product of each game’s probability of
winning:

Another example may be helpful. In Texas Hold’em poker, each player is


dealt two cards facedown. →hen you are holding two of a kind, you say you
have two “in the pocket.” So, what is the probability of being dealt pocket
aces? It’s That’s right: it’s 0.45%.
Let’s formalize what we’ve been saying for a more generalized case. For
independent events, to calculate joint probabilities, we multiply the
marginal probabilities:

where Pr(A,ψ) is the joint probability of both A and ψ occurring, and Pr(A)
is the marginal probability of A event occurring.
Now, for a slightly more difficult application. →hat is the probability of
rolling a 7 using two six-sided dice, and is it the same as the probability of
rolling a 3? To answer this, let’s compare the two probabilities. →e’ll use a
table to help explain the intuition. First, let’s look at all the ways to get a 7
using two six-sided dice. There are 36 total possible outcomes (62 =36)
when rolling two dice. In Table 2 we see that there are six different ways to
roll a 7 using only two dice. So the probability of rolling a 7 is 6/36 =
16.67%. Next, let’s look at all the ways to roll a 3 using two six-sided dice.
Table 3 shows that there are only two ways to get a 3 rolling two six-sided
dice. So the probability of rolling a 3 is 2/36 = 5.56%. So, no, the
probabilities of rolling a 7 and rolling a 3 are different.
Table 2. Total number of ways to get a 7 with two six-sided dice.

Table 3. Total number of ways to get a 3 using two six-sided dice.

Events and conditional probability. First, before we talk about the three
ways of representing a probability, I’d like to introduce some new
terminology and concepts: events and conditional probabilities. Let A be
some event. And let ψ be some other event. For two events, there are four
possibilities.
1. A and B: Both A and B occur.
2. ∼ A and B: A does not occur, but B occurs.
3. A and ∼ B: A occurs, but B does not occur.
4. ∼ A and ∼ B: Neither A nor B occurs.

I’ll use a couple of different examples to illustrate how to represent a


probability.

Probability tree. Let’s think about a situation in which you are trying to get
your driver’s license. Suppose that in order to get a driver’s license, you
have to pass the written exam and the driving exam. However, if you fail
the written exam, you’re not allowed to take the driving exam. →e can
represent these two events in a probability tree.
Probability trees are intuitive and easy to interpret.2 First, we see that the
probability of passing the written exam is 0.75 and the probability of failing
the exam is 0.25. Second, at every branching off from a node, we can
further see that the probabilities associated with a given branch are
summing to 1.0. The joint probabilities are also all summing to 1.0. This is
called the law of total probability and it is equal to the sum of all joint
probability of A and Bn events occurring:

→e also see the concept of a conditional probability in the driver’s


license tree. For instance, the probability of failing the driving exam,
conditional on having passed the written exam, is represented as Pr(Fail |
Pass) = 0.45.

Venn diagrams and sets. A second way to represent multiple events


occurring is with a Venn diagram. Venn diagrams were first conceived by
John Venn in 1880. They are used to teach elementary set theory, as well as
to express set relationships in probability and statistics. This example will
involve two sets, A and ψ.
The University of Texas’s football coach has been on the razor’s edge
with the athletic director and regents all season. After several mediocre
seasons, his future with the school is in jeopardy. If the Longhorns don’t
make it to a great bowl game, he likely won’t be rehired. But if they do,
then he likely will be rehired. Let’s discuss elementary set theory using this
coach’s situation as our guiding example. But before we do, let’s remind
ourselves of our terms. A and ψ are events, and U is the universal set of
which A and ψ are subsets. Let A be the probability that the Longhorns get
invited to a great bowl game and ψ be the probability that their coach is
rehired. Let Pr(A) = 0.6 and let Pr(ψ)=0.8. Let the probability that both A
and ψ occur be Pr(A,ψ)=0.5.
Note, that A+ ∼ A = U, where ∼ A is the complement of A. The
complement means that it is everything in the universal set that is not A.
The same is said of B. The sum of ψ and ∼ ψ = U. Therefore:

→e can rewrite out the following definitions:


→henever we want to describe a set of events in which either A or ψ

could occur, it is: A ψ. And this is pronounced “A union ψ,” which means
it is the new set that contains every element from A and every element from
ψ. Any element that is in either set A or set ψ, then, is also in the new union
set. And whenever we want to describe a set of events that occurred
together—the joint set—it’s A∩ψ, which is pronounced “A intersect ψ.”
This new set contains every element that is in both the A and ψ sets. That is,
only things inside both A and ψ get added to the new set.
Now let’s look closely at a relationship involving the set A.

Notice what this is saying: there are two ways to identify the A set. First,
you can look at all the instances where A occurs with ψ. But then what

about the rest of A that is not in ψ? →ell, that’s the A ψ situation, which
covers the rest of the A set.
A similar style of reasoning can help you understand the following
expression.

To get the A intersect ψ, we need three objects: the set of A units outside of
ψ, the set of ψ units outside A, and their joint set. You get all those, and you
have A ∩ ψ.
Now it is just simple addition to find all missing values. Recall that A is
your team making playoffs and Pr(A)=0.6. And ψ is the probability that the
coach is rehired, Pr(ψ) = 0.8. Also, Pr(A,ψ) = 0.5, which is the probability
of both A and ψ occurring. Then we have:
→hen working with sets, it is important to understand that probability is
calculated by considering the share of the set (for example A) made up by
the subset (for example A ∪
ψ). →hen we write down that the probability
that A ∪ ψ occurs at all, it is with regards to U. But what if we were to ask
the question “→hat share of A is due to A ∪ ψ?” Notice, then, that we
would need to do this:

Table 4. Twoway contingency table.

I left this intentionally undefined on the left side so as to focus on the


calculation itself. But now let’s define what we are wanting to calculate: In
a world where A has occurred, what is the probability that ψ will also
occur? This is:

Notice, these conditional probabilities are not as easy to see in the Venn
diagram. →e are essentially asking what percentage of a subset—e.g., Pr(A)
—is due to the joint set, for example, Pr(A,ψ). This reasoning is the very
same reasoning used to define the concept of a conditional probability.
ωontingency tables. Another way that we can represent events is with a
contingency table. Contingency tables are also sometimes called twoway
tables. Table 4 is an example of a contingency table. →e continue with our
example about the worried Texas coach.
Recall that Pr(A)=0.6, Pr(ψ)=0.8, and Pr(A,ψ)=0.5. Note that to calculate
conditional probabilities, we must know the frequency of the element in
question (e.g., Pr(A,ψ)) relative to some other larger event (e.g., Pr(A)). So
if we want to know what the conditional probability of ψ is given A, then
it’s:


But note that knowing the frequency of A ψ in a world where ψ occurs is
to ask the following:

So, we can use what we have done so far to write out a definition of joint
probability. Let’s start with a definition of conditional probability first.
Given two events, A and ψ:

Using equations 2.1 and 2.2, I can simply write down a definition of joint
probabilities.
And this is the formula for joint probability. Given equation 2.3, and using
the definitions of (Pr(A,ψ and Pr(ψ,A)), I can also rearrange terms, make a
substitution, and rewrite it as:

Equation 2.8 is sometimes called the naive version of Bayes’s rule. →e will
now decompose this equation more fully, though, by substituting equation
2.5 into equation 2.8.

Substituting equation 2.6 into the denominator for equation 2.9 yields:

Finally, we note that using the definition of joint probability, that Pr(ψ,∼ A)
= Pr(ψ |∼ A)Pr(∼ A), which we substitute into the denominator of equation
2.10 to get:

That’s a mouthful of substitutions, so what does equation 2.11 mean? This


is the Bayesian decomposition version of Bayes’s rule. Let’s use our
example again of Texas making a great bowl game. A is Texas making a
great bowl game, and ψ is the coach getting rehired. And A∩ψ is the joint
probability that both events occur. →e can make each calculation using the
contingency tables. The questions here is this: If the Texas coach is rehired,
what’s the probability that the Longhorns made a great bowl game? Or
formally, Pr(A | ψ). →e can use the Bayesian decomposition to find this
probability.

Check this against the contingency table using the definition of joint
probability:

So, if the coach is rehired, there is a 63 percent chance we made a great


bowl game.3

εonty Hall example. Let’s use a different example, the Monty Hall
example. This is a fun one, because most people find it counterintuitive. It
even is used to stump mathematicians and statisticians.4 But Bayes’s rule
makes the answer very clear—so clear, in fact, that it’s somewhat surprising
that Bayes’s rule was actually once controversial [McGrayne, 2012].
Let’s assume three closed doors: door 1 (D1), door 2 (D2), and door 3
(D3). Behind one of the doors is a million dollars. Behind each of the other
two doors is a goat. Monty Hall, the game-show host in this example, asks
the contestants to pick a door. After they pick the door, but before he opens
the door they picked, he opens one of the other doors to reveal a goat. He
then asks the contestant, “→ould you like to switch doors?”
A common response to Monty Hall’s offer is to say it makes no sense to
change doors, because there’s an equal chance that the million dollars is
behind either door. Therefore, why switch? There’s a 50–50 chance it’s
behind the door picked and there’s a 50–50 chance it’s behind the remaining
door, so it makes no rational sense to switch. Right? Yet, a little intuition
should tell you that’s not the right answer, because it would seem that when
Monty Hall opened that third door, he made a statement. But what exactly
did he say?
Let’s formalize the problem using our probability notation. Assume that
you chose door 1, D1. The probability that D1 had a million dollars when
you made that choice is Pr(D1 = 1 million) = 1/3. →e will call that event A1.
And the probability that D1 has a million dollars at the start of the game is
1/3 because the sample space is 3 doors, of which one has a million dollars
behind it. Thus, Pr(A1) = 1/3. Also, by the law of total probability, Pr(∼ A1)
= 2/3. Let’s say that Monty Hall had opened door 2, D2, to reveal a goat.
Then he asked, “→ould you like to change to door number 3?”
→e need to know the probability that door 3 has the million dollars and
compare that to Door 1’s probability. →e will call the opening of door 2
event ψ. →e will call the probability that the million dollars is behind door i,
Ai.→e now write out the question just asked formally and decompose it
using the Bayesian decomposition. →e are ultimately interested in knowing
what the probability is that door 1 has a million dollars (event A1) given that
Monty Hall opened door 2 (event ψ), which is a conditional probability
question. Let’s write out that conditional probability using the Bayesian
decomposition from equation 2.11.

There are basically two kinds of probabilities on the right side of the
equation. There’s the marginal probability that the million dollars is behind
a given door, Pr(Ai). And there’s the conditional probability that Monty Hall
would open door 2 given that the million dollars is behind door Ai, Pr(ψ |
Ai).
The marginal probability that door i has the million dollars behind it
without our having any additional information is 1/3. →e call this the prior
probability, or prior belief. It may also be called the unconditional
probability.
The conditional probability, Pr(ψ|Ai), requires a little more careful
thinking. Take the first conditional probability, Pr(ψ | A1). If door 1 has the
million dollars behind it, what’s the probability that Monty Hall would open
door 2?
Let’s think about the second conditional probability: Pr(ψ | A2). If the
money is behind door 2, what’s the probability that Monty Hall would open
door 2?
And then the last conditional probability, Pr(ψ | A3). In a world where the
money is behind door 3, what’s the probability Monty Hall will open door
2?
Each of these conditional probabilities requires thinking carefully about
the feasibility of the events in question. Let’s examine the easiest question:
Pr(ψ | A2). If the money is behind door 2, how likely is it for Monty Hall to
open that same door, door 2? Keep in mind: this is a game show. So that
gives you some idea about how the game-show host will behave. Do you
think Monty Hall would open a door that had the million dollars behind it?
It makes no sense to think he’d ever open a door that actually had the
money behind it—he will always open a door with a goat. So don’t you
think he’s only opening doors with goats? Let’s see what happens if take
that intuition to its logical extreme and conclude that Monty Hall never
opens a door if it has a million dollars. He only opens a door if the door has
a goat. Under that assumption, we can proceed to estimate Pr(A1 | ψ) by
substituting values for Pr(ψ | Ai) and Pr(Ai) into the right side of equation
2.12.
→hat then is Pr(ψ | A1)? That is, in a world where you have chosen door
1, and the money is behind door 1, what is the probability that he would
open door 2? There are two doors he could open if the money is behind
door 1—he could open either door 2 or door 3, as both have a goat behind
them. So Pr(ψ | A1) = 0.5.
→hat about the second conditional probability, Pr(ψ | A2)? If the money is
behind door 2, what’s the probability he will open it? Under our assumption
that he never opens the door if it has a million dollars, we know this
probability is 0.0. And finally, what about the third probability, Pr(ψ | A3)?
→hat is the probability he opens door 2 given that the money is behind door
3? Now consider this one carefully—the contestant has already chosen door
1, so he can’t open that one. And he can’t open door 3, because that has the
money behind it. The only door, therefore, he could open is door 2. Thus,
this probability is 1.0. Furthermore, all marginal probabilities, Pr(Ai), equal
1/3, allowing us to solve for the conditional probability on the left side
through substitution, multiplication, and division.

Aha. Now isn’t that just a little bit surprising? The probability that the
contestant chose the correct door is 1/3, just as it was before Monty Hall
opened door 2.
But what about the probability that door 3, the door you’re holding, has
the million dollars? Have your beliefs about that likelihood changed now
that door 2 has been removed from the equation? Let’s crank through our
Bayesian decomposition and see whether we learned anything.
Interestingly, while your beliefs about the door you originally chose
haven’t changed, your beliefs about the other door have changed. The prior
probability, Pr(A3)= 1/3, increased through a process called updating to a
new probability of Pr(A3 | ψ) = 2/3. This new conditional probability is
called the posterior probability, or posterior belief. And it simply means
that having witnessed ψ, you learned information that allowed you to form a
new belief about which door the money might be behind.
As was mentioned in footnote 14 regarding the controversy around vos
Sant’s correct reasoning about the need to switch doors, deductions based
on Bayes’s rule are often surprising even to smart people—probably
because we lack coherent ways to correctly incorporate information into
probabilities. Bayes’s rule shows us how to do that in a way that is logical
and accurate. But besides being insightful, Bayes’s rule also opens the door
for a different kind of reasoning about cause and effect. →hereas most of
this book has to do with estimating effects from known causes, Bayes’s rule
reminds us that we can form reasonable beliefs about causes from known
effects.

Summation operator. The tools we use to reason about causality rest atop a
bedrock of probabilities. →e are often working with mathematical tools and
concepts from statistics such as expectations and probabilities. One of the
most common tools we will use in this book is the linear regression model,
but before we can dive into that, we have to build out some simple
notation.5 →e’ll begin with the summation operator. The Greek letter (the
capital Sigma) denotes the summation operator. Let x1,x2, . . . ,xn be a
sequence of numbers. →e can compactly write a sum of numbers using the
summation operator as:

The letter i is called the index of summation. Other letters, such as j or k,


are sometimes used as indices of summation. The subscript variable simply
represents a specific value of a random variable, x. The numbers 1 and n are
the lower limit and the upper limit, respectively, of the summation. The
expression can be stated in words as “sum the numbers xi for all
values of i from 1 to n.” An example can help clarify:

The summation operator has three properties. The first property is called
the constant rule. Formally, it is:

Let’s consider an example. Say that we are given:

A second property of the summation operator is:

Again let’s use an example. Say we are given:

→e can apply both of these properties to get the following third property:
Before leaving the summation operator, it is useful to also note things
which are not properties of this operator. First, the summation of a ratio is
not the ratio of the summations themselves.

Second, the summation of some squared variable is not equal to the


squaring of its summation.

→e can use the summation indicator to make a number of calculations,


some of which we will do repeatedly over the course of this book. For
instance, we can use the summation operator to calculate the average:


where x is the average (mean) of the random variable xi. Another
calculation we can make is a random variable’s deviations from its own
mean. The sum of the deviations from the mean is always equal to 0:
Table 5. Sum of deviations equaling 0.

You can see this in Table 5.


Consider a sequence of two numbers {y1,y2, . . . ,yn} and {x1,x2, . . . ,xn}.
Now we can consider double summations over possible values of x’s and
y’s. For example, consider the case where n = m = 2. Then, is
equal to x1y1 +x1y2 +x2y1 +x2y2. This is because

One result that will be very useful throughout the book is:
An overly long, step-by-step proof is below. Note that the summation index
is suppressed after the first line for easier reading.

A more general version of this result is:

Or:
Expected value. The expected value of a random variable, also called the
expectation and sometimes the population mean, is simply the weighted
average of the possible values that the variable can take, with the weights
being given by the probability of each value occurring in the population.
Suppose that the variable X can take on values x1,x2, . . . ,xk, each with
probability f(x1), f(x2), . . . , f(xk), respectively. Then we define the expected
value of X as:

Let’s look at a numerical example. If X takes on values of −1, 0, and 2, with


probabilities 0.3, 0.3, and 0.4, respectively.6 Then the expected value of X
equals:

In fact, you could take the expectation of a function of that variable, too,
such as X2. Note that X2 takes only the values 1, 0, and 4, with probabilities
0.3, 0.3, and 0.4. Calculating the expected value of X2 therefore is:

The first property of expected value is that for any constant c, E(c) = c.
The second property is that for any two constants a and b, then
E(aX+b)=E(aX)+E(b)=aE(X)+b. And the third property is that if we have
numerous constants, a1, . . . ,an and many random variables, X1, . . . ,Xn,
then the following is true:
→e can also express this using the expectation operator:

And in the special case where ai = 1, then

Variance. The expectation operator, E(·), is a population concept. It refers to


the whole group of interest, not just to the sample available to us. Its
meaning is somewhat similar to that of the average of a random variable in
the population. Some additional properties for the expectation operator can
be explained assuming two random variables, W and H.

Consider the variance of a random variable, W:

→e can show

In a given sample of data, we can estimate the variance by the following


calculation:
where we divide by n − 1 because we are making a degree-of-freedom
adjustment from estimating the mean. But in large samples, this degree-of-
freedom adjustment has no practical effect on the value of S2 where S2 is
the average (after a degree of freedom correction) over the sum of all
squared deviations from the mean.7
A few more properties of variance. First, the variance of a line is:

And the variance of a constant is 0 (i.e., V(c) = 0 for any constant, c). The
variance of the sum of two random variables is equal to:

If the two variables are independent, then E(XY)=E(X)E(Y) and V(X+ Y) is


equal to the sum of V(X)+V(Y).

ωovariance. The last part of equation 2.22 is called the covariance. The
covariance measures the amount of linear dependence between two random
variables. →e represent it with the ω(X,Y) operator. The expression ω(X,Y)
> 0 indicates that two variables move in the same direction, whereas ω(X,Y)
< 0 indicates that they move in opposite directions. Thus we can rewrite
equation 2.22 as:

→hile it’s tempting to say that a zero covariance means that two random
variables are unrelated, that is incorrect. They could have a nonlinear
relationship. The definition of covariance is

As we said, if X and Y are independent, then ω(X,Y) = 0 in the population.


The covariance between two linear functions is:
The two constants, a1 and a2, zero out because their mean is themselves and
so the difference equals 0.
Interpreting the magnitude of the covariance can be tricky. For that, we
are better served by looking at correlation. →e define correlation as follows.
Let and Then:

The correlation coefficient is bounded by −1 and 1. A positive (negative)


correlation indicates that the variables move in the same (opposite) ways.
The closer the coefficient is to 1 or −1, the stronger the linear relationship
is.

Population model. →e begin with cross-sectional analysis. →e will assume


that we can collect a random sample from the population of interest.
Assume that there are two variables, x and y, and we want to see how y
varies with changes in x.8
There are three questions that immediately come up. One, what if y is
affected by factors other than x? How will we handle that? Two, what is the
functional form connecting these two variables? Three, if we are interested
in the causal effect of x on y, then how can we distinguish that from mere
correlation? Let’s start with a specific model.

This model is assumed to hold in the population. Equation 2.25 defines a


linear bivariate regression model. For models concerned with capturing
causal effects, the terms on the left side are usually thought of as the effect,
and the terms on the right side are thought of as the causes.
Equation 2.25 explicitly allows for other factors to affect y by including a
random variable called the error term, u. This equation also explicitly
models the functional form by assuming that y is linearly dependent on x.
→e call the 0 coefficient the intercept parameter, and we call the 1
coefficient the slope parameter. These describe a population, and our goal in
empirical work is to estimate their values. →e never directly observe these
parameters, because they are not data (I will emphasize this throughout the
book). →hat we can do, though, is estimate these parameters using data and
assumptions. To do this, we need credible assumptions to accurately
estimate these parameters with data. →e will return to this point later. In this
simple regression framework, all unobserved variables that determine y are
subsumed by the error term u.
First, we make a simplifying assumption without loss of generality. Let
the expected value of u be zero in the population. Formally:

where E(·) is the expected value operator discussed earlier. If we normalize


the u random variable to be 0, it is of no consequence. →hy? Because the
presence of 0 (the intercept term) always allows us this flexibility. If the
average of u is different from 0—for instance, say that it’s α0—then we
adjust the intercept. Adjusting the intercept has no effect on the 1 slope
parameter, though. For instance:

where α0 = E(u). The new error term is u−α0, and the new intercept term is
0 + α0. But while those two terms changed, notice what did not change: the
slope, 1.

εean independence. An assumption that meshes well with our elementary


treatment of statistics involves the mean of the error term for each “slice” of
the population determined by values of x:

where E(u | x) means the “expected value of u given x.” If equation 2.27
holds, then we say that u is mean independent of x.
An example might help here. Let’s say we are estimating the effect of
schooling on wages, and u is unobserved ability. Mean independence
requires that E(ability | x=8)=E(ability | x=12)=E(ability | x= 16) so that the
average ability is the same in the different portions of the population with
an eighth-grade education, a twelfth-grade education, and a college
education. Because people choose how much schooling to invest in based
on their own unobserved skills and attributes, equation 2.27 is likely
violated—at least in our example.
But let’s say we are willing to make this assumption. Then combining
this new assumption, E(u | x) = E(u) (the nontrivial assumption to make),
with E(u)=0 (the normalization and trivial assumption), and you get the
following new assumption:

Equation 2.28 is called the zero conditional mean assumption and is a key
identifying assumption in regression models. Because the conditional
expected value is a linear operator, E(u | x)=0 implies that

which shows the population regression function is a linear function of x, or


what Angrist and Pischke [2009] call the conditional expectation function.9
This relationship is crucial for the intuition of the parameter, 1, as a causal
parameter.

τrdinary least squares. Given data on x and y, how can we estimate the
population parameters, 0 and 1? Let the pairs of (xi, and yi): i=1, 2, . . . ,n}
be random samples of size from the population. Plug any observation into
the population equation:

where i indicates a particular observation. →e observe yi and xi but not ui.


→e just know that ui is there. →e then use the two population restrictions
that we discussed earlier:

to obtain estimating equations for 0 and 1. →e talked about the first


condition already. The second one, though, means that the mean value of x
does not change with different slices of the error term. This independence
assumption implies E(xu) = 0, we get E(u) = 0, and ω(x,u) = 0. Notice that
if ω(x,u) = 0, then that implies x and u are independent.10 Next we plug in
for u, which is equal to y− 0− 1x:

These are the two conditions in the population that effectively determine 0
and 1. And again, note that the notation here is population concepts. →e
don’t have access to populations, though we do have their sample
counterparts:

where and are the estimates from the data.11 These are two linear
equations in the two unknowns and . Recall the properties of the
summation operator as we work through the following sample properties of
these two equations. →e begin with equation 2.29 and pass the summation
operator through.

where which is the average of the n numbers {yi: 1, . . . ,n}. For


emphasis we will call ȳ the sample average. →e have already shown that the
first equation equals zero (equation 2.29), so this implies So we
now use this equation to write the intercept in terms of the slope:

→e now plug into the second equation, . This gives


us the following (with some simple algebraic manipulation):

So the equation to solve is12

The previous formula for is important because it shows us how to take


data that we have and compute the slope estimate. The estimate, , is
commonly referred to as the ordinary least squares (OLS) slope estimate. It
can be computed whenever the sample variance of xi isn’t 0. In other words,
it can be computed if xi is not constant across all values of i. The intuition is
that the variation in x is what permits us to identify its impact in y. This also
means, though, that we cannot determine the slope in a relationship if we
observe a sample in which everyone has the same years of schooling, or
whatever causal variable we are interested in.
Once we have calculated , we can compute the intercept value, , as

= ȳ − x. This is the OLS intercept estimate because it is calculated using
sample averages. Notice that it is straightforward because is linear in .
→ith computers and statistical programming languages and software, we let
our computers do these calculations because even when n is small, these
calculations are quite tedious.
For any candidate estimates, , , we define a fitted value for each i as:

Recall that i = {1, . . . ,n}, so we have n of these equations. This is the value
we predict for yi given that x = xi. But there is prediction error because y =
yi. →e call that mistake the residual, and here use the notation for it. So
the residual equals:

→hile both the residual and the error term are represented with a u, it is
important that you know the differences. The residual is the prediction error
based on our fitted and the actual y. The residual is therefore easily
calculated with any sample of data. But u without the hat is the error term,
and it is by definition unobserved by the researcher. →hereas the residual
will appear in the data set once generated from a few steps of regression
and manipulation, the error term will never appear in the data set. It is all of
the determinants of our outcome not captured by our model. This is a
crucial distinction, and strangely enough it is so subtle that even some
seasoned researchers struggle to express it.
Suppose we measure the size of the mistake, for each i, by squaring it.
Squaring it will, after all, eliminate all negative values of the mistake so that
everything is a positive value. This becomes useful when summing the
mistakes if we don’t want positive and negative values to cancel one
another out. So let’s do that: square the mistake and add them all up to get
This equation is called the sum of squared residuals because the residual is
= yi− . But the residual is based on estimates of the slope and the
intercept. →e can imagine any number of estimates of those values. But
what if our goal is to minimize the sum of squared residuals by choosing
and ? Using calculus, it can be shown that the solutions to that problem
yield parameter estimates that are the same as what we obtained before.
Once we have the numbers and for a given data set, we write the
OLS regression line:

Let’s consider a short simulation.


Let’s look at the output from this. First, if you summarize the data, you’ll
see that the fitted values are produced both using Stata’s Predict command
and manually using the Generate command. I wanted the reader to have a
chance to better understand this, so did it both ways. But second, let’s look
at the data and paste on top of it the estimated coefficients, the y-intercept
and slope on x in Figure 3. The estimated coefficients in both are close to
the hard coded values built into the data-generating process.
Figure 3. Graphical representation of bivariate regression from y on x.

Once we have the estimated coefficients and we have the OLS regression
line, we can predict y (outcome) for any (sensible) value of x. So plug in
certain values of x, and we can immediately calculate what y will probably
be with some error. The value of OLS here lies in how large that error is:
OLS minimizes the error for a linear function. In fact, it is the best such
guess at y for all linear estimators because it minimizes the prediction error.
There’s always prediction error, in other words, with any estimator, but
OLS is the least worst.
Notice that the intercept is the predicted value of y if and when x = 0. In
this sample, that value is −0.0750109.13 The slope allows us to predict
changes in y for any reasonable change in x according to:

And if Δx = 1, then x increases by one unit, and so = 5.598296 in our


numerical example because = 5.598296.
Now that we have calculated and , we get the OLS fitted values by
plugging xi into the following equation for i = 1, . . . ,n:

The OLS residuals are also calculated by:

Most residuals will be different from 0 (i.e., they do not lie on the
regression line). You can see this in Figure 3. Some are positive, and some
are negative. A positive residual indicates that the regression line (and
hence, the predicted values) underestimates the true value of yi. And if the
residual is negative, then the regression line overestimates the true value.
Recall that we defined the fitted value as i and the residual, , as yi− i.
Notice that the scatter-plot relationship between the residuals and the fitted
values created a spherical pattern, suggesting that they are not correlated
(Figure 4). This is mechanical—least squares produces residuals which are
uncorrelated with fitted values. There’s no magic here, just least squares.

Algebraic Properties of τδS. Remember how we obtained and ? →hen


an intercept is included, we have:

The OLS residual always adds up to zero, by construction.

Sometimes seeing is believing, so let’s look at this together. Type the


following into Stata verbatim.
Figure 4. Distribution of residuals around regression line.
Output from this can be summarized as in the following table (Table 6).
Notice the difference between the u, and columns. →hen we sum
these ten lines, neither the error term nor the fitted values of y sum to zero.
But the residuals do sum to zero. This is, as we said, one of the algebraic
properties of OLS—coefficients were optimally chosen to ensure that the
residuals sum to zero.
Because yi = i + by definition (which we can also see in Table 6), we
can take the sample average of both sides:

and so because the residuals sum to zero. Similarly, the way that we
obtained our estimates yields

The sample covariance (and therefore the sample correlation) between the
explanatory variables and the residuals is always zero (see Table 6).
Table 6. Simulated data showing the sum of residuals equals zero.

Because the i are linear functions of the xi, the fitted values and residuals
are uncorrelated too (see Table 6):

Both properties hold by construction. In other words, 0 and 1 were


selected to make them true.14
A third property is that if we plug in the average for x, we predict the

sample average for y. That is, the point (x,ȳ) is on the OLS regression line,
or:

Goodness-of-fit. For each observation, we write

Define the total sum of squares (SST), explained sum of squares (SSE), and
residual sum of squares (SSR) as
These are sample variances when divided by n-1.15 is the sample
variance of yi, is the sample variance of i, and is the sample
variance of . →ith some simple manipulation rewrite equation 2.34:

Since equation 2.34 shows that the fitted values are uncorrelated with the
residuals, we can write the following equation:

Assuming SST > 0, we can define the fraction of the total variation in yi that
is explained by xi (or the OLS regression line) as

which is called the R-squared of the regression. It can be shown to be equal


to the square of the correlation between yi and i. Therefore 0 ≤ R2 ≤ 1. An
R-squared of zero means no linear relationship between yi and xi, and an R-
squared of one means a perfect linear relationship (e.g., yi = xi +2). As R2
increases, the yi are closer and closer to falling on the OLS regression line.
I would encourage you not to fixate on R-squared in research projects
where the aim is to estimate some causal effect, though. It’s a useful
summary measure, but it does not tell us about causality. Remember, you
aren’t trying to explain variation in y if you are trying to estimate some
causal effect. The R2 tells us how much of the variation in yi is explained by
the explanatory variables. But if we are interested in the causal effect of a
single variable, R2 is irrelevant. For causal inference, we need equation
2.28.

Expected value of τδS. Up until now, we motivated simple regression using


a population model. But our analysis has been purely algebraic, based on a
sample of data. So residuals always average to zero when we apply OLS to
a sample, regardless of any underlying model. But our job gets tougher.
Now we have to study the statistical properties of the OLS estimator,
referring to a population model and assuming random sampling.16
The field of mathematical statistics is concerned with questions. How do
estimators behave across different samples of data? On average, for
instance, will we get the right answer if we repeatedly sample? →e need to
find the expected value of the OLS estimators—in effect, the average
outcome across all possible random samples—and determine whether we
are right, on average. This leads naturally to a characteristic called
unbiasedness, which is desirable of all estimators.

Remember, our objective is to estimate 1, which is the slope population


parameter that describes the relationship between y and x. Our estimate, ,
is an estimator of that parameter obtained for a specific sample. Different
samples will generate different estimates ( ) for the “true” (and
unobserved) 1. Unbiasedness means that if we could take as many random
samples on Y as we want from the population and compute an estimate each
time, the average of the estimates would be equal to 1.
There are several assumptions required for OLS to be unbiased. The first
assumption is called linear in the parameters. Assume a population model

where 0 and 1 are the unknown population parameters. →e view x and u


as outcomes of random variables generated by some data-generating
process. Thus, since y is a function of x and u, both of which are random,
then y is also random. Stating this assumption formally shows that our goal
is to estimate 0 and 1.
Our second assumption is random sampling. →e have a random sample
of size n, {(xi,yi):i = 1, . . . ,n}, following the population model. →e know
how to use this data to estimate 0 and 1 by OLS. Because each i is a draw
from the population, we can write, for each i:

Notice that ui here is the unobserved error for observation i. It is not the
residual that we compute from the data.
The third assumption is called sample variation in the explanatory
variable. That is, the sample outcomes on xi are not all the same value. This
is the same as saying that the sample variance of x is not zero. In practice,
this is no assumption at all. If the xi all have the same value (i.e., are
constant), we cannot learn how x affects y in the population. Recall that
OLS is the covariance of y and x divided by the variance in x, and so if x is
constant, then we are dividing by zero, and the OLS estimator is undefined.
→ith the fourth assumption our assumptions start to have real teeth. It is
called the zero conditional mean assumption and is probably the most
critical assumption in causal inference. In the population, the error term has
zero mean given any value of the explanatory variable:

This is the key assumption for showing that OLS is unbiased, with the zero
value being of no importance once we assume that E(u | x) does not change
with x. Note that we can compute OLS estimates whether or not this
assumption holds, even if there is an underlying population model.
So, how do we show that 1 is an unbiased estimate of 1 (equation
2.37)? →e need to show that under the four assumptions we just outlined,
the expected value of , when averaged across random samples, will center
on the true value of 1. This is a subtle yet critical concept. Unbiasedness in
this context means that if we repeatedly sample data from a population and
run a regression on each new sample, the average over all those estimated
coefficients will equal the true value of 1. →e will discuss the answer as a
series of steps.
Step 1: →rite down a formula for . It is convenient to use the form:

Let’s get rid of some of this notational clutter by defining =


SSTx (i.e., total variation in the xi) and rewrite this as:

Step 2: Replace each yi with yi = 0 + 1xi + ui, which uses the first linear
assumption and the fact that we have sampled data (our second
assumption). The numerator becomes:
Note, we used to do this.17
→e have shown that:

Note that the last piece is the slope coefficient from the OLS regression of
ui on xi, i: 1, . . . ,n.18 →e cannot do this regression because the ui are not
observed. Now define so that we have the following:

This has showed us the following: First, is a linear function of the


unobserved errors, ui. The wi are all functions of {x1, . . . ,xn}. Second, the
random difference between 1 and the estimate of it, , is due to this linear
function of the unobservables.
Step 3: Find E( ). Under the random sampling assumption and the zero
conditional mean assumption, E(ui | x1, . . . ,xn)=0, that means conditional
on each of the x variables:

because wi is a function of {x1, . . . ,xn}. This would be true if in the


population u and x are correlated.
Now we can complete the proof: conditional on {x1, . . . ,xn},
Remember, 1 is the fixed constant in the population. The estimator, ,
varies across samples and is the random outcome: before we collect our
data, we do not know what will be. Under the four aforementioned
assumptions, E( ) = 0 and E( ) = 1.
I find it helpful to be concrete when we work through exercises like this.
So let’s visualize this. Let’s create a Monte Carlo simulation. →e have the
following population model:

where x ∼ σormal (0,9), u ∼ σormal (0,36). Also, x and u are independent.


The following Monte Carlo simulation will estimate OLS on a sample of
data 1,000 times. The true parameter equals 2. But what will the average
equal when we use repeated sampling?
Table 7 gives us the mean value of over the 1,000 repetitions (repeated
sampling). Your results will differ from mine here only in the randomness
involved in the simulation. But your results should be similar to what is
shown here. →hile each sample had a different estimated slope, the average
for over all the samples was 1.998317, which is close to the true value of
2 (see equation 2.38). The standard deviation in this estimator was
0.0398413, which is close to the standard error recorded in the regression
itself.19 Thus, we see that the estimate is the mean value of the coefficient
from repeated sampling, and the standard error is the standard deviation
from that repeated estimation. →e can see the distribution of these
coefficient estimates in Figure 5.
The problem is, we don’t know which kind of sample we have. Do we
have one of the “almost exactly 2” samples, or do we have one of the
“pretty different from 2” samples? →e can never know whether we are
close to the population value. →e hope that our sample is “typical” and
produces a slope estimate close to , but we can’t know. Unbiasedness is a
property of the procedure of the rule. It is not a property of the estimate
itself. For example, say we estimated an that 8.2% return on schooling. It is
tempting to say that 8.2% is an unbiased estimate of the return to schooling,
but that’s technically incorrect. The rule used to get = 0.082 is unbiased
(if we believe that u is unrelated to schooling), not the actual estimate itself.
Table 7. Monte Carlo simulation of OLS.

Figure 5. Distribution of coefficients from Monte Carlo simulation.

δaw of iterated expectations. The conditional expectation function (CEF) is


the mean of some outcome y with some covariate x held fixed. Let’s focus
more intently on this function.20 Let’s get the notation and some of the
syntax out of the way. As noted earlier, we write the CEF as E(yi | xi). Note
that the CEF is explicitly a function of xi. And because xi is random, the
CEF is random—although sometimes we work with particular values for xi,
like E(yi | xi =8 years schooling) or E(yi | xi =Female). →hen there are
treatment variables, then the CEF takes on two values: E(yi | di = 0) and E(yi
| di = 1). But these are special cases only.
An important complement to the CEF is the law of iterated expectations
(LIE). This law says that an unconditional expectation can be written as the
unconditional average of the CEF. In other words, E(yi)= E{E(yi | xi)}. This
is a fairly simple idea: if you want to know the unconditional expectation of
some random variable y, you can simply calculate the weighted sum of all
conditional expectations with respect to some covariate x. Let’s look at an
example. Let’s say that average grade-point for females is 3.5, average GPA
for males is a 3.2, half the population is female, and half is male. Then:

You probably use LIE all the time and didn’t even know it. The proof is not
complicated. Let xi and yi each be continuously distributed. The joint
density is defined as fxy(u,t). The conditional distribution of y given x = u is
defined as fy(t | xi = u). The marginal densities are gy(t) and gx(u).
Check out how easy this proof is. The first line uses the definition of
expectation. The second line uses the definition of conditional expectation.
The third line switches the integration order. The fourth line uses the
definition of joint density. The fifth line replaces the prior line with the
subsequent expression. The sixth line integrates joint density over the
support of x which is equal to the marginal density of y. So restating the law
of iterated expectations: E(yi) = E{E(y | xi)}.

ωEF decomposition property. The first property of the CEF we will discuss
is the CEF decomposition property. The power of LIE comes from the way
it breaks a random variable into two pieces—the CEF and a residual with
special properties. The CEF decomposition property states that

where (i) εi is mean independent of xi, That is,

and (ii) εi is not correlated with any function of xi.


The theorem says that any random variable yi can be decomposed into a
piece that is explained by xi (the CEF) and a piece that is left over and
orthogonal to any function of xi. I’ll prove the (i) part first. Recall that εi =
yi − E(yi | xi) as we will make a substitution in the second line below.
The second part of the theorem states that εi is uncorrelated with any
function of xi. Let h(xi) be any function of xi. Then E(h(xi)εi) = E{h(xi)E(εi |
xi)} The second term in the interior product is equal to zero by mean
independence.21

ωEF prediction property. The second property is the CEF prediction


property. This states that E(yi | xi) = argminm(xi) E[(y − m(xi))2], where m(xi)
is any function of xi. In words, this states that the CEF is the minimum
mean squared error of yi given xi. By adding E(yi | xi)−E(yi | xi)=0 to the
right side we get

I personally find this easier to follow with simpler notation. So replace this
expression with the following terms:

Distribute the terms, rearrange them, and replace the terms with their
original values until you get the following:

Now minimize the function with respect to m(xi). →hen minimizing this
function with respect to m(xi), note that the first term (yi−E(yi | xi))2 doesn’t
matter because it does not depend on m(xi). So it will zero out. The second
and third terms, though, do depend on m(xi). So rewrite 2(E(yi | xi) − m(xi))
as h(xi). Also set εi equal to [yi − E(yi | xi)] and substitute
Now minimizing this function and setting it equal to zero we get

which equals zero by the decomposition property.

AστVA theory. The final property of the CEF that we will discuss is the
analysis of variance theorem, or ANOVA. According to this theorem, the
unconditional variance in some random variable is equal to the variance in
the conditional expectation plus the expectation of the conditional variance,
or

where V is the variance and V(yi | xi) is the conditional variance.

δinear ωEF theorem. As you probably know by now, the use of least
squares in applied work is extremely common. That’s because regression
has several justifications. →e discussed one—unbiasedness under certain
assumptions about the error term. But I’d like to present some slightly
different arguments. Angrist and Pischke [2009] argue that linear regression
may be useful even if the underlying CEF itself is not linear, because
regression is a good approximation of the CEF. So keep an open mind as I
break this down a little bit more.
Angrist and Pischke [2009] give several arguments for using regression,
and the linear CEF theorem is probably the easiest. Let’s assume that we are
sure that the CEF itself is linear. So what? →ell, if the CEF is linear, then
the linear CEF theoremstates that the population regression is equal to that
linear CEF. And if the CEF is linear, and if the population regression equals
it, then of course you should use the population regression to estimate CEF.
If you need a proof for what could just as easily be considered common
sense, I provide one. If E(yi | xi) is linear, then E(yi | xi) = x' for some
vector . By the decomposition property, you get:

And then when you solve this, you get = . Hence E(y | x) = x' .
ψest linear predictor theorem. There are a few other linear theorems that are
worth bringing up in this context. For instance, recall that the CEF is the
minimum mean squared error predictor of y given x in the class of all
functions, according to the CEF prediction property. Given this, the
population regression function is the best that we can do in the class of all
linear functions.22

Regression ωEF theorem. I would now like to cover one more attribute of
regression. The function X provides the minimum mean squared error
linear approximation to the CEF. That is,

So? Let’s try and back up for a second, though, and get the big picture, as
all these linear theorems can leave the reader asking, “So what?” I’m telling
you all of this because I want to present to you an argument that regression
is appealing; even though it’s linear, it can still be justified when the CEF
itself isn’t. And since we don’t know with certainty that the CEF is linear,
this is actually a nice argument to at least consider. Regression is ultimately
nothing more than a crank turning data into estimates, and what I’m saying
here is that crank produces something desirable even under bad situations.
Let’s look a little bit more at this crank, though, by reviewing another
theorem which has become popularly known as the regression anatomy
theorem.

Regression anatomy theorem. In addition to our discussion of the CEF and


regression theorems, we now dissect the regression itself. Here we discuss
the regression anatomy theorem. The regression anatomy the orem is based
on earlier work by Frisch and→augh [1933] and Lovell [1963].23 I find the
theorem more intuitive when I think through a specific example and offer
up some data visualization. In my opinion, the theorem helps us interpret
the individual coefficients of a multiple linear regression model. Say that
we are interested in the causal effect of family size on labor supply. →e
want to regress labor supply on family size:
where Y is labor supply, and X is family size.
If family size is truly random, then the number of kids in a family is
uncorrelated with the unobserved error term.24 This implies that when we
regress labor supply on family size, our estimate, , can be interpreted as
the causal effect of family size on labor supply. →e could just plot the
regression coefficient in a scatter plot showing all i pairs of data; the slope
coefficient would be the best linear fit of the data for this data cloud.
Furthermore, under randomized number of children, the slope would also
tell us the average causal effect of family size on labor supply.
But most likely, family size isn’t random, because so many people
choose the number of children to have in their family—instead of, say,
flipping a coin. So how do we interpret if the family size is not random?
Often, people choose their family size according to something akin to an
optimal stopping rule. People pick both how many kids to have, when to
have them, and when to stop having them. In some instances, they may
even attempt to pick the gender. All of these choices are based on a variety
of unobserved and observed economic factors that may themselves be
associated with one’s decision to enter the labor market. In other words,
using the language we’ve been using up until now, it’s unlikely that E(u | X)
= E(u) = 0.
But let’s say that we have reason to think that the number of kids in a
family is conditionally random. To make this tractable for the sake of
pedagogy, let’s say that a particular person’s family size is as good as
randomly chosen once we condition on race and age.25 →hile unrealistic, I
include it to illustrate an important point regarding multivariate regressions.
If this assumption were to be true, then we could write the following
equation:

where Y is labor supply, X is number of kids, R is race, A is age, and u is the


population error term.
If we want to estimate the average causal effect of family size on labor
supply, then we need two things. First, we need a sample of data containing
all four of those variables. →ithout all four of the variables, we cannot
estimate this regression model. Second, we need for the number of kids, X,
to be randomly assigned for a given set of race and age.
Now, how do we interpret ? And how might we visualize this
coefficient given that there are six dimensions to the data? The regression
anatomy theorem both tells us what this coefficient estimate actually means
and also lets us visualize the data in only two dimensions.
To explain the intuition of the regression anatomy theorem, let’s write
down a population model with multiple variables. Assume that your main
multiple regression model of interest has K covariates. →e can then write it
as:

Now assume an auxiliary regression in which the variable x1i is regressed


on all the remaining independent variables:

and is the residual from that auxiliary regression. Then the


parameter 1 can be rewritten as:

Notice that again we see that the coefficient estimate is a scaled covariance,
only here, the covariance is with respect to the outcome and residual from
the auxiliary regression, and the scale is the variance of that same residual.
To prove the theorem, note that and plug yi
and residual x̃ki from xki auxiliary regression into the covariance cov(yi,xki):

Since by construction E[fi] = 0, it follows that the term 0E[fi] = 0. Since


fi is a linear combination of all the independent variables with the exception
of xki, it must be that
Consider now the term E[eifi]. This can be written as

Since ei is uncorrelated with any independent variable, it is also


uncorrelated with xki. Accordingly, we have E[eixki] = 0. →ith regard to the
second term of the subtraction, substituting the predicted value from the xki
auxiliary regression, we get

Once again, since ei is uncorrelated with any independent variable, the


expected value of the terms is equal to zero. It follows that E[eifi] = 0.
The only remaining term, then, is [ kxkifi], which equals E[ kxkix̃ki], since
fi = x̃ki. The term xki can be substituted by rewriting the auxiliary regression
model, xki, such that

This gives

which follows directly from the orthogonality between E[xki | X−k] and x̃ki.
From previous derivations we finally get
which completes the proof.
I find it helpful to visualize things. Let’s look at an example in Stata
using its popular automobile data set. I’ll show you:
Let’s walk through both the regression output that I’ve reproduced in
Table 8 as well as a nice visualization of the slope parameters in what I’ll
call the short bivariate regression and the longer multivariate regression.
The short regression of price on car length yields a coefficient of 57.20 on
length. For every additional inch, a car is $57 more expensive, which is
shown by the upward-sloping, dashed line in Figure 6. The slope of that line
is 57.20.
It will eventually become second nature for you to talk about including
more variables on the right side of a regression as “controlling for” those
variables. But in this regression anatomy exercise, I hope to give a different
interpretation of what you’re doing when you in fact “control for” variables
in a regression. First, notice how the coefficient on length changed signs
and increased in magnitude once we controlled for the other variables. Now,
the effect on length is −94.5. It appears that the length was confounded by
several other variables, and once we conditioned on them, longer cars
actually were cheaper. You can see a visual representation of this in Figure
6, where the multivariate slope is negative.
Table 8. Regression estimates of automobile price on length and other characteristics.

Figure 6. Regression anatomy display.

So what exactly going on in this visualization? →ell, for one, it has


condensed the number of dimensions (variables) from four to only two. It
did this through the regression anatomy process that we described earlier.
Basically, we ran the auxiliary regression, used the residuals from it, and
then calculated the slope coefficient as . This allowed us to show
scatter plots of the auxiliary residuals paired with their outcome
observations and to slice the slope through them (Figure 6). Notice that this
is a useful way to preview the multidimensional correlation between two
variables from a multivariate regression. Notice that the solid black line is
negative and the slope from the bivariate regression is positive. The
regression anatomy theorem shows that these two estimators—one being a
multivariate OLS and the other being a bivariate regression price and a
residual—are identical.

Variance of the τδS estimators. That more or less summarizes what we


want to discuss regarding the linear regression. Under a zero conditional
mean assumption, we could epistemologically infer that the rule used to
produce the coefficient from a regression in our sample was unbiased.
That’s nice because it tells us that we have good reason to believe that
result. But now we need to build out this epistemological justification so as
to capture the inherent uncertainty in the sampling process itself. This
added layer of uncertainty is often called inference. Let’s turn to it now.
Remember the simulation we ran earlier in which we resampled a
population and estimated regression coefficients a thousand times? →e
produced a histogram of those 1,000 estimates in Figure 5. The mean of the
coefficients was around 1.998, which was very close to the true effect of 2
(hard-coded into the data-generating process). But the standard deviation
was around 0.04. This means that, basically, in repeated sampling of some
population, we got different estimates. But the average of those estimates
was close to the true effect, and their spread had a standard deviation of
0.04. This concept of spread in repeated sampling is probably the most
useful thing to keep in mind as we move through this section.
Under the four assumptions we discussed earlier, the OLS estimators are
unbiased. But these assumptions are not sufficient to tell us anything about
the variance in the estimator itself. The assumptions help inform our beliefs
that the estimated coefficients, on average, equal the parameter values
themselves. But to speak intelligently about the variance of the estimator,
we need a measure of dispersion in the sampling distribution of the
estimators. As we’ve been saying, this leads us to the variance and
ultimately to the standard deviation. →e could characterize the variance of
the OLS estimators under the four assumptions. But for now, it’s easiest to
introduce an assumption that simplifies the calculations. →e’ll keep the
assumption ordering we’ve been using and call this the fifth assumption.
The fifth assumption is the homoskedasticity or constant variance
assumption. This assumption stipulates that our population error term, u,
has the same variance given any value of the explanatory variable, x.
Formally, this is:

→hen I was first learning this material, I always had an unusually hard time
wrapping my head around σ2. Part of it was because of my humanities
background; I didn’t really have an appreciation for random variables that
were dispersed. I wasn’t used to taking a lot of numbers and trying to
measure distances between them, so things were slow to click. So if you’re
like me, try this. Think of σ2 as just a positive number like 2 or 8. That
number is measuring the spreading out of underlying errors themselves. In
other words, the variance of the errors conditional on the explanatory
variable is simply some finite, positive number. And that number is
measuring the variance of the stuff other than x that influence the value of y
itself. And because we assume the zero conditional mean assumption,
whenever we assume homoskedasticity, we can also write:

Now, under the first, fourth, and fifth assumptions, we can write:

So the average, or expected, value of y is allowed to change with x, but if


the errors are homoskedastic, then the variance does not change with x. The
constant variance assumption may not be realistic; it must be determined on
a case-by-case basis.
Theorem: Sampling variance of τδS. Under assumptions 1 and 2, we get:
To show this, write, as before,

where →e are treating this as nonrandom in the derivation.


Because 1 is a constant, it does not affect V( ). Now, we need to use the
fact that, for uncorrelated random variables, the variance of the sum is the
sum of the variances. The {ui: i = 1, . . . ,n} are actually independent across
i and uncorrelated. Remember: if we know x, we know w. So:
where the penultimate equality condition used the fifth assumption so that
the variance of ui does not depend on xi. Now we have:

→e have shown:
A couple of points. First, this is the “standard” formula for the variance of
the OLS slope estimator. It is not valid if the fifth assumption, of
homoskedastic errors, doesn’t hold. The homoskedasticity assumption is
needed, in other words, to derive this standard formula. But the
homoskedasticity assumption is not used to show unbiasedness of the OLS
estimators. That requires only the first four assumptions.
Usually, we are interested in 1. →e can easily study the two factors that
affect its variance: the numerator and the denominator.

As the error variance increases—that is, as σ2 increases—so does the


variance in our estimator. The more “noise” in the relationship between y
and x (i.e., the larger the variability in u), the harder it is to learn something
about 1. In contrast, more variation in {xi} is a good thing. As SSTx rises,
V( ) .
Notice that is the sample variance in x. →e can think of this as
getting close to the population variance of x, as n gets large. This means:

which means that as n grows, V( ) shrinks at the rate of This is why more
data is a good thing: it shrinks the sampling variance of our estimators.
The standard deviation of is the square root of the variance. So:

This turns out to be the measure of variation that appears in confidence


intervals and test statistics.
Next we look at estimating the error variance. In the formula,
we can compute SSTx from {xi: i = 1, . . . ,n}. But we need to estimate σ2.
Recall that σ2 = E(u2). Therefore, if we could observe a sample on the
errors, {ui: i = 1, . . . ,n}, an unbiased estimator of σ2 would be the sample
average:
But this isn’t an estimator that we can compute from the data we observe,
because ui are unobserved. How about replacing each ui with its “estimate,”
the OLS residual ?

→hereas ui cannot be computed, can be computed from the data because


it depends on the estimators, and . But, except by sheer coincidence, ui
= for any i.

Note that E( 0) = 0 and E( ) = 1, but the estimators almost always differ


from the population values in a sample. So what about this as an estimator
of σ2?

It is a true estimator and easily computed from the data after OLS. As it
turns out, this estimator is slightly biased: its expected value is a little less
than σ2. The estimator does not account for the two restrictions on the
residuals used to obtain 0 and :
There is no such restriction on the unobserved errors. The unbiased
estimator, therefore, of σ2 uses a degrees-of-freedom adjustment. The
residuals have only n−2, not n, degrees of freedom. Therefore:

→e now propose the following theorem. The unbiased estimator of σ2


under the first five assumptions is:

In most software packages, regression output will include:

This is an estimator of sd(u), the standard deviation of the population error.


One small glitch is that is not unbiased for σ.26 This will not matter for
our purposes: is called the standard error of the regression, which means
that it is an estimate of the standard deviation of the error in the regression.
The software package Stata calls it the root mean squared error.
Given , we can now estimate sd( ) and sd( 0). The estimates of these
are called the standard errors of the j. →e will use these a lot. Almost all
regression packages report the standard errors in a column next to the
coefficient estimates. →e can just plug in for σ:
where both the numerator and the denominator are computed from the
data. For reasons we will see, it is useful to report the standard errors
below the corresponding coefficient, usually in parentheses.

Robust standard errors. How realistic is it that the variance in the errors is
the same for all slices of the explanatory variable, x? The short answer here
is that it is probably unrealistic. Heterogeneity is just something I’ve come
to accept as the rule, not the exception, so if anything, we should be opting
in to believing in homoskedasticity, not opting out. You can just take it as a
given that errors are never homoskedastic and move forward to the solution.
This isn’t completely bad news, because the unbiasedness of our
regressions based on repeated sampling never depended on assuming
anything about the variance of the errors. Those four assumptions, and
particularly the zero conditional mean assumption, guaranteed that the
central tendency of the coefficients under repeated sampling would equal
the true parameter, which for this book is a causal parameter. The problem
is with the spread of the coefficients. →ithout homoskedasticity, OLS no
longer has the minimum mean squared errors, which means that the
estimated standard errors are biased. Using our sampling metaphor, then,
the distribution of the coefficients is probably larger than we thought.
Fortunately, there is a solution. Let’s write down the variance equation
under heterogeneous variance terms:

Notice the i subscript in our term; that means variance is not a constant.
→hen for all i, this formula reduces to the usual form, But
when that isn’t true, then we have a problem called heteroskedastic errors.
A valid estimator of Var( ) for heteroskedasticity of any form (including
homoskedasticity) is
which is easily computed from the data after the OLS regression. →e have
Friedhelm Eicker, Peter J. Huber, and Halbert →hite to thank for this
solution (→hite [1980]).27 The solution for heteroskedasticity goes by
several names, but the most common is “robust” standard error.

ωluster robust standard errors. People will try to scare you by challenging
how you constructed your standard errors. Heteroskedastic errors, though,
aren’t the only thing you should be worried about when it comes to
inference. Some phenomena do not affect observations individually, but
they do affect groups of observations that involve individuals. And then
they affect those individuals within the group in a common way. Say you
want to estimate the effect of class size on student achievement, but you
know that there exist unobservable things (like the teacher) that affect all
the students equally. If we can commit to independence of these
unobservables across classes, but individual student unobservables are
correlated within a class, then we have a situation in which we need to
cluster the standard errors. Before we dive into an example, I’d like to start
with a simulation to illustrate the problem.
As a baseline for this simulation, let’s begin by simulating nonclustered
data and analyze least squares estimates of that nonclustered data. This will
help firm up our understanding of the problems that occur with least
squares when data is clustered.28
As we can see in Figure 7, the least squares estimate is centered on its
true population parameter.
Setting the significance level at 5%, we should incorrectly reject the null
that 1 =0 about 5% of the time in our simulations. But let’s check the
confidence intervals. As can be seen in Figure 8, about 95% of the 95%
confidence intervals contain the true value of 1, which is zero. In words,
this means that we incorrectly reject the null about 5% of the time.
But what happens when we use least squares with clustered data? To see
that, let’s resimulate our data with observations that are no longer
independent draws in a given cluster of observations.
Figure 7. Distribution of the least squares estimator over 1,000 random draws.
Figure 8. Distribution of the 95% confidence intervals with shading showing those that are
incorrectly rejecting the null.
As can be seen in Figure 9, the least squares estimate has a narrower
spread than that of the estimates when the data isn’t clustered. But to see
this a bit more clearly, let’s look at the confidence intervals again.

Figure 9. Distribution of the least squares estimator over 1,000 random draws.
Figure 10. Distribution of 1,000 95% confidence intervals, with darker region representing those
estimates that incorrectly reject the null.

Figure 10 shows the distribution of 95% confidence intervals from the


least squares estimates. As can be seen, a much larger number of estimates
incorrectly rejected the null hypothesis when the data was clustered. The
standard deviation of the estimator shrinks under clustered data, causing us
to reject the null incorrectly too often. So what can we do?
Now in this case, notice that we included the “, cluster(cluster_ID)”
syntax in our regression command. Before we dive in to what this syntax
did, let’s look at how the confidence intervals changed. Figure 11 shows the
distribution of the 95% confidence intervals where, again, the darkest
region represents those estimates that incorrectly rejected the null. Now,
when there are observations whose errors are correlated within a cluster, we
find that estimating the model using least squares leads us back to a
situation in which the type I error has decreased considerably.
Figure 11. Distribution of 1,000 95% confidence intervals from a clustered robust least squares
regression, with dashed region representing those estimates that incorrectly reject the null.

This leads us to a natural question: what did the adjustment of the


estimator’s variance do that caused the type I error to decrease by so much?
→hatever it’s doing, it sure seems to be working! Let’s dive in to this
adjustment with an example. Consider the following model:

and

which equals zero if g = g' and equals σ(ij)g if g = g'.


Let’s stack the data by cluster first.
The OLS estimator is still = E[XX]−1XY. →e just stacked the data, which
doesn’t affect the estimator itself. But it does change the variance.

→ith this in mind, we can now write the variance-covariance matrix for
clustered data as

Adjusting for clustered data will be quite common in your applied work
given the ubiquity of clustered data in the first place. It’s absolutely
essential for working in the panel contexts, or in repeated cross-sections
like the difference-in-differences design. But it also turns out to be
important for experimental design, because often, the treatment will be at a
higher level of aggregation than the microdata itself. In the real world,
though, you can never assume that errors are independent draws from the
same distribution. You need to know how your variables were constructed
in the first place in order to choose the correct error structure for calculating
your standard errors. If you have aggregate variables, like class size, then
you’ll need to cluster at that level. If some treatment occurred at the state
level, then you’ll need to cluster at that level. There’s a large literature
available that looks at even more complex error structures, such as multi-
way clustering [Cameron et al., 2011].
But even the concept of the sample as the basis of standard errors may be
shifting. It’s becoming increasingly less the case that researchers work with
random samples; they are more likely working with administrative data
containing the population itself, and thus the concept of sampling
uncertainty becomes strained.29 For instance, Manski and Pepper [2018]
wrote that “random sampling assumptions . . . are not natural when
considering states or counties as units of observation.” So although a
metaphor of a superpopulation may be useful for extending these classical
uncertainty concepts, the ubiquity of digitized administrative data sets has
led econometricians and statisticians to think about uncertainty in other
ways.
New work by Abadie et al. [2020] explores how sampling-based
concepts of the standard error may not be the right way to think about
uncertainty in the context of causal inference, or what they call design-
based uncertainty. This work in many ways anticipates the next two
chapters because of its direct reference to the concept of the counterfactual.
Design-based uncertainty is a reflection of not knowing which values would
have occurred had some intervention been different in counterfactual. And
Abadie et al. [2020] derive standard errors for design-based uncertainty, as
opposed to sampling-based uncertainty. As luck would have it, those
standard errors are usually smaller.
Let’s now move into these fundamental concepts of causality used in
applied work and try to develop the tools to understand how counterfactuals
and causality work together.

Notes
1 The probability rolling a 5 using one six-sided die is 16 = 0.167.

2 The set notation means “union” and refers to two events occurring together.
3 →hy are they different? Because 0.83 is an approximation of Pr(ψ | A), which was technically
0.833 . . . trailing.
4 There’s an ironic story in which someone posed the Monty Hall question to the columnist,
Marilyn vos Savant. Vos Savant had an extremely high IQ and so people would send in puzzles to
stump her. →ithout the Bayesian decomposition, using only logic, she got the answer right. Her
column enraged people, though. Critics wrote in to mansplain how wrong she was, but in fact it was
they who were wrong.
5 For a more complete review of regression, see →ooldridge [2010] and →ooldridge [2015]. I
stand on the shoulders of giants.
6 The law of total probability requires that all marginal probabilities sum to unity.
7 →henever possible, I try to use the “hat” to represent an estimated statistic. Hence instead of
just S2. But it is probably more common to see the sample variance represented as S2.
8 This is not necessarily causal language. →e are speaking first and generally in terms of two
random variables systematically moving together in some measurable way.
9 Notice that the conditional expectation passed through the linear function leaving a constant,
because of the first property of the expectation operator, and a constant times x. This is because the
conditional expectation of E[X | X] = X. This leaves us with E[u | X] which under zero conditional
mean is equal to 0.
10 See equation 2.23.
11 Notice that we are dividing by n, not n − 1. There is no degrees-of-freedom correction, in other
words, when using samples to calculate means. There is a degrees-of-freedom correction when we
start calculating higher moments.
12 Recall from much earlier that:
13 It isn’t exactly 0 even though u and x are independent. Think of it as u and x are independent in
the population, but not in the sample. This is because sample characteristics tend to be slightly
different from population properties due to sampling error.
14 Using the Stata code from Table 6, you can show these algebraic properties yourself. I
encourage you to do so by creating new variables equaling the product of these terms and collapsing
as we did with the other variables. That sort of exercise may help convince you that the
aforementioned algebraic properties always hold.
15 Recall the earlier discussion about degrees-of-freedom correction.
16 This section is a review of traditional econometrics pedagogy. →e cover it for the sake of
completeness. Traditionally, econometricians motivated their discuss of causality through ideas like
unbiasedness and consistency.
17 Told you we would use this result a lot.
18 I find it interesting that we see so many terms when working with regression. They show up
constantly. Keep your eyes peeled.
19 The standard error I found from running this on one sample of data was 0.0403616.
20 I highly encourage the interested reader to study Angrist and Pischke [2009], who have an
excellent discussion of LIE there.
21 Let’s take a concrete example of this proof. Let h(xi) = α + γ xi. Then take the joint expectation
E(h(xi)εi)=E([α+γ xi]εi). Then take conditional expectations E(α | xi)+ E(γ | xi)E(xi | xi)E(ε | xi)} = α
+xiE(εi | xi) = 0 after we pass the conditional expectation through.
22 See Angrist and Pischke [2009] for a proof.
23 A helpful proof of the Frisch-→augh-Lovell theorem can be found in Lovell [2008].
24 →hile randomly having kids may sound fun, I encourage you to have kids when you want to
have them. Contact your local high school health teacher to learn more about a number of methods
that can reasonably minimize the number of random children you create.
25 Almost certainly a ridiculous assumption, but stick with me.
26 There does exist an unbiased estimator of σ, but it’s tedious and hardly anyone in economics
seems to use it. See Holtzman [1950].
27 No one even bothers to cite →hite [1980] anymore, just like how no one cites Leibniz or
Newton when using calculus. Eicker, Huber, and →hite created a solution so valuable that it got
separated from the original papers when it was absorbed into the statistical toolkit.
28 Hat tip to Ben Chidmi, who helped create this simulation in Stata.
29 Usually we appeal to superpopulations in such situations where the observed population is
simply itself a draw from some “super” population.
Directed Acyclic Graphs

Everyday it rains, so everyday the pain Went ignored and I’m sure ignorance was to blame ψut life
is a chain, cause and effected.
Jay-Z

The history of graphical causal modeling goes back to the early twentieth
century and Sewall →right, one of the fathers of modern genetics and son of
the economist Philip →right. Sewall developed path diagrams for genetics,
and Philip, it is believed, adapted them for econometric identification
[Matsueda, 2012].1
But despite that promising start, the use of graphical modeling for causal
inference has been largely ignored by the economics profession, with a few
exceptions [Heckman and Pinto, 2015; Imbens, 2019]. It was revitalized for
the purpose of causal inference when computer scientist and Turing Award
winner Judea Pearl adapted them for his work on artificial intelligence. He
explained this in his mangum opus, which is a general theory of causal
inference that expounds on the usefulness of his directed graph notation
[Pearl, 2009]. Since graphical models are immensely helpful for designing a
credible identification strategy, I have chosen to include them for your
consideration. Let’s review graphical models, one of Pearl’s contributions to
the theory of causal inference.2

Introduction to DAG Notation


Using directed acyclic graphical (DAG) notation requires some upfront
statements. The first thing to notice is that in DAG notation, causality runs
in one direction. Specifically, it runs forward in time. There are no cycles in
a DAG. To show reverse causality, one would need to create multiple
nodes, most likely with two versions of the same node separated by a time
index. Similarly, simultaneity, such as in supply and demand models, is not
straightforward with DAGs [Heckman and Pinto, 2015]. To handle either
simultaneity or reverse causality, it is recommended that you take a
completely different approach to the problem than the one presented in this
chapter. Third, DAGs explain causality in terms of counterfactuals. That is,
a causal effect is defined as a comparison between two states of the world—
one state that actually happened when some intervention took on some
value and another state that didn’t happen (the “counterfactual”) under
some other intervention.
Think of a DAG as like a graphical representation of a chain of causal
effects. The causal effects are themselves based on some underlying,
unobserved structured process, one an economist might call the equilibrium
values of a system of behavioral equations, which are themselves nothing
more than a model of the world. All of this is captured efficiently using
graph notation, such as nodes and arrows. Nodes represent random
variables, and those random variables are assumed to be created by some
data-generating process.3 Arrows represent a causal effect between two
random variables moving in the intuitive direction of the arrow. The
direction of the arrow captures the direction of causality.
Causal effects can happen in two ways. They can either be direct (e.g., D
Y), or they can be mediated by a third variable (e.g., D X Y). →hen
they are mediated by a third variable, we are capturing a sequence of events
originating with D, which may or may not be important to you depending
on the question you’re asking.
A DAG is meant to describe all causal relationships relevant to the effect
of D on Y. →hat makes the DAG distinctive is both the explicit
commitment to a causal effect pathway and the complete commitment to
the lack of a causal pathway represented by missing arrows. In other words,
a DAG will contain both arrows connecting variables and choices to
exclude arrows. And the lack of an arrow necessarily means that you think
there is no such relationship in the data—this is one of the strongest beliefs
you can hold. A complete DAG will have all direct causal effects among the
variables in the graph as well as all common causes of any pair of variables
in the graph.
At this point, you may be wondering where the DAG comes from. It’s an
excellent question. It may be the question. A DAG is supposed to be a
theoretical representation of the state-of-the-art knowledge about the
phenomena you’re studying. It’s what an expert would say is the thing
itself, and that expertise comes from a variety of sources. Examples include
economic theory, other scientific models, conversations with experts, your
own observations and experiences, literature reviews, as well as your own
intuition and hypotheses.
I have included this material in the book because I have found DAGs to
be useful for understanding the critical role that prior knowledge plays in
identifying causal effects. But there are other reasons too. One, I have found
that DAGs are very helpful for communicating research designs and
estimators if for no other reason than pictures speak a thousand words. This
is, in my experience, especially true for instrumental variables, which have
a very intuitive DAG representation. Two, through concepts such as the
backdoor criterion and collider bias, a well-designed DAG can help you
develop a credible research design for identifying the causal effects of some
intervention. As a bonus, I also think a DAG provides a bridge between
various empirical schools, such as the structural and reduced form groups.
And finally, DAGs drive home the point that assumptions are necessary for
any and all identification of causal effects, which economists have been
hammering at for years [→olpin, 2013].

A simple DAG. Let’s begin with a simple DAG to illustrate a few basic
ideas. I will expand on it to build slightly more complex ones later.

In this DAG, we have three random variables: X, D, and Y. There is a


direct path from D to Y, which represents a causal effect. That path is
represented by D Y. But there is also a second path from D to Y called
the backdoor path. The backdoor path is D←X Y. →hile the direct path is
a causal effect, the backdoor path is not causal. Rather, it is a process that
creates spurious correlations between D and Y that are driven solely by
fluctuations in the X random variable.
The idea of the backdoor path is one of the most important things we can
learn from the DAG. It is similar to the notion of omitted variable bias in
that it represents a variable that determines the outcome and the treatment
variable. Just as not controlling for a variable like that in a regression
creates omitted variable bias, leaving a backdoor open creates bias. The
backdoor path is D←X Y. →e therefore call X a confounder because it
jointly determines D and Y, and so confounds our ability to discern the
effect of D on Y in naïve comparisons.
Think of the backdoor path like this: Sometimes when D takes on
different values, Y takes on different values because D causes Y. But
sometimes D and Y take on different values because X takes on different
values, and that bit of the correlation between D and Y is purely spurious.
The existence of two causal pathways is contained within the correlation
between D and Y.
Let’s look at a second DAG, which is subtly different from the first. In
the previous example, X was observed. →e know it was observed because
the direct edges from X to D and Y were solid lines. But sometimes there
exists a confounder that is unobserved, and when there is, we represent its
direct edges with dashed lines. Consider the following DAG:

Same as before, U is a noncollider along the backdoor path from D to Y,


but unlike before, U is unobserved to the researcher. It exists, but it may
simply be missing from the data set. In this situation, there are two
pathways from D to Y. There’s the direct pathway, D Y, which is the
causal effect, and there’s the backdoor pathway, D←U Y. And since U is
unobserved, that backdoor pathway is open.
Let’s now move to another example, one that is slightly more realistic. A
classical question in labor economics is whether college education increases
earnings. According to the Becker human capital model [Becker, 1994],
education increases one’s marginal product, and since workers are paid their
marginal product in competitive markets, education also increases their
earnings. But college education is not random; it is optimally chosen given
an individual’s subjective preferences and resource constraints. →e
represent that with the following DAG. As always, let D be the treatment
(e.g., college education) and Y be the outcome of interest (e.g., earnings).
Furthermore, let PE be parental education, I be family income, and ψ be
unobserved background factors, such as genetics, family environment, and
mental ability.
This DAG is telling a story. And one of the things I like about DAGs is
that they invite everyone to listen to the story together. Here is my
interpretation of the story being told. Each person has some background.
It’s not contained in most data sets, as it measures things like intelligence,
contentiousness, mood stability, motivation, family dynamics, and other
environmental factors—hence, it is unobserved in the picture. Those
environmental factors are likely correlated between parent and child and
therefore subsumed in the variable ψ.
Background causes a child’s parent to choose her own optimal level of
education, and that choice also causes the child to choose their level of
education through a variety of channels. First, there is the shared
background factors, ψ. Those background factors cause the child to choose
a level of education, just as her parent had. Second, there’s a direct effect,
perhaps through simple modeling of achievement or setting expectations, a
kind of peer effect. And third, there’s the effect that parental education has
on family earnings, I, which in turn affects how much schooling the child
receives. Family earnings may itself affect the child’s future earnings
through bequests and other transfers, as well as external investments in the
child’s productivity.
This is a simple story to tell, and the DAG tells it well, but I want to alert
your attention to some subtle points contained in this DAG. The DAG is
actually telling two stories. It is telling what is happening, and it is telling
what is not happening. For instance, notice that ψ has no direct effect on the
child’s earnings except through its effect on schooling. Is this realistic,
though? Economists have long maintained that unobserved ability both
determines how much schooling a child gets and directly affects the child’s
future earnings, insofar as intelligence and motivation can influence careers.
But in this DAG, there is no relationship between background and earnings,
which is itself an assumption. And you are free to call foul on this
assumption if you think that background factors affect both schooling and
the child’s own productivity, which itself should affect wages. So what if
you think that there should be an arrow from ψ to Y? Then you would draw
one and rewrite all the backdoor paths between D and Y.
Now that we have a DAG, what do we do? I like to list out all direct and
indirect paths (i.e., backdoor paths) between D and Y. Once I have all those,
I have a better sense of where my problems are. So:
1. D Y (the causal effect of education on earnings)
2. D ←I Y (backdoor path 1)
3. D ←PE I Y (backdoor path 2)
4. D ←ψ PE I Y (backdoor path 3)

So there are four paths between D and Y: one direct causal effect (which
arguably is the important one if we want to know the return on schooling)
and three backdoor paths. And since none of the variables along the
backdoor paths is a collider, each of the backdoors paths is open. The
problem, though, with open backdoor paths is that they create systematic
and independent correlations between D and Y. Put a different way, the
presence of open backdoor paths introduces bias when comparing educated
and less-educated workers.

ωolliding. But what is this collider? It’s an unusual term, one you may have
never seen before, so let’s introduce it with another example. I’m going to
show you what a collider is graphically using a simple DAG, because it’s an
easy thing to see and a slightly more complicated phenomenon to explain.
So let’s work with a new DAG. Pay careful attention to the directions of the
arrows, which have changed.

As before, let’s list all paths from D to Y:


1. D Y (causal effect of D on Y)
2. D X ←Y (backdoor path 1)

Just like last time, there are two ways to get from D to Y. You can get from
D to Y using the direct (causal) path, D Y. Or you can use the backdoor
path, D X ←Y. But something is different about this backdoor path; do
you see it? This time the X has two arrows pointing to it, not away from it.
→hen two variables cause a third variable along some path, we call that
third variable a “collider.” Put differently, X is a collider along this
backdoor path because D and the causal effects of Y collide at X. But so
what? →hat makes a collider so special? Colliders are special in part
because when they appear along a backdoor path, that backdoor path is
closed simply because of their presence. Colliders, when they are left alone,
always close a specific backdoor path.

ψackdoor criterion. →e care about open backdoor paths because they they
create systematic, noncausal correlations between the causal variable of
interest and the outcome you are trying to study. In regression terms, open
backdoor paths introduce omitted variable bias, and for all you know, the
bias is so bad that it flips the sign entirely. Our goal, then, is to close these
backdoor paths. And if we can close all of the otherwise open backdoor
paths, then we can isolate the causal effect of D on Y using one of the
research designs and identification strategies discussed in this book. So how
do we close a backdoor path?
There are two ways to close a backdoor path. First, if you have a
confounder that has created an open backdoor path, then you can close that
path by conditioning on the confounder. Conditioning requires holding the
variable fixed using something like subclassification, matching, regression,
or another method. It is equivalent to “controlling for” the variable in a
regression. The second way to close a backdoor path is the appearance of a
collider along that backdoor path. Since colliders always close backdoor
paths, and conditioning on a collider always opens a backdoor path,
choosing to ignore the colliders is part of your overall strategy to estimate
the causal effect itself. By not conditioning on a collider, you will have
closed that backdoor path and that takes you closer to your larger ambition
to isolate some causal effect.
→hen all backdoor paths have been closed, we say that you have come
up with a research design that satisfies the backdoor criterion. And if you
have satisfied the backdoor criterion, then you have in effect isolated some
causal effect. But let’s formalize this: a set of variables X satisfies the
backdoor criterion in a DAG if and only if X blocks every path between
confounders that contain an arrow from D to Y. Let’s review our original
DAG involving parental education, background and earnings.
The minimally sufficient conditioning strategy necessary to achieve the
backdoor criterion is the control for I, because I appeared as a noncollider
along every backdoor path (see earlier). It might literally be no simpler than
to run the following regression:

By simply conditioning on I, your estimated takes on a causal


interpretation.4

But maybe in hearing this story, and studying it for yourself by reviewing
the literature and the economic theory surrounding it, you are skeptical of
this DAG. Maybe this DAG has really bothered you from the moment you
saw me produce it because you are skeptical that ψ has no relationship to Y
except through D or PE. That skepticism leads you to believe that there
should be a direct connection from ψ to Y, not merely one mediated through
own education.

Note that including this new backdoor path has created a problem
because our conditioning strategy no longer satisfies the backdoor criterion.
Even controlling for I, there still exist spurious correlations between D and
Y due to the D←ψ Y backdoor path. →ithout more information about the
nature of ψ Y and ψ D, we cannot say much more about the partial
correlation between D and Y. →e just are not legally allowed to interpret
from our regression as the causal effect of D on Y.
εore examples of collider bias. The issue of conditioning on a collider is
important, so how do we know if we have that problem or not? No data set
comes with a flag saying “collider” and “confounder.” Rather, the only way
to know whether you have satisfied the backdoor criterion is with a DAG,
and a DAG requires a model. It requires in-depth knowledge of the data-
generating process for the variables in your DAG, but it also requires ruling
out pathways. And the only way to rule out pathways is through logic and
models. There is no way to avoid it—all empirical work requires theory to
guide it. Otherwise, how do you know if you’ve conditioned on a collider or
a noncollider? Put differently, you cannot identify treatment effects without
making assumptions.
In our earlier DAG with collider bias, we conditioned on some variable X
that was a collider—specifically, it was a descendent of D and Y. But that is
just one example of a collider. Oftentimes, colliders enter into the system in
very subtle ways. Let’s consider the following scenario: Again, let D and Y
be child schooling and child future earnings. But this time we introduce
three new variables—U1, which is father’s unobserved genetic ability; U2,
which is mother’s unobserved genetic ability; and I, which is joint family
income. Assume that I is observed but that Ui is unobserved for both
parents.

Notice in this DAG that there are several backdoor paths from D to Y.
They are as follows:
1. D←U2 Y
2. D←U1 Y
3. D←U1 I←U2 Y
4. D←U2 I←U1 Y
Notice, the first two are open-backdoor paths, and as such, they cannot be
closed, because U1 and U2 are not observed. But what if we controlled for I
anyway? Controlling for I only makes matters worse, because it opens the
third and fourth backdoor paths, as I was a collider along both of them. It
does not appear that any conditioning strategy could meet the backdoor
criterion in this DAG. And any strategy controlling for I would actually
make matters worse. Collider bias is a difficult concept to understand at
first, so I’ve included a couple of examples to help you sort through it.

Discrimination and collider bias. Let’s examine a real-world example


around the problem of gender discrimination in labor-markets. It is common
to hear that once occupation or other characteristics of a job are conditioned
on, the wage disparity between genders disappears or gets smaller. For
instance, critics once claimed that Google systematically underpaid its
female employees. But Google responded that its data showed that when
you take “location, tenure, job role, level and performance” into
consideration, women’s pay is basically identical to that of men. In other
words, controlling for characteristics of the job, women received the same
pay.
But what if one of the ways gender discrimination creates gender
disparities in earnings is through occupational sorting? If discrimination
happens via the occupational match, then naïve contrasts of wages by
gender controlling for occupation characteristics will likely understate the
presence of discrimination in the marketplace. Let me illustrate this with a
DAG based on a simple occupational sorting model with unobserved
heterogeneity.

Notice that there is in fact no effect of female gender on earnings; women


are assumed to have productivity identical to that of men. Thus, if we could
control for discrimination, we’d get a coefficient of zero as in this example
because women are, initially, just as productive as men.5
But in this example, we aren’t interested in estimating the effect of being
female on earnings; we are interested in estimating the effect of
discrimination itself. Now you can see several noticeable paths between
discrimination and earnings. They are as follows:
1. D τ Y
2. D τ←A Y

The first path is not a backdoor path; rather, it is a path whereby


discrimination is mediated by occupation before discrimination has an
effect on earnings. This would imply that women are discriminated against,
which in turn affects which jobs they hold, and as a result of holding
marginally worse jobs, women are paid less. The second path relates to that
channel but is slightly more complicated. In this path, unobserved ability
affects both which jobs people get and their earnings.
So let’s say we regress Y onto D, our discrimination variable. This yields
the total effect of discrimination as the weighted sum of both the direct
effect of discrimination on earnings and the mediated effect of
discrimination on earnings through occupational sorting. But say that we
want to control for occupation because we want to compare men and
women in similar jobs. →ell, controlling for occupation in the regression
closes down the mediation channel, but it then opens up the second channel.
→hy? Because D τ←A Y has a collider τ. So when we control for
occupation, we open up this second path. It had been closed because
colliders close backdoor paths, but since we conditioned on it, we actually
opened it instead. This is the reason we cannot merely control for
occupation. Such a control ironically introduces new patterns of bias.6
→hat is needed is to control for occupation and ability, but since ability is
unobserved, we cannot do that, and therefore we do not possess an
identification strategy that satisfies the backdoor criterion. Let’s now look at
code to illustrate this DAG.7
This simulation hard-codes the data-generating process represented by
the previous DAG. Notice that ability is a random draw from the standard
normal distribution. Therefore it is independent of female preferences. And
then we have our last two generated variables: the heterogeneous
occupations and their corresponding wages. Occupations are increasing in
unobserved ability but decreasing in discrimination. →ages are decreasing
in discrimination but increasing in higher-quality jobs and higher ability.
Thus, we know that discrimination exists in this simulation because we are
hard-coding it that way with the negative coefficients both the occupation
and wage processes.
The regression coefficients from the three regressions at the end of the
code are presented in Table 9. First note that when we simply regress wages
onto gender, we get a large negative effect, which is the combination of the
direct effect of discrimination on earnings and the indirect effect via
occupation. But if we run the regression that Google and others recommend
wherein we control for occupation, the sign on gender changes. It becomes
positive! →e know this is wrong because we hard-coded the effect of
gender to be −1! The problem is that occupation is a collider. It is caused by
ability and discrimination. If we control for occupation, we open up a
backdoor path between discrimination and earnings that is spurious and so
strong that it perverts the entire relationship. So only when we control for
occupation and ability can we isolate the direct causal effect of gender on
wages.
Table 9. Regressions illustrating confounding bias with simulated gender disparity.

Sample selection and collider bias. Bad controls are not the only kind of
collider bias to be afraid of, though. Collider bias can also be baked directly
into the sample if the sample itself was a collider. That’s no doubt a strange
concept to imagine, so I have a funny illustration to clarify what I mean.
A 2009 CNN blog post reported that Megan Fox, who starred in the
movie Transformers, was voted the worst and most attractive actress of
2009 in some survey about movie stars [Piazza, 2009]. The implication
could be taken to be that talent and beauty are negatively correlated. But are
they? And why might they be? →hat if they are independent of each other
in reality but negatively correlated in a sample of movie stars because of
collider bias? Is that even possible?8
To illustrate, we will generate some data based on the following DAG:
Let’s illustrate this with a simple program.
Figure 12 shows the output from this simulation. The bottom left panel
shows the scatter plot between talent and beauty. Notice that the two
variables are independent, random draws from the standard normal
distribution, creating an oblong data cloud. But because “movie star” is in
the top 85th percentile of the distribution of a linear combination of talent
and beauty, the sample consists of people whose combined score is in the
top right portion of the joint distribution. This frontier has a negative slope
and is in the upper right portion of the data cloud, creating a negative
correlation between the observations in the movie-star sample. Likewise,
the collider bias has created a negative correlation between talent and
beauty in the non-movie-star sample as well. Yet we know that there is in
fact no relationship between the two variables. This kind of sample
selection creates spurious correlations. A random sample of the full
population would be sufficient to show that there is no relationship between
the two variables, but splitting the sample into movie stars only, we
introduce spurious correlations between the two variables of interest.

Figure 12. Aspiring actors and actresses.


σote: Top left: Non-star sample scatter plot of beauty (vertical axis) and talent (horizontal axis). Top
right: Star sample scatter plot of beauty and talent. Bottom left: Entire (stars and non-stars combined)
sample scatter plot of beauty and talent.

ωollider bias and police use of force. →e’ve known about the problems of
nonrandom sample selection for decades [Heckman, 1979]. But DAGs may
still be useful for helping spot what might be otherwise subtle cases of
conditioning on colliders [Elwert and →inship, 2014]. And given the
ubiquitous rise in researcher access to large administrative databases, it’s
also likely that some sort of theoretically guided reasoning will be needed
to help us determine whether the databases we have are themselves rife
with collider bias. A contemporary debate could help illustrate what I mean.
Public concern about police officers systematically discriminating against
minorities has reached a breaking point and led to the emergence of the
Black Lives Matter movement. “Vigilante justice” episodes such as George
Zimmerman’s killing of teenage Trayvon Martin, as well as police killings
of Michael Brown, Eric Garner, and countless others, served as catalysts to
bring awareness to the perception that African Americans face enhanced
risks for shootings. Fryer [2019] attempted to ascertain the degree to which
there was racial bias in the use of force by police. This is perhaps one of the
most important questions in policing as of this book’s publication.
There are several critical empirical challenges in studying racial biases in
police use of force, though. The main problem is that all data on police-
citizen interactions are conditional on an interaction having already
occurred. The data themselves were generated as a function of earlier
police-citizen interactions. In this sense, we can say that the data itself are
endogenous. Fryer [2019] collected several databases that he hoped would
help us better understand these patterns. Two were public-use data sets—the
New York City Stop and Frisk database and the Police-Public Contact
Survey. The former was from the New York Police Department and
contained data on police stops and questioning of pedestrians; if the police
wanted to, they could frisk them for weapons or contraband. The latter was
a survey of civilians describing interactions with the police, including the
use of force.
But two of the data sets were administrative. The first was a compilation
of event summaries from more than a dozen large cities and large counties
across the United States from all incidents in which an officer discharged a
weapon at a civilian. The second was a random sample of police-civilian
interactions from the Houston Police Department. The accumulation of
these databases was by all evidence a gigantic empirical task. For instance,
Fryer [2019] notes that the Houston data was based on arrest narratives that
ranged from two to one hundred pages in length. From these arrest
narratives, a team of researchers collected almost three hundred variables
relevant to the police use of force on the incident. This is the world in
which we now live, though. Administrative databases can be accessed more
easily than ever, and they are helping break open the black box of many
opaque social processes.
A few facts are important to note. First, using the stop-and-frisk data,
Fryer finds that blacks and Hispanics were more than 50 percent more
likely to have an interaction with the police in the raw data. The racial
difference survives conditioning on 125 baseline characteristics, encounter
characteristics, civilian behavior, precinct, and year fixed effects. In his full
model, blacks are 21 percent more likely than whites to be involved in an
interaction with police in which a weapon is drawn (which is statistically
significant). These racial differences show up in the Police-Public Contact
Survey as well, only here the racial differences are considerably larger. So
the first thing to note is that the actual stop itself appears to be larger for
minorities, which I will come back to momentarily.
Things become surprising when Fryer moves to his rich administrative
data sources. He finds that conditional on a police interaction, there are no
racial differences in officer-involved shootings. In fact, controlling for
suspect demographics, officer demographics, encounter characteristics,
suspect weapon, and year fixed effects, blacks are 27 percent less likely to
be shot at by police than are nonblack non-Hispanics. The coefficient is not
significant, and it shows up across alternative specifications and cuts of the
data. Fryer is simply unable with these data to find evidence for racial
discrimination in officer-involved shootings.
One of the main strengths of Fryer’s study are the shoe leather he used to
accumulate the needed data sources. →ithout data, one cannot study the
question of whether police shoot minorities more than they shoot whites.
And the extensive coding of information from the narratives is also a
strength, for it afforded Fryer the ability to control for observable
confounders. But the study is not without issues that could cause a skeptic
to take issue. Perhaps the police departments most willing to cooperate with
a study of this kind are the ones with the least racial bias, for instance. In
other words, maybe these are not the departments with the racial bias to
begin with.9 Or perhaps a more sinister explanation exists, such as records
being unreliable because administrators scrub out the data on racially
motivated shootings before handing them over to Fryer altogether.
But I would like to discuss a more innocent possibility, one that requires
no conspiracy theories and yet is so basic a problem that it is in fact more
worrisome. Perhaps the administrative datasource is endogenous because of
conditioning on a collider. If so, then the administrative data itself may have
the racial bias baked into it from the start. Let me explain with a DAG.
Fryer showed that minorities were more likely to be stopped using both
the stop-and-frisk data and the Police-Public Contact Survey. So we know
already that the D ε pathway exists. In fact, it was a very robust
correlation across multiple studies. Minorities are more likely to have an
encounter with the police. Fryer’s study introduces extensive controls about
the nature of the interaction, time of day, and hundreds of factors that I’ve
captured with X. Controlling for X allows Fryer to shut this backdoor path.
But notice ε—the stop itself. All the administrative data is conditional
on a stop. Fryer [2019] acknowledges this from the outset: “Unless
otherwise noted, all results are conditional on an interaction. Understanding
potential selection into police data sets due to bias in who police interacts
with is a difficult endeavor” (3). Yet what this DAG shows is that if police
stop people who they believe are suspicious and use force against people
they find suspicious, then conditioning on the stop is equivalent to
conditioning on a collider. It opens up the D ε←U Y mediated path,
which introduces spurious patterns into the data that, depending on the
signs of these causal associations, may distort any true relationship between
police and racial differences in shootings.
Dean Knox, →ill Lowe, and Jonathan Mummolo are a talented team of
political scientists who study policing, among other things. They produced
a study that revisited Fryer’s question and in my opinion both yielded new
clues as to the role of racial bias in police use of force and the challenges of
using administrative data sources to do so. I consider Knox et al. [2020] one
of the more methodologically helpful studies for understanding this
problem and attempting to solve it. The study should be widely read by
every applied researcher whose day job involves working with proprietary
administrative data sets, because this DAG may in fact be a more general
problem. After all, administrative data sources are already select samples,
and depending on the study question, they may constitute a collider
problem of the sort described in this DAG. The authors develop a bias
correction procedure that places bounds on the severity of the selection
problems. →hen using this bounding approach, they find that even lower-
bound estimates of the incidence of police violence against civilians is as
much as five times higher than a traditional approach that ignores the
sample selection problem altogether.
It is incorrect to say that sample selection problems were unknown
without DAGs. →e’ve known about them and have had some limited
solutions to them since at least Heckman [1979]. →hat I have tried to show
here is more general. An atheoretical approach to empiricism will simply
fail. Not even “big data” will solve it. Causal inference is not solved with
more data, as I argue in the next chapter. Causal inference requires
knowledge about the behavioral processes that structure equilibria in the
world. →ithout them, one cannot hope to devise a credible identification
strategy. Not even data is a substitute for deep institutional knowledge
about the phenomenon you’re studying. That, strangely enough, even
includes the behavioral processes that generated the samples you’re using in
the first place. You simply must take seriously the behavioral theory that is
behind the phenomenon you’re studying if you hope to obtain believable
estimates of causal effects. And DAGs are a helpful tool for wrapping your
head around and expressing those problems.

ωonclusion. In conclusion, DAGs are powerful tools.10 They are helpful at


both clarifying the relationships between variables and guiding you in a
research design that has a shot at identifying a causal effect. The two
concepts we discussed in this chapter—the backdoor criterion and collider
bias—are but two things I wanted to bring to your attention. And since
DAGs are themselves based on counterfactual forms of reasoning, they fit
well with the potential outcomes model that I discuss in the next chapter.

Notes
1 I will discuss the →rights again in the chapter on instrumental variables. They were an
interesting pair.
2 If you find this material interesting, I highly recommend Morgan and →inship [2014], an all-
around excellent book on causal inference, and especially on graphical models.
3 I leave out some of those details, though, because their presence (usually just error terms
pointing to the variables) clutters the graph unnecessarily.
4 Subsequent chapters discuss other estimators, such as matching.
5 Productivity could diverge, though, if women systematically sort into lower-quality occupations
in which human capital accumulates over time at a lower rate.
6 Angrist and Pischke [2009] talk about this problem in a different way using language called “bad
controls.” Bad controls are not merely conditioning on outcomes. Rather, they are any situation in
which the outcome had been a collider linking the treatment to the outcome of interest, like
D τ←A Y.
7 Erin Hengel is a professor of economics at the University of Liverpool. She and I were talking
about this on Twitter one day, and she and I wrote down the code describing this problem. Her code
was better, so I asked if I could reproduce it here, and she said yes. Erin’s work partly focuses on
gender discrimination. You can see some of that work on her website at http://www.erinhengel.com.
8 I wish I had thought of this example, but alas the sociologist Gabriel Rossman gets full credit.
9 I am not sympathetic to this claim. The administrative data comes from large Texas cities, a large
county in California, the state of Florida, and several other cities and counties racial bias has been
reported.
10 There is far more to DAGs than I have covered here. If you are interested in learning more
about them, then I encourage you to carefully read Pearl [2009], which is his magnum opus and a
major contribution to the theory of causation.

You might also like