10 - Matching - Causal Inference For The Brave and True
10 - Matching - Causal Inference For The Brave and True
10 - Matching - Causal Inference For The Brave and True
Contents
What is Regression Doing After All?
The Subclassification Estimator
Matching Estimator
Matching Bias
The Curse of Dimensionality
Key Ideas
References
Contribute
example). Regression then combines the estimate in each of the cells to produce a final ATE.
The way it does this is by applying weights to the cell proportional to the variance of the
treatment on that group.
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
from matplotlib import style
from matplotlib import pyplot as plt
import statsmodels.formula.api as smf
import graphviz as gr
%matplotlib inline
style.use("fivethirtyeight")
To give an example, let’s suppose I’m trying to estimate the effect of a drug and I have 6 men
and 4 women. My response variable is days hospitalised and I hope my drug can lower that.
On men, the true causal effect is -3, so the drug lowers the stay period by 3 days. On women,
it is -2. To make matters more interesting, men are much more affected by this illness and stay
longer at the hospital. They also get much more of the drug. Only 1 out of the 6 men does not
get the drug. On the other hand, women are more resistant to this illness, so they stay less at
the hospital. 50% of the women get the drug.
drug_example = pd.DataFrame(dict(
sex= ["M","M","M","M","M","M", "W","W","W","W"],
drug=[1,1,1,1,1,0, 1,0,1,0],
days=[5,5,5,5,5,8, 2,4,2,4]
))
Note that simple comparison of treatment and control yields a negatively biased effect, that
is, the drug seems less effective than it truly is. This is expected, since we’ve omitted the sex
confounder. In this case, the estimated ATE is smaller than the true one because men get
more of the drug and are more affected by the illness.
drug_example.query("drug==1")["days"].mean() - drug_example.query("drug==0")
["days"].mean()
-1.1904761904761898
Since the true effect for men is -3 and the true effect for women is -2, the ATE should be
(−3 ∗ 6) + (−2 ∗ 4)
AT E = = −2.6
10
This estimate is done by 1) partitioning the data into confounder cells, in this case, men and
women, 2) estimating the effect on each cell and 3) combining the estimate with a weighted
average, where the weight is the sample size of the cell or covariate group. If we had exactly
the same number of men and women in the data, the ATE estimate would be right in the
middle of the ATE of the 2 groups, -2.5. Since there are more men than women in our dataset,
the ATE estimate is a little bit closer to the men’s ATE. This is called a non-parametric
estimate, since it places no assumption on how the data was generated.
If we control for sex using regression, we will add the assumption of linearity. Regression will
also partition the data into men and women and estimate the effect on both of these groups.
So far, so good. However, when it comes to combining the effect on each group, it does not
weigh them by the sample size. Instead, regression uses weights that are proportional to the
variance of the treatment in that group. In our case, the variance of the treatment in men is
smaller than in women, since only one man is in the control group. To be exact, the variance of
T for men is 0.139 = 1/6 ∗ (1 − 1/6) and for women is0.25 = 2/4 ∗ (1 − 2/4) . So
regression will give a higher weight to women in our example and the ATE will be a bit closer
to the women’s ATE of -2.
smf.ols('days ~ drug + C(sex)', data=drug_example).fit().summary().tables[1]
If there is some causal effect we want to estimate, like the effect of job training on earnings,
and the treatment is not randomly assigned, we need to watch out for confounders. It could
be that only more motivated people do the training and they would have higher earnings
regardless of the training. We need to estimate the effect of the training program within small
groups of individuals that are roughly the same in motivation level and any other confounders
we may have.
More generally, if there is some causal effect we want to estimate, but it is hard to do so
because of confounding of some variables X, what we need to do is make the treatment vs
control comparison within small groups where X is the same. If we have conditional
independence , then we can write the ATE as follows.
(Y 0 , Y 1 ) ⊥ T |X
What this integral does is it goes through all the space of the distribution of features X,
computes the difference in means for all those tiny spaces and combines everything into the
ATE. Another way to see this is to think about a discrete set of features. In this case, we can
say that the features X takes on K different cells and what we are doing is
{X 1 , X 2 , . . . , X k }
computing the treatment effect in each cell and combining them into the ATE. In this discrete
case, converting the integral to a sum, we can derive the subclassifications estimator
K
Nk
^ ¯ ¯
AT E = ∑ (Y k1 − Y k0 ) ∗
k=1
N
where the bar represent the mean of the outcome on the treated, , and non-treated, ,
Y k1 Y k0
at cell k and is the number of observations in that same cell. As you can see, we are
Nk
computing a local ATE for each cell and combining them using a weighted average, where the
weights are the sample size of the cell. In our medicine example above, this would be the first
estimate, which gave us −2.6.
Matching Estimator
The subclassification estimator isn’t used much in practice (we will see why shortly, it is
because of the curse of dimensionality) but it gives us a nice intuition of what a causal
inference estimator should do, how it should control for confounders. This allows us to explore
other kinds of estimators, such as the Matching Estimator.
The idea is very similar. Since some sort of confounder X makes it so that treated and
untreated are not initially comparable, I can make them so by matching each treated unit
with a similar untreated unit. It is like I’m finding an untreated twin for every treated unit. By
making such comparisons, treated and untreated become again comparable.
As an example, let’s suppose we are trying to estimate the effect of a trainee program on
earnings. Here is what the trainees looks like
trainee = pd.read_csv("./data/trainees.csv")
trainee.query("trainees==1")
unit trainees age earnings
0 1 1 28 17700
1 2 1 34 10200
2 3 1 29 14400
3 4 1 25 20800
4 5 1 29 6100
5 6 1 23 28600
6 7 1 33 21900
7 8 1 27 28800
8 9 1 31 20300
9 10 1 26 28100
10 11 1 25 9400
11 12 1 27 14300
12 13 1 29 12500
13 14 1 24 19700
14 15 1 25 10100
15 16 1 43 10700
16 17 1 28 11500
17 18 1 27 10700
18 19 1 28 16300
And here are the non-trainees:
trainee.query("trainees==0")
unit trainees age earnings
19 20 0 43 20900
20 21 0 50 31000
21 22 0 30 21000
22 23 0 27 9300
23 24 0 54 41100
24 25 0 48 29800
25 26 0 39 42000
26 27 0 28 8800
27 28 0 24 25500
28 29 0 33 15500
29 31 0 26 400
30 32 0 31 26600
31 33 0 26 16500
32 34 0 34 24200
33 35 0 25 23300
34 36 0 24 9700
35 37 0 29 6200
36 38 0 35 30200
37 39 0 32 17800
38 40 0 23 9500
39 41 0 32 25900
If I do a simple comparison in means, we get that the trainees earn less money than those that
didn’t go through the program.
trainee.query("trainees==1")["earnings"].mean() -
trainee.query("trainees==0")["earnings"].mean()
-4297.49373433584
However, if we look at the table above, we notice that trainees are much younger than non
trainees, which indicates that age is probably a confounder. Let’s use matching on age to try
to correct that. We will take unit 1 from the treated and pair it with unit 27, since both are 28
years old. Unit 2 we will pair it with unit 34, unit 3 with unit 37, unit 4 we will pair it with unit
35… When it comes to unit 5, we need to find someone with age 29 from the non treated, but
that is unit 37, which is already paired. This is not a problem, since we can use the same unit
multiple times. If more than 1 unit is a match, we can choose randomly between them.
This is what the matched dataset looks like for the first 7 units
# make dataset where no one has the same age
unique_on_age = (trainee
.query("trainees==0")
.drop_duplicates("age"))
matches = (trainee
.query("trainees==1")
.merge(unique_on_age, on="age", how="left", suffixes=("_t_1",
"_t_0"))
.assign(t1_minuts_t0 = lambda d: d["earnings_t_1"] -
d["earnings_t_0"]))
matches.head(7)
unit_t_1 trainees_t_1 age earnings_t_1 unit_t_0 trainees_t_0 earnings_t_0 t1_minuts_t0
0 1 1 28 17700 27 0 8800 8900
1 2 1 34 10200 34 0 24200 -14000
2 3 1 29 14400 37 0 6200 8200
3 4 1 25 20800 35 0 23300 -2500
4 5 1 29 6100 37 0 6200 -100
5 6 1 23 28600 40 0 9500 19100
6 7 1 33 21900 29 0 15500 6400
Notice how the last column has the difference in earnings between the treated and its
matched untreated unit. If we take the mean of this last column we get the ATET estimate
while controlling for age. Notice how the estimate is now very positive, compared to the
previous one where we used a simple difference in means.
matches["t1_minuts_t0"].mean()
2457.8947368421054
But this was a very contrived example, just to introduce matching. In reality, we usually have
more than one feature and units don’t match perfectly. In this case, we have to define some
measurement of proximity to compare how units are close to each other. One common metric
for this is the euclidean norm . This difference, however, is not invariant to the
||X i − X j ||
scale of the features. This means that features like age, that take values on the tenths, will be
much less important when computing this norm compared to features like income, which take
the order of hundreds. For this reason, before applying the norm, we need to scale the
features so that they are on roughly the same scale.
Having defined a distance measure, we can now define the match as the nearest neighbour to
that sample we wish to match. In math terms, we can write the matching estimator the
following way
^ 1 N
AT E = ∑ (2T i − 1)(Y i − Y jm (i))
N i=1
Where Y jm (i) is the sample from the other treatment group which is most similar to . We
Yi
do this2T i − 1 to match both ways: treated with controls and controls with the treatment.
To test this estimator, let’s consider a medicine example. Once again, we want to find the
effect of a medication on days until recovery. Unfortunately, this effect is confounded by
severity, sex and age. We have reasons to believe that patients with more severe conditions
have a higher chance of receiving the medicine.
med = pd.read_csv("./data/medicine_impact_recovery.csv")
med.head()
treatment takes, on average, 16.9 more days to recover than the untreated. This is probably
due to confounding, since we don’t expect the medicine to cause harm to the patient.
med.query("medication==1")["recovery"].mean() - med.query("medication==0")
["recovery"].mean()
16.895799546498726
To correct for this bias, we will control for X using matching. First, we need to remember to
scale our features, otherwise, features like age will have higher importance than features like
severity when we compute the distance between points. To do so, we can standardise the
features.
# scale features
X = ["severity", "age", "sex"]
y = "recovery"
treated = med.query("medication==1")
untreated = med.query("medication==0")
predicted = pd.concat([
# find matches for the treated looking at the untreated knn model
treated.assign(match=mt0.predict(treated[X])),
# find matches for the untreated looking at the treated knn model
untreated.assign(match=mt1.predict(untreated[X]))
])
predicted.head()
-0.9954
Using this sort of matching, we can see that the effect of the medicine is not positive
anymore. This means that, controlling for X, the medicine reduces the recovery time by about
1 day, on average. This is already a huge improvement on top of the biased estimate that
predicted a 16.9 increase in recovery time.
However, we can still do better.
Matching Bias
It turns out the matching estimator as we’ve designed above is biased. To see this, let’s
consider the ATET estimator, instead of the ATE, just because it is simpler to write. The
intuition will apply to the ATE as well.
^ 1
AT ET = ∑(Y i − Y j (i))
N1
where is the number of treated individuals and is the untreated match of treated unit
N1 Y j (i)
i. To check for bias, what we do is hope we can apply the Central Limit Theorem so that this
down there converges to a normal distribution with mean zero.
^
√ N 1 (AT ET − AT ET )
However, this doesn’t alway happen. If we define the mean outcome for the untreated given X,
μ 0 (x) = E[Y |X = x, T = 0] , we will have that (btw, I’ve omitted the proof for that because
it’s a little beyond the point here).
^
E[√N 1 (AT ET − AT ET )] = E[√N 1 (μ 0 (X i ) − μ 0 (X j (i)))]
is the outcome Y value of a treated unit had it not been treated. So, it is the
μ 0 (X i ) i
the match of unit . So, it is also the , but for unit now. Only this time, it is a factual
i Y0 j
outcome, because is in the non treated group. Now, because and are only similar, but not
j j i
the same, this will likely not be zero. In other words, . So,
Xi ≈ Xj . Y 0i ≈ Y 0j
As we increase the sample size, there will be more units to match, so the difference between
unit and its match will also get smaller. But this difference converges to zero slowly. As a
i j
Bias arises when the matching discrepancies are huge. Fortunately, we know how to correct it.
Each observation contributes to the bias so all we need to do is
(μ 0 (X i ) − μ 0 (X j (i)))
subtract this quantity from each matching comparison in our estimator. To do so, we can
replace with some sort of estimate of this quantity
μ 0 (X j (i)) , which can be
μ
^0 (X j (i))
obtained with models like linear regression. This updates the ATET estimator to the following
equation
^ 1
AT ET = ∑ ((Y i − Y j(i) ) − (μ
^0 (X i ) − μ
^0 (X j(i) )))
N1