This is a React page
-{description}
-{siteConfig.tagline}
-Here we consider a causal inference problem with a binary treatment and a binary outcome where there is unobserved confounding, but an exogenous instrument is available (also binary). This problem will require a number of extensions to the basic BART model, all of which can be implemented straightforwardly as Gibbs samplers using stochtree
. We'll go through all of the model fitting steps in quite a lot of detail here.
To be concrete, suppose we wish to measure the effect of receiving a flu vaccine on the probability of getting the flu. Individuals who opt to get a flu shot differ in many ways from those that don't, and these lifestyle differences presumably also affect their respective chances of getting the flu. Consequently, comparing the percentage of individuals who get the flu in the vaccinated and unvaccinated groups does not give a clear picture of the vaccine efficacy.
+However, a so-called encouragement design can be implemented, where some individuals are selected at random to be given some extra incentive to get a flu shot (free clinics at the workplace or a personalized reminder, for example). Studying the impact of this randomized encouragement allows us to tease apart the impact of the vaccine from the confounding factors, at least to some extent. This exact problem has been considered several times in the literature, starting with McDonald, Hiu, and Tierny (1992) with follow-on analysis by Hirano et. al. (2000), Richardson and Robins (2011), and Imbens and Rubin (2015).
+Our analysis here follows the Bayesian nonparametric approach described in the supplement to Hahn, Murray, and Manolopoulou (2016).
+First, load requisite libraries
+import numpy as np
+import matplotlib.pyplot as plt
+from scipy.stats import norm
+
+from stochtree import (
+ RNG,
+ Dataset,
+ Forest,
+ ForestContainer,
+ ForestSampler,
+ Residual,
+ ForestModelConfig,
+ GlobalModelConfig,
+)
+
Let $V$ denote the treatment variable (as in "vaccine"). Let $Y$ denote the response variable (getting the flu), $Z$ denote the instrument (encouragement or reminder to get a flu shot), and $X$ denote an additional observable covariate (for instance, patient age).
+Further, let $S$ denote the so-called principal strata, which is an exhaustive characterization of how individuals' might be affected by the encouragement regarding the flu shot. Some people will get a flu shot no matter what: these are the always takers (a). Some people will not get the flu shot no matter what: these are the never takers (n). For both always-takers and never-takers, the randomization of the encouragement is irrelevant and our data set contains no always takers who skipped the vaccine and no never takers who got the vaccine and so the treatment effect of the vaccine in these groups is fundamentally non-identifiable.
+By contrast, we also have compliers (c): folks who would not have gotten the shot but for the fact that they were encouraged to do so. These are the people about whom our randomized encouragement provides some information, because they are precisely the ones that have been randomized to treatment.
+Lastly, we could have defiers (d): contrarians who who were planning on getting the shot, but -- upon being reminded -- decided not to! For our analysis we will do the usual thing of assuming that there are no defiers. And because we are going to simulate our data, we can make sure that this assumption is true.
+The causal diagram for this model can be expressed as follows. Here we are considering one confounder and moderator variable ($X$), which is the patient's age. In our data generating process (which we know because this is a simulation demonstration) higher age will make it more likely that a person is an always taker or complier and less likely that they are a never taker, which in turn has an effect on flu risk. We stipulate here that always takers are at lower risk and never takers at higher risk. Simultaneously, age has an increasing and then decreasing direct effect on flu risk; very young and very old are at higher risk, while young and middle age adults are at lower risk. In this DGP the flu efficacy has a multiplicative effect, reducing flu risk as a fixed proportion of baseline risk -- accordingly, the treatment effect (as a difference) is nonlinear in Age (for each principal stratum).
+The biggest question about this graph concerns the dashed red arrow from the putative instrument $Z$ to the outcome (flu). We say "putative" because if that dashed red arrow is there, then technically $Z$ is not a valid instrument. The assumption/assertion that there is no dashed red arrow is called the "exclusion restriction". In this vignette, we will explore what sorts of inferences are possible if we remain agnostic about the presence or absence of that dashed red arrow.
+There are two relevant potential outcomes in an instrumental variables analysis, corresponding to the causal effect of the instrument on the treatment and the causal effect of the treatment on the outcome. In this example, that is the effect of the reminder/encouragement on vaccine status and the effect of the vaccine itself on the flu. The notation is $V(Z)$ and $Y(V(Z),Z)$ respectively, so that we have six distinct random variables: $V(0)$, $V(1)$, $Y(0,0)$, $Y(1,0)$, $Y(0,1)$ and $Y(1,1)$. The problem -- sometimes called the fundamental problem of causal inference -- is that some of these random variables can never be seen simultaneously, they are observationally mutually exclusive. For this reason, it may be helpful to think about causal inference as a missing data problem, as depicted in the following table.
+$i$ | +$Z_i$ | +$V_i(0)$ | +$V_i(1)$ | +$Y_i(0,0)$ | +$Y_i(1,0)$ | +$Y_i(0,1)$ | +$Y_i(1,1)$ | +
---|---|---|---|---|---|---|---|
1 | +1 | +? | +1 | +? | +? | +? | +0 | +
2 | +0 | +1 | +? | +? | +1 | +? | +? | +
3 | +0 | +0 | +? | +1 | +? | +? | +? | +
4 | +1 | +? | +0 | +? | +? | +0 | +? | +
Likewise, with this notation we can formally define the principal strata:
+$V_i(0)$ | +$V_i(1)$ | +$S_i$ | +
---|---|---|
0 | +0 | +Never Taker ($n$) | +
1 | +1 | +Always Taker ($a$) | +
0 | +1 | +Complier ($c$) | +
1 | +0 | +Defier ($d$) | +
Let $\pi_s(x)$ denote the conditional (on $x$) probability that an individual belongs to principal stratum $s$:
+\begin{equation} +\pi_s(x)=\operatorname{Pr}(S=s \mid X=x), +\end{equation}
+and let $\gamma_s^{v z}(x)$ denote the potential outcome probability for given values $v$ and $z$:
+\begin{equation} +\gamma_s^{v z}(x)=\operatorname{Pr}(Y(v, z)=1 \mid S=s, X=x) +\end{equation}
+Various estimands of interest may be expressed in terms of the functions $\gamma_c^{vz}(x)$. In particular, the complier conditional average treatment effect $$\gamma_c^{1,z}(x) - \gamma_c^{0,z}(x)$$ is the ultimate goal (for either $z=0$ or $z=1$). Under an exclusion restriction, we would have $\gamma_s^{vz}(x) = \gamma_s^{v}(x)$ and the reminder status $z$ itself would not matter. In that case, we can estimate $$\gamma_c^{1,z}(x) - \gamma_c^{0,z}$$ and $$\gamma_c^{1,1}(x) - \gamma_c^{0,0}(x).$$ This latter quantity is called the complier intent-to-treat effect, or $ITT_c$, and it can be partially identify even if the exclusion restriction is violated, as follows.
+The left-hand side of the following system of equations are all estimable quantities that can be learned from observable data, while the right hand side expressions involve the unknown functions of interest, $\gamma_s^{vz}(x)$:
+\begin{equation} +\begin{aligned} +p_{1 \mid 00}(x) = \operatorname{Pr}(Y=1 \mid V=0, Z=0, X=x)=\frac{\pi_c(x)}{\pi_c(x)+\pi_n(x)} \gamma_c^{00}(x)+\frac{\pi_n(x)}{\pi_c(x)+\pi_n(x)} \gamma_n^{00}(x) \\ +p_{1 \mid 11}(x) =\operatorname{Pr}(Y=1 \mid V=1, Z=1, X=x)=\frac{\pi_c(x)}{\pi_c(x)+\pi_a(x)} \gamma_c^{11}(x)+\frac{\pi_a(x)}{\pi_c(x)+\pi_a(x)} \gamma_a^{11}(x) \\ +p_{1 \mid 01}(x) =\operatorname{Pr}(Y=1 \mid V=0, Z=1, X=x)=\frac{\pi_d(x)}{\pi_d(x)+\pi_n(x)} \gamma_d^{01}(x)+\frac{\pi_n(x)}{\pi_d(x)+\pi_n(x)} \gamma_n^{01}(x) \\ +p_{1 \mid 10}(x) =\operatorname{Pr}(Y=1 \mid V=1, Z=0, X=x)=\frac{\pi_d(x)}{\pi_d(x)+\pi_a(x)} \gamma_d^{10}(x)+\frac{\pi_a(x)}{\pi_d(x)+\pi_a(x)} \gamma_a^{10}(x) +\end{aligned} +\end{equation}
+Furthermore, we have
+\begin{equation} +\begin{aligned} +\operatorname{Pr}(V=1 \mid Z=0, X=x)&=\pi_a(x)+\pi_d(x)\\ +\operatorname{Pr}(V=1 \mid Z=1, X=x)&=\pi_a(x)+\pi_c(x) +\end{aligned} +\end{equation}
+Under the monotonicy assumption, $\pi_d(x) = 0$ and these expressions simplify somewhat.
+\begin{equation} +\begin{aligned} +p_{1 \mid 00}(x)&=\frac{\pi_c(x)}{\pi_c(x)+\pi_n(x)} \gamma_c^{00}(x)+\frac{\pi_n(x)}{\pi_c(x)+\pi_n(x)} \gamma_n^{00}(x) \\ +p_{1 \mid 11}(x)&=\frac{\pi_c(x)}{\pi_c(x)+\pi_a(x)} \gamma_c^{11}(x)+\frac{\pi_a(x)}{\pi_c(x)+\pi_a(x)} \gamma_a^{11}(x) \\ +p_{1 \mid 01}(x)&=\gamma_n^{01}(x) \\ +p_{1 \mid 10}(x)&=\gamma_a^{10}(x) +\end{aligned} +\end{equation}
+and
+\begin{equation} +\begin{aligned} +\operatorname{Pr}(V=1 \mid Z=0, X=x)&=\pi_a(x)\\ +\operatorname{Pr}(V=1 \mid Z=1, X=x)&=\pi_a(x)+\pi_c(x) +\end{aligned} +\end{equation}
+The exclusion restriction would dictate that $\gamma_s^{01}(x) = \gamma_s^{00}(x)$ and $\gamma_s^{11}(x) = \gamma_s^{10}(x)$ for all $s$. This has two implications. One, $\gamma_n^{01}(x) = \gamma_n^{00}(x)$ and $\gamma_a^{10}(x) = \gamma_a^{11}(x)$,and because the left-hand terms are identified, this permits $\gamma_c^{11}(x)$ and $\gamma_c^{00}(x)$ to be solved for by substitution. Two, with these two quantities solved for, we also have the two other quantities (the different settings of $z$), since $\gamma_c^{11}(x) = \gamma_c^{10}(x)$ and $\gamma_c^{00}(x) = \gamma_c^{01}(x)$. Consequently, both of our estimands from above can be estimated:
+$$\gamma_c^{11}(x) - \gamma_c^{01}(x)$$ +and
+$$\gamma_c^{10}(x) - \gamma_c^{00}(x)$$ +because they are both (supposing the exclusion restriction holds) the same as
+$$\gamma_c^{11}(x) - \gamma_c^{00}(x).$$ +If the exclusion restriction does not hold, then the three above treatment effects are all (potentially) distinct and not much can be said about the former two. The latter one, the $ITT_c$, however, can be partially identified, by recognizing that the first two equations (in our four equation system) provide non-trivial bounds based on the fact that while $\gamma_c^{11}(x)$ and $\gamma_c^{00}(x)$ are no longer identified, as probabilities both must lie between 0 and 1. Thus,
+\begin{equation} +\begin{aligned} + \max\left( + 0, \frac{\pi_c(x)+\pi_n(x)}{\pi_c(x)}p_{1\mid 00}(x) - \frac{\pi_n(x)}{\pi_c(x)} + \right) +&\leq\gamma^{00}_c(x)\leq + \min\left( + 1, \frac{\pi_c(x)+\pi_n(x)}{\pi_c(x)}p_{1\mid 00}(x) + \right)\\\\ +% +\max\left( + 0, \frac{\pi_a(x)+\pi_c(x)}{\pi_c(x)}p_{1\mid 11}(x) - \frac{\pi_a(x)}{\pi_c(x)} +\right) +&\leq\gamma^{11}_c(x)\leq +\min\left( + 1, \frac{\pi_a(x)+\pi_c(x)}{\pi_c(x)}p_{1\mid 11}(x) +\right) +\end{aligned} +\end{equation}
+The point of all this is that the data (plus a no-defiers assumption) lets us estimate all the necessary inputs to these upper and lower bounds on $\gamma^{11}_c(x)$ and $\gamma^{00}_c(x)$ which in turn define our estimand. What remains is to estimate those inputs, as functions of $x$, and to do so while enforcing the monotonicty restriction $$\operatorname{Pr}(V=1 \mid Z=0, X=x)=\pi_a(x) \leq +\operatorname{Pr}(V=1 \mid Z=1, X=x)=\pi_a(x)+\pi_c(x).$$
+We can do all of this with calls to stochtree from R (or Python). But first, let's generate some test data.
+Start with some initial setup / housekeeping
+# Size of the training sample
+n = 20000
+
+# To set the seed for reproducibility/illustration purposes, replace "None" with a positive integer
+random_seed = None
+if random_seed is not None:
+ rng = np.random.default_rng(random_seed)
+else:
+ rng = np.random.default_rng()
+
First, we generate the instrument exogenously
+z = rng.binomial(n=1, p=0.5, size=n)
+
Next, we generate the covariate. (For this example, let's think of it as patient age, although we are generating it from a uniform distribution between 0 and 3, so you have to imagine that it has been pre-standardized to this scale. It keeps the DGPs cleaner for illustration purposes.)
+p_X = 1
+X = rng.uniform(low=0., high=3., size=(n,p_X))
+x = X[:,0] # for ease of reference later
+
Next, we generate the principal strata $S$ based on the observed value of $X$. We generate it according to a logistic regression with two coefficients per strata, an intercept and a slope. Here, these coefficients are set so that the probability of being a never taker decreases with age.
+alpha_a = 0
+beta_a = 1
+
+alpha_n = 1
+beta_n = -1
+
+alpha_c = 1
+beta_c = 1
+
+# Define function (a logistic model) to generate Pr(S = s | X = x)
+def pi_s(xval, alpha_a, beta_a, alpha_n, beta_n, alpha_c, beta_c):
+ w_a = np.exp(alpha_a + beta_a*xval)
+ w_n = np.exp(alpha_n + beta_n*xval)
+ w_c = np.exp(alpha_c + beta_c*xval)
+ w = np.column_stack((w_a, w_n, w_c))
+ w_rowsum = np.sum(w, axis=1, keepdims=True)
+ return np.divide(w, w_rowsum)
+
+# Sample principal strata based on observed probabilities
+strata_probs = pi_s(X[:,0], alpha_a, beta_a, alpha_n, beta_n, alpha_c, beta_c)
+s = np.empty_like(X[:,0], dtype=str)
+for i in range(s.size):
+ s[i] = rng.choice(a=['a','n','c'], size=1, p=strata_probs[i,:])[0]
+
Next, we generate the treatment variable, here denoted $V$ (for "vaccine"), as a deterministic function of $S$ and $Z$; this is what gives the principal strata their meaning.
+v = 1*(s=='a') + 0*(s=='n') + z*(s=="c") + (1-z)*(s == "d")
+
Finally, the outcome structural model is specified, based on which the outcome is sampled. By varying this function in particular ways, we can alter the identification conditions.
+def gamfun(xval, vval, zval, sval):
+ """
+ If this function depends on zval, then exclusion restriction is violated.
+ If this function does not depend on sval, then IV analysis wasn't necessary.
+ If this function does not depend on x, then there are no HTEs.
+ """
+ baseline = norm.cdf(2 - 1*xval - 2.5*((xval-1.5)**2) - 0.5*zval + 1*(sval=="n") - 1*(sval=="a"))
+ return baseline - 0.5*vval*baseline
+
+y = rng.binomial(n=1, p=gamfun(X[:,0],v,z,s), size=n)
+
Lastly, we perform some organization for our supervised learning algorithms later on.
+# Concatenate X, v and z for our supervised learning algorithms
+Xall = np.concatenate((X, np.column_stack((v,z))), axis=1)
+
+# Update the size of "X" to be the size of Xall
+p_X = p_X + 2
+
+# For the monotone probit model it is necessary to sort the observations so that the Z=1 cases are all together
+# at the start of the outcome vector.
+sort_index = np.argsort(z)[::-1]
+X = X[sort_index,:]
+Xall = Xall[sort_index,:]
+z = z[sort_index]
+v = v[sort_index]
+s = s[sort_index]
+y = y[sort_index]
+x = x[sort_index]
+
Now let's see if we can recover these functions from the observed data.
+We have to fit three models here, the treatment models: $\operatorname{Pr}(V = 1 | Z = 1, X=x)$ and $\operatorname{Pr}(V = 1 | Z = 0,X = x)$, subject to the monotonicity constraint $\operatorname{Pr}(V = 1 | Z = 1, X=x) \geq \operatorname{Pr}(V = 1 | Z = 0,X = x)$, and an outcome model $\operatorname{Pr}(Y = 1 | Z = 1, V = 1, X = x)$. All of this will be done with stochtree.
+The outcome model is fit with a single (S-learner) BART model. This part of the model could be fit as a T-Learner or as a BCF model. Here we us an S-Learner for simplicity. Both models are probit models, and use the well-known Albert and Chib (1993) data augmentation Gibbs sampler. This section covers the more straightforward outcome model. The next section describes how the monotonicity constraint is handled with a data augmentation Gibbs sampler.
+These models could (and probably should) be wrapped as functions. Here they are implemented as scripts, with the full loops shown. The output -- at the end of the loops -- are stochtree forest objects from which we can extract posterior samples and generate predictions. In particular, the $ITT_c$ will be constructed using posterior counterfactual predictions derived from these forest objects.
+We begin by setting a bunch of hyperparameters and instantiating the forest objects to be operated upon in the main sampling loop. We also initialize the latent variables.
+# Fit the BART model for Pr(Y = 1 | Z = 1, V = 1, X = x)
+
+# Set number of iterations
+num_warmstart = 10
+num_mcmc = 1000
+num_samples = num_warmstart + num_mcmc
+
+# Set a bunch of hyperparameters. These are ballpark default values.
+alpha = 0.95
+beta = 2
+min_samples_leaf = 1
+max_depth = 20
+num_trees = 50
+cutpoint_grid_size = 100
+global_variance_init = 1.
+tau_init = 0.5
+leaf_prior_scale = np.array([[tau_init]])
+leaf_regression = False
+feature_types = np.append(np.repeat(0, p_X - 2), [1,1]).astype(int)
+var_weights = np.repeat(1.0/p_X, p_X)
+outcome_model_type = 0
+
+# C++ dataset
+forest_dataset = Dataset()
+forest_dataset.add_covariates(Xall)
+
+# Random number generator (std::mt19937)
+if random_seed is not None:
+ cpp_rng = RNG(random_seed)
+else:
+ cpp_rng = RNG()
+
+# Sampling data structures
+forest_model_config = ForestModelConfig(
+ feature_types = feature_types,
+ num_trees = num_trees,
+ num_features = p_X,
+ num_observations = n,
+ variable_weights = var_weights,
+ leaf_dimension = 1,
+ alpha = alpha,
+ beta = beta,
+ min_samples_leaf = min_samples_leaf,
+ max_depth = max_depth,
+ leaf_model_type = outcome_model_type,
+ leaf_model_scale = leaf_prior_scale,
+ cutpoint_grid_size = cutpoint_grid_size
+)
+global_model_config = GlobalModelConfig(global_error_variance=1.0)
+forest_sampler = ForestSampler(
+ forest_dataset, global_model_config, forest_model_config
+)
+
+# Container of forest samples
+forest_samples = ForestContainer(num_trees, 1, True, False)
+
+# "Active" forest state
+active_forest = Forest(num_trees, 1, True, False)
+
+# Initialize the latent outcome zed
+n1 = np.sum(y)
+zed = 0.25*(2.0*y - 1.0)
+
+# C++ outcome variable
+outcome = Residual(zed)
+
+# Initialize the active forest and subtract each root tree's predictions from outcome
+forest_init_val = np.array([0.0])
+forest_sampler.prepare_for_sampler(
+ forest_dataset,
+ outcome,
+ active_forest,
+ outcome_model_type,
+ forest_init_val,
+)
+
Now we enter the main loop, which involves only two steps: sample the forest, given the latent utilities, then sample the latent utilities given the estimated conditional means defined by the forest and its parameters.
+gfr_flag = True
+for i in range(num_samples):
+ # The first num_warmstart iterations use the grow-from-root algorithm of He and Hahn
+ if i >= num_warmstart:
+ gfr_flag = False
+
+ # Sample forest
+ forest_sampler.sample_one_iteration(
+ forest_samples, active_forest, forest_dataset, outcome, cpp_rng,
+ global_model_config, forest_model_config, keep_forest=True, gfr = gfr_flag
+ )
+
+ # Get the current means
+ eta = np.squeeze(forest_samples.predict_raw_single_forest(forest_dataset, i))
+
+ # Sample latent normals, truncated according to the observed outcome y
+ mu0 = eta[y == 0]
+ mu1 = eta[y == 1]
+ u0 = rng.uniform(
+ low=0.0,
+ high=norm.cdf(0 - mu0),
+ size=n-n1,
+ )
+ u1 = rng.uniform(
+ low=norm.cdf(0 - mu1),
+ high=1.0,
+ size=n1,
+ )
+ zed[y == 0] = mu0 + norm.ppf(u0)
+ zed[y == 1] = mu1 + norm.ppf(u1)
+
+ # Update outcome
+ new_outcome = np.squeeze(zed) - eta
+ outcome.update_data(new_outcome)
+
The monotonicty constraint relies on a data augmentation as described in Papakostas et al (2023). The implementation of this sampler is inherently cumbersome, as one of the "data" vectors is constructed from some observed data and some latent data and there are two forest objects, one of which applies to all of the observations and one of which applies to only those observations with $Z = 0$. We go into more details about this sampler in a dedicated vignette. Here we include the code, but without producing the equations derived in Papakostas (2023). What is most important is simply that
+\begin{equation} +\begin{aligned} +\operatorname{Pr}(V=1 \mid Z=0, X=x)&=\pi_a(x) = \Phi_f(x)\Phi_h(x),\\ +\operatorname{Pr}(V=1 \mid Z=1, X=x)&=\pi_a(x)+\pi_c(x) = \Phi_f(x), +\end{aligned} +\end{equation} +where $\Phi_{\mu}(x)$ denotes the normal cumulative distribution function with mean $\mu(x)$ and variance 1.
+We first create a secondary data matrix for the $Z=0$ group only. We also set all of the hyperparameters and initialize the latent variables.
+# Fit the monotone probit model to the treatment such that Pr(V = 1 | Z = 1, X=x) >= Pr(V = 1 | Z = 0,X = x)
+X_h = X[z==0,:]
+n0 = np.sum(z==0)
+n1 = np.sum(z==1)
+
+num_trees_f = 50
+num_trees_h = 20
+feature_types = np.repeat(0, p_X-2).astype(int)
+var_weights = np.repeat(1.0/(p_X - 2.0), p_X - 2)
+cutpoint_grid_size = 100
+global_variance_init = 1.
+tau_init_f = 1/num_trees_f
+tau_init_h = 1/num_trees_h
+leaf_prior_scale_f = np.array([[tau_init_f]])
+leaf_prior_scale_h = np.array([[tau_init_h]])
+leaf_regression = False # fit a constant leaf mean BART model
+
+# Instantiate the C++ dataset objects
+forest_dataset_f = Dataset()
+forest_dataset_f.add_covariates(X)
+forest_dataset_h = Dataset()
+forest_dataset_h.add_covariates(X_h)
+
+# Tell it we're fitting a normal BART model
+outcome_model_type = 0
+
+# Set up model configuration objects
+forest_model_config_f = ForestModelConfig(
+ feature_types = feature_types,
+ num_trees = num_trees_f,
+ num_features = X.shape[1],
+ num_observations = n,
+ variable_weights = var_weights,
+ leaf_dimension = 1,
+ alpha = alpha,
+ beta = beta,
+ min_samples_leaf = min_samples_leaf,
+ max_depth = max_depth,
+ leaf_model_type = outcome_model_type,
+ leaf_model_scale = leaf_prior_scale_f,
+ cutpoint_grid_size = cutpoint_grid_size
+)
+forest_model_config_h = ForestModelConfig(
+ feature_types = feature_types,
+ num_trees = num_trees_h,
+ num_features = X_h.shape[1],
+ num_observations = n0,
+ variable_weights = var_weights,
+ leaf_dimension = 1,
+ alpha = alpha,
+ beta = beta,
+ min_samples_leaf = min_samples_leaf,
+ max_depth = max_depth,
+ leaf_model_type = outcome_model_type,
+ leaf_model_scale = leaf_prior_scale_h,
+ cutpoint_grid_size = cutpoint_grid_size
+)
+global_model_config = GlobalModelConfig(global_error_variance=global_variance_init)
+
+# Instantiate the sampling data structures
+forest_sampler_f = ForestSampler(
+ forest_dataset_f, global_model_config, forest_model_config_f
+)
+forest_sampler_h = ForestSampler(
+ forest_dataset_h, global_model_config, forest_model_config_h
+)
+
+# Instantiate containers of forest samples
+forest_samples_f = ForestContainer(num_trees_f, 1, True, False)
+forest_samples_h = ForestContainer(num_trees_h, 1, True, False)
+
+# Instantiate "active" forests
+active_forest_f = Forest(num_trees_f, 1, True, False)
+active_forest_h = Forest(num_trees_h, 1, True, False)
+
+# Set algorithm specifications
+# these are set in the earlier script for the outcome model; number of draws needs to be commensurable
+
+# num_warmstart = 40
+# num_mcmc = 5000
+# num_samples = num_warmstart + num_mcmc
+
+# Initialize the Markov chain
+
+# Initialize (R0, R1), the latent binary variables that enforce the monotonicty
+v1 = v[z==1]
+v0 = v[z==0]
+
+R1 = np.empty(n0, dtype=float)
+R0 = np.empty(n0, dtype=float)
+
+R1[v0==1] = 1
+R0[v0==1] = 1
+
+nv0 = np.sum(v0==0)
+R1[v0 == 0] = 0
+R0[v0 == 0] = rng.choice([0,1], size = nv0)
+
+# The first n1 observations of vaug are actually observed
+# The next n0 of them are the latent variable R1
+vaug = np.append(v1, R1)
+
+# Initialize the Albert and Chib latent Gaussian variables
+z_f = (2.0*vaug - 1.0)
+z_h = (2.0*R0 - 1.0)
+z_f = z_f/np.std(z_f)
+z_h = z_h/np.std(z_h)
+
+# Pass these variables to the BART models as outcome variables
+outcome_f = Residual(z_f)
+outcome_h = Residual(z_h)
+
+# Initialize active forests to constant (0) predictions
+forest_init_val_f = np.array([0.0])
+forest_sampler_f.prepare_for_sampler(
+ forest_dataset_f,
+ outcome_f,
+ active_forest_f,
+ outcome_model_type,
+ forest_init_val_f,
+)
+forest_init_val_h = np.array([0.0])
+forest_sampler_h.prepare_for_sampler(
+ forest_dataset_h,
+ outcome_h,
+ active_forest_h,
+ outcome_model_type,
+ forest_init_val_h,
+)
+
Now we run the main sampling loop, which consists of three key steps: sample the BART forests, given the latent probit utilities, sampling the latent binary outcome pairs (this is the step that is necessary for enforcing monotonicity), given the forest predictions and the latent utilities, and finally sample the latent utilities.
+# PART IV: run the Markov chain
+
+# Initialize the Markov chain with num_warmstart grow-from-root iterations
+gfr_flag = True
+for i in range(num_samples):
+ # Switch over to random walk Metropolis-Hastings tree updates after num_warmstart
+ if i >= num_warmstart:
+ gfr_flag = False
+
+ # Step 1: Sample the BART forests
+
+ # Sample forest for the function f based on (y_f, R1)
+ forest_sampler_f.sample_one_iteration(
+ forest_samples_f, active_forest_f, forest_dataset_f, outcome_f, cpp_rng,
+ global_model_config, forest_model_config_f, keep_forest=True, gfr = gfr_flag
+ )
+
+ # Sample forest for the function h based on outcome R0
+ forest_sampler_h.sample_one_iteration(
+ forest_samples_h, active_forest_h, forest_dataset_h, outcome_h, cpp_rng,
+ global_model_config, forest_model_config_h, keep_forest=True, gfr = gfr_flag
+ )
+
+ # Get the current means
+ eta_f = np.squeeze(forest_samples_f.predict_raw_single_forest(forest_dataset_f, i))
+ eta_h = np.squeeze(forest_samples_h.predict_raw_single_forest(forest_dataset_h, i))
+
+ # Step 2: sample the latent binary pair (R0, R1) given eta_h, eta_f, and y_g
+
+ # Three cases: (0,0), (0,1), (1,0)
+ w1 = (1 - norm.cdf(eta_h[v0==0]))*(1 - norm.cdf(eta_f[n1 + np.where(v0==0)]))
+ w2 = (1 - norm.cdf(eta_h[v0==0]))*norm.cdf(eta_f[n1 + np.where(v0==0)])
+ w3 = norm.cdf(eta_h[v0==0])*(1 - norm.cdf(eta_f[n1 + np.where(v0==0)]))
+
+ s = w1 + w2 + w3
+ w1 = w1/s
+ w2 = w2/s
+ w3 = w3/s
+
+ u = rng.uniform(low=0,high=1,size=np.sum(v0==0))
+ temp = 1*(np.squeeze(u < w1)) + 2*(np.squeeze((u > w1) & (u < (w1 + w2)))) + 3*(np.squeeze(u > (w1 + w2)))
+
+ R1[v0==0] = 1*(temp==2)
+ R0[v0==0] = 1*(temp==3)
+
+ # Redefine y with the updated R1 component
+ vaug = np.append(v1, R1)
+
+ # Step 3: sample the latent normals, given (R0, R1) and y_f
+
+ # First z0
+ mu1 = eta_h[R0==1]
+ U1 = rng.uniform(
+ low=norm.cdf(0 - mu1),
+ high=1,
+ size=np.sum(R0).astype(int)
+ )
+ z_h[R0==1] = mu1 + norm.ppf(U1)
+
+ mu0 = eta_h[R0==0]
+ U0 = rng.uniform(
+ low=0,
+ high=norm.cdf(0 - mu0),
+ size=(n0 - np.sum(R0)).astype(int)
+ )
+ z_h[R0==0] = mu0 + norm.ppf(U0)
+
+ # Then z1
+ mu1 = eta_f[vaug==1]
+ U1 = rng.uniform(
+ low=norm.cdf(0 - mu1),
+ high=1,
+ size=np.sum(vaug).astype(int)
+ )
+ z_f[vaug==1] = mu1 + norm.ppf(U1)
+
+ mu0 = eta_f[vaug==0]
+ U0 = rng.uniform(
+ low=0,
+ high=norm.cdf(0 - mu0),
+ size=(n - np.sum(vaug)).astype(int)
+ )
+ z_f[vaug==0] = mu0 + norm.ppf(U0)
+
+ # Propagate the updated outcomes through the BART models
+ new_outcome_h = np.squeeze(z_h) - eta_h
+ outcome_h.update_data(new_outcome_h)
+
+ new_outcome_f = np.squeeze(z_f) - eta_f
+ outcome_f.update_data(new_outcome_f)
+
Now for the most interesting part, which is taking the stochtree BART model fits and producing the causal estimates of interest.
+First we set up our grid for plotting the functions in $X$. This is possible in this example because the moderator, age, is one dimensional; in may applied problems this will not be the case and visualization will be substantially trickier.
+# Extract the credible intervals for the conditional treatment effects as a function of x.
+# We use a grid of values for plotting, with grid points that are typically fewer than the number of observations.
+
+ngrid = 200
+xgrid = np.linspace(start=0.1, stop=2.9, num=ngrid)
+X_11 = np.column_stack((xgrid, np.ones(ngrid), np.ones(ngrid)))
+X_00 = np.column_stack((xgrid, np.zeros(ngrid), np.zeros(ngrid)))
+X_01 = np.column_stack((xgrid, np.zeros(ngrid), np.ones(ngrid)))
+X_10 = np.column_stack((xgrid, np.ones(ngrid), np.zeros(ngrid)))
+
Next, we compute the truth function evaluations on this plotting grid, using the functions defined above when we generated our data.
+# Compute the true conditional outcome probabilities for plotting
+pi_strat = pi_s(xgrid, alpha_a, beta_a, alpha_n, beta_n, alpha_c, beta_c)
+w_a = pi_strat[:,0]
+w_n = pi_strat[:,1]
+w_c = pi_strat[:,2]
+
+w = (w_c/(w_a + w_c))
+p11_true = w*gamfun(xgrid,1,1,"c") + (1-w)*gamfun(xgrid,1,1,"a")
+
+w = (w_c/(w_n + w_c))
+p00_true = w*gamfun(xgrid,0,0,"c") + (1-w)*gamfun(xgrid,0,0,"n")
+
+# Compute the true ITT_c for plotting and comparison
+itt_c_true = gamfun(xgrid,1,1,"c") - gamfun(xgrid,0,0,"c")
+
+# Compute the true LATE for plotting and comparison
+LATE_true0 = gamfun(xgrid,1,0,"c") - gamfun(xgrid,0,0,"c")
+LATE_true1 = gamfun(xgrid,1,1,"c") - gamfun(xgrid,0,1,"c")
+
Next we populate the data structures for stochtree to operate on, call the predict functions to extract the predictions, convert them to probability scale using the scipy.stats.norm.cdf
method.
# Datasets for counterfactual predictions
+forest_dataset_grid = Dataset()
+forest_dataset_grid.add_covariates(np.expand_dims(xgrid, 1))
+forest_dataset_11 = Dataset()
+forest_dataset_11.add_covariates(X_11)
+forest_dataset_00 = Dataset()
+forest_dataset_00.add_covariates(X_00)
+forest_dataset_10 = Dataset()
+forest_dataset_10.add_covariates(X_10)
+forest_dataset_01 = Dataset()
+forest_dataset_01.add_covariates(X_01)
+
+# Forest predictions
+preds_00 = forest_samples.predict(forest_dataset_00)
+preds_11 = forest_samples.predict(forest_dataset_11)
+preds_01 = forest_samples.predict(forest_dataset_01)
+preds_10 = forest_samples.predict(forest_dataset_10)
+
+# Probability transformations
+phat_00 = norm.cdf(preds_00)
+phat_11 = norm.cdf(preds_11)
+phat_01 = norm.cdf(preds_01)
+phat_10 = norm.cdf(preds_10)
+
+preds_ac = forest_samples_f.predict(forest_dataset_grid)
+phat_ac = norm.cdf(preds_ac)
+
+preds_adj = forest_samples_h.predict(forest_dataset_grid)
+phat_a = norm.cdf(preds_ac) * norm.cdf(preds_adj)
+phat_c = phat_ac - phat_a
+phat_n = 1 - phat_ac
+
Now we may plot posterior means of various quantities (as a function of $X$) to visualize how well the models are fitting.
+fig, (ax1, ax2) = plt.subplots(1, 2)
+ax1.scatter(p11_true, np.mean(phat_11, axis=1), color="black")
+ax1.axline((0, 0), slope=1, color="red", linestyle=(0, (3, 3)))
+ax2.scatter(p00_true, np.mean(phat_00, axis=1), color="black")
+ax2.axline((0, 0), slope=1, color="red", linestyle=(0, (3, 3)))
+plt.show()
+
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, sharex="none", sharey="none")
+ax1.scatter(np.mean(phat_ac, axis=1), w_c + w_a, color="black")
+ax1.axline((0, 0), slope=1, color="red", linestyle=(0, (3, 3)))
+ax1.set_xlim(0.5,1.1)
+ax1.set_ylim(0.5,1.1)
+ax2.scatter(np.mean(phat_a, axis=1), w_a, color="black")
+ax2.axline((0, 0), slope=1, color="red", linestyle=(0, (3, 3)))
+ax2.set_xlim(0.1,0.4)
+ax2.set_ylim(0.1,0.3)
+ax3.scatter(np.mean(phat_c, axis=1), w_c, color="black")
+ax3.axline((0, 0), slope=1, color="red", linestyle=(0, (3, 3)))
+ax3.set_xlim(0.4,0.9)
+ax3.set_ylim(0.4,0.8)
+plt.show()
+
These plots are not as pretty as we might hope, but mostly this is a function of how difficult it is to learn conditional probabilities from binary outcomes. That we capture the trend broadly turns out to be adequate for estimating treatment effects. Fit does improve with simpler DGPs and larger training sets, as can be confirmed by experimentation with this script.
+Lastly, we can construct the estimate of the $ITT_c$ and compare it to the true value as well as the $Z=0$ and $Z=1$ complier average treatment effects (also called "local average treatment effects" or LATE). The key step in this process is to center our posterior on the identified interval (at each iteration of the sampler) at the value implied by a valid exclusion restriction. For some draws this will not be possible, as that value will be outside the identification region.
+# Generate draws from the posterior of the treatment effect
+# centered at the point-identified value under the exclusion restriction
+itt_c = np.empty((ngrid, phat_c.shape[1]))
+late = np.empty((ngrid, phat_c.shape[1]))
+ss = 6
+for j in range(phat_c.shape[1]):
+ # Value of gamma11 implied by an exclusion restriction
+ gamest11 = ((phat_a[:,j] + phat_c[:,j])/phat_c[:,j])*phat_11[:,j] - phat_10[:,j]*phat_a[:,j]/phat_c[:,j]
+
+ # Identified region for gamma11
+ lower11 = np.maximum(0., ((phat_a[:,j] + phat_c[:,j])/phat_c[:,j])*phat_11[:,j] - phat_a[:,j]/phat_c[:,j])
+ upper11 = np.minimum(1., ((phat_a[:,j] + phat_c[:,j])/phat_c[:,j])*phat_11[:,j])
+
+ # Center a beta distribution at gamma11, but restricted to (lower11, upper11)
+ # do this by shifting and scaling the mean, drawing from a beta on (0,1), then shifting and scaling to the
+ # correct restricted interval
+ m11 = (gamest11 - lower11)/(upper11 - lower11)
+
+ # Parameters of the beta
+ a1 = ss*m11
+ b1 = ss*(1-m11)
+
+ # When the corresponding mean is out-of-range, sample from a beta with mass piled near the violated boundary
+ a1[m11<0] = 1
+ b1[m11<0] = 5
+
+ a1[m11>1] = 5
+ b1[m11>1] = 1
+
+ # Value of gamma00 implied by an exclusion restriction
+ gamest00 = ((phat_n[:,j] + phat_c[:,j])/phat_c[:,j])*phat_00[:,j] - phat_01[:,j]*phat_n[:,j]/phat_c[:,j]
+
+ # Identified region for gamma00
+ lower00 = np.maximum(0., ((phat_n[:,j] + phat_c[:,j])/phat_c[:,j])*phat_00[:,j] - phat_n[:,j]/phat_c[:,j])
+ upper00 = np.minimum(1., ((phat_n[:,j] + phat_c[:,j])/phat_c[:,j])*phat_00[:,j])
+
+ # Center a beta distribution at gamma00, but restricted to (lower00, upper00)
+ # do this by shifting and scaling the mean, drawing from a beta on (0,1), then shifting and scaling to the
+ # correct restricted interval
+ m00 = (gamest00 - lower00)/(upper00 - lower00)
+
+ a0 = ss*m00
+ b0 = ss*(1-m00)
+ a0[m00<0] = 1
+ b0[m00<0] = 5
+ a0[m00>1] = 5
+ b0[m00>1] = 1
+
+ # ITT and LATE
+ itt_c[:,j] = lower11 + (upper11 - lower11)*rng.beta(a=a1,b=b1,size=ngrid) - (lower00 + (upper00 - lower00)*rng.beta(a=a0,b=b0,size=ngrid))
+ late[:,j] = gamest11 - gamest00
+
+upperq = np.quantile(itt_c, q=0.975, axis=1)
+lowerq = np.quantile(itt_c, q=0.025, axis=1)
+upperq_er = np.quantile(late, q=0.975, axis=1)
+lowerq_er = np.quantile(late, q=0.025, axis=1)
+
And now we can plot all of this, shading posterior quantiles with pyplot's fill
function.
plt.plot(xgrid, itt_c_true, color = "black")
+plt.ylim(-0.75, 0.05)
+plt.fill(np.append(xgrid, xgrid[::-1]), np.append(lowerq, upperq[::-1]), color = (0.5,0.5,0,0.25))
+plt.fill(np.append(xgrid, xgrid[::-1]), np.append(lowerq_er, upperq_er[::-1]), color = (0,0,0.5,0.25))
+
+itt_c_est = np.mean(itt_c, axis=1)
+late_est = np.mean(late, axis=1)
+
+plt.plot(xgrid, late_est, color = "darkgrey")
+plt.plot(xgrid, itt_c_est, color = "gold")
+plt.plot(xgrid, LATE_true0, color = "black", linestyle = (0, (2, 2)))
+plt.plot(xgrid, LATE_true1, color = "black", linestyle = (0, (4, 4)))
+plt.plot(xgrid, itt_c_true, color = "black")
+
+plt.show()
+
With a valid exclusion restriction the three black curves would all be the same. With no exclusion restriction, as we have here, the direct effect of $Z$ on $Y$ (the vaccine reminder on flu status) makes it so these three treatment effects are different. Specifically, the $ITT_c$ compares getting the vaccine and getting the reminder to not getting the vaccine and not getting the reminder. When both things have risk reducing impacts, we see a larger risk reduction over all values of $X$. Meanwhile, the two LATE effects compare the isolated impact of the vaccine among people that got the reminder and those that didn't, respectively. Here, not getting the reminder makes the vaccine more effective because the risk reduction is as a fraction of baseline risk, and the reminder reduces baseline risk in our DGP.
+We see also that the posterior mean of the $ITT_c$ estimate (gold) is very similar to the posterior mean under the assumption of an exclusion restriction (gray). This is by design...they will only deviate due to Monte Carlo variation or due to the rare situations where the exclusion restriction is incompatible with the identification interval.
+By changing the sample size and various aspects of the DGP this script allows us to build some intuition for how aspects of the DGP affect posterior inferences, particularly how violates of assumptions affect accuracy and posterior uncertainty.
+Albert, James H, and Siddhartha Chib. 1993. “Bayesian Analysis of Binary and Polychotomous Response Data.” Journal of the American Statistical Association 88 (422): 669–79.
+Hahn, P Richard, Jared S Murray, and Ioanna Manolopoulou. 2016. “A Bayesian Partial Identification Approach to Inferring the Prevalence of Accounting Misconduct.” Journal of the American Statistical Association 111 (513): 14–26.
+Hirano, Keisuke, Guido W. Imbens, Donald B. Rubin, and Xiao-Hua Zhou. 2000. “Assessing the Effect of an Influenza Vaccine in an Encouragement Design.” Biostatistics 1 (1): 69–88. https://doi.org/10.1093/biostatistics/1.1.69.
+Imbens, Guido W., and Donald B. Rubin. 2015. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press.
+McDonald, Clement J, Siu L Hui, and William M Tierney. 1992. “Effects of Computer Reminders for Influenza Vaccination on Morbidity During Influenza Epidemics.” MD Computing: Computers in Medical Practice 9 (5): 304–12.
+Papakostas, Demetrios, P Richard Hahn, Jared Murray, Frank Zhou, and Joseph Gerakos. 2023. “Do Forecasts of Bankruptcy Cause Bankruptcy? A Machine Learning Sensitivity Analysis.” The Annals of Applied Statistics 17 (1): 711–39.
+Richardson, Thomas S., Robin J. Evans, and James M. Robins. 2011. “Transparent Parametrizations of Models for Potential Outcomes.” In Bayesian Statistics 9. Oxford University Press. https://doi.org/10.1093/acprof:oso/9780199694587.003.0019.
+stochtree
¶We study conditional average treatment effect (CATE) estimation for regression discontinuity designs (RDD), in which treatment assignment is based on whether a particular covariate --- referred to as the running variable --- lies above or below a known value, referred to as the cutoff value. Because treatment is deterministically assigned as a known function of the running variable, RDDs are trivially deconfounded: treatment assignment is independent of the outcome variable, given the running variable (because treatment is conditionally constant). However, estimation of treatment effects in RDDs is more complicated than simply controlling for the running variable, because doing so introduces a complete lack of overlap, which is the other key condition needed to justify regression adjustment for causal inference. Nonetheless, the CATE at the cutoff, $X=c$, may still be identified provided the conditional expectation $E[Y \mid X,W]$ is continuous at that point for all $W=w$. We exploit this assumption with the leaf regression BART model implemented in Stochtree, which allows us to define an explicit prior on the CATE. We now describe the RDD setup and our model in more detail, and provide code to implement our approach.
+We conceptualize the treatment effect estimation problem via a quartet of random variables $(Y, X, Z, U)$. The variable $Y$ is the outcome variable; the variable $X$ is the running variable; the variable $Z$ is the treatment assignment indicator variable; and the variable $U$ represents additional, possibly unobserved, causal factors. What specifically makes this correspond to an RDD is that we stipulate that $Z = I(X > c)$, for cutoff $c$. We assume $c = 0$ without loss of generality.
+The following figure depicts a causal diagram representing the assumed causal relationships between these variables. Two key features of this diagram are one, that $X$ blocks the impact of $U$ on $Z$: in other words, $X$ satisfies the back-door criterion for learning causal effects of $Z$ on $Y$. And two, $X$ and $U$ are not descendants of $Z$.
Using this causal diagram, we may express $Y$ as some function of its graph parents, the random variables $(X,Z,U)$: $$Y = F(X,Z,U).$$ In principle, we may obtain draws of $Y$ by first drawing $(X,Z,U)$ according to their joint distribution and then applying the function $F$. Similarly, we may relate this formulation to the potential outcomes framework straightforwardly: +\begin{equation} +\begin{split} +Y^1 &= F(X,1,U),\\ +Y^0 &= F(X,0,U). +\end{split} +\end{equation} +Here, draws of $(Y^1, Y^0)$ may be obtained (in principle) by drawing $(X,Z,U)$ from their joint distribution and using only the $(X,U)$ elements as arguments in the above two equations, "discarding" the drawn value of $Z$. Note that this construction implies the consistency condition: $Y = Y^1 Z + Y^0 ( 1 - Z)$. Likewise, this construction implies the no interference condition because each $Y_i$ is considered to be produced with arguments ($X_i, Z_i, U_i)$ and not those from other units $j$; in particular, in constructing $Y_i$, $F$ does not take $Z_j$ for $j \neq i$ as an argument.
+Next, we define the following conditional expectations +\begin{equation} +\begin{split} +\mu_1(x) &= E[ F(x, 1, U) \mid X = x] ,\\ +\mu_0(x) &= E[ F(x, 0, U) \mid X = x], +\end{split} +\end{equation} +with which we can define the treatment effect function +$$\tau(x) = \mu_1(x) - \mu_0(x).$$ +Because $X$ satisfies the back-door criterion, $\mu_1$ and $\mu_0$ are estimable from the data, meaning that +\begin{equation} +\begin{split} +\mu_1(x) &= E[ F(x, 1, U) \mid X = x] = E[Y \mid X=x, Z=1],\\ +\mu_0(x) &= E[ F(x, 0, U) \mid X = x] = E[Y \mid X=x, Z=0], +\end{split} +\end{equation} +the right-hand-sides of which can be estimated from sample data, which we supposed to be independent and identically distributed realizations of $(Y_i, X_i, Z_i)$ for $i = 1, \dots, n$. However, because $Z = I(X >0)$ we can in fact only learn $\mu_1(x)$ for $X > 0$ and $\mu_0(x)$ for $X < 0$. In potential outcomes terminology, conditioning on $X$ satisfies ignorability, +$$(Y^1, Y^0) \perp \!\!\! \perp Z \mid X,$$ +but not strong ignorability, because overlap is violated. Overlap would require that +$$0 < P(Z = 1 \mid X=x) < 1 \;\;\;\; \forall x,$$ +which is clearly violated by the RDD assumption that $Z = I(X > 0)$. Consequently, the overall ATE, +$\bar{\tau} = E(\tau(X)),$ is unidentified, and we must content ourselves with estimating $\tau(0)$, the conditional average effect at the point $x = 0$, which we estimate as the difference between $\mu_1(0) - \mu_0(0)$. This is possible for continuous $X$ so long as one is willing to assume that $\mu_1(x)$ and $\mu_0(x)$ are both suitably smooth functions of $x$: any inferred discontinuity at $x = 0$ must therefore be attributable to treatment effect.
+We are concerned with learning not only $\tau(0)$, the "RDD ATE" (e.g. the CATE at $x = 0$), but also RDD CATEs, $\tau(0, \mathrm{w})$ for some covariate vector $\mathrm{w}$. Incorporating additional covariates in the above framework turns out to be straightforward, simply by defining $W = \varphi(U)$ to be an observable function of the (possibly unobservable) causal factors $U$. We may then define our potential outcome means as +\begin{equation} +\begin{split} +\mu_1(x,\mathrm{w}) &= E[ F(x, 1, U) \mid X = x, W = \mathrm{w}] = E[Y \mid X=x, W=\mathrm{w}, Z=1],\\ +\mu_0(x,\mathrm{w}) &= E[ F(x, 0, U) \mid X = x, W = \mathrm{w}] = E[Y \mid X=x, W =\mathrm{w}, Z=0], +\end{split} +\end{equation} +and our treatment effect function as +\begin{equation} +\tau(x,\mathrm{w}) = \mu_1(x,\mathrm{w}) - \mu_0(x,\mathrm{w}) +\end{equation} +We consider our data to be independent and identically distributed realizations $(Y_i, X_i, Z_i, W_i)$ for $i = 1, \dots, n$. Furthermore, we must assume that $\mu_1(x,\mathrm{w})$ and $\mu_0(x,\mathrm{w})$ are suitably smooth functions of $x$, {\em for every} $\mathrm{w}$; in other words, for each value of $\mathrm{w}$ the usual continuity-based identification assumptions must hold.
+With this framework and notation established, CATE estimation in RDDs boils down to estimation of condition expectation functions $E[Y \mid X=x, W=\mathrm{w}, Z=z]$, for which we turn to BART models.
+We propose a BART model where the trees are allowed to split on $(x,\mathrm{w})$ but where each leaf node parameter is a vector of regression coefficients tailored to the RDD context (rather than a scalar constant as in default BART). In one sense, such a model can be seen as implying distinct RDD ATE regressions for each subgroup determined by a given tree; however, this intuition is only heuristic, as the entire model is fit jointly as an ensemble of such trees. Instead, we motivate this model as a way to estimate the necessary conditional expectations via a parametrization where the conditional treatment effect function can be explicitly regularized, as follows.
+Let $\psi$ denote the following basis vector: +\begin{equation} +\psi(x,z) = \begin{bmatrix} +1 & z x & (1-z) x & z +\end{bmatrix}. +\end{equation} +To generalize the original BART model, we define $g_j(x, \mathrm{w}, z)$ as a piecewise linear function as follows. Let $b_j(x, \mathrm{w})$ denote the node in the $j$th tree which contains the point $(x, \mathrm{w})$; then the prediction function for tree $j$ is defined to be: +\begin{equation} +g_j(x, \mathrm{w}, z) = \psi(x, z) \Gamma_{b_j(x, \mathrm{w})} +\end{equation} +for a leaf-specific regression vector $\Gamma_{b_j} = (\eta_{b_j}, \lambda_{b_j}, \theta_{b_j}, \Delta_{b_j})^t$. Therefore, letting $n_{b_j}$ denote the number of data points allocated to node $b$ in the $j$th tree and $\Psi_{b_j}$ denote the $n_{b_j} \times 4$ matrix, with rows equal to $\psi(x,z)$ for all $(x_i,z_i) \in b_j$, the model for observations assigned to leaf $b_j$, can be expressed in matrix notation as: +\begin{equation} +\begin{split} +\mathbf{Y}_{b_j} \mid \Gamma_{b_j}, \sigma^2 &\sim \mathrm{N}(\Psi_{b_j} \Gamma_{b_j},\sigma^2)\\ +\Gamma_{b_j} &\sim \mathrm{N} (0, \Sigma_0), +\end{split} +\end{equation} +where we set $\Sigma_0 = \frac{0.033}{J} \mathbf{I}$ as a default (for $x$ vectors standardized to have unit variance in-sample).
+This choice of basis entails that the RDD CATE at $\mathrm{w}$, $\tau(0, \mathrm{w})$, is a sum of the $\Delta_{b_j(0, \mathrm{w})}$ elements across all trees $j = 1, \dots, J$: +\begin{equation} +\begin{split} +\tau(0, \mathrm{w}) &= E[Y^1 \mid X=0, W = \mathrm{w}] - E[Y^0 \mid X = 0, W = \mathrm{w}]\\ +& = E[Y \mid X=0, W = \mathrm{w}, Z = 1] - E[Y \mid X = 0, W = \mathrm{w}, Z = 0]\\ +&= \sum_{j = 1}^J g_j(0, \mathrm{w}, 1) - \sum_{j = 1}^J g_j(0, \mathrm{w}, 0)\\ +&= \sum_{j = 1}^J \psi(0, 1) \Gamma_{b_j(0, \mathrm{w})} - \sum_{j = 1}^J \psi(0, 0) \Gamma_{b_j(0, \mathrm{w})} \\ +& = \sum_{j = 1}^J \Bigl( \psi(0, 1) - \psi(0, 0) \Bigr) \Gamma_{b_j(0, \mathrm{w})} \\ +& = \sum_{j = 1}^J \Bigl( (1,0,0,1) - (1,0,0,0) \Bigr) \Gamma_{b_j(0, \mathrm{w})} \\ +&= \sum_{j=1}^J \Delta_{b_j(0, \mathrm{w})}. +\end{split} +\end{equation} +As a result, the priors on the $\Delta$ coefficients directly regularize the treatment effect. We set the tree and error variance priors as in the original BART model.
+The following figures provide a graphical depiction of how the BARDDT model fits a response surface and thereby estimates CATEs for distinct values of $\mathrm{w}$. For simplicity only two trees are used in the illustration, while in practice dozens or hundreds of trees may be used (in our simulations and empirical example, we use 150 trees).
+An interesting property of BARDDT can be seen in this small illustration --- by letting the regression trees split on the running variable, there is no need to separately define a 'bandwidth' as is used in the polynomial approach to RDD. Instead, the regression trees automatically determine (in the course of posterior sampling) when to 'prune' away regions away from the cutoff value. There are two notable features of this approach. One, different trees in the ensemble are effectively using different local bandwidths and these fits are then blended together. For example, in the bottom panel of the second figure, we obtain one bandwidth for the region $d+i$, and a different one for regions $a+g$ and $d+g$. Two, for cells in the tree partition that do not span the cutoff, the regression within that partition contains no causal contrasts --- all observations either have $Z = 1$ or $Z = 0$. For those cells, the treatment effect coefficient is ill-posed and in those cases the posterior sampling is effectively a draw from the prior; however, such draws correspond to points where the treatment effect is unidentified and none of these draws contribute to the estimation of $\tau(0, \mathrm{w})$ --- for example, only nodes $a+g$, $d+g$, and $d+i$ provide any contribution. This implies that draws of $\Delta$ corresponding to nodes not predicting at $X=0$ will always be draws from the prior, which has some intuitive appeal.
+In this section, we provide code for implementing our model in stochtree
on a popular RDD dataset.
+First, let us load stochtree
and all the necessary libraries for our posterior analysis.
import matplotlib.pyplot as plt
+import seaborn as sns
+import numpy as np
+import pandas as pd
+from sklearn.tree import DecisionTreeRegressor, plot_tree
+from stochtree import BARTModel
+
The data comes from Lindo et al (2010), who analyze data on college students enrolled in a large Canadian university in order to evaluate the effectiveness of an academic probation policy. Students who present a grade point average (GPA) lower than a certain threshold at the end of each term are placed on academic probation and must improve their GPA in the subsequent term or else face suspension. We are interested in how being put on probation or not, $Z$, affects students' GPA, $Y$, at the end of the current term. The running variable, $X$, is the negative distance between a student's previous-term GPA and the probation threshold, so that students placed on probation ($Z = 1$) have a positive score and the cutoff is 0. Potential moderators, $W$, are:
+male
),age_at_entry
)bpl_north_america
),totcredits_year1
)loc_campus
1, 2 and 3), andhsgrade_pct
).# Load and organize data
+data = pd.read_csv("https://raw.githubusercontent.com/rdpackages-replication/CIT_2024_CUP/refs/heads/main/CIT_2024_CUP_discrete.csv")
+
y = data.loc[:,"nextGPA"].to_numpy()
+x = data.loc[:,"X"].to_numpy()
+
x = x/np.std(x)
+w = data.iloc[:,3:11]
+
ordered_cat = pd.api.types.CategoricalDtype(ordered=True)
+unordered_cat = pd.api.types.CategoricalDtype(ordered=False)
+w.loc[:,"totcredits_year1"] = w.loc[:,"totcredits_year1"].astype(ordered_cat)
+w.loc[:,"male"] = w.loc[:,"male"].astype(unordered_cat)
+w.loc[:,"bpl_north_america"] = w.loc[:,"bpl_north_america"].astype(unordered_cat)
+w.loc[:,"loc_campus1"] = w.loc[:,"loc_campus1"].astype(unordered_cat)
+w.loc[:,"loc_campus2"] = w.loc[:,"loc_campus2"].astype(unordered_cat)
+w.loc[:,"loc_campus3"] = w.loc[:,"loc_campus3"].astype(unordered_cat)
+c = 0
+n = data.shape[0]
+z = np.where(x > c, 1.0, 0.0)
+# Window for prediction sample
+h = 0.1
+test = (x > -h) & (x < h)
+ntest = np.sum(np.where(test, 1, 0))
+
Generically, our estimand is the CATE function at $x = 0$; i.e. $\tau(0, \mathrm{w})$. The key practical question is which values of $\mathrm{w}$ to consider. Some values of $\mathrm{w}$ will not be well-represented near $x=0$ and so no estimation technique will be able to estimate those points effectively. As such, to focus on feasible points --- which will lead to interesting comparisons between methods --- we recommend restricting the evaluation points to the observed $\mathrm{w}_i$ such that $|x_i| \leq \delta$, for some $\delta > 0$. In our example, we use $\delta = 0.1$ for a standardized $x$ variable. Therefore, our estimand of interest is a vector of treatment effects: +$$\tau(0, \mathrm{w}_i) \;\;\; \forall i \;\text{ such that }\; |x_i| \leq \delta$$
+In order to implement our model, we write the Psi vector, as defined before: Psi = np.column_stack([np.ones(n), z * x, (1 - z) * x, z])
. The training matrix for the model is np.column_stack([x, w])
, which we feed into the BARTModel
sampler via the X_train
parameter. The basis vector Psi
is fed into the function via the leaf_basis_train
parameter. The parameter list barddt_mean_params
defines options for the mean forest (a different list can be defined for a variance forest in the case of heteroscedastic BART, which we do not consider here). Importantly, in this list we define parameter sigma2_leaf_init = np.diag(np.repeat(0.1/150, 4))
, which sets $\Sigma_0$ as described above. Now, we can fit the model, which is saved in object barddt_model
.
Once the model is fit, we need 3 elements to obtain the CATE predictions: the basis vectors at the cutoff for $z=1$ and $z=0$, the test matrix $[X \quad W]$ at the cutoff, and the testing sample. We define the prediction basis vectors $\psi_1 = [1 \quad 0 \quad 0 \quad 1]$ and $\psi_0 = [1 \quad 0 \quad 0 \quad 0]$, which correspond to $\psi$ at $(x=0,z=1)$, and $(x=0,z=0)$, respectively. These vectors are written into Python as Psi1 = np.column_stack([np.ones(n), np.repeat(c, n), np.zeros(n), np.ones(n)])
and Psi0 = np.column_stack([np.ones(n), np.zeros(n), np.repeat(c, n), np.zeros(n)])
. Then, we write the test matrix at $(x=0,\mathrm{w})$ as xmat_test = np.column_stack([np.zeros(n), w])[test,:]
. Finally, we must define the testing window. As discussed previously, our window is set such that $|x| \leq 0.1$, which can be set in Python as test = (x > -h) & (x < h)
.
Once all of these elements are set, we can obtain the outcome predictions at the cutoff by running barddt_model.predict(xmat_test, Psi1)
(resp. barddt_model.predict(xmat_test, Psi0)
). Each of these calls returns a list, from which we can extract element y_hat
to obtain the posterior distribution for the outcome. In the code below, the treated and control outcome predictions are saved in the matrix objects pred1
and pred0
, respectively. Now, we can obtain draws from the CATE posterior by simply subtracting these matrices. The function below outlines how to perform each of these steps in Python.
def estimate_barddt(y,x,w,z,test,c,num_gfr=10,num_mcmc=100,seed=None):
+ ## Lists of parameters for the Stochtree BART function
+ barddt_global_params = {
+ "standardize": True,
+ "sample_sigma_global": True,
+ "sigma2_global_init": 0.1
+ }
+ if seed is not None:
+ barddt_global_params["random_seed"] = seed
+ barddt_mean_params = {
+ "num_trees": 50,
+ "min_samples_leaf": 20,
+ "alpha": 0.95,
+ "beta": 2,
+ "max_depth": 20,
+ "sample_sigma2_leaf": False,
+ "sigma2_leaf_init": np.diag(np.repeat(0.1/150, 4))
+ }
+ ## Set basis vector for leaf regressions
+ n = y.shape[0]
+ Psi = np.column_stack([np.ones(n), z * x, (1 - z) * x, z])
+ covariates = np.column_stack([x, w])
+ ## Model fit
+ barddt_model = BARTModel()
+ barddt_model.sample(
+ X_train=covariates,
+ y_train=y,
+ leaf_basis_train=Psi,
+ num_gfr=num_gfr,
+ num_mcmc=num_mcmc,
+ general_params=barddt_global_params,
+ mean_forest_params=barddt_mean_params
+ )
+ ## Define basis vectors and test matrix for outcome predictions at X=c
+ Psi1 = np.column_stack([np.ones(n), np.repeat(c, n), np.zeros(n), np.ones(n)])
+ Psi0 = np.column_stack([np.ones(n), np.zeros(n), np.repeat(c, n), np.zeros(n)])
+ Psi1 = Psi1[test,:]
+ Psi0 = Psi0[test,:]
+ xmat_test = np.column_stack([np.zeros(n), w])[test,:]
+ ## Obtain outcome predictions
+ pred1 = barddt_model.predict(xmat_test, Psi1)
+ pred0 = barddt_model.predict(xmat_test, Psi0)
+ ## Obtain CATE posterior
+ return pred1 - pred0
+
Now, we proceed to fit the BARDDT model.
+num_chains = 4
+num_gfr = 2
+num_mcmc = 100
+cate_result = np.empty((ntest, num_chains*num_mcmc))
+for i in range(num_chains):
+ cate_rdd = estimate_barddt(y,x,w,z,test,c,num_gfr=2,num_mcmc=100,seed=i)
+ cate_result[:,(i*num_mcmc):((i+1)*num_mcmc)] = cate_rdd
+
We now proceed to analyze the CATE posterior. The figure produced below presents a summary of the CATE posterior produced by BARDDT for this application. This picture is produced fitting a regression tree, using $W$ as the predictors, to the individual posterior mean CATEs: +\begin{equation} +\bar{\tau}_i = \frac{1}{M} \sum_{h = 1}^M \tau^{(h)}(0, \mathrm{w}_i), +\end{equation} +where $h$ indexes each of $M$ total posterior samples. As in our simulation studies, we restrict our posterior analysis to use $\mathrm{w}_i$ values of observations with $|x_i| \leq \delta = 0.1$ (after normalizing $X$ to have standard deviation 1 in-sample). For the Lindo et al (2010) data, this means that BARDDT was trained on $n = 40,582$ observations, of which 1,602 satisfy $x_i \leq 0.1$, which were used to generate the effect moderation tree.
+## Fit regression tree
+y_surrogate = np.mean(cate_rdd, axis=1)
+X_surrogate = w.iloc[test,:]
+cate_surrogate = DecisionTreeRegressor(min_impurity_decrease=0.0001)
+cate_surrogate.fit(X=X_surrogate, y=y_surrogate)
+plot_tree(cate_surrogate, impurity=False, filled=True, feature_names=w.columns, proportion=False, label='root', node_ids=True)
+plt.show()
+
The resulting effect moderation tree indicates that course load (credits attempted) in the academic term leading to their probation is a strong moderator. Contextually, this result is plausible, both because course load could relate to latent character attributes that influence a student's responsiveness to sanctions and also because it could predict course load in the current term, which would in turn have implications for the GPA (i.e. it is harder to get a high GPA while taking more credit hours). The tree also suggests that effects differ by age and gender of the student. These findings are all prima facie plausible as well.
+To gauge how strong these findings are statistically, we can zoom in on isolated subgroups and compare the posteriors of their subgroup average treatment effects. This approach is valid because in fitting the effect moderation tree to the posterior mean CATEs we in no way altered the posterior itself; the effect moderation tree is a posterior summary tool and not any additional inferential approach; the posterior is obtained once and can be explored freely using a variety of techniques without vitiating its statistical validity. Investigating the most extreme differences is a good place to start: consider the two groups of students at opposite ends of the treatment effect range discovered by the effect moderation tree:
+Subgroup CATEs are obtained by aggregating CATEs across the observed $\mathrm{w}_i$ values for individuals in each group; this can be done for individual posterior samples, yielding a posterior distribution over the subgroup CATE: +\begin{equation} +\bar{\tau}_A^{(h)} = \frac{1}{n_A} \sum_{i : \mathrm{w}_i} \tau^{(h)}(0, \mathrm{w}_i), +\end{equation} +where $h$ indexes a posterior draw and $n_A$ denotes the number of individuals in the group A.
+The code below produces a contour plot for a bivariate kernel density estimate of the joint CATE posterior distribution for subgroups A and B. The contour lines are nearly all above the $45^{\circ}$ line, indicating that the preponderance of posterior probability falls in the region where the treatment effect for Group A is greater than that of Group B, meaning that the difference in the subgroup treatment effects flagged by the effect moderation tree persist even after accounting for estimation uncertainty in the underlying CATE function.
+predicted_nodes = cate_surrogate.apply(X=X_surrogate)
+posterior_group_a = np.mean(cate_result[predicted_nodes==2,:],axis=0)
+posterior_group_b = np.mean(cate_result[predicted_nodes==6,:],axis=0)
+posterior_df = pd.DataFrame({'group_a': posterior_group_a, 'group_b': posterior_group_b})
+sns.kdeplot(data=posterior_df, x="group_b", y="group_a")
+plt.axline((0, 0), slope=1, color="black", linestyle=(0, (3, 3)))
+plt.show()
+
As always, CATEs that vary with observable factors do not necessarily represent a causal moderating relationship. Here, if the treatment effect of academic probation is seen to vary with the number of credits, that does not imply that this association is causal: prescribing students to take a certain number of credits will not necessarily lead to a more effective probation policy, it may simply be that the type of student to naturally enroll for fewer credit hours is more likely to be responsive to academic probation. An entirely distinct set of causal assumptions are required to interpret the CATE variations themselves as causal. All the same, uncovering these patterns of treatment effect variability are crucial to suggesting causal mechanism to be investigated in future studies.
+Lindo, Jason M., Nicholas J. Sanders, and Philip Oreopoulos. "Ability, gender, and performance standards: Evidence from academic probation." American economic journal: Applied economics 2, no. 2 (2010): 95-117.
+Here we consider a causal inference problem with a binary treatment
+and a binary outcome where there is unobserved confounding, but an
+exogenous instrument is available (also binary). This problem will
+require a number of extensions to the basic BART model, all of which can
+be implemented straightforwardly as Gibbs samplers using
+stochtree
. We’ll go through all of the model fitting steps
+in quite a lot of detail here.
To be concrete, suppose we wish to measure the effect of receiving a +flu vaccine on the probability of getting the flu. Individuals who opt +to get a flu shot differ in many ways from those that don’t, and these +lifestyle differences presumably also affect their respective chances of +getting the flu. Consequently, comparing the percentage of individuals +who get the flu in the vaccinated and unvaccinated groups does not give +a clear picture of the vaccine efficacy.
+However, a so-called encouragement design can be implemented, where +some individuals are selected at random to be given some extra incentive +to get a flu shot (free clinics at the workplace or a personalized +reminder, for example). Studying the impact of this randomized +encouragement allows us to tease apart the impact of the vaccine from +the confounding factors, at least to some extent. This exact problem has +been considered several times in the literature, starting with McDonald, Hui, and Tierney (1992) with follow-on +analysis by Hirano et al. (2000), Richardson, Evans, and Robins (2011), and Imbens and Rubin (2015).
+Our analysis here follows the Bayesian nonparametric approach +described in the supplement to Hahn, Murray, and +Manolopoulou (2016).
+Let \(V\) denote the treatment +variable (as in “vaccine”). Let \(Y\) +denote the response variable (getting the flu), \(Z\) denote the instrument (encouragement or +reminder to get a flu shot), and \(X\) +denote an additional observable covariate (for instance, patient +age).
+Further, let \(S\) denote the +so-called principal strata, which is an exhaustive +characterization of how individuals’ might be affected by the +encouragement regarding the flu shot. Some people will get a flu shot no +matter what: these are the always takers (a). Some people will +not get the flu shot no matter what: these are the never takers +(n). For both always-takers and never-takers, the randomization of the +encouragement is irrelevant and our data set contains no always takers +who skipped the vaccine and no never takers who got the vaccine and so +the treatment effect of the vaccine in these groups is fundamentally +non-identifiable.
+By contrast, we also have compliers (c): folks who would not +have gotten the shot but for the fact that they were encouraged to do +so. These are the people about whom our randomized encouragement +provides some information, because they are precisely the ones that have +been randomized to treatment.
+Lastly, we could have defiers (d): contrarians who who were +planning on getting the shot, but – upon being reminded – decided not +to! For our analysis we will do the usual thing of assuming that there +are no defiers. And because we are going to simulate our data, we can +make sure that this assumption is true.
+The causal diagram for this model can be expressed as follows. Here +we are considering one confounder and moderator variable (\(X\)), which is the patient’s age. In our +data generating process (which we know because this is a simulation +demonstration) higher age will make it more likely that a person is an +always taker or complier and less likely that they are a never taker, +which in turn has an effect on flu risk. We stipulate here that always +takers are at lower risk and never takers at higher risk. +Simultaneously, age has an increasing and then decreasing direct effect +on flu risk; very young and very old are at higher risk, while young and +middle age adults are at lower risk. In this DGP the flu efficacy has a +multiplicative effect, reducing flu risk as a fixed proportion of +baseline risk – accordingly, the treatment effect (as a difference) is +nonlinear in Age (for each principal stratum).
++The causal directed acyclic graph (CDAG) for the instrumental variables +flu example. +
+The biggest question about this graph concerns the dashed red arrow +from the putative instrument \(Z\) to +the outcome (flu). We say “putative” because if that dashed red arrow is +there, then technically \(Z\) is not a +valid instrument. The assumption/assertion that there is no dashed red +arrow is called the “exclusion restriction”. In this vignette, we will +explore what sorts of inferences are possible if we remain agnostic +about the presence or absence of that dashed red arrow.
+There are two relevant potential outcomes in an instrumental +variables analysis, corresponding to the causal effect of the instrument +on the treatment and the causal effect of the treatment on the outcome. +In this example, that is the effect of the reminder/encouragement on +vaccine status and the effect of the vaccine itself on the flu. The +notation is \(V(Z)\) and \(Y(V(Z),Z)\) respectively, so that we have +six distinct random variables: \(V(0)\), \(V(1)\), \(Y(0,0)\), \(Y(1,0)\), \(Y(0,1)\) and \(Y(1,1)\). The problem – sometimes called +the fundamental problem of causal inference – is that some of +these random variables can never be seen simultaneously, they are +observationally mutually exclusive. For this reason, it may be helpful +to think about causal inference as a missing data problem, as depicted +in the following table.
++\(i\) + | ++\(Z_i\) + | ++\(V_i(0)\) + | ++\(V_i(1)\) + | ++\(Y_i(0,0)\) + | ++\(Y_i(1,0)\) + | ++\(Y_i(0,1)\) + | ++\(Y_i(1,1)\) + | +
---|---|---|---|---|---|---|---|
+1 + | ++1 + | ++? + | ++1 + | ++? + | ++? + | ++? + | ++0 + | +
+2 + | ++0 + | ++1 + | ++? + | ++? + | ++1 + | ++? + | ++? + | +
+3 + | ++0 + | ++0 + | ++? + | ++1 + | ++? + | ++? + | ++? + | +
+4 + | ++1 + | ++? + | ++0 + | ++? + | ++? + | ++0 + | ++? + | +
+\(\vdots\) + | ++\(\vdots\) + | ++\(\vdots\) + | ++\(\vdots\) + | ++\(\vdots\) + | ++\(\vdots\) + | ++\(\vdots\) + | ++\(\vdots\) + | +
Likewise, with this notation we can formally define the principal +strata:
++\(V_i(0)\) + | ++\(V_i(1)\) + | ++\(S_i\) + | +
---|---|---|
+0 + | ++0 + | ++Never Taker (n) + | +
+1 + | ++1 + | ++Always Taker (a) + | +
+0 + | ++1 + | ++Complier (c) + | +
+1 + | ++0 + | ++Defier (d) + | +
Let \(\pi_s(x)\) denote the +conditional (on \(x\)) probability that +an individual belongs to principal stratum \(s\): \[\begin{equation} +\pi_s(x)=\operatorname{Pr}(S=s \mid X=x), +\end{equation}\] and let \(\gamma_s^{v +z}(x)\) denote the potential outcome probability for given values +\(v\) and \(z\): \[\begin{equation} +\gamma_s^{v z}(x)=\operatorname{Pr}(Y(v, z)=1 \mid S=s, X=x). +\end{equation}\]
+Various estimands of interest may be expressed in terms of the +functions \(\gamma_c^{vz}(x)\). In +particular, the complier conditional average treatment effect \[\gamma_c^{1,z}(x) - \gamma_c^{0,z}(x)\] is +the ultimate goal (for either \(z=0\) +or \(z=1\)). Under an exclusion +restriction, we would have \(\gamma_s^{vz}(x) += \gamma_s^{v}(x)\) and the reminder status \(z\) itself would not matter. In that case, +we can estimate \[\gamma_c^{1,z}(x) - +\gamma_c^{0,z}\] and \[\gamma_c^{1,1}(x) - \gamma_c^{0,0}(x).\] +This latter quantity is called the complier intent-to-treat effect, or +\(ITT_c\), and it can be partially +identify even if the exclusion restriction is violated, as follows.
+The left-hand side of the following system of equations are all +estimable quantities that can be learned from observable data, while the +right hand side expressions involve the unknown functions of interest, +\(\gamma_s^{vz}(x)\):
+\[\begin{equation} +\begin{aligned} +p_{1 \mid 00}(x) = \operatorname{Pr}(Y=1 \mid V=0, Z=0, +X=x)=\frac{\pi_c(x)}{\pi_c(x)+\pi_n(x)} +\gamma_c^{00}(x)+\frac{\pi_n(x)}{\pi_c(x)+\pi_n(x)} \gamma_n^{00}(x) \\ +p_{1 \mid 11}(x) =\operatorname{Pr}(Y=1 \mid V=1, Z=1, +X=x)=\frac{\pi_c(x)}{\pi_c(x)+\pi_a(x)} +\gamma_c^{11}(x)+\frac{\pi_a(x)}{\pi_c(x)+\pi_a(x)} \gamma_a^{11}(x) \\ +p_{1 \mid 01}(x) =\operatorname{Pr}(Y=1 \mid V=0, Z=1, +X=x)=\frac{\pi_d(x)}{\pi_d(x)+\pi_n(x)} +\gamma_d^{01}(x)+\frac{\pi_n(x)}{\pi_d(x)+\pi_n(x)} \gamma_n^{01}(x) \\ +p_{1 \mid 10}(x) =\operatorname{Pr}(Y=1 \mid V=1, Z=0, +X=x)=\frac{\pi_d(x)}{\pi_d(x)+\pi_a(x)} +\gamma_d^{10}(x)+\frac{\pi_a(x)}{\pi_d(x)+\pi_a(x)} \gamma_a^{10}(x) +\end{aligned} +\end{equation}\]
+Furthermore, we have \[\begin{equation} +\begin{aligned} +\operatorname{Pr}(V=1 \mid Z=0, X=x)&=\pi_a(x)+\pi_d(x)\\ +\operatorname{Pr}(V=1 \mid Z=1, X=x)&=\pi_a(x)+\pi_c(x) +\end{aligned} +\end{equation}\]
+Under the monotonicy assumption, \(\pi_d(x) += 0\) and these expressions simplify somewhat. \[\begin{equation} +\begin{aligned} +p_{1 \mid 00}(x)&=\frac{\pi_c(x)}{\pi_c(x)+\pi_n(x)} +\gamma_c^{00}(x)+\frac{\pi_n(x)}{\pi_c(x)+\pi_n(x)} \gamma_n^{00}(x) \\ +p_{1 \mid 11}(x)&=\frac{\pi_c(x)}{\pi_c(x)+\pi_a(x)} +\gamma_c^{11}(x)+\frac{\pi_a(x)}{\pi_c(x)+\pi_a(x)} \gamma_a^{11}(x) \\ +p_{1 \mid 01}(x)&=\gamma_n^{01}(x) \\ +p_{1 \mid 10}(x)&=\gamma_a^{10}(x) +\end{aligned} +\end{equation}\] and \[\begin{equation} +\begin{aligned} +\operatorname{Pr}(V=1 \mid Z=0, X=x)&=\pi_a(x)\\ +\operatorname{Pr}(V=1 \mid Z=1, X=x)&=\pi_a(x)+\pi_c(x) +\end{aligned} +\end{equation}\]
+The exclusion restriction would dictate that \(\gamma_s^{01}(x) = \gamma_s^{00}(x)\) and +\(\gamma_s^{11}(x) = \gamma_s^{10}(x)\) +for all \(s\). This has two +implications. One, \(\gamma_n^{01}(x) = +\gamma_n^{00}(x)\) and \(\gamma_a^{10}(x) = \gamma_a^{11}(x)\),and +because the left-hand terms are identified, this permits \(\gamma_c^{11}(x)\) and \(\gamma_c^{00}(x)\) to be solved for by +substitution. Two, with these two quantities solved for, we also have +the two other quantities (the different settings of \(z\)), since \(\gamma_c^{11}(x) = \gamma_c^{10}(x)\) and +\(\gamma_c^{00}(x) = +\gamma_c^{01}(x)\). Consequently, both of our estimands from +above can be estimated:
+\[\gamma_c^{11}(x) - +\gamma_c^{01}(x)\] and
+\[\gamma_c^{10}(x) - +\gamma_c^{00}(x)\] because they are both (supposing the exclusion +restriction holds) the same as
+\[\gamma_c^{11}(x) - +\gamma_c^{00}(x).\] If the exclusion restriction does +not hold, then the three above treatment effects are all +(potentially) distinct and not much can be said about the former two. +The latter one, the \(ITT_c\), however, +can be partially identified, by recognizing that the first two equations +(in our four equation system) provide non-trivial bounds based on the +fact that while \(\gamma_c^{11}(x)\) +and \(\gamma_c^{00}(x)\) are no longer +identified, as probabilities both must lie between 0 and 1. Thus,
+\[\begin{equation} +\begin{aligned} + \max\left( + 0, \frac{\pi_c(x)+\pi_n(x)}{\pi_c(x)}p_{1\mid 00}(x) - +\frac{\pi_n(x)}{\pi_c(x)} + \right) +&\leq\gamma^{00}_c(x)\leq + \min\left( + 1, \frac{\pi_c(x)+\pi_n(x)}{\pi_c(x)}p_{1\mid 00}(x) + \right)\\\\ +% +\max\left( + 0, \frac{\pi_a(x)+\pi_c(x)}{\pi_c(x)}p_{1\mid 11}(x) - +\frac{\pi_a(x)}{\pi_c(x)} +\right) +&\leq\gamma^{11}_c(x)\leq +\min\left( + 1, \frac{\pi_a(x)+\pi_c(x)}{\pi_c(x)}p_{1\mid 11}(x) +\right) +\end{aligned} +\end{equation}\]
+The point of all this is that the data (plus a no-defiers assumption) +lets us estimate all the necessary inputs to these upper and lower +bounds on \(\gamma^{11}_c(x)\) and +\(\gamma^{00}_c(x)\) which in turn +define our estimand. What remains is to estimate those inputs, as +functions of \(x\), and to do so while +enforcing the monotonicty restriction \[\operatorname{Pr}(V=1 \mid Z=0, X=x)=\pi_a(x) +\leq +\operatorname{Pr}(V=1 \mid Z=1, X=x)=\pi_a(x)+\pi_c(x).\]
+We can do all of this with calls to stochtree from R (or Python). But +first, let’s generate some test data.
+Start with some initial setup / housekeeping
+library(stochtree)
+
+# size of the training sample
+n <- 20000
+
+# To set the seed for reproducibility/illustration purposes, replace "NULL" with a positive integer
+random_seed <- NULL
+First, we generate the instrument exogenously
+z <- rbinom(n, 1, 0.5)
+Next, we generate the covariate. (For this example, let’s think of it +as patient age, although we are generating it from a uniform +distribution between 0 and 3, so you have to imagine that it has been +pre-standardized to this scale. It keeps the DGPs cleaner for +illustration purposes.)
+p_X <- 1
+X <- matrix(runif(n*p_X, 0, 3), ncol = p_X)
+x <- X[,1] # for ease of reference later
+Next, we generate the principal strata \(S\) based on the observed value of \(X\). We generate it according to a logistic +regression with two coefficients per strata, an intercept and a slope. +Here, these coefficients are set so that the probability of being a +never taker decreases with age.
+alpha_a <- 0
+beta_a <- 1
+
+alpha_n <- 1
+beta_n <- -1
+
+alpha_c <- 1
+beta_c <- 1
+
+# define function (a logistic model) to generate Pr(S = s | X = x)
+pi_s <- function(xval){
+
+ w_a <- exp(alpha_a + beta_a*xval)
+ w_n <- exp(alpha_n + beta_n*xval)
+ w_c <- exp(alpha_c + beta_c*xval)
+
+ w <- cbind(w_a, w_n, w_c)
+ colnames(w) <- c("w_a","w_n","w_c")
+ w <- w/rowSums(w)
+
+ return(w)
+
+}
+s <- sapply(1:n, function(j) sample(c("a","n","c"), 1, prob = pi_s(X[j,1])))
+Next, we generate the treatment variable, here denoted \(V\) (for “vaccine”), as a +deterministic function of \(S\) and \(Z\); this is what gives the principal +strata their meaning.
+v <- 1*(s=="a") + 0*(s=="n") + z*(s=="c") + (1-z)*(s == "d")
+Finally, the outcome structural model is specified, based on which +the outcome is sampled. By varying this function in particular ways, we +can alter the identification conditions.
+gamfun <- function(xval,vval, zval,sval){
+
+ # if this function depends on zval, then exclusion restriction is violated
+ # if this function does not depend on sval, then IV analysis wasn't necessary
+ # if this function does not depend on x, then there are no HTEs
+
+ baseline <- pnorm(2 -1*xval - 2.5*(xval-1.5)^2 - 0.5*zval + 1*(sval=="n") - 1*(sval=="a") )
+ prob <- baseline - 0.5*vval*baseline # 0.5*vval*baseline
+
+ return(prob)
+}
+
+# Generate the observed outcome
+y <- rbinom(n, 1, gamfun(X[,1],v,z,s))
+Lastly, we perform some organization for our supervised learning +algorithms later on.
+# Concatenate X, v and z for our supervised learning algorithms
+Xall <- cbind(X,v,z)
+
+# update the size of "X" to be the size of Xall
+p_X <- p_X + 2
+
+# For the monotone probit model it is necessary to sort the observations so that the Z=1 cases are all together
+# at the start of the outcome vector.
+index <- sort(z,decreasing = TRUE, index.return = TRUE)
+
+X <- matrix(X[index$ix,],ncol= 1)
+Xall <- Xall[index$ix,]
+z <- z[index$ix]
+v <- v[index$ix]
+s <- s[index$ix]
+y <- y[index$ix]
+x <- x[index$ix]
+Now let’s see if we can recover these functions from the observed +data.
+We have to fit three models here, the treatment models: \(\operatorname{Pr}(V = 1 | Z = 1, X=x)\) and +\(\operatorname{Pr}(V = 1 | Z = 0,X = +x)\), subject to the monotonicity constraint \(\operatorname{Pr}(V = 1 | Z = 1, X=x) \geq +\operatorname{Pr}(V = 1 | Z = 0,X = x)\), and an outcome model +\(\operatorname{Pr}(Y = 1 | Z = 1, V = 1, X = +x)\). All of this will be done with stochtree.
+The outcome model is fit with a single (S-learner) BART model. This +part of the model could be fit as a T-Learner or as a BCF model. Here we +us an S-Learner for simplicity. Both models are probit models, and use +the well-known Albert and Chib (1993) data +augmentation Gibbs sampler. This section covers the more straightforward +outcome model. The next section describes how the monotonicity +constraint is handled with a data augmentation Gibbs sampler.
+These models could (and probably should) be wrapped as functions. +Here they are implemented as scripts, with the full loops shown. The +output – at the end of the loops – are stochtree forest objects from +which we can extract posterior samples and generate predictions. In +particular, the \(ITT_c\) will be +constructed using posterior counterfactual predictions derived from +these forest objects.
+We begin by setting a bunch of hyperparameters and instantiating the +forest objects to be operated upon in the main sampling loop. We also +initialize the latent variables.
+# Fit the BART model for Pr(Y = 1 | Z = 1, V = 1, X = x)
+
+# Set number of iterations
+num_warmstart <- 10
+num_mcmc <- 1000
+num_samples <- num_warmstart + num_mcmc
+
+# Set a bunch of hyperparameters. These are ballpark default values.
+alpha <- 0.95
+beta <- 2
+min_samples_leaf <- 1
+max_depth <- 20
+num_trees <- 50
+cutpoint_grid_size = 100
+global_variance_init = 1.
+tau_init = 0.5
+leaf_prior_scale = matrix(c(tau_init), ncol = 1)
+a_leaf <- 2.
+b_leaf <- 0.5
+leaf_regression <- F
+feature_types <- as.integer(c(rep(0, p_X-2),1,1)) # 0 = numeric
+var_weights <- rep(1,p_X)/p_X
+outcome_model_type <- 0
+
+# C++ dataset
+forest_dataset <- createForestDataset(Xall)
+
+# Random number generator (std::mt19937)
+if (is.null(random_seed)) {
+ rng <- createCppRNG(-1)
+} else {
+ rng <- createCppRNG(random_seed)
+}
+
+# Sampling data structures
+forest_model_config <- createForestModelConfig(
+ feature_types = feature_types, num_trees = num_trees, num_features = p_X,
+ num_observations = n, variable_weights = var_weights, leaf_dimension = 1,
+ alpha = alpha, beta = beta, min_samples_leaf = min_samples_leaf,
+ max_depth = max_depth, leaf_model_type = outcome_model_type,
+ leaf_model_scale = leaf_prior_scale, cutpoint_grid_size = cutpoint_grid_size
+)
+global_model_config <- createGlobalModelConfig(global_error_variance = 1)
+forest_model <- createForestModel(forest_dataset, forest_model_config, global_model_config)
+
+# Container of forest samples
+forest_samples <- createForestSamples(num_trees, 1, T, F)
+
+# "Active" forest state
+active_forest <- createForest(num_trees, 1, T, F)
+
+# Initialize the latent outcome zed
+n1 <- sum(y)
+zed <- 0.25*(2*as.numeric(y) - 1)
+
+# C++ outcome variable
+outcome <- createOutcome(zed)
+
+# Initialize the active forest and subtract each root tree's predictions from outcome
+active_forest$prepare_for_sampler(forest_dataset, outcome, forest_model, outcome_model_type, 0.0)
+active_forest$adjust_residual(forest_dataset, outcome, forest_model, FALSE, FALSE)
+Now we enter the main loop, which involves only two steps: sample the +forest, given the latent utilities, then sample the latent utilities +given the estimated conditional means defined by the forest and its +parameters.
+# Initialize the Markov chain with num_warmstart grow-from-root iterations
+gfr_flag <- T
+for (i in 1:num_samples) {
+
+ # The first num_warmstart iterations use the grow-from-root algorithm of He and Hahn
+ if (i > num_warmstart){
+ gfr_flag <- F
+ }
+
+ # Sample forest
+ forest_model$sample_one_iteration(
+ forest_dataset, outcome, forest_samples, active_forest,
+ rng, forest_model_config, global_model_config,
+ keep_forest = T, gfr = gfr_flag
+ )
+
+ # Get the current means
+ eta <- forest_samples$predict_raw_single_forest(forest_dataset, i-1)
+
+ # Sample latent normals, truncated according to the observed outcome y
+ U1 <- runif(n1,pnorm(0,eta[y==1],1),1)
+ zed[y==1] <- qnorm(U1,eta[y==1],1)
+ U0 <- runif(n - n1,0, pnorm(0,eta[y==0],1))
+ zed[y==0] <- qnorm(U0,eta[y==0],1)
+
+ # Propagate the newly sampled latent outcome to the BART model
+ outcome$update_data(zed)
+ forest_model$propagate_residual_update(outcome)
+}
+The monotonicty constraint relies on a data augmentation as described +in Papakostas et al. (2023). The +implementation of this sampler is inherently cumbersome, as one of the +“data” vectors is constructed from some observed data and some latent +data and there are two forest objects, one of which applies to all of +the observations and one of which applies to only those observations +with \(Z = 0\). We go into more details +about this sampler in a dedicated vignette. Here we include the code, +but without producing the equations derived in Papakostas et al. (2023). What is most important +is simply that
+\[\begin{equation} +\begin{aligned} +\operatorname{Pr}(V=1 \mid Z=0, X=x)&=\pi_a(x) = +\Phi_f(x)\Phi_h(x),\\ +\operatorname{Pr}(V=1 \mid Z=1, X=x)&=\pi_a(x)+\pi_c(x) = \Phi_f(x), +\end{aligned} +\end{equation}\] where \(\Phi_{\mu}(x)\) denotes the normal +cumulative distribution function with mean \(\mu(x)\) and variance 1.
+We first create a secondary data matrix for the \(Z=0\) group only. We also set all of the +hyperparameters and initialize the latent variables.
+# Fit the monotone probit model to the treatment such that Pr(V = 1 | Z = 1, X=x) >= Pr(V = 1 | Z = 0,X = x)
+
+X_h <- as.matrix(X[z==0,])
+n0 <- sum(z==0)
+n1 <- sum(z==1)
+
+num_trees_f <- 50
+num_trees_h <- 20
+feature_types <- as.integer(rep(0, 1)) # 0 = numeric
+var_weights <- rep(1,1)
+cutpoint_grid_size = 100
+global_variance_init = 1.
+tau_init = 1/num_trees_h
+leaf_prior_scale = matrix(c(tau_init), ncol = 1)
+nu <- 4
+lambda <- 0.5
+a_leaf <- 2.
+b_leaf <- 0.5
+leaf_regression <- F # fit a constant leaf mean BART model
+
+# Instantiate the C++ dataset objects
+forest_dataset_f <- createForestDataset(X)
+forest_dataset_h <- createForestDataset(X_h)
+
+# Tell it we're fitting a normal BART model
+outcome_model_type <- 0
+
+# Set up model configuration objects
+forest_model_config_f <- createForestModelConfig(
+ feature_types = feature_types, num_trees = num_trees_f, num_features = ncol(X),
+ num_observations = nrow(X), variable_weights = var_weights, leaf_dimension = 1,
+ alpha = alpha, beta = beta, min_samples_leaf = min_samples_leaf,
+ max_depth = max_depth, leaf_model_type = outcome_model_type,
+ leaf_model_scale = leaf_prior_scale, cutpoint_grid_size = cutpoint_grid_size
+)
+forest_model_config_h <- createForestModelConfig(
+ feature_types = feature_types, num_trees = num_trees_h, num_features = ncol(X_h),
+ num_observations = nrow(X_h), variable_weights = var_weights, leaf_dimension = 1,
+ alpha = alpha, beta = beta, min_samples_leaf = min_samples_leaf,
+ max_depth = max_depth, leaf_model_type = outcome_model_type,
+ leaf_model_scale = leaf_prior_scale, cutpoint_grid_size = cutpoint_grid_size
+)
+global_model_config <- createGlobalModelConfig(global_error_variance = 1)
+
+# Instantiate the sampling data structures
+forest_model_f <- createForestModel(forest_dataset_f, forest_model_config_f, global_model_config)
+forest_model_h <- createForestModel(forest_dataset_h, forest_model_config_h, global_model_config)
+
+# Instantiate containers of forest samples
+forest_samples_f <- createForestSamples(num_trees_f, 1, T)
+forest_samples_h <- createForestSamples(num_trees_h, 1, T)
+
+# Instantiate "active" forests
+active_forest_f <- createForest(num_trees_f, 1, T)
+active_forest_h <- createForest(num_trees_h, 1, T)
+
+# Set algorithm specifications
+# these are set in the earlier script for the outcome model; number of draws needs to be commensurable
+
+#num_warmstart <- 10
+#num_mcmc <- 2000
+#num_samples <- num_warmstart + num_mcmc
+
+# Initialize the Markov chain
+
+# Initialize (R0, R1), the latent binary variables that enforce the monotonicty
+
+v1 <- v[z==1]
+v0 <- v[z==0]
+
+R1 = rep(NA,n0)
+R0 = rep(NA,n0)
+
+R1[v0==1] <- 1
+R0[v0==1] <- 1
+
+R1[v0 == 0] <- 0
+R0[v0 == 0] <- sample(c(0,1),sum(v0==0),replace=TRUE)
+
+# The first n1 observations of vaug are actually observed
+# The next n0 of them are the latent variable R1
+vaug <- c(v1, R1)
+
+# Initialize the Albert and Chib latent Gaussian variables
+z_f <- (2*as.numeric(vaug) - 1)
+z_h <- (2*as.numeric(R0)-1)
+z_f <- z_f/sd(z_f)
+z_h <- z_h/sd(z_h)
+
+# Pass these variables to the BART models as outcome variables
+outcome_f <- createOutcome(z_f)
+outcome_h <- createOutcome(z_h)
+
+# Initialize active forests to constant (0) predictions
+active_forest_f$prepare_for_sampler(forest_dataset_f, outcome_f, forest_model_f, outcome_model_type, 0.0)
+active_forest_h$prepare_for_sampler(forest_dataset_h, outcome_h, forest_model_h, outcome_model_type, 0.0)
+active_forest_f$adjust_residual(forest_dataset_f, outcome_f, forest_model_f, FALSE, FALSE)
+active_forest_h$adjust_residual(forest_dataset_h, outcome_h, forest_model_h, FALSE, FALSE)
+Now we run the main sampling loop, which consists of three key steps: +sample the BART forests, given the latent probit utilities, sampling the +latent binary outcome pairs (this is the step that is necessary for +enforcing monotonicity), given the forest predictions and the latent +utilities, and finally sample the latent utilities.
+# PART IV: run the Markov chain
+
+# Initialize the Markov chain with num_warmstart grow-from-root iterations
+gfr_flag <- T
+for (i in 1:num_samples) {
+
+ # Switch over to random walk Metropolis-Hastings tree updates after num_warmstart
+ if (i > num_warmstart) {
+ gfr_flag <- F
+ }
+
+ # Step 1: Sample the BART forests
+
+ # Sample forest for the function f based on (y_f, R1)
+ forest_model_f$sample_one_iteration(
+ forest_dataset_f, outcome_f, forest_samples_f, active_forest_f,
+ rng, forest_model_config_f, global_model_config,
+ keep_forest = T, gfr = gfr_flag
+ )
+
+ # Sample forest for the function h based on outcome R0
+ forest_model_h$sample_one_iteration(
+ forest_dataset_h, outcome_h, forest_samples_h, active_forest_h,
+ rng, forest_model_config_h, global_model_config,
+ keep_forest = T, gfr = gfr_flag
+ )
+
+ # Extract the means for use in sampling the latent variables
+ eta_f <- forest_samples_f$predict_raw_single_forest(forest_dataset_f, i-1)
+ eta_h <- forest_samples_h$predict_raw_single_forest(forest_dataset_h, i-1)
+
+
+ # Step 2: sample the latent binary pair (R0, R1) given eta_h, eta_f, and y_g
+
+ # Three cases: (0,0), (0,1), (1,0)
+ w1 <- (1 - pnorm(eta_h[v0==0]))*(1-pnorm(eta_f[n1 + which(v0==0)]))
+ w2 <- (1 - pnorm(eta_h[v0==0]))*pnorm(eta_f[n1 + which(v0==0)])
+ w3 <- pnorm(eta_h[v0==0])*(1 - pnorm(eta_f[n1 + which(v0==0)]))
+
+ s <- w1 + w2 + w3
+ w1 <- w1/s
+ w2 <- w2/s
+ w3 <- w3/s
+
+ u <- runif(sum(v0==0))
+ temp <- 1*(u < w1) + 2*(u > w1 & u < w1 + w2) + 3*(u > w1 + w2)
+
+ R1[v0==0] <- 1*(temp==2)
+ R0[v0==0] <- 1*(temp==3)
+
+ # Redefine y with the updated R1 component
+ vaug <- c(v1, R1)
+
+ # Step 3: sample the latent normals, given (R0, R1) and y_f
+
+ # First z0
+ U1 <- runif(sum(R0),pnorm(0, eta_h[R0==1],1),1)
+ z_h[R0==1] <- qnorm(U1, eta_h[R0==1],1)
+
+ U0 <- runif(n0 - sum(R0),0, pnorm(0, eta_h[R0==0],1))
+ z_h[R0==0] <- qnorm(U0, eta_h[R0==0],1)
+
+ # Then z1
+ U1 <- runif(sum(vaug),pnorm(0, eta_f[vaug==1],1),1)
+ z_f[vaug==1] <- qnorm(U1, eta_f[vaug==1],1)
+
+ U0 <- runif(n - sum(vaug),0, pnorm(0, eta_f[vaug==0],1))
+ z_f[vaug==0] <- qnorm(U0, eta_f[vaug==0],1)
+
+ # Propagate the updated outcomes through the BART models
+ outcome_h$update_data(z_h)
+ forest_model_h$propagate_residual_update(outcome_h)
+
+ outcome_f$update_data(z_f)
+ forest_model_f$propagate_residual_update(outcome_f)
+
+ # No more steps, just repeat a bunch of times
+}
+Now for the most interesting part, which is taking the stochtree BART +model fits and producing the causal estimates of interest.
+First we set up our grid for plotting the functions in \(X\). This is possible in this example +because the moderator, age, is one dimensional; in may applied problems +this will not be the case and visualization will be substantially +trickier.
+# Extract the credible intervals for the conditional treatment effects as a function of x.
+# We use a grid of values for plotting, with grid points that are typically fewer than the number of observations.
+
+ngrid <- 200
+xgrid <- seq(0.1,2.5,length.out = ngrid)
+X_11 <- cbind(xgrid,rep(1,ngrid),rep(1,ngrid))
+
+X_00 <- cbind(xgrid,rep(0,ngrid),rep(0,ngrid))
+X_01 <- cbind(xgrid,rep(0,ngrid),rep(1,ngrid))
+X_10 <- cbind(xgrid,rep(1,ngrid),rep(0,ngrid))
+Next, we compute the truth function evaluations on this plotting +grid, using the functions defined above when we generated our data.
+# Compute the true conditional outcome probabilities for plotting
+pi_strat <- pi_s(xgrid)
+w_a <- pi_strat[,1]
+w_n <- pi_strat[,2]
+w_c <- pi_strat[,3]
+
+w <- (w_c/(w_a + w_c))
+
+p11_true <- w*gamfun(xgrid,1,1,"c") + (1-w)*gamfun(xgrid,1,1,"a")
+
+w <- (w_c/(w_n + w_c))
+
+p00_true <- w*gamfun(xgrid,0,0,"c") + (1-w)*gamfun(xgrid,0,0,"n")
+
+# Compute the true ITT_c for plotting and comparison
+itt_c_true <- gamfun(xgrid,1,1,"c") - gamfun(xgrid,0,0,"c")
+
+# Compute the true LATE for plotting and comparison
+LATE_true0 <- gamfun(xgrid,1,0,"c") - gamfun(xgrid,0,0,"c")
+LATE_true1 <- gamfun(xgrid,1,1,"c") - gamfun(xgrid,0,1,"c")
+Next we populate the data structures for stochtree to operate on,
+call the predict functions to extract the predictions, convert them to
+probability scale using the built in pnorm
function.
# Datasets for counterfactual predictions
+forest_dataset_grid <- createForestDataset(cbind(xgrid))
+forest_dataset_11 <- createForestDataset(X_11)
+forest_dataset_00 <- createForestDataset(X_00)
+forest_dataset_10 <- createForestDataset(X_10)
+forest_dataset_01 <- createForestDataset(X_01)
+
+# Forest predictions
+preds_00 <- forest_samples$predict(forest_dataset_00)
+preds_11 <- forest_samples$predict(forest_dataset_11)
+preds_01 <- forest_samples$predict(forest_dataset_01)
+preds_10 <- forest_samples$predict(forest_dataset_10)
+
+# Probability transformations
+phat_00 <- pnorm(preds_00)
+phat_11 <- pnorm(preds_11)
+phat_01 <- pnorm(preds_01)
+phat_10 <- pnorm(preds_10)
+
+# Cleanup
+rm(preds_00)
+rm(preds_11)
+rm(preds_01)
+rm(preds_10)
+
+
+preds_ac <- forest_samples_f$predict(forest_dataset_grid)
+phat_ac <- pnorm(preds_ac)
+
+preds_adj <- forest_samples_h$predict(forest_dataset_grid)
+phat_a <- pnorm(preds_ac)*pnorm(preds_adj)
+rm(preds_adj)
+rm(preds_ac)
+
+phat_c <- phat_ac - phat_a
+
+phat_n <- 1 - phat_ac
+Now we may plot posterior means of various quantities (as a function +of \(X\)) to visualize how well the +models are fitting.
+# Set up the plotting window
+par(mfrow=c(1,2))
+
+# Plot the fitted outcome probabilities against the truth
+plot(p11_true,rowMeans(phat_11),pch=20,cex=0.5,bty='n')
+abline(0,1,col='red')
+
+plot(p00_true,rowMeans(phat_00),pch=20,cex=0.5,bty='n')
+abline(0,1,col='red')
+par(mfrow=c(1,3))
+plot(rowMeans(phat_ac),w_c +w_a,pch=20)
+abline(0,1,col='red')
+
+plot(rowMeans(phat_a),w_a,pch=20)
+abline(0,1,col='red')
+
+plot(rowMeans(phat_c),w_c,pch=20)
+abline(0,1,col='red')
+These plots are not as pretty as we might hope, but mostly this is a +function of how difficult it is to learn conditional probabilities from +binary outcomes. That we capture the trend broadly turns out to be +adequate for estimating treatment effects. Fit does improve with simpler +DGPs and larger training sets, as can be confirmed by experimentation +with this script.
+Lastly, we can construct the estimate of the \(ITT_c\) and compare it to the true value as +well as the \(Z=0\) and \(Z=1\) complier average treatment effects +(also called “local average treatment effects” or LATE). The key step in +this process is to center our posterior on the identified interval (at +each iteration of the sampler) at the value implied by a valid exclusion +restriction. For some draws this will not be possible, as that value +will be outside the identification region.
+# Generate draws from the posterior of the treatment effect
+# centered at the point-identified value under the exclusion restriction
+itt_c <- late <- matrix(NA,ngrid, ncol(phat_c))
+ss <- 6
+for (j in 1:ncol(phat_c)){
+
+ # Value of gamma11 implied by an exclusion restriction
+ gamest11 <- ((phat_a[,j] + phat_c[,j])/phat_c[,j])*phat_11[,j] - phat_10[,j]*phat_a[,j]/phat_c[,j]
+
+ # Identified region for gamma11
+ lower11 <- pmax(rep(0,ngrid), ((phat_a[,j] + phat_c[,j])/phat_c[,j])*phat_11[,j] - phat_a[,j]/phat_c[,j])
+ upper11 <- pmin(rep(1,ngrid), ((phat_a[,j] + phat_c[,j])/phat_c[,j])*phat_11[,j])
+
+ # Center a beta distribution at gamma11, but restricted to (lower11, upper11)
+ # do this by shifting and scaling the mean, drawing from a beta on (0,1), then shifting and scaling to the
+ # correct restricted interval
+ m11 <- (gamest11 - lower11)/(upper11-lower11)
+
+ # Parameters to the beta
+ a1 <- ss*m11
+ b1 <- ss*(1-m11)
+
+ # When the corresponding mean is out-of-range, sample from a beta with mass piled near the violated boundary
+ a1[m11<0] <- 1
+ b1[m11<0] <- 5
+
+ a1[m11>1] <- 5
+ b1[m11>1] <- 1
+
+ # Value of gamma00 implied by an exclusion restriction
+ gamest00 <- ((phat_n[,j] + phat_c[,j])/phat_c[,j])*phat_00[,j] - phat_01[,j]*phat_n[,j]/phat_c[,j]
+
+ # Identified region for gamma00
+ lower00 <- pmax(rep(0,ngrid), ((phat_n[,j] + phat_c[,j])/phat_c[,j])*phat_00[,j] - phat_n[,j]/phat_c[,j])
+ upper00 <- pmin(rep(1,ngrid), ((phat_n[,j] + phat_c[,j])/phat_c[,j])*phat_00[,j])
+
+ # Center a beta distribution at gamma00, but restricted to (lower00, upper00)
+ # do this by shifting and scaling the mean, drawing from a beta on (0,1), then shifting and scaling to the
+ # correct restricted interval
+ m00 <- (gamest00 - lower00)/(upper00-lower00)
+
+ a0 <- ss*m00
+ b0 <- ss*(1-m00)
+
+ a0[m00<0] <- 1
+ b0[m00<0] <- 5
+
+ a0[m00>1] <- 5
+ b0[m00>1] <- 1
+
+ # ITT and LATE
+ itt_c[,j] <- lower11 + (upper11 - lower11)*rbeta(ngrid,a1,b1) - (lower00 + (upper00 - lower00)*rbeta(ngrid,a0,b0))
+ late[,j] <- gamest11 - gamest00
+}
+
+upperq <- apply(itt_c,1,quantile,0.975)
+lowerq <- apply(itt_c,1,quantile,0.025)
+And now we can plot all of this, using the “polygon” function to +shade posterior quantiles.
+par(mfrow=c(1,1))
+plot(xgrid,itt_c_true,pch=20,cex=0.5,ylim=c(-0.75,0.05),bty='n',type='n',xlab='x',ylab='Treatment effect')
+
+upperq_er <- apply(late,1,quantile,0.975,na.rm=TRUE)
+
+lowerq_er <- apply(late,1,quantile,0.025,na.rm=TRUE)
+
+polygon(c(xgrid,rev(xgrid)),c(lowerq,rev(upperq)),col=rgb(0.5,0.25,0,0.25),pch=20,border=FALSE)
+polygon(c(xgrid,rev(xgrid)),c(lowerq_er,rev(upperq_er)),col=rgb(0,0,0.5,0.25),pch=20,border=FALSE)
+
+itt_c_est <- rowMeans(itt_c)
+late_est <- rowMeans(late)
+lines(xgrid,late_est,col="slategray",lwd=3)
+
+lines(xgrid,itt_c_est,col='goldenrod1',lwd=1)
+
+lines(xgrid,LATE_true0,col="black",lwd=2,lty=3)
+lines(xgrid,LATE_true1,col="black",lwd=2,lty=2)
+
+lines(xgrid,itt_c_true,col="black",lwd=1)
+With a valid exclusion restriction the three black curves would all +be the same. With no exclusion restriction, as we have here, the direct +effect of \(Z\) on \(Y\) (the vaccine reminder on flu status) +makes it so these three treatment effects are different. Specifically, +the \(ITT_c\) compares getting the +vaccine and getting the reminder to not getting the vaccine +and not getting the reminder. When both things have risk +reducing impacts, we see a larger risk reduction over all values of +\(X\). Meanwhile, the two LATE effects +compare the isolated impact of the vaccine among people that got the +reminder and those that didn’t, respectively. Here, not getting the +reminder makes the vaccine more effective because the risk reduction is +as a fraction of baseline risk, and the reminder reduces baseline risk +in our DGP.
+We see also that the posterior mean of the \(ITT_c\) estimate (gold) is very similar to +the posterior mean under the assumption of an exclusion restriction +(gray). This is by design…they will only deviate due to Monte Carlo +variation or due to the rare situations where the exclusion restriction +is incompatible with the identification interval.
+By changing the sample size and various aspects of the DGP this +script allows us to build some intuition for how aspects of the DGP +affect posterior inferences, particularly how violates of assumptions +affect accuracy and posterior uncertainty.
+We study conditional average treatment effect (CATE) estimation for +regression discontinuity designs (RDD), in which treatment assignment is +based on whether a particular covariate — referred to as the running +variable — lies above or below a known value, referred to as the cutoff +value. Because treatment is deterministically assigned as a known +function of the running variable, RDDs are trivially deconfounded: +treatment assignment is independent of the outcome variable, given the +running variable (because treatment is conditionally constant). However, +estimation of treatment effects in RDDs is more complicated than simply +controlling for the running variable, because doing so introduces a +complete lack of overlap, which is the other key condition needed to +justify regression adjustment for causal inference. Nonetheless, the +CATE at the cutoff, \(X=c\), +may still be identified provided the conditional expectation \(E[Y \mid X,W]\) is continuous at that point +for all \(W=w\). We exploit +this assumption with the leaf regression BART model implemented in +Stochtree, which allows us to define an explicit prior on the CATE. We +now describe the RDD setup and our model in more detail, and provide +code to implement our approach.
+We conceptualize the treatment effect estimation problem via a +quartet of random variables \((Y, X, Z, +U)\). The variable \(Y\) is the +outcome variable; the variable \(X\) is +the running variable; the variable \(Z\) is the treatment assignment indicator +variable; and the variable \(U\) +represents additional, possibly unobserved, causal factors. What +specifically makes this correspond to an RDD is that we stipulate that +\(Z = I(X > c)\), for cutoff \(c\). We assume \(c = 0\) without loss of generality.
+The following figure depicts a causal diagram representing the +assumed causal relationships between these variables. Two key features +of this diagram are one, that \(X\) +blocks the impact of \(U\) on \(Z\): in other words, \(X\) satisfies the back-door criterion for +learning causal effects of \(Z\) on +\(Y\). And two, \(X\) and \(U\) are not descendants of \(Z\).
++A causal directed acyclic graph representing the general structure of a +regression discontinuity design problem +
+Using this causal diagram, we may express \(Y\) as some function of its graph parents, +the random variables \((X,Z,U)\): \[Y = F(X,Z,U).\] In principle, we may +obtain draws of \(Y\) by first drawing +\((X,Z,U)\) according to their joint +distribution and then applying the function \(F\). Similarly, we may relate this +formulation to the potential outcomes framework straightforwardly: \[\begin{equation} +\begin{split} +Y^1 &= F(X,1,U),\\ +Y^0 &= F(X,0,U). +\end{split} +\end{equation}\] Here, draws of \((Y^1, +Y^0)\) may be obtained (in principle) by drawing \((X,Z,U)\) from their joint distribution and +using only the \((X,U)\) elements as +arguments in the above two equations, ``discarding’’ the drawn value of +\(Z\). Note that this construction +implies the consistency condition: \(Y = Y^1 Z + Y^0 ( 1 - Z)\). Likewise, this +construction implies the no interference condition because each +\(Y_i\) is considered to be produced +with arguments (\(X_i, Z_i, U_i)\) and +not those from other units \(j\); in +particular, in constructing \(Y_i\), +\(F\) does not take \(Z_j\) for \(j +\neq i\) as an argument.
+Next, we define the following conditional expectations \[\begin{equation}
+\begin{split}
+\mu_1(x) &= E[ F(x, 1, U) \mid X = x] ,\\
+\mu_0(x) &= E[ F(x, 0, U) \mid X = x],
+\end{split}
+\end{equation}\] with which we can define the treatment effect
+function \[\tau(x) = \mu_1(x) -
+\mu_0(x).\] Because \(X\)
+satisfies the back-door criterion, \(\mu_1\) and \(\mu_0\) are estimable from the data,
+meaning that \[\begin{equation}
+\begin{split}
+\mu_1(x) &= E[ F(x, 1, U) \mid X = x] = E[Y \mid X=x, Z=1],\\
+\mu_0(x) &= E[ F(x, 0, U) \mid X = x] = E[Y \mid X=x, Z=0],
+\end{split}
+\end{equation}\]
+the right-hand-sides of which can be estimated from sample data, which
+we supposed to be independent and identically distributed realizations
+of \((Y_i, X_i, Z_i)\) for \(i = 1, \dots, n\). However, because \(Z = I(X >0)\) we can in fact only learn
+\(\mu_1(x)\) for \(X > 0\) and \(\mu_0(x)\) for \(X < 0\). In potential outcomes
+terminology, conditioning on \(X\)
+satisfies ignorability, \[(Y^1, Y^0) \perp
+\!\!\! \perp Z \mid X,\] but not strong ignorability,
+because overlap is violated. Overlap would require that \[0 < P(Z = 1 \mid X=x) < 1 \;\;\;\; \forall
+x,\] which is clearly violated by the RDD assumption that \(Z = I(X > 0)\). Consequently, the
+overall ATE, \(\bar{\tau} =
+E(\tau(X)),\) is unidentified, and we must content ourselves with
+estimating \(\tau(0)\), the conditional
+average effect at the point \(x = 0\),
+which we estimate as the difference between \(\mu_1(0) - \mu_0(0)\). This is possible for
+continuous \(X\) so long as one is
+willing to assume that \(\mu_1(x)\) and
+\(\mu_0(x)\) are both suitably smooth
+functions of \(x\): any inferred
+discontinuity at \(x = 0\) must
+therefore be attributable to treatment effect.
We are concerned with learning not only \(\tau(0)\), the “RDD ATE” (e.g. the CATE at +\(x = 0\)), but also RDD CATEs, \(\tau(0, \mathrm{w})\) for some covariate +vector \(\mathrm{w}\). Incorporating +additional covariates in the above framework turns out to be +straightforward, simply by defining \(W = +\varphi(U)\) to be an observable function of the (possibly +unobservable) causal factors \(U\). We +may then define our potential outcome means as \[\begin{equation} +\begin{split} +\mu_1(x,\mathrm{w}) &= E[ F(x, 1, U) \mid X = x, W = \mathrm{w}] = +E[Y \mid X=x, W=\mathrm{w}, Z=1],\\ +\mu_0(x,\mathrm{w}) &= E[ F(x, 0, U) \mid X = x, W = \mathrm{w}] = +E[Y \mid X=x, W = \mathrm{w}, Z=0], +\end{split} +\end{equation}\] and our treatment effect function as \[\tau(x,\mathrm{w}) = \mu_1(x,\mathrm{w}) - +\mu_0(x,\mathrm{w}).\] We consider our data to be independent and +identically distributed realizations \((Y_i, +X_i, Z_i, W_i)\) for \(i = 1, \dots, +n\). Furthermore, we must assume that \(\mu_1(x,\mathrm{w})\) and \(\mu_0(x,\mathrm{w})\) are suitably smooth +functions of \(x\), {} \(\mathrm{w}\); in other words, for each +value of \(\mathrm{w}\) the usual +continuity-based identification assumptions must hold.
+With this framework and notation established, CATE estimation in RDDs +boils down to estimation of condition expectation functions \(E[Y \mid X=x, W=\mathrm{w}, Z=z]\), for +which we turn to BART models.
+We propose a BART model where the trees are allowed to split on \((x,\mathrm{w})\) but where each leaf node +parameter is a vector of regression coefficients tailored to the RDD +context (rather than a scalar constant as in default BART). In one +sense, such a model can be seen as implying distinct RDD ATE regressions +for each subgroup determined by a given tree; however, this intuition is +only heuristic, as the entire model is fit jointly as an ensemble of +such trees. Instead, we motivate this model as a way to estimate the +necessary conditional expectations via a parametrization where the +conditional treatment effect function can be explicitly regularized, as +follows.
+Let \(\psi\) denote the following
+basis vector: \[\begin{equation}
+\psi(x,z) = \begin{bmatrix}
+1 & z x & (1-z) x & z
+\end{bmatrix}.
+\end{equation}\] To generalize the original BART model, we define
+\(g_j(x, \mathrm{w}, z)\) as a
+piecewise linear function as follows. Let \(b_j(x, \mathrm{w})\) denote the node in the
+\(j\)th tree which contains the point
+\((x, \mathrm{w})\); then the
+prediction function for tree \(j\) is
+defined to be: \[\begin{equation}
+g_j(x, \mathrm{w}, z) = \psi(x, z) \Gamma_{b_j(x, \mathrm{w})}
+\end{equation}\]
+for a leaf-specific regression vector \(\Gamma_{b_j} = (\eta_{b_j}, \lambda_{b_j},
+\theta_{b_j}, \Delta_{b_j})^t\). Therefore, letting \(n_{b_j}\) denote the number of data points
+allocated to node \(b\) in the \(j\)th tree and \(\Psi_{b_j}\) denote the \(n_{b_j} \times 4\) matrix, with rows equal
+to \(\psi(x,z)\) for all \((x_i,z_i) \in b_j\), the model for
+observations assigned to leaf \(b_j\),
+can be expressed in matrix notation as: \[\begin{equation}
+\begin{split}
+\mathbf{Y}_{b_j} \mid \Gamma_{b_j}, \sigma^2 &\sim
+\mathrm{N}(\Psi_{b_j} \Gamma_{b_j},\sigma^2)\\
+\Gamma_{b_j} &\sim \mathrm{N}(0, \Sigma_0),
+\end{split} \label{eq:leaf.regression}
+\end{equation}\] where we set \(\Sigma_0 = \frac{0.033}{J} \mbox{I}\) as a
+default (for \(x\) vectors standardized
+to have unit variance in-sample).
This choice of basis entails that the RDD CATE at \(\mathrm{w}\), \(\tau(0, \mathrm{w})\), is a sum of the +\(\Delta_{b_j(0, \mathrm{w})}\) +elements across all trees \(j = 1, \dots, +J\): \[\begin{equation} +\begin{split} +\tau(0, \mathrm{w}) &= E[Y^1 \mid X=0, W = \mathrm{w}] - E[Y^0 \mid +X = 0, W = \mathrm{w}]\\ +& = E[Y \mid X=0, W = \mathrm{w}, Z = 1] - E[Y \mid X = 0, W = +\mathrm{w}, Z = 0]\\ +&= \sum_{j = 1}^J g_j(0, \mathrm{w}, 1) - \sum_{j = 1}^J g_j(0, +\mathrm{w}, 0)\\ +&= \sum_{j = 1}^J \psi(0, 1) \Gamma_{b_j(0, \mathrm{w})} - \sum_{j += 1}^J \psi(0, 0) \Gamma_{b_j(0, \mathrm{w})} \\ +& = \sum_{j = 1}^J \Bigl( \psi(0, 1) - \psi(0, 0) +\Bigr) \Gamma_{b_j(0, \mathrm{w})} \\ +& = \sum_{j = 1}^J \Bigl( (1,0,0,1) - +(1,0,0,0) \Bigr) \Gamma_{b_j(0, \mathrm{w})} \\ +&= \sum_{j=1}^J \Delta_{b_j(0, \mathrm{w})}. +\end{split} +\end{equation}\] As a result, the priors on the \(\Delta\) coefficients directly regularize +the treatment effect. We set the tree and error variance priors as in +the original BART model.
+The following figures provide a graphical depiction of how the BARDDT +model fits a response surface and thereby estimates CATEs for distinct +values of \(\mathrm{w}\). For +simplicity only two trees are used in the illustration, while in +practice dozens or hundreds of trees may be used (in our simulations and +empirical example, we use 150 trees).
++Two regression trees with splits in x and a single scalar w. Node images +depict the g(x,w,z) function (in x) defined by that node’s coefficients. +The vertical gap between the two line segments in a node that contain +x=0 is that node’s contribution to the CATE at X = 0. Note that only +such nodes contribute for CATE prediction at x=0 +
++The two top figures show the same two regression trees as in the +preceding figure, now represented as a partition of the x-w plane. +Labels in each partition correspond to the leaf nodes depicted in the +previous picture. The bottom figure shows the partition of the x-w plane +implied by the sum of the two trees; the red dashed line marks point +W=w* and the combination of nodes that include this point +
++Left: The function fit at W = w* for the two trees shown in the previous +two figures, shown superimposed. Right: The aggregated fit achieved by +summing the contributes of two regression tree fits shown at left. The +magnitude of the discontinuity at x = 0 (located at the dashed gray +vertical line) represents the treatment effect at that point. Different +values of w will produce distinct fits; for the two trees shown, there +can be three distinct fits based on the value of w. +
+An interesting property of BARDDT can be seen in this small +illustration — by letting the regression trees split on the running +variable, there is no need to separately define a ‘bandwidth’ as is used +in the polynomial approach to RDD. Instead, the regression trees +automatically determine (in the course of posterior sampling) when to +‘prune’ away regions away from the cutoff value. There are two notable +features of this approach. One, different trees in the ensemble are +effectively using different local bandwidths and these fits are then +blended together. For example, in the bottom panel of the second figure, +we obtain one bandwidth for the region \(d+i\), and a different one for regions +\(a+g\) and \(d+g\). Two, for cells in the tree partition +that do not span the cutoff, the regression within that partition +contains no causal contrasts — all observations either have \(Z = 1\) or \(Z = +0\). For those cells, the treatment effect coefficient is +ill-posed and in those cases the posterior sampling is effectively a +draw from the prior; however, such draws correspond to points where the +treatment effect is unidentified and none of these draws contribute to +the estimation of \(\tau(0, +\mathrm{w})\) — for example, only nodes \(a+g\), \(d+g\), and \(d+i\) provide any contribution. This +implies that draws of \(\Delta\) +corresponding to nodes not predicting at \(X=0\) will always be draws from the prior, +which has some intuitive appeal.
+In this section, we provide code for implementing our model in
+stochtree
on a popular RDD dataset. First, let us load
+stochtree
and all the necessary libraries for our posterior
+analysis.
## Load libraries
+library(stochtree)
+library(rpart)
+library(rpart.plot)
+library(xtable)
+library(foreach)
+library(doParallel)
+## Loading required package: iterators
+## Loading required package: parallel
+The data comes from Lindo, Sanders, and +Oreopoulos (2010), who analyze data on college students enrolled +in a large Canadian university in order to evaluate the effectiveness of +an academic probation policy. Students who present a grade point average +(GPA) lower than a certain threshold at the end of each term are placed +on academic probation and must improve their GPA in the subsequent term +or else face suspension. We are interested in how being put on probation +or not, \(Z\), affects students’ GPA, +\(Y\), at the end of the current term. +The running variable, \(X\), is the +negative distance between a student’s previous-term GPA and the +probation threshold, so that students placed on probation (\(Z = 1\)) have a positive score and the +cutoff is 0. Potential moderators, \(W\), are:
+male
),age_at_entry
)bpl_north_america
),totcredits_year1
)loc_campus
1, 2 and 3), andhsgrade_pct
).## Load and organize data
+data <- read.csv("https://raw.githubusercontent.com/rdpackages-replication/CIT_2024_CUP/refs/heads/main/CIT_2024_CUP_discrete.csv")
+y <- data$nextGPA
+x <- data$X
+x <- x/sd(x) ## we always standardize X
+w <- data[,4:11]
+### Must define categorical features as ordered/unordered factors
+w$totcredits_year1 <- factor(w$totcredits_year1,ordered=TRUE)
+w$male <- factor(w$male,ordered=FALSE)
+w$bpl_north_america <- factor(w$bpl_north_america,ordered=FALSE)
+w$loc_campus1 <- factor(w$loc_campus1,ordered=FALSE)
+w$loc_campus2 <- factor(w$loc_campus2,ordered=FALSE)
+w$loc_campus3 <- factor(w$loc_campus3,ordered=FALSE)
+c <- 0
+n <- nrow(data)
+z <- as.numeric(x>c)
+h <- 0.1 ## window for prediction sample
+test <- -h < x & x < h
+ntest <- sum(test)
+Generically, our estimand is the CATE function at \(x = 0\); i.e. \(\tau(0, \mathrm{w})\). The key practical +question is which values of \(\mathrm{w}\) to consider. Some values of +\(\mathrm{w}\) will not be +well-represented near \(x=0\) and so no +estimation technique will be able to estimate those points effectively. +As such, to focus on feasible points — which will lead to interesting +comparisons between methods — we recommend restricting the evaluation +points to the observed \(\mathrm{w}_i\) +such that \(|x_i| \leq \delta\), for +some \(\delta > 0\). In our example, +we use \(\delta = 0.1\) for a +standardized \(x\) variable. Therefore, +our estimand of interest is a vector of treatment effects: \[\begin{equation} +\tau(0, \mathrm{w}_i) \;\;\; \forall i \;\mbox{ such that }\; |x_i| \leq +\delta. +\end{equation}\]
+In order to implement our model, we write the Psi vector, as defined
+before: Psi <- cbind(z*x,(1-z)*x, z,rep(1,n))
. The
+training matrix for the model is as.matrix(cbind(x,w))
,
+which we feed into the stochtree::bart
function via the
+X_train
parameter. The basis vector Psi
is fed
+into the function via the leaf_basis_train
parameter. The
+list object barddt.mean.parmlist
defines options for the
+mean forest (a different list can be defined for a variance forest in
+the case of heteroscedastic BART, which we do not consider here).
+Importantly, in this list we define parameter
+sigma2_leaf_init = diag(rep(0.1/150,4))
, which sets \(\Sigma_0\) as described above. Now, we can
+fit the model, which is saved in object barddt.fit
.
Once the model is fit, we need 3 elements to obtain the CATE
+predictions: the basis vectors at the cutoff for \(z=1\) and \(z=0\), the test matrix \([X \quad W]\) at the cutoff, and the
+testing sample. We define the prediction basis vectors \(\psi_1 = [1 \quad 0 \quad 0 \quad 1]\) and
+\(\psi_0 = [1 \quad 0 \quad 0 \quad
+0]\), which correspond to \(\psi\) at \((x=0,z=1)\), and \((x=0,z=0)\), respectively. These vectors
+are written into R as
+Psi1 <- cbind(rep(1,n), rep(c,n), rep(0,n), rep(1,n))
+and
+Psi0 <- cbind(rep(1,n), rep(0,n), rep(c,n), rep(0,n))
.
+Then, we write the test matrix at \((x=0,\mathrm{w})\) as
+xmat_test <- as.matrix(cbind(rep(0,n),w)
. Finally, we
+must define the testing window. As discussed previously, our window is
+set such that \(|x| \leq 0.1\), which
+can be set in R as
+test <- -0.1 < x & x <0.1
.
Once all of these elements are set, we can obtain the outcome
+predictions at the cutoff by running
+predict(barddt.fit, xmat_test, Psi1)
(resp.
+predict(barddt.fit, xmat_test, Psi0)
). Each of these calls
+returns a list, from which we can extract element y_hat
to
+obtain the posterior distribution for the outcome. In the code below,
+the treated and control outcome predictions are saved in the matrix
+objects pred1
and pred0
, respectively. Now, we
+can obtain draws from the CATE posterior by simply subtracting these
+matrices. The function below outlines how to perform each of these steps
+in R.
fit.barddt <- function(y,x,w,z,test,c)
+{
+ ## Lists of parameters for the Stochtree BART function
+ barddt.global.parmlist <- list(standardize=T,sample_sigma_global=TRUE,sigma2_global_init=0.1)
+ barddt.mean.parmlist <- list(num_trees=50, min_samples_leaf=20, alpha=0.95, beta=2,
+ max_depth=20, sample_sigma2_leaf=FALSE, sigma2_leaf_init = diag(rep(0.1/150,4)))
+ ## Set basis vector for leaf regressions
+ Psi <- cbind(rep(1,n),z*x,(1-z)*x,z)
+ ## Model fit
+ barddt.fit = stochtree::bart(X_train= as.matrix(cbind(x,w)), y_train=y,
+ leaf_basis_train = Psi, mean_forest_params=barddt.mean.parmlist,
+ general_params=barddt.global.parmlist,
+ num_mcmc=1000,num_gfr=30)
+ ## Define basis vectors and test matrix for outcome predictions at X=c
+ Psi1 <- cbind(rep(1,n), rep(c,n), rep(0,n), rep(1,n))
+ Psi0 <- cbind(rep(1,n), rep(0,n), rep(c,n), rep(0,n))
+ Psi1 <- Psi1[test,]
+ Psi0 <- Psi0[test,]
+ xmat_test <- as.matrix(cbind(rep(0,n),w)[test,])
+ ## Obtain outcome predictions
+ pred1 <- predict(barddt.fit,xmat_test,Psi1)$y_hat
+ pred0 <- predict(barddt.fit,xmat_test,Psi0)$y_hat
+ ## Obtain CATE posterior
+ out <- pred1-pred0
+ return(out)
+}
+Now, we proceed to fit the BARDDT model. The procedure is exactly the +same as described in the simulation section.
+## We will sample multiple chains sequentially
+num_chains <- 20
+num_gfr <- 2
+num_burnin <- 0
+num_mcmc <- 500
+bart_models <- list()
+## Define basis functions for training and testing
+B <- cbind(z*x,(1-z)*x, z,rep(1,n))
+B1 <- cbind(rep(c,n), rep(0,n), rep(1,n), rep(1,n))
+B0 <- cbind(rep(0,n), rep(c,n), rep(0,n), rep(1,n))
+B1 <- B1[test,]
+B0 <- B0[test,]
+B_test <- rbind(B1,B0)
+xmat_test <- cbind(x=rep(0,n),w)[test,]
+xmat_test <- rbind(xmat_test,xmat_test)
+### We combine the basis for Z=1 and Z=0 to feed it to the BART call and get the Y(z) predictions instantaneously
+### Then we separate the posterior matrix between each Z and calculate the CATE prediction
+## Sampling trees in parallel
+ncores <- 5
+cl <- makeCluster(ncores)
+registerDoParallel(cl)
+
+start_time <- Sys.time()
+bart_model_outputs <- foreach (i = 1:num_chains) %dopar% {
+ random_seed <- i
+ ## Lists to define BARDDT parameters
+ barddt.global.parmlist <- list(standardize=T,sample_sigma_global=TRUE,sigma2_global_init=0.1)
+ barddt.mean.parmlist <- list(num_trees=50, min_samples_leaf=20, alpha=0.95, beta=2,
+ max_depth=20, sample_sigma2_leaf=FALSE, sigma2_leaf_init = diag(rep(0.1/50,4)))
+ bart_model <- stochtree::bart(
+ X_train = cbind(x,w), leaf_basis_train = B, y_train = y,
+ X_test = xmat_test, leaf_basis_test = B_test,
+ num_gfr = num_gfr, num_burnin = num_burnin, num_mcmc = num_mcmc,
+ general_params = barddt.global.parmlist, mean_forest_params = barddt.mean.parmlist
+ )
+ bart_model <- bart_model$y_hat_test[1:ntest,]-bart_model$y_hat_test[(ntest+1):(2*ntest),]
+}
+stopCluster(cl)
+## Combine CATE predictions
+pred <- do.call("cbind",bart_model_outputs)
+
+end_time <- Sys.time()
+
+print(end_time - start_time)
+## Time difference of 9.554316 mins
+## Save the results
+saveRDS(pred, "bart_rdd_posterior.rds")
+We now proceed to analyze the CATE posterior. The figure produced +below presents a summary of the CATE posterior produced by BARDDT for +this application. This picture is produced fitting a regression tree, +using \(W\) as the predictors, to the +individual posterior mean CATEs: \[\begin{equation} +\bar{\tau}_i = \frac{1}{M} \sum_{h = 1}^M \tau^{(h)}(0, \mathrm{w}_i), +\end{equation}\] where \(h\) +indexes each of \(M\) total posterior +samples. As in our simulation studies, we restrict our posterior +analysis to use \(\mathrm{w}_i\) values +of observations with \(|x_i| \leq \delta = +0.1\) (after normalizing \(X\) +to have standard deviation 1 in-sample). For the Lindo, Sanders, and Oreopoulos (2010) data, this +means that BARDDT was trained on \(n = +40,582\) observations, of which 1,602 satisfy \(x_i \leq 0.1\), which were used to generate +the effect moderation tree.
+## Fit regression tree
+cate <- rpart(y~.,data.frame(y=rowMeans(pred),w[test,]),control = rpart.control(cp=0.015))
+## Define separate colors for left and rightmost nodes
+plot.cart <- function(rpart.obj)
+{
+ rpart.frame <- rpart.obj$frame
+ left <- which.min(rpart.frame$yval)
+ right <- which.max(rpart.frame$yval)
+ nodes <- rep(NA,nrow(rpart.frame))
+ for (i in 1:length(nodes))
+ {
+ if (rpart.frame$yval[i]==rpart.frame$yval[right]) nodes[i] <- "gold2"
+ else if (rpart.frame$yval[i]==rpart.frame$yval[left]) nodes[i] <- "tomato3"
+ else nodes[i] <- "lightblue3"
+ }
+ return(nodes)
+}
+## Plot regression tree
+rpart.plot(cate,main="",box.col=plot.cart(cate))
++Regression tree fit to posterior point estimates of individual treatment +effects: top number in each box is the average subgroup treatment +effect, lower number shows the percentage of the total sample in that +subgroup; the tree flags credits in first year, gender, and age at entry +as important moderators. +
+The resulting effect moderation tree indicates that course load +(credits attempted) in the academic term leading to their probation is a +strong moderator. Contextually, this result is plausible, both because +course load could relate to latent character attributes that influence a +student’s responsiveness to sanctions and also because it could predict +course load in the current term, which would in turn have implications +for the GPA (i.e. it is harder to get a high GPA while taking more +credit hours). The tree also suggests that effects differ by campus, and +age and gender of the student. These findings are all prima facie +plausible as well.
+To gauge how strong these findings are statistically, we can zoom in +on isolated subgroups and compare the posteriors of their subgroup +average treatment effects. This approach is valid because in fitting the +effect moderation tree to the posterior mean CATEs we in no way altered +the posterior itself; the effect moderation tree is a posterior summary +tool and not any additional inferential approach; the posterior is +obtained once and can be explored freely using a variety of techniques +without vitiating its statistical validity. Investigating the most +extreme differences is a good place to start: consider the two groups of +students at opposite ends of the treatment effect range discovered by +the effect moderation tree:
+Subgroup CATEs are obtained by aggregating CATEs across the observed +\(\mathrm{w}_i\) values for individuals +in each group; this can be done for individual posterior samples, +yielding a posterior distribution over the subgroup CATE: \[\begin{equation} +\bar{\tau}_A^{(h)} = \frac{1}{n_A} \sum_{i : \mathrm{w}_i} \tau^{(h)}(0, +\mathrm{w}_i), +\end{equation}\] where \(h\) +indexes a posterior draw and \(n_A\) +denotes the number of individuals in the group A.
+The code below produces a contour plot for a bivariate kernel density +estimate of the joint CATE posterior distribution for subgroups A and B. +The contour lines are nearly all above the \(45^{\circ}\) line, indicating that the +preponderance of posterior probability falls in the region where the +treatment effect for Group B is greater than that of Group A, meaning +that the difference in the subgroup treatment effects flagged by the +effect moderation tree persist even after accounting for estimation +uncertainty in the underlying CATE function.
+## Define function to produce KD estimates of the joint distribution of two subgroups
+cate.kde <- function(rpart.obj,pred)
+{
+ rpart.frame <- rpart.obj$frame
+ left <- rpart.obj$where==which.min(rpart.frame$yval)
+ right <- rpart.obj$where==which.max(rpart.frame$yval)
+ ## Calculate CATE posterior for groups A and B
+ cate.a <- do.call("cbind",by(pred,left, colMeans))
+ cate.b <- do.call("cbind",by(pred,right, colMeans))
+ cate.a <- cate.a[,2]
+ cate.b <- cate.b[,2]
+ ## Estimate kernel density
+ denshat <- MASS::kde2d(cate.a, cate.b, n=200)
+ return(denshat)
+}
+contour(cate.kde(cate,pred),bty='n',xlab="Group A",ylab="Group B")
+abline(a=0,b=1)
++Kernel density estimates for the joint CATE posterior between male +students who entered college older than 19 and attempted more than 4.8 +credits in the first year (leftmost leaf node, red) and students who +entered college younger than 19 and attempted between 4.3 and 4.8 +credits in the first year (rightmost leaf node, gold) +
+As always, CATEs that vary with observable factors do not necessarily +represent a causal moderating relationship. Here, if the +treatment effect of academic probation is seen to vary with the number +of credits, that does not imply that this association is causal: +prescribing students to take a certain number of credits will not +necessarily lead to a more effective probation policy, it may simply be +that the type of student to naturally enroll for fewer credit hours is +more likely to be responsive to academic probation. An entirely distinct +set of causal assumptions are required to interpret the CATE variations +themselves as causal. All the same, uncovering these patterns of +treatment effect variability are crucial to suggesting causal mechanism +to be investigated in future studies.
+