1. Introduction
Consider a directed acyclic graph (DAG), where nodes represent random variables and edges represent a direct causal influence between two variables. We here discuss the problem of quantifying these causal influences. This problem has received considerable attention in a variety of communities; for the sake of exposition, we coarsely categorize methods as either statistical (i.e., those summarized by [
1]) or information theoretic (IT) (i.e., those measured in units of bits or nats) [
2,
3,
4]. When viewed from an applications perspective, these two approaches are quite different. Statistical approaches are common in epidemiology and economics [
5,
6], whereas IT methods appear in the study of complex natural systems, for example climate scientic [
7,
8] or neuroscientific [
9,
10]. The fundamental difference in perspectives that gives rise to this disparity is not well presented in the development of IT methodologies.
To illustrate this difference, consider a simple example with a two node graph
, where
represents whether or not an individual has won the lottery and
represents that individual’s average monthly spending (assume for clarity that there are no confounding factors, i.e., observing a lottery winner is equivalent producing a lottery winner by means of intervention). A statistical measure such as the average causal effect (ACE) [
11,
12] would seek to answer the question “What is the effect of winning the lottery on spending?” by comparing the average spending of lottery winners (
) against the average spending of lottery non-winners (
):
. We would of course expect this to be quite large. It is important to note that the ACE is defined irrespective of the marginal distribution of
X, meaning that the probability with which
x occurs has no bearing on the effect of
x on
Y. An IT approach addresses a subtly different question: “What is the effect of the lottery on spending?” In other words, an IT measure considers the effect of the random variable representing whether or not one wins the lottery on spending. Specifically, the effect of
X on
Y would be given by the mutual information (MI),
(see
Section 2, (P2) in [
2]). Using a simple IT inequality, we get that the MI is bounded above the Shannon entropy,
. Given that the odds of winning the lottery are essentially a point mass, which has zero Shannon entropy, we have
. In words, because so few people win the lottery, an IT measure indicates that the lottery has a negligible effect on spending. In this regard, statistical measures consider the effect of a specific cause, whereas IT measures have historically considered the effect at a systemic level.
A second difference is that, whereas statistical approaches typically measure causal effects on the value of an outcome, IT approaches measure the causal effect on the distribution of an outcome. Each of these approaches comes with benefits and drawbacks. With statistical approaches, the units are preserved (in the previous example, the units of the ACE are dollars). While IT measures yield the less interpretable unit of bits, they are able to capture more complex causal effects, for instance the effect that a variable has on the variance of another. Acknowledging this difference helps to understand the disparity between the applications of statistical and IT measures. When evaluating the causal link between smoking and cancer, the number of bits of information shared by the smoking and cancer variables may not be as useful as knowing the extent to which quitting smoking decreases the likelihood of cancer. However, when studying the nature of complex natural networks, it may be desirable to use a measure that can capture higher order causal effects.
Of course, not all causal inference approaches fall neatly into this coarse categorization. There has been considerable work in the statistics literature on distributional effects, most commonly using quantile effects [
13,
14]. Quantile effects measure the difference between a particular quantile of two distributions (for example the median), and thus, like the average causal effect, do not capture the effect that an intervention may have on a distribution as a whole. While approaches measuring the
distance between distributions (given by the integral/sum of absolute differences) [
15] and the difference in Gini indices of distributions [
16] provide reasonable alternatives to the proposed approach, they offer no insight into the nature of information theoretic measures of causal influence in the broader context of causal inference. It should also be noted that not all statistical measures of causal effect rely on the specification of two values of a cause. Studies using stochastic intervention effects allow for interventions to affect the distribution of a cause [
17].
The problem considered presently is distinct from that addressed by popular time series analyses such as Granger causality [
18], directed information [
19,
20,
21], and transfer entropy [
22]. Rather than evaluating the effects of interventions in a causal model, these methods rely on time-lagged correlations or mutual informations. While scenarios exist where these methods coincide with approaches based on interventions, they are not equivalent in general. In this paper we focus on methods based on interventions and refer the reader to [
23,
24,
25] for further discussion on the relationship between the interventional and non-interventional approaches.
In the present work we seek to endow IT measures with the ability to measure specific causal effects. Furthermore, we show that existing IT measures of causal influences are ill-equipped for distinguishing direct and indirect effects. Following a parallel storyline to that of Pearl [
26], we provide measures of the total, (natural and controlled) direct, and natural indirect effects. We show that these measures do not fundamentally change the underlying IT perspective on causality, but enable obtaining “higher resolution” measures of causal influence. In doing so, we provide increased clarity to the aforementioned differences between IT and statistical causal measures. We showcase how the framework can be used in practical contexts, focusing on the evaluation of the causal effect of the El Niño–Southern Oscillation (ENSO) on land surface temperature anomalies in the North American Pacific Northwest (PNW). Our results confirm the scientific consensus that both ENSO phases affect PNW land surface temperatures asymmetrically. Furthermore, using a conditional version of the proposed measures, we show the presence of a “persistence signal” across two-week average temperature anomalies that is modulated by the El Niño phase. This result both demonstrates the value of the proposed framework and provides direction for future studies focused on climate scientific findings.
The remainder of the paper is structured as follows:
Section 2 introduces notation and provides background on the relevant works on the quantification of causal effects.
Section 3 provides definitions of novel measures of causal influence along with a number of relevant properties and extensions.
Section 4 presents intuitive examples demonstrating the utility of the proposed perspective.
Section 4.3 presents our case study applying the proposed measures to measure the effect of ENSO on PNW temperature anomalies. Finally,
Section 6 contains concluding remarks.
3. Novel Information Theoretic Causal Measures
The observation that the MI
does not capture how different values of
X may contain different amounts of information about
Y has been made in a variety of contexts throughout the literature, including experimental design [
31,
32], neural stimulus response [
33,
34], information decomposition [
35,
36], measuring surprise [
37], and most recently, distinguishing between information transfer and information copying [
38]. Central to each of these works is the development of a notion of MI for a specific value of
X, i.e.,
. There is, however, no inherent
implied by the definition of
—to see this, we use the notation of [
33] and provide two candidate definitions of
based on the two equivalent definitions of
:
It is well understood that, in general,
. This is clear to see by simply noting that, for any joint distribution
,
for all
x, whereas it is possible to have
. In words, the knowledge of a specific value of
X will only provide us with a more accurate distribution of
Y (
), though it is possible for this distribution to have a greater entropy than the marginal distribution (
). We here use
as a foundation for establishing value specific measures of causal influence, and, using the terminology of [
38], refer to it as the specific mutual information (SMI). Building upon this language in the present context, we refer to the quantities measured by the proposed methods as specific causal effects. To our knowledge, the use of SMI in the context of quantifying causal influence is novel. As such, we begin with an informal discussion around the use of SMI for the quantification of causal influence in two-node DAGs, followed by a formal definition of various specific causal effects in a mediation model.
3.1. Specific Mutual Information in Two-Node DAGs
Consider a DAG with joint distribution over nodes , and for the sake of exposition, assume there are no confounding variables. In this simple scenario, when considering the effect of X on Y, we can freely exchange interventions for observations (assuming we only consider x s.t. ), and thus the ACE of x with respect to baseline is given by . Once again, this addresses the question of how much the value of Y is expected to change as a result of switching from to x. With regard to the CS and IF methods discussed above, both would quantify the effect of X on Y as . Consider the SMI as a measure of the specific causal influence of x upon Y and note the following:
(I) We have the equivalence , where the expectation is taken with respect to X. As such, we can think of the specific causal effect as a random variable, whose expectation is the mutual information. In doing so, we are able to capture that different values of X may have different magnitudes of causal effect on Y, with each of those effects occurring with some probability according to . Moreover, this makes clear that the perspective adopted here is consistent with that of other IT measures.
(II) is non-negative for all . Whereas a negative ACE has the clear interpretation of x causing a decrease in the expected value of Y, we are measuring influences that x has on the distribution of Y. Given that there is no obvious notion of a (potentially negative) difference between distributions, we utilize a definition that results in all causal effects having positive magnitude. This serves as a partial justification for using , rather than , as a foundation.
(III) The SMI does not depend on the value of
Y, standing in contrast with the information of a single event
introduced by Fano [
39] and its variants, referred to as “local information measures” [
40,
41]. Interpreting local information measures as measures of causal influence is challenging given that they are negative when
Y takes on values that are unexpected given
. We adopt the perspective that, while different values of
X may have different levels of effect on
Y, they can only affect the distribution of
Y, with the specific value
y occurring randomly according to an appropriate conditional (or interventional) distribution.
(IV) The SMI does not require specifying a reference value
. Instead, we can view SMI as measuring the causal effect of
x as compared with the
X that would have occurred naturally. This suggests an intuition for the appearance of IT measures of causal influences in complex natural networks—values of
X that are seen as changing the course of nature will be assigned a large causal influence. Given that we can (in this setting) exchange observation for intervention, we can view the SMI as comparing the effect of an intervention
with a random (i.e., non-atomic) intervention
with
(see [
12,
42] for discussions on random interventions).
(V) The SMI addresses a very clear causal question: “How much different would we expect the distribution of Y to be if, instead of forcing X to take the value x, we let X take on a value naturally?” Stated more compactly: “How much would we expect performing the intervention to change the course of nature for Y?”
(VI) We can interpret the SMI as comparing a ground truth distribution of
Y conditioned on
x (
) with a counterfactual distribution wherein nature was allowed to run its course (
). This works well with the interpretation of the KL-divergence as a measure of excess bits resulting from encoding
Y using the distribution that is not the true distribution from which
Y is sampled. The use of the KL-divergence is further justified in this context by the fact that the logarithmic loss is unique in its ability to capture the benefit of conditioning on
X in the prediction of
Y [
43].
(VII) Finally, we note that if and only if for all y for which . By contrast, it is possible to have and . The following example illustrates why this is undesirable:
Example 1. Consider a two-node DAG with , , and . It is clear that the distribution of Y is highly dependent upon the value of X. Next note that , where . Thus, and . On the other hand, we have bits. This exemplifies how simply measuring differences in entropy is insufficient for capturing causal influences.
3.2. Specific Causal Effects in the Mediation Model
Following the process of [
26], we here formalize a series of definitions of total/direct/indirect causal influences from an information theoretic perspective. When leaving the comfort of the unconfounded two-node DAG, it is necessary to incorporate the interventions in the definition of the causal measures:
Definition 1. The specific total effect of x on Y is defined as: With the exception of the interventional notation, the STE is equivalent to the SMI. Note that for a DAG given by , we will have but , where represents the specific total effect of y on X. Thus, the STE answers the question posed above in point (IV): “How much would we expect performing the intervention to change the course of nature for Y?”
Next we define the specific controlled direct effect (SCDE) of x on Y. Given that computing the controlled direct effect must be done by means of intervention on Z, we define the SCDE with respect to a specific value z, as it is unclear what distribution over Z should be used if the definition were to take an expectation over all possible values of z (see Theorem 2).
Definition 2. The specific controlled direct effect of x on Y with mediator z is defined as: The SCDE measures how much we would expect performing the intervention to alter the course of nature for Y given that Z is held fixed at z.
Next, the specific natural direct effect measures the direct effect of x on Y that occurs naturally when the mediator is not controlled:
Definition 3. The specific natural direct effect of x on Y is defined as: It is helpful to dissect the two distributions of Y considered by the SNDE. Expanding the first argument as , both distributions are given by a weighted combination of the distribution of Y conditioned upon different values of Z. In both cases, these values of Z are weighted by the probability with which they would occur under the intervention . For the intervened values of X used to evaluate the probability of Y, however, the first distribution uses the “ground truth” value x, whereas the second uses the “naturally occurring” , weighted according to . We can interpret the SNDE as a measure of how much we expect performing the intervention to directly alter the course of nature for Y. Using the same logic, we can define a specific natural indirect effect:
Definition 4. The specific natural indirect effect of x on Y is defined as: Conducting a similar dissection, we see that the roles of x and are swapped from the SNDE—the “ground truth” x is used to evaluate the probability of Y, while the naturally occurring is used to weight different values . As such, the only difference between the first and second arguments of the SNIE is how the value of the mediating Z is determined, resulting in a measurement of the indirect effect of x on Y. We can interpret the SNIE as a measure of how much we expect performing the intervention to indirectly alter the course of nature for Y.
Unfortunately, the proposed definitions of SNDE and SNIE yield no obvious inequalities with respect to the STE (for example, in general). While this is initially unintuitive, it can be justified by the decision to have all causal influences be assigned a non-negative magnitude. As such, we would expect that contradictory indirect and direct effects could individually have a large magnitude while still resulting in a total effect of zero.
3.3. Equivalence Relations
We now analyze the relationship between the proposed specific measures and IF/CS.
Theorem 1. The expected STE is equivalent to the information flow, i.e., , where the expectation is taken with respect to the marginal distribution over X.
A proof is provided in
Appendix B.1. The above theorem shows that the expected STE recovers the standard (unconditional) IF from
X to
Y. Notably, the expected STE is not equivalent to the CS associated with any subset of the arrows in the graph. Next, we show that both IF and CS provide a notion of expected SCDE:
Theorem 2. The conditional IF is given by the expected value of the SCDE taken with respect to the marginal distributions of X and Z: Furthermore, if the DAG consists of only X, Y, and Z (i.e., ), then the CS of is given by the expected value of the SCDE taken with respect to the joint distribution of X and Z: A proof is provided in
Appendix B.2. This theorem clarifies the point made earlier with regard to the value of a measure of natural direct effect. In particular, when taking an average with respect to possible control values for the mediator
Z, it is not clear what distribution over
Z should be used.
3.4. Conditional Specific Influences
Even though the above causal measures are defined for specific values of X, they provide a notion of average causal influence in that they are implicitly averaging over all possible covariates U. Given that different values of u may significantly affect the nature of the relationship between x and Y, we define conditional versions of the above definitions for a specific value . We here consider the general case where only a subset of the covariates are observed:
Definition 5. The conditional STE of x on Y given is defined as: For the special case where we can observe all relevant covariates, i.e., , the conditional STE can be simplified as: This definition violates the locality postulate
(P2) of Janzing et al. [
2] in that the causal effect of
x on
Y may be dependent upon how
X is affected by its own parents. Allowing this is, however, consistent with the perspective that IT measures quantify the deviance from the course of nature in that the value
u dictates the current natural state. Nevertheless, the terms
and
can be replaced with
if one wishes to remain faithful to the locality postulate (though not explored presently, this would provide us with a notion of specific causal strength). The conditional versions of SCDE, SNDE, and SNIE follow very similar logic to that of the STE, and are defined in
Appendix C.
3.5. Identifiability
When
U is partially observable or unobservable, the nature of the dependence relationships between
,
, and
will dictate the ability to estimate the proposed causal measures from observational data—more specifically, the ability to determine the interventional distributions given only estimated conditional distributions. This is crucially important given that performing interventions in many complex natural systems is infeasible. The following theorem uses the d-separation criterion [
44,
45] to identify when the conditional specific measures can be estimated in the partially observable setting where only
can be observed:
Theorem 3. Consider a dataset containing observations of X, Y, Z, and partially observable covariates . Then, the conditional STE, SNDE, and SNIE are non-experimentally identifiable if there exist such that the following two conditions hold:(1) and(2), where represents the DAG with all outgoing arrows from X removed, and represents the d-separation of A and B by C in DAG .
The proof uses a direct application of Pearl’s
-calculus (theorem 3.4.1 in [
12]), and is provided in
Appendix B.3. By letting
, identifiability conditions for the specific unconditional causal effects are obtained. Similarly, the theorem provides the corollary that the conditional specific causal effects may be estimated from observational data when
U is fully observable. It is important to note that the above theorem assumes that each conditional distribution can be sufficiently well estimated. Indeed, the "increased resolution" of the proposed measures comes at a cost in that reliable estimation of the proposed measures poses challenges for values of
X that occur infrequently. Consider, for example, estimating the second argument of the KL-divergence defining the SNDE in (
8), namely
. Given that there is a sum over
and
, it is necessary to know this distribution for every pair
. Thus, when
is very small, a significant amount of data will be required to estimate
(and therefore the SNDE) reliably.
3.6. Normalized Specific Effects
The opacity of measuring causal influences in bits can be addressed by identifying a normalization procedure.
Definition 6. The normalized conditional STE of x on Y conditioned on is defined as: The normalized versions of the other specific causal measures are provided in
Appendix D. For the sake of exposition, suppose
and recall the data compression interpretation of
as the excess number of bits used to encode
Y under the assumption
X occurs naturally when we have in fact forced
by means of an intervention. Noting that
represents the number of bits required to encode
Y when we have (knowingly) forced
, the denominator of
gives the total number of bits used to encode
Y under the incorrect assumption of a naturally occurring
X. As such, the normalized STE represents the fraction of bits used to encode
Y under the assumption that
X occurred naturally that are unnecessary when performing the intervention
.
As a result of the non-negativity of entropy and the KL-divergence, the normalized STE is bounded between zero and one. Interpreting is facilitated by considering the scenarios that yield the extremal values. First, the normalized STE is zero if and only if the STE is zero, which is to say that for all y for which . More interestingly, the normalized STE is one if and only if the STE is greater than zero and . As such, the normalized STE being equal to one represents x having a maximal causal effect on Y in the sense that performing the intervention determines the value of Y with 100 percent certainty. It should be emphasized that, like the unnormalized measures, this notion of maximal causal effect applies strictly in a distributional sense and says nothing of the direction or magnitude of the causal effect with respect to the units of Y. For example, if performing results in with probability one, then and we would conclude that x has a maximal effect on Y even though x causes Y to take the value it is expected to take absent an intervention.
4. Examples
We now present three examples of notions of causal influence that are uniquely identified by the specific causal measures.
4.1. Chain Reaction
For the first example consider a simple chain
. This can be thought of as a simplified version of the example proposed by Ay and Polani [
3] and modified to include noise by Janzing et al. (example 7 in [
2]). We will consider the simplest case of this example where a binary message is being passed from
X to
Z to
Y, with the message being flipped by
Z and
Y with probability
. We will interpret each variable as representing the message it passes on, i.e.,
means “
X passes the message 1 to
Z.” Formally, let
with
:
where ⊕ is the XOR operation.
Focusing first on the effect of
x on
Y, we note that because the only path from
X to
Y is the one through
Z, the direct effect is zero and the total and indirect effects are equal. Noting that
,
, and
, the total effect is the same for both
and is given by:
Thus, as the probability of flipping the message approaches zero,
Y will be deterministically linked to
X, and
X resolves the entire one bit of uncertainty associated with
Y. Now consider the conditional STE of
z on
Y for a particular
x. We can compute this by comparing the distributions
and
. Given the symmetry of the problem, this will take one of two values depending on whether or not
x and
z are equal:
As approaches zero, the STE approaches zero when and infinity when . To understand this result, fix to be an arbitrarily small number such that Z will pass on its received message with high probability. Thus, when , it is, in a sense, unreasonable to endow Z with responsibility for causing the value taken by Y when it is propagating the message in a nearly deterministic manner. In such a case, it is not so much Z that is causing Y, but rather X that initiated a chain reaction. On the other hand, in the unlikely occurrence that , we have that Z does have a causal effect on Y. This scenario can be thought of as Z acting of its own volition in selecting a message to pass to Y.
We acknowledge that the notion of an unbounded causal influence is initially unsettling. When looking closer, however, this property is intuitive. First, we note that for any fixed , the STE will be finite. It is only for that the STE could be infinite, but in that case, the setting that results in infinite influence happens with probability zero. Thus, in general, an infinite influence could only be achieved through intervention. Furthermore, such an intervention would have to assign a value to a cause that occurs with probability zero, and that cause would in turn have to enable an otherwise impossible effect to have non-zero probability.
This conditional formulation violates the locality postulate
(P2) of Janzing et al. [
2] in that the effect of
z depends on the value of its own parent,
x. We do not claim that the perspective taken here is "correct," but merely point out that there exist justifications for considering the value of a cause’s parent in evaluating the causal effect.
4.2. Caused Uncertainty
Consider a 3-node DAG characterized by the connections
with
,
and:
Given that X and Z are both parentless, we can treat interventions on X and Z as observations, and the CS, conditional IF, and conditional mutual information (CMI) are equivalent. In particular, we have that and . Writing CMI as a difference of conditional entropies provides us with the interpretation of CMI as the reduction in uncertainty of Y resulting from the added conditioning of Z, which will always be non-negative.
Next we consider
and
for
. Given the symmetry of the problem with respect to
X, we only need to consider two of the four possible values of
, namely
and
. In order to compute the STE for each
X and
Z to
Y in either case, we need the following distributions:
For a given
, the STE is given by
and
:
The results presented above are intuitive: when , then the value taken by Y is largely determined by X, and the knowledge that tells us very little about the distribution of Y. On the other hand, when , X has no bearing on the value taken by Y. Thus, in this scenario, it is the value taken by Z that has caused the shift in the distribution of Y, even though Z provides no information with regard to the particular value taken by Y. In this sense, we can think of Z as causing uncertainty in Y. This scenario makes particularly clear why it makes sense to condition on the cause but take an expectation with respect to the effect—no outcome y could be attributed to being a result of , despite the clear influence that such an event has on the distribution of Y.
4.3. Shared Responsibility
Consider a scenario where a collection of
n iid variables
collectively influence a single outcome
Y, i.e.,
for
. For a given context
, let
k be the number of
that are one, i.e.,
. Then let
Y be distributed as:
where
is a random variable. One interpretation of this example is that each
is a potential inhibitor of
Y. As more inhibitors become activated (i.e., as
k grows), the effect of adding another inhibitor diminishes. Since the value taken by
K depends on the values taken by each
, a measure that averages with respect to
will not capture this change in causal effect that results for different values of
k.
As with the previous example, the CS, conditional IF, and CMI are equivalent for this problem setting. While there is no simple computation for these measures as a function of and n, there are a couple of key points. First, the influence of each of the variables on Y is the same, i.e., for all . Second, as , the probability of goes to zero, and as , the probability of goes to one. In either of the limits, the entropy of Y goes to zero and thus so does the causal influence of each as measured by either CMI, conditional IF, or CS.
Now consider a realization
and the corresponding
. While the influence of each
on
Y will not be the same for a given realization, the symmetry of the problem is such that the computation will be performed in the same manner for each
. Letting
be the number of ones excluding
, define the following distributions:
Then, for a given realization, the STE is a function of
and
:
In interpreting these results, first assume that is small, meaning that for each of the inhibitors, it is unlikely that it will be activated. As a result of this assumption, we have , i.e., an inhibitor has a greater influence when it is activated. More interestingly, note that is strictly decreasing in . This is consistent with the intuition provided above, namely that if a large number of inhibitors are active, then they share responsibility and the influence of any single one is negligible. On the other hand, if only one is activated (i.e., ), then in the limit of , its influence will approach infinity (and its normalized influence will approach one).
5. Case Study—Effect of El Niño—Southern Oscillation on Pacific Northwest Temperature Anomalies
We now present an application of the proposed framework to measuring the specific causal influences of the El Niño–Southern Oscillation (ENSO) on the temperature anomaly signal in the North American Pacific Northwest (PNW, latitude: 47
N, longitude: 240
E). The dataset we use is publicly available at the National Center for Atmospheric Research website and all code is published on the Code Ocean platform (
https://doi.org/10.24433/CO.5484914.v1). For our purposes, ENSO is characterized by the sea surface temperature in the Niño 3.4 region located in the equatorial Pacific (latitude: 5
S–5
N, longitude: 120
W–170
W). The ENSO signal is typically understood as being in one of three phases (or states)—a neutral phase (we will refer to this as
) gives rise to a precipitation region centered near longitude 160
E (
Figure 2B), the El Niño phase (
) gives rise to an eastward shifted precipitation region (∼170
W,
Figure 2C), and the La Niña phase (
) gives rise to a westward shifted precipitation region (∼150
E,
Figure 2A) [
46,
47]. Niño and Niña phases can occur with varying intensities during the winter months with a typical return period of two to seven years [
48]. When a Niño or Niña phase occurs, the shifted precipitation signal produces large scale atmospheric Rossby waves (waves in the upper level atmospheric pressure field) that influence North American land temperatures, predominantly through the well studied Pacific North American teleconnection pattern (PNA) [
49,
50]. PNA affects North American land temperatures through the advection of warm marine air during a Niño phase and cool polar air during a Niña phase [
51,
52]. We here use the proposed framework to quantify the causal effect of this teleconnection, focusing specifically on the temperature in the PNW.
This application is a particularly good fit for the proposed analysis for a number of reasons. First, by utilizing a collection of simulation model runs, an immense amount of data can be obtained. Second, domain expertise can be leveraged to construct causal DAGs prior to performing analysis. For example, it is well known that the ENSO signal influences temperature as opposed to the temperature influencing ENSO. Third, there are well-accepted methods for detrending signals, and these methods can be used to control for possible confounding effects. Fourth, it is to be expected that certain phases of the ENSO signal will, in some sense, give rise to larger causal effects than other phases [
54]. The proposed framework can be used to quantify these differences in a formal sense.
The analyzed dataset is composed of nine simulated model runs from the National Center for Atmospheric Research’s (NCAR) Community Earth System Model, version 2 (CESM2) [
55] scientifically validated historical CMIP6 runs [
56]. Full model details are provided in
Appendix F. Each model run provides an array of daily temperature values spanning the years 1850 to 2015 from which we can compute the Niño 3.4 index (as in [
57]) and directly obtain the PNW two-meter temperature. The Niño 3.4 index is a measure of anomalous equatorial sea surface temperatures in the Niño 3.4 region described above. Each of the model runs provides an independent realization of possible evolutions of temperatures that obey the underlying dynamic and thermodynamic equations as encoded by the model. It is important to clarify that the model is not intended for prediction, but rather gives possible atmospheric states for a given set of initial conditions and constraints determined by the selected time period (i.e., CO
forcing, solar/lunar cycles, etc.). Both the ENSO index and PNW two-meter temperature signals have the mean and the leading six harmonics of the annual cycle removed, leaving only the anomalous components of the signal. As this is standard practice in the analysis of climate data (e.g., [
58]), we henceforth strictly consider anomaly signals.
A 20-year model run of the Niño 3.4 index is shown in
Figure 3. It is clear that the ENSO signal does not reliably alternate between
and
with a constant period. As a result of ENSO cold-season phase locking [
59], the ENSO signal is strongest in or near to January (marked by vertical grid lines). As such, we limit our focus to the months of January, February, and March, as it is not interesting to measure the effect of the ENSO signal in the months where it is not present. We further simplify the problem by quantizing the ENSO index on an annual timescale, i.e., we assign a single value to
for January-March of a given year based on the ENSO index value on January 1st of that year.
Given that we are estimating the effect of ENSO on temperature, we similarly consider the temperature signal only during the months of January, February, and March. Rather than attempting to assess the effect of ENSO on daily temperature anomalies, we choose to focus on two-week averages, corresponding to the limit of predictability in numerical weather forecasting [
60]. As we will discuss in the next section, this choice also facilitates the causal modeling. As a final processing step, we quantize the temperature anomaly averages to
. While this quantization does come with an inevitable loss of resolution, it yields the easily understood interpretation of the temperature signal as representing either a cold anomaly, a warm anomaly, or neutral state. We compute the quantization threshold on the entire dataset (i.e., before averaging and before selecting for months) such that one third of days are in each category. The averages are then compared to these thresholds, given by −1.3 and +1.94 degrees Kelvin. The resultant dataset after selecting for the winter months and taking two-week averages consists of 9840 samples.
5.1. Causal Model
In order to implement the proposed framework, we first need to formulate a causal DAG representation of the dataset discussed above. As a starting point, consider the DAG on the left side of
Figure 4, where we let
E represent an annual ENSO phase,
represent the quantized two-week temperature anomaly averages for January through March (i.e.,
averages January 1st through 14th,
averages January 15th through 28th, etc.), and
U represents the other factors, such as seasonality and CO
forcing. This DAG encodes a number of assumptions. First, it encodes the intuition that seasonality may affect ENSO and the temperature, but not the other way around. Similarly, ENSO will affect the temperature in the PNW, but not the other way around. The more interesting implicit assumption is that there is a persistence signal in the temperature represented by the arrow
. Importantly, we have assumed that this persistence signal is Markov (when conditioned on
E and
U), i.e., there is no arrow
for
. This assumption significantly simplifies estimation of the direct and indirect effects of
E on
, as those require estimating the distribution of
for every possible combination of its parents. This serves as a motivation for the decision to consider two-week averages—if we were to simply consider daily temperatures, it is unreasonable to expect that
would be independent of
when conditioned on
E,
U, and
.
We next incorporate two assumptions in order to simplify the causal model. First, we assume that all the effects of
U are removed by the detrending and removal of annual cycle performed in the preprocessing steps. It is to be expected that this assumption will hold for the well known shared causes (such as the aforementioned seasonality and CO
forcing), but the possibility of other factors that have effects not captured by the leading six harmonics of the annual cycle is important to note. The second assumption we make is that the distribution of the temperature anomaly averages does not change over time, i.e., that
and
are not dependent on
i. After making these assumptions, we obtain the simplified DAG on the right of
Figure 4, where we introduce the new variable
S to represent the past temperature anomaly average and
T to represent the subsequent temperature average, and note that this perfectly matches the mediation model in
Figure 1 with
. We can think of
T as representing
and
S as representing either
or the collection
. To see that these interpretations of
S are equivalent, consider the SNDE, given by:
Now let
and
, and note that:
Plugging these into the second argument of the KL-divergence in Equation (
15), we get:
Given that S appears nowhere in the first argument the KL-divergence, we can see that whether or , the result is the same. The same procedure can be applied to show equivalence for the SNIE. We here choose the interpretation . As a result of the assumption that does not depend on i, we have that for . It should be noted that for (i.e., the average for the first two weeks of January), we define to be the average taken over the last two weeks of December.
5.2. Estimation and Significance Testing
We define the dataset from which we estimate the causal influences as
. Given that there is a large amount of data and a relatively small alphabet size, we utilize plug-in estimators of the proposed measures, where every distribution in question is estimated using a maximum likelihood estimator. Since
E has no parents reduced DAG, we can freely exchange interventions
for observations
e in the estimation of the effect of
e on
T. As such, the estimates of the specific effect of ENSO on temperature are given by:
where
gives the maximum likelihood estimate of
p on the sample
(see
Appendix E).
Next note that the conditional STE of the past temperature average
S on the subsequent temperature
T conditioned on an ENSO state
E is:
Letting
,
,
, and
, it follows from Theorem 3 that we can estimate the total effect from observational data. Therefore, we use the following plug-in estimator:
Given the absence of an intuitive link between bits and temperature, we choose to focus on the normalized versions of the proposed causal measures (see
Appendix E).
By applying these estimators to the complete dataset
, we obtain point estimates of the desired measures. For ease of notation, we omit
from the estimates from here on. It is important to note that even though not all estimates will utilize all 9840 samples,
Figure 5 makes clear there is a considerable amount of samples available for estimating every distribution in question. In particular, we see that:
In other words, the distribution estimated on the smallest number of samples is , and this estimate is obtained from 596 samples.
In addition to these point estimates, it is desirable to have a means of measuring the significance of the estimated measures and quantifying the uncertainty in our estimates. To achieve these goals, we perform a nonparametric bootstrap hypothesis test [
61] and construct a nonparametric bootstrap confidence interval [
62]. The goal of the hypothesis test is to estimate the distribution of the estimated measure under a null hypothesis (H0) and assess the likelihood that our estimate came from such a distribution. In this case, H0 corresponds to the absence of a causal link, which would result in the true causal measure being equal to zero. The primary challenge to performing this test is the generation of samples from a distribution representative of H0. We accomplish this using a scheme similar to that presented in Example 2 in [
2] wherein we group the data by one of the three variables (
E,
S, or
T) and shuffle the other two in order to break one of the causal links. For example, when performing the test for the direct effect of
E on
T, we split the data into three sets:
,
, and
. Within each of these sets, we shuffle (i.e., permute) all the samples of
E (or
T). Because the shuffling occurs within groupings of
S, any possible link from
E to
S and
S to
T is preserved (and thus so is the indirect effect), but the link between
E and
T is destroyed. Each of these permutations is then treated as a sample under H0 from which we estimate the SNDE. We perform this shuffling and estimation procedure 10,000 times and use the 95th percentile as the cutoff threshold for statistical significance. This threshold is given by the upper whisker on the boxplots labeled H0 in the figures in the next section. When performing this test for the indirect effect, we choose to break the link from
S to
T rather than from
E to
S in order to preserve the assumption that
for
.
To quantify the uncertainty in our point estimate we construct a nonparametric bootstrap confidence interval by repeatedly drawing a collection of samples from the empirical distribution of our data and estimating the measure on the new collection of samples. Specifically, let be the bth bootstrap sample, where are drawn independently from the uniform distribution over for 10,000. We estimate the causal measure in question on each of the 10,000 bootstrap samples and use the 5th and 95th percentiles as the lower and upper bounds of the confidence interval.
5.3. Results
We estimate the normalized STE, SNDE, and SNIE of ENSO on temperature and the normalized conditional STE of the past temperature average on the next average conditioned on ENSO. In every case, the measure is estimated on the complete dataset (red ×) and compared with the corresponding weighted average (i.e., “non-specific”) measure (red dashed lines). For the specific measure, we obtain an estimate for each value of the cause, i.e.,
or
. The average measure is then calculated by taking an expectation of the specific measures with respect to
or
. As an example, the red dashed line in the left panel of
Figure 6 represents
, and the three red dashed lines in
Figure 7 represent
for
. Each figure also displays two boxplots for each measure—the first shows the distribution of the measure estimated on the bootstrap samples and the second shows the distribution of the measure estimated under the null hypothesis that the causal link in question does not exist (denoted “H0”).
We begin by considering the total effect of ENSO on temperature shown in
Figure 6. Given that
E is a root node in the DAG representation given on the right of
Figure 4, we note that STE and SMI are equivalent, i.e.,
, and the expectation gives an estimate of the mutual information, i.e.,
. This illustrates the value of considering a specific causal measure—as we can see, the estimated effect of
is roughly three times the effect as estimated by the average with respect to
E. Recall the interpretation of the SMI as a measure of how much we would expect performing
to change the course of nature for
T. Under this interpretation, we see that forcing an El Niño year would alter the temperature distribution from what we would expect to occur naturally moreso than forcing a La Niña or neutral year.
Figure 6 shows that both the direct and indirect effects are less than the STE for all values
e. This is consistent with the intuition that the direct and indirect effects of ENSO on temperature would not cancel each other out. Intuition is also validated by the fact that the SNIE is less than the SNDE for all values. While this need not be the case in general, we make the assumption that
S and
T are identically distributed given
E, and thus we would expect the indirect link
to be weaker than the direct link
. While the proposed method does not explicitly identify a physical causal mechanism, the indirect link would represent a situation wherein certain temperatures give rise to environmental circumstances that may affect future temperatures, for example snow pack or soil moisture. Given that there is no evidence in the literature of these environmental factors having a large affect on temperatures, it is sensible that the SNIE is very low. As a final point, we note that while all estimates are statistically significant as measured by our proposed tests, only the effect of
has a non-zero lower bound on the confidence interval for the SNIE. This serves as further justification for the measurement of specific causal influences—when simply measuring average influences with MI, CS, or IF, statistical significance testing results in an “all or nothing” test, whereas the present framework enables identifying influences that are significant for only some values of a cause.
We conclude this section with the conditional STE of past on current temperature in a specific ENSO phase, as portrayed by
Figure 7. We can clearly see that there is a strong persistence in the temperature anomaly signal, i.e., that the past temperature average has a strong effect on the subsequent average, with the largest effect (
) being roughly five times that of the effect of
. The fact that the largest effect of
S on
T occurs when performing the intervention
during an El Niño year can likely be explained by the tendency for El Niño years to give rise to high temperatures. Thus, we would expect that forcing a cold spell during an El Niño would alter the course of nature moreso than, say, forcing a heat wave. Furthermore, the second largest effect is seen when
and
, i.e., when a heat wave is forced during a La Niña year. This result is reminiscent of the earlier example where there is a large causal influence resulting from a broken chain reaction. In this case, since we would expect an El Niño (resp. La Niña) year to assign a higher probability to a heat wave (resp. cold spell) that would then persist through the effect of
S on
T, intervening on
S to force a cold front (resp. heat wave) will result in a large deviation from the natural behavior and thus a large causal effect. It is important to note that "forcing a cold spell" is ambiguous in that there are many different mechanisms by which one could hypothetically force a temperature. The following section includes a discussion of how these different mechanisms affect the ability to consider the estimated affect as a true causal effect or merely a measure of predictive utility. In either case, the proposed methods provide a clearer picture of how the relationship between subsequent two-week anomaly averages is modulated by ENSO phase than traditional IT methods. This suggests an area for future investigation, as two-week temperature persistence is not well studied outside of the context of persistent high pressure anomalies [
63].
5.4. Challenges and Caveats
Any causal interpretation of the results is predicated on the assumption that there are no confounding factors not accounted for in the preprocessing steps. This assumption is less of an issue when measuring the effect of ENSO, where we only need to assume that there is no common cause for
E and
S or
E and
T (that there is no backdoor path, to be precise) beyond the seasonality, CO
forcing, and any other phenomena captured by the leading six harmonics. When measuring the effect of past temperatures, however, this assumption is a bit more far reaching. For example, we have neglected to consider the temperatures in neighboring regions. Moreover, the explicit nature of the causal effect of
S on
T is more elusive than that of
E on
T. While it is reasonable to expect the temperature to have some causal effect in a literal sense (i.e., via the heat equation), it is likely that the estimation procedure is also capturing the effects of temperature related variables. For example, if we additionally included PNW atmospheric pressure waves in the model, we would expect these waves to be a common cause for
S and
T resulting in a significantly weaker (if not absent) link
. As such, the above estimate of
ought to be viewed as either a measure of predictive utility of the literal temperature, or the causal effect of a "meta variable" representative of the temperature and related quantities that are intervened upon as a whole. In any case, the present study serves as a starting point for the development of more intricate causal models relating ENSO and temperature. A potential avenue for continued work is to use the proposed framework on causal graphs learned from data rather than those prespecified using domain expertise. Development of methods for learning causal structures from data is a highly active area of ongoing research in climate science [
64].
A second set of challenges arises from the need to estimate the measures for every value of the cause. While these challenges are indeed a fundamental challenge with the proposed framework, they provide an opportunity for the development of novel estimation and statistical testing techniques. On one hand, the proposed specific causal measures are necessarily more challenging to estimate than their average counterparts. On the other hand, they necessarily provide more resolution and allow for estimating separate confidence intervals for each element in the analysis. If we are trying to estimate but only have a small number of points in our dataset where , then we would have very little confidence in our estimate. However, that need not discourage us from having high confidence in an estimate of for some for which we have many samples. That having been said, the proposed estimators and significance test used in the present study lack a formal analysis and leave considerable room for improvement.
As a final discussion point, we return to the comparison of information theoretic and statistical notions of causal influence. Despite having carefully formulated the proposed measures as measures of the extent to which an intervention results in a deviation from the course of nature, the results presented in this section beg the question: How useful are bits? As an absolute measure, it is worth noting that a measure in bits will be largely influenced by the number of quantization regions we select. While this can be partially addressed by the proposed normalization, there is no question that the data compression interpretation provided alongside those equations is less intuitive than a measure of, say, the number of degrees warmer we would expect it to be an El Niño year than a La Niña year. Moreover, this intuition gap would be even larger for someone outside of the information theory community (e.g., climate scientists). This is not to say that the proposed measures are so opaque that they are unusable. In fact, we believe that they provide more interpretable notions of causal influence than other information theoretic measures that have experienced some popularity in the literature. Instead, this discussion is merely intended to maximize the level of intuition that we can associate with the proposed measures while simultaneously acknowledging the limitations of information theoretic measures in terms of interpretability.