Neural Interaction Detection
Neural Interaction Detection
Neural Interaction Detection
A BSTRACT
Interpreting neural networks is a crucial and challenging task in machine learning.
In this paper, we develop a novel framework for detecting statistical interactions
captured by a feedforward multilayer neural network by directly interpreting its
learned weights. Depending on the desired interactions, our method can achieve
significantly better or similar interaction detection performance compared to the
state-of-the-art without searching an exponential solution space of possible in-
teractions. We obtain this accuracy and efficiency by observing that interactions
between input features are created by the non-additive effect of nonlinear acti-
vation functions, and that interacting paths are encoded in weight matrices. We
demonstrate the performance of our method and the importance of discovered
interactions via experimental results on both synthetic datasets and real-world ap-
plication datasets.
1 I NTRODUCTION
Despite their strong predictive power, neural networks have traditionally been treated as “black
box” models, preventing their adoption in many application domains. It has been noted that com-
plex machine learning models can learn unintended patterns from data, raising significant risks to
stakeholders (Varshney & Alemzadeh, 2016). Therefore, in applications where machine learning
models are intended for making critical decisions, such as healthcare or finance, it is paramount to
understand how they make predictions (Caruana et al., 2015; Goodman & Flaxman, 2016).
Existing approaches to interpreting neural networks can be summarized into two types. One type is
direct interpretation, which focuses on 1) explaining individual feature importance, for example by
computing input gradients (Simonyan et al., 2013; Ross et al., 2017; Sundararajan et al., 2017) and
decomposing predictions (Bach et al., 2015; Shrikumar et al., 2017), 2) developing attention-based
models, which illustrate where neural networks focus during inference (Itti et al., 1998; Mnih et al.,
2014; Xu et al., 2015), and 3) providing model-specific visualizations, such as feature map and gate
activation visualizations (Yosinski et al., 2015; Karpathy et al., 2015). The other type is indirect
interpretation, for example post-hoc interpretations of feature importance (Ribeiro et al., 2016) and
knowledge distillation to simpler interpretable models (Che et al., 2016).
It has been commonly believed that one major advantage of neural networks is their capability of
modeling complex statistical interactions between features for automatic feature learning. Statistical
interactions capture important information on where features often have joint effects with other
features on predicting an outcome. The discovery of interactions is especially useful for scientific
discoveries and hypothesis validation. For example, physicists may be interested in understanding
what joint factors provide evidence for new elementary particles; doctors may want to know what
interactions are accounted for in risk prediction models, to compare against known interactions from
existing medical literature.
In this paper, we propose an accurate and efficient framework, called Neural Interaction Detection
(NID), which detects statistical interactions of any order or form captured by a feedforward neural
network, by examining its weight matrices. Our approach is efficient because it avoids searching
over an exponential solution space of interaction candidates by making an approximation of hidden
unit importance at the first hidden layer via all weights above and doing a 2D traversal of the input
1
Published as a conference paper at ICLR 2018
weight matrix. We provide theoretical justifications on why interactions between features are created
at hidden units and why our hidden unit importance approximation satisfies bounds on hidden unit
gradients. Top-K true interactions are determined from interaction rankings by using a special form
of generalized additive model, which accounts for interactions of variable order (Wood, 2006; Lou
et al., 2013). Experimental results on simulated datasets and real-world datasets demonstrate the
effectiveness of NID compared to the state-of-the-art methods in detecting statistical interactions.
The rest of the paper is organized as follows: we first review related work and define notations in
Section 2. In Section 3, we examine and quantify the interactions encoded in a neural network,
which leads to our framework for interaction detection detailed in Section 4. Finally, we study our
framework empirically and demonstrate its practical utility on real-world datasets in Section 5.
Statistical interaction detection has been a well-studied topic in statistics, dating back to the 1920s
when two-way ANOVA was first introduced (Fisher, 1925). Since then, two general approaches
emerged for conducting interaction detection. One approach has been to conduct individual tests
for each combination of features (Lou et al., 2013). The other approach has been to pre-specify all
interaction forms of interest, then use lasso to simultaneously select which are important (Tibshirani,
1996; Bien et al., 2013; Min et al., 2014; Purushotham et al., 2014).
Notable approaches such as ANOVA and Additive Groves (Sorokina et al., 2008) belong to the first
group. Two-way ANOVA has been a standard method of performing pairwise interaction detection
that involves conducting hypothesis tests for each interaction candidate by checking each hypothesis
with F-statistics (Wonnacott & Wonnacott, 1972). Besides two-way ANOVA, there is also three-
way ANOVA that performs the same analyses but with interactions between three variables instead
of two; however, four-way ANOVA and beyond are rarely done because of how computationally
expensive such tests become. Specifically, the number of interactions to test grows exponentially
with interaction order.
Additive Groves is another method that conducts individual tests for interactions and hence must face
the same computational difficulties; however, it is special because the interactions it detects are not
constrained to any functional form e.g. multiplicative interactions. The unconstrained manner by
which interactions are detected is advantageous when the interactions are present in highly nonlinear
data (Sorokina et al., 2007; 2008). Additive Groves accomplishes this by comparing two regression
trees, one that fits all interactions, and the other that has the interaction of interest forcibly removed.
In interaction detection, lasso-based methods are popular in large part due to how quick they are at
selecting interactions. One can construct an additive model with many different interaction terms
and let lasso shrink the coefficients of unimportant terms to zero (Tibshirani, 1996). While lasso
methods are fast, they require specifying all interaction terms of interest. For pairwise interaction
detection, this requires O(p2 ) terms (where p is the number of features), and O(2p ) terms for higher-
order interaction detection. Still, the form of interactions that lasso-based methods capture is limited
by which are pre-specified.
Our approach to interaction detection is unlike others in that it is both fast and capable of detecting
interactions of variable order without limiting their functional forms. The approach is fast because it
does not conduct individual tests for each interaction to accomplish higher-order interaction detec-
tion. This property has the added benefit of avoiding a high false positive-, or false discovery rate,
that commonly arises from multiple testing (Benjamini & Hochberg, 1995).
2.2 I NTERPRETABILITY
The interpretability of neural networks has largely been a mystery since their inception; however,
many approaches have been developed in recent years to interpret neural networks in their traditional
feedforward form and as deep architectures. Feedforward neural networks have undergone multiple
advances in recent years, with theoretical works justifying the benefits of neural network depth (Tel-
garsky, 2016; Liang & Srikant, 2016; Rolnick & Tegmark, 2018) and new research on interpreting
2
Published as a conference paper at ICLR 2018
feature importance from input gradients (Hechtlinger, 2016; Ross et al., 2017; Sundararajan et al.,
2017). Deep architectures have seen some of the greatest breakthroughs, with the widespread use
of attention mechanisms in both convolutional and recurrent architectures to show where they fo-
cus on for their inferences (Itti et al., 1998; Mnih et al., 2014; Xu et al., 2015). Methods such as
feature map visualization (Yosinski et al., 2015), de-convolution (Zeiler & Fergus, 2014), saliency
maps (Simonyan et al., 2013), and many others have been especially important to the vision commu-
nity for understanding how convolutional networks represent images. With long short-term memory
networks (LSTMs), a research direction has studied multiplicative interactions in the unique gating
equations of LSTMs to extract relationships between variables across a sequence (Arras et al., 2017;
Murdoch et al., 2018).
Unlike previous works in interpretability, our
approach extracts generalized non-additive in-
teractions between variables from the weights
of a neural network. !1
2.3 N OTATIONS #
!2
We can construct a directed acyclic graph G = (V, E) based on non-zero weights, where we create
vertices for input features and hidden units in the neural network and edges based on the non-zero
entries in the weight matrices. See Appendix A for a formal definition.
A statistical interaction describes a situation in which the joint influence of multiple variables on an
output variable is not additive (Dodge, 2006; Sorokina et al., 2008). Let xi , i ∈ [p] be the features
and y be the response variable, a statistical interaction I ⊆ [p] exists if and only if E [y|x], which is
a function of x = (x1 , x2 , . . . , xp ), contains a non-additive interaction between variables xI :
1
In this paper, we mainly focus on the multilayer perceptron architecture with ReLU activation functions,
while some of our results can be generalized to a broader class of feedforward neural networks.
3
Published as a conference paper at ICLR 2018
Definition 1 (Non-additive Interaction). Consider a function f (·) with input variables xi , i ∈ [p],
and an interaction I ⊆ [p]. Then I is a non-additive interaction of function f (·) if and only if there
does not exist a set of functions fi (·), ∀i ∈ I where fi (·) is not a function of xi , such that
X
f (x) = fi x[p]\{i} .
i∈I
For example, in x1 x2 +sin (x2 + x3 + x4 ), there is a pairwise interaction {1, 2} and a 3-way higher-
order interaction {2, 3, 4}, where higher-order denotes |I| ≥ 3. Note that from the definition of
statistical interaction, a d-way interaction can only exist if all its corresponding (d − 1)-interactions
exist (Sorokina et al., 2008). For example, the interaction {1, 2, 3} can only exist if interactions
{1, 2}, {1, 3}, and {2, 3} also exist. We will often use this property in this paper.
In feedforward neural networks, statistical interactions between features, or feature interactions for
brevity, are created at hidden units with nonlinear activation functions, and the influences of the
interactions are propagated layer-by-layer to the final output (see Figure 1). In this section, we
propose a framework to identify and quantify interactions at a hidden unit for efficient interaction
detection, then the interactions are combined across hidden units in Section 4.
In feedforward neural networks, any interacting features must follow strongly weighted connections
to a common hidden unit before the final output. That is, in the corresponding directed graph,
interacting features will share at least one common descendant. The key observation is that non-
overlapping paths in the network are aggregated via weighted summation at the final output without
creating any interactions between features. The statement is rigorized in the following proposition
and a proof is provided in Appendix A. The reverse of this statement, that a common descendant
will create an interaction among input features, holds true in most cases.
Proposition 2 (Interactions at Common Hidden Units). Consider a feedforward neural network
with input feature xi , i ∈ [p], where y = ϕ (x1 , . . . , xp ). For any interaction I ⊂ [p] in ϕ (·), there
exists a vertex vI in the associated directed graph such that I is a subset of the ancestors of vI at
the input layer (i.e., ` = 0).
In general, the weights in a neural network are nonzero, in which case Proposition 2 blindly infers
that all features are interacting. For example, in a neural network with just a single hidden layer,
any hidden unit in the network can imply up to 2kWj,: k0 potential interactions, where kWj,: k0 is
the number of nonzero values in the weight vector Wj,: for the j-th hidden unit. Managing the
large solution space of interactions based on nonzero weights requires us to characterize the relative
importance of interactions, so we must mathematically define the concept of interaction strength. In
addition, we limit the search complexity of our task by only quantifying interactions created at the
first hidden layer, which is important for fast interaction detection and sufficient for high detection
performance based on empirical evaluation (see evaluation in Section 5.2 and Table 2).
Consider a hidden unit in the first layer: φ w> x + b , where w is the associated weight vector and
x is the input vector. While having the weight wi of each feature i, the correct way of summarizing
feature weights for defining interaction strength is not trivial. For an interaction I ⊆ [p], we propose
to use an average of the relevant feature weights wI as the surrogate for the interaction strength:
µ (|wI |), where µ (·) is the averaging function for an interaction that represents the interaction
strength due to feature weights.
We provide guidance on how µ should be defined by first considering representative averaging func-
tions from the generalized mean family: maximum value, root mean square, arithmetic mean, ge-
ometric mean, harmonic mean, and minimum value (Bullen et al., 1988). These options can be
narrowed down by accounting for intuitive properties of interaction strength : 1) interaction strength
is evaluated as zero whenever an interaction does not exist (one of the features has zero weight); 2)
interaction strength does not decrease with any increase in magnitude of feature weights; 3) interac-
tion strength is less sensitive to changes in large feature weights.
While the first two properties place natural constraints on interaction strength behavior, the third
property is subtle in its intuition. Consider the scaling between the magnitudes of multiple feature
weights, where one weight has much higher magnitude than the others. In the worst case, there is
4
Published as a conference paper at ICLR 2018
one large weight in magnitude while the rest are near zero. If the large weight grows in magnitude,
then interaction strength may not change significantly, but if instead the smaller weights grow at the
same rate, then interaction strength should strictly increase. As a result, maximum value, root mean
square, and arithmetic mean should be ruled out because they do not satisfy either property 1 or 3.
Our definition of interaction strength at individual hidden units is not complete without considering
their outgoing paths, because an outgoing path of zero weight cannot contribute an interaction to
the final output. To propose a way of quantifying the influence of an outgoing path on the final
output, we draw inspiration from Garson’s algorithm (Garson, 1991; Goh, 1995), which instead of
computing the influence of a hidden unit, computes the influence of features on the output. This
is achieved by cumulative matrix multiplications of the absolute values of weight matrices. In the
following, we propose our definition of hidden unit influence, then prove that this definition upper
bounds the gradient magnitude of the hidden unit with its activation function. To represent the
(`)
influence of a hidden unit i at the `-th hidden layer, we define the aggregated weight zi ,
>
z(`) = |wy | W(L) · W(L−1) · · · W(`+1) ,
where z(`) ∈ Rp` . This definition upper bounds the gradient magnitudes of hidden units because
it computes Lipschitz constants for the corresponding units. Gradients have been commonly used
as variable importance measures in neural networks, especially input gradients which compute di-
rections normal to decision boundaries (Ross et al., 2017; Goodfellow et al., 2015; Simonyan et al.,
2013). Thus, an upper bound on the gradient magnitude approximates how important the variable
can be. A full proof is shown in Appendix C.
Lemma 3 (Neural Network Lipschitz Estimation). Let the activation function φ (·) be a 1-Lipschitz
(`) (`)
function. Then the output y is zi -Lipschitz with respect to hi .
We now combine our definitions from Sections 3.1 and 3.2 to obtain the interaction strength ωi (I)
(1)
of a potential interaction I at the i-th unit in the first hidden layer hi ,
(1) (1)
ωi (I) = zi µ Wi,I . (1)
Note that ωi (I) is defined on a single hidden unit, and it is agnostic to scaling ambiguity within a
ReLU based neural network. In Section 4, we discuss our scheme of aggregating strengths across
hidden units, so we can compare interactions of different orders.
4 I NTERACTION D ETECTION
In this section, we propose our feature interaction detection algorithm NID, which can extract inter-
actions of all orders without individually testing for each of them. Our methodology for interaction
detection is comprised of three main steps: 1) train a feedforward network with regularization, 2)
interpret learned weights to obtain a ranking of interaction candidates, and 3) determine a cutoff for
the top-K interactions.
4.1 A RCHITECTURE
Data often contains both statistical interactions and main effects (Winer et al., 1971). Main effects
describe the univariate influences of variables on an outcome variable. We study two architectures:
MLP and MLP-M. MLP is a standard multilayer perceptron, and MLP-M is an MLP with addi-
tional univariate networks summed at the output (Figure 2). The univariate networks are intended
to discourage the modeling of main effects away from the standard MLP, which can create spurious
interactions using the main effects. When training the neural networks, we apply sparsity regular-
ization on the MLP portions of the architectures to 1) suppress unimportant interacting paths and 2)
push the modeling of main effects towards any univariate networks. We note that this approach can
also generalize beyond sparsity regularization (Appendix G).
5
Published as a conference paper at ICLR 2018
The full proof is included in Appendix D. Under the noted assumption, the theorem in part a) shows
that a d-way interaction will improve over one its d − 1 subsets in rankings as long as there is no
sudden drop from the weight of the (d − 1)-way to the d-way interaction at the same hidden units.
We note that the improvement extends to b) as well, when d = |S ∩ R| > 1.
Lastly, we note that Algorithm 1 assumes there are at least as many first-layer hidden units as there
are the true number of non-redundant interactions. In practice, we use an arbitrarily large number of
first-layer hidden units because true interactions are initially unknown.
6
Published as a conference paper at ICLR 2018
p
X K
X
cK (x) = gi (xi ) + gi0 (xI ),
i=1 i=1
5 E XPERIMENTS
In this section, we discuss our experiments on both simulated and real-world datasets to study the
performance of our approach on interaction detection.
7
Published as a conference paper at ICLR 2018
with the averaging function, minimum, which we will use in all of our experiments. A simple
analytical study on a bivariate hidden unit also suggests that the minimum is closely correlated with
interaction strength (Appendix B).
Neural Network Configuration We trained feedforward
networks of MLP and MLP-M architectures to obtain in-
teraction rankings, and we trained MLP-Cutoff to find cut- 500
x.
.s.
ge .
ha .
.
n.
th
om
rm
ma
mi
r.m
ari
were trained jointly. The objective functions were mean-
squared error for regression and cross-entropy for classifi-
cation tasks. On the synthetic test suite, MLP and MLP-M Figure 3: A comparison of aver-
were trained with L1 constants in the range of 5e-6 to 5e-4, aging functions by the total num-
based on parameter tuning on a validation set. On real-world ber of correct interactions ranked be-
datasets, L1 was fixed at 5e-5. MLP-Cutoff used a fixed L2 fore any false positives, evaluated on
constant of 1e-4 in all experiments involving cutoff. Early the test suite (Table 1) over 10 tri-
stopping was used to prevent overfitting. als. x-axis labels are maximum, root
mean square, arithmetic mean, ge-
Datasets We study our interaction detection framework on ometric mean, harmonic mean, and
both simulated and real-world experiments. For simulated minimum.
experiments, we used a test suite of synthetic functions, as
shown in Table 1. The test functions were designed to have
a mixture of pairwise and higher-order interactions, with varying order, strength, nonlinearity, and
overlap. F1 is a commonly used function in interaction detection literature (Hooker, 2004; Sorokina
et al., 2008; Lou et al., 2013). All features were uniformly distributed between −1 and 1 except in
F1 , where we used the same variable ranges as reported in literature (Hooker, 2004). In all synthetic
experiments, we used random train/valid/test splits of 1/3 each on 30k data points.
We use four real-world datasets, of which two are regression datasets, and the other two are binary
classification datasets. The datasets are a mixture of common prediction tasks in the cal housing
and bike sharing datasets, a scientific discovery task in the higgs boson dataset, and an example of
very-high order interaction detection in the letter dataset. Specifically, the cal housing dataset is
a regression dataset with 21k data points for predicting California housing prices (Pace & Barry,
1997). The bike sharing dataset contains 17k data points of weather and seasonal information to
predict the hourly count of rental bikes in a bikeshare system (Fanaee-T & Gama, 2014). The
higgs boson dataset has 800k data points for classifying whether a particle environment originates
from the decay of a Higgs Boson (Adam-Bourdarios et al., 2014). Lastly, the letter recognition
dataset contains 20k data points of transformed features for binary classification of letters on a pixel
display (Frey & Slate, 1991). For all real-world data, we used random train/valid/test splits of
80/10/10.
Baselines We compare the performance of NID to that of three baseline interaction detection meth-
ods. Two-Way ANOVA (Wonnacott & Wonnacott, 1972) utilizes linear models to conduct signifi-
cance tests on the existence of interaction terms. Hierarchical lasso (HierLasso) (Bien et al., 2013)
applies lasso feature selection to extract pairwise interactions. RuleFit (Friedman & Popescu, 2008)
contains a statistic to measure pairwise interaction strength using partial dependence functions. Ad-
ditive Groves (AG) (Sorokina et al., 2008) is a nonparameteric means of testing for interactions by
placing structural constraints on an additive model of regression trees. AG is a reference method for
interaction detection because it directly detects interactions based on their non-additive definition.
8
Published as a conference paper at ICLR 2018
Table 2: AUC of pairwise interaction strengths proposed by NID and baselines on a test suite of
synthetic functions (Table 1). ANOVA, HierLasso, and RuleFit are deterministic.
scores of interaction strength proposed by baseline methods and NID for both MLP and MLP-M are
shown in Table 2. We ran ten trials of AG and NID on each dataset and removed two trials with
highest and lowest AUC scores.
When comparing the AUCs of NID applied to MLP and MLP-M, we observe that the scores of MLP-
M tend to be comparable or better, except the AUC for F6 . On one hand, MLP-M performed better
on F2 and F4 because these functions contain main effects that MLP would model as spurious inter-
actions with other variables. On the other hand, MLP-M performed worse on F6 because it modeled
spurious main effects in the {8, 9, 10} interaction. Specifically, {8, 9, 10} can be approximated as
independent parabolas for each variable (shown in Appendix I). In our analyses of NID, we mostly
focus on MLP-M because handling main effects is widely considered an important problem in inter-
action detection (Bien et al., 2013; Lim & Hastie, 2015; Kong et al., 2017). Comparing the AUCs of
AG and NID for MLP-M, the scores tend to close, except for F5 , F6 , and F8 , where NID performs
significantly better than AG. This performance difference may be due to limitations on the model
capacity of AG, which is tree-based. In comparison to ANOVA, HierLasso and RuleFit, NID-MLP-M
generally performs on par or better. This is expected for ANOVA and HierLasso because they are
based on quadratic models, which can have difficulty approximating the interaction nonlinearities
present in the test suite.
In Figure 4, heat maps of synthetic functions show the relative strengths of all possible pairwise
interactions as interpreted from MLP-M, and ground truth is indicated by red cross-marks. The
interaction strengths shown are normally high at the cross-marks. An exception is F6 , where NID
proposes weak or negligible interaction strengths at the cross-marks corresponding to the {8, 9, 10}
interaction, which is consistent with previous remarks about this interaction. Besides F6 , F7 also
shows erroneous interaction strengths; however, comparative detection performance by the baselines
is similarly poor. Interaction strengths are also visualized on real-world datasets via heat maps
(Figure 5). For example, in the cal housing dataset, there is a high-strength interaction between
x1 and x2 . These variables mean longitude and latitude respectively, and it is clear to see that the
outcome variable, California housing price, should indeed strongly depend on geographical location.
We further observe high-strength interactions appearing in the heat maps of the bike sharing, higgs
boson dataset, and letter datasets. For example, all feature pairs appear to be interacting in the letter
dataset. The binary classification task from the letter dataset is to distinguish letters A-M from N-Z
using 16 pixel display features. Since the decision boundary between A-M and N-Z is not obvious, it
would make sense that a neural network learns a highly interacting function to make the distinction.
9
Published as a conference paper at ICLR 2018
x1 x1 x1 x1 x1
x2 x2 x2 x2 x2
x3 x3 x3 x3 x3
x4 x4 x4 x4 x4
x5 x5 x5 x5 x5
x6 x6 x6 x6 x6
x7 x7 x7 x7 x7
x8 x8 x8 x8 x8
x9 x9 x9 x9 x9
x10 x10 x10 x10 x10
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
F1 F2 F3 F4 F5
x1 x1 x1 x1 x1
x2 x2 x2 x2 x2
x3 x3 x3 x3 x3
x4 x4 x4 x4 x4
x5 x5 x5 x5 x5
x6 x6 x6 x6 x6
x7 x7 x7 x7 x7
x8 x8 x8 x8 x8
x9 x9 x9 x9 x9
x10 x10 x10 x10 x10
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
F6 F7 F8 F9 F10
Figure 4: Heat maps of pairwise interaction strengths proposed by our NID framework on MLP-M
for datasets generated by functions F1 -F10 (Table 1). Cross-marks indicate ground truth interactions.
1
1 1
x1 2
2
3
4 2
3
x2 3
4
5
6
7 4
8
5
x3 5 9
10
6
6 11
12
x4 7 13
14
7
8
8 15
x5 9
16
17
18
9
10
10 19
x6 11
20
21
22
11
12
12 23
x7 13
24
25
26
13
14
x8 14
15
27
28
29
15
16
x1 x2 x3 x4 x5 x6 x7 x8
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
10
Published as a conference paper at ICLR 2018
true interaction 0.7 true interaction 0.7 true interaction true interaction true interaction
0.25 true subset interaction true subset interaction 0.6 0.8 true subset interaction
0.6 0.6 true subset interaction true subset interaction
Standardized RMSE
0.5
0.20 0.5 0.5 0.6
0.4
0.4 0.4
0.15 0.3
0.3 0.3 0.4
0.10 cutoff 0.2 cutoff
0.2 0.2 cutoff 0.2 cutoff
cutoff 0.1
0.05 0.1 0.1
Ø x1 x1 x9 x7 x7 x2 x4 x1 Ø1 2 3 1 7 9 8 2 1 1 1 3 3 1 Ø x1 x5 x7 x2 x5 x3 x4 x5 x4 x4 x1 x1 x4 Ø x8 x4 x6 x2 x1 x1 x4 x4
x2 x2 x10 x9 | x7 x5 x2 2 7 5 2 9 10 9 7 2 3 2 6 6 | Ø x1 x5 x4 x7 x4 x5 x2 x4 x3 x5 x4 x4 x1 x2 x7 x8 x3 x8 x4 x8 x7 x5 x5 x2 x4 x5 x9 x5 x7 x3 x3 x2 x8 x5
x2 x7 x8 x8 x5 x8 x3 x7 x4 x7 x7 x5 x2 x8 x8 x7 x3 x7
x3 x10 x3 7 9 7 4 9 10 x8 x8 x7 x9 x8 x8 x10 x9 x6
x5 8 7 x8 x9 x10
F1 F2 F3 F4 F5
true interaction true interaction 0.5 true interaction 0.25
0.8 true interaction 0.25 0.4 true subset interaction
true inter-
true subset interaction true subset interaction action
Standardized RMSE
F6 F7 F8 F9 F10
Figure 6: MLP-Cutoff error with added top-ranked interactions (along x-axis) of F1 -F10 (Table 1),
where the interaction rankings were generated by the NID framework applied to MLP-M. Red cross-
marks indicate ground truth interactions, and Ø denotes MLP-Cutoff without any interactions. Sub-
set interactions become redundant when their true superset interactions are found.
0.52 0.55 0.10
0.110
0.51
0.50 0.08
0.50
Standardized RMSE
0.45 0.105
0.49
0.06
1 - AUC
0.48 0.40
0.100
0.47 0.35 0.04
0.46 cutoff 0.30 0.095 cutoff cutoff
0.45 cutoff 0.02
0.25
0.44 0.090
0.00 Ø
Ø Ø 4 8 12 16 20 24 28 32 36 40 44 48 Ø 3 6 9 12 15 18 21 24 27 30 33 36 13 8 8 9 8 10 8 9 14 4 9 10 9 15 8 1
x1 x4 x4 x5 x2 15 15 13 15 9 11 12 12 15 10 13 15 | 16 16 |
x2 x6 x7 x6 x7 Rank Order Rank Order 15 12 15 12 16
Figure 7: MLP-Cutoff error with added top-ranked interactions (along x-axis) of real-world datasets
(Table 1), where the interaction rankings were generated by the NID framework on MLP-M. Ø
denotes MLP-Cutoff without any interactions.
that a relatively small number of interactions of variable order are highly predictive of their corre-
sponding datasets, as true interactions should.
We further study higher-order interaction detection of our NID framework by comparing it to AG
in both interaction ranking quality and runtime. To assess ranking quality, we design a metric, top-
rank recall, which computes a recall of proposed interaction rankings by only considering those
interactions that are correctly ranked before any false positive. The number of top correctly-ranked
interactions is then divided by the true number of interactions. Because subset interactions are
redundant in the presence of corresponding superset interactions, only such superset interactions
can count as true interactions, and our metric ignores any subset interactions in the ranking. We
compute the top-rank recall of NID on MLP and MLP-M, the scores of which are averaged across
all tests in the test suite of synthetic functions (Table 1) with 10 trials per test function. For each
test, we remove two trials with max and min recall. We conduct the same tests using the state-
of-the-art interaction detection method AG, except with only one trial per test because AG is very
computationally expensive to run. In Figure 8a, we show top-rank recall of NID and AG at different
Gaussian noise levels3 , and in Figure 8b, we show runtime comparisons on real-world and synthetic
datasets. As shown, NID can obtain similar top-rank recall as AG while running orders of magnitude
times faster.
5.4 L IMITATIONS
In higher-order interaction detection, our NID framework can have difficulty detecting interactions
from functions with interlinked interacting variables. For example, a clique x1 x2 +x1 x3 +x2 x3 only
3
Gaussian noise was to applied to both features and the outcome variable after standard scaling all variables.
11
Published as a conference paper at ICLR 2018
Table 3: Test performance improvement when adding top-K discovered interactions to MLP-Cutoff
on real-world datasets and select synthetic datasets. Here, the median K̄ excludes subset interac-
¯ denotes average interaction cardinality. RMSE values are standard scaled.
tions, and |I|
AG NID, MLP-M
0.5 NID, MLP
0.4 104
0.3 103
0.2
0.1 102
0.0
0.0 0.2 0.4 0.6 0.8 1.0
ter
noise
son s
F7
F3
F5
0
F1
bohigg
ari e
ng
ing
us l
let
sh ik
ho ca
b
(a) (b)
Figure 8: Comparisons between AG and NID in higher-order interaction detection. (a) Comparison
of top-ranked recall at different noise levels on the synthetic test suite (Table 1), (b) comparison of
runtimes, where NID runtime with and without cutoff are both measured. NID detects interactions
with top-rank recall close to the state-of-the-art AG while running orders of magnitude times faster.
contains pairwise interactions. When detecting pairwise interactions (Section 5.2), NID often obtains
an AUC of 1. However, in higher-order interaction detection, the interlinked pairwise interactions
are often confused for single higher-order interactions. This issue could mean that our higher-
order interaction detection algorithm fails to separate interlinked pairwise interactions encoded in
a neural network, or the network approximates interlinked low-order interactions as higher-order
interactions. Another limitation of our framework is that it sometimes detects spurious interactions
or misses interactions as a result of correlations between features; however, correlations are known
to cause such problems for any interaction detection method (Sorokina et al., 2008; Lou et al., 2013).
6 C ONCLUSION
We presented our NID framework, which detects statistical interactions by interpreting the learned
weights of a feedforward neural network. The framework has the practical utility of accurately
detecting general types of interactions without searching an exponential solution space of interaction
candidates. Our core insight was that interactions between features must be modeled at common
hidden units, and our framework decoded the weights according to this insight.
In future work, we plan to detect feature interactions by accounting for common units in intermediate
hidden layers of feedforward networks. We would also like to use the perspective of interaction
detection to interpret weights in other deep neural architectures.
ACKNOWLEDGMENTS
We are very thankful to Xinran He, Natali Ruchansky, Meisam Razaviyayn, Pavankumar Murali
(IBM), Harini Eavani (Intel), and anonymous reviewers for their thoughtful advice. This work was
supported by National Science Foundation awards IIS-1254206, IIS-1539608 and USC Annenberg
Fellowships for MT and DC.
12
Published as a conference paper at ICLR 2018
R EFERENCES
Claire Adam-Bourdarios, Glen Cowan, Cecile Germain, Isabelle Guyon, Balazs Kegl, and
David Rousseau. Learning to discover: the higgs boson machine learning challenge. URL
https://higgsml.lal.in2p3.fr/documentation/, 2014.
Leila Arras, Grégoire Montavon, Klaus-Robert Müller, and Wojciech Samek. Explaining recur-
rent neural network predictions in sentiment analysis. In Proceedings of the 8th Workshop on
Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 159–168,
2017.
Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller,
and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise
relevance propagation. PloS one, 10(7):e0130140, 2015.
Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful
approach to multiple testing. Journal of the royal statistical society. Series B (Methodological),
pp. 289–300, 1995.
Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. Annals of
statistics, 41(3):1111, 2013.
PS Bullen, DS Mitrinović, and PM Vasić. Means and their inequalities, mathematics and its appli-
cations, 1988.
Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intel-
ligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, pp. 1721–1730. ACM, 2015.
Zhengping Che, Sanjay Purushotham, Robinder Khemani, and Yan Liu. Interpretable deep models
for icu outcome prediction. In AMIA Annual Symposium Proceedings, volume 2016, pp. 371.
American Medical Informatics Association, 2016.
Yadolah Dodge. The Oxford dictionary of statistical terms. Oxford University Press on Demand,
2006.
Yingying Fan, Yinfei Kong, Daoji Li, and Jinchi Lv. Interaction pursuit with feature screening and
selection. arXiv preprint arXiv:1605.08933, 2016.
Hadi Fanaee-T and Joao Gama. Event labeling combining ensemble detectors and background
knowledge. Progress in Artificial Intelligence, 2(2-3):113–127, 2014.
Ronald Aylmer Fisher. Statistical methods for research workers. Genesis Publishing Pvt Ltd, 1925.
Peter W Frey and David J Slate. Letter recognition using holland-style adaptive classifiers. Machine
learning, 6(2):161–182, 1991.
Jerome H Friedman and Bogdan E Popescu. Predictive learning via rule ensembles. The Annals of
Applied Statistics, pp. 916–954, 2008.
G David Garson. Interpreting neural-network connection weights. AI Expert, 6(4):46–51, 1991.
ATC Goh. Back-propagation neural networks for modeling complex systems. Artificial Intelligence
in Engineering, 9(3):143–151, 1995.
Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial
examples. In International Conference on Learning Representations, 2015.
Bryce Goodman and Seth Flaxman. European union regulations on algorithmic decision-making
and a” right to explanation”. arXiv preprint arXiv:1606.08813, 2016.
Yotam Hechtlinger. Interpretation of prediction models using the input gradient. arXiv preprint
arXiv:1611.07634, 2016.
13
Published as a conference paper at ICLR 2018
Giles Hooker. Discovering additive structure in black box functions. In Proceedings of the tenth
ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 575–580.
ACM, 2004.
Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid
scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254–
1259, 1998.
Andrej Karpathy, Justin Johnson, and Li Fei-Fei. Visualizing and understanding recurrent networks.
arXiv preprint arXiv:1506.02078, 2015.
Yinfei Kong, Daoji Li, Yingying Fan, Jinchi Lv, et al. Interaction pursuit in high-dimensional multi-
response regression via distance correlation. The Annals of Statistics, 45(2):897–922, 2017.
Shiyu Liang and R Srikant. Why deep neural networks for function approximation? 2016.
Michael Lim and Trevor Hastie. Learning interactions via hierarchical group-lasso regularization.
Journal of Computational and Graphical Statistics, 24(3):627–654, 2015.
Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. Accurate intelligible models with
pairwise interactions. In Proceedings of the 19th ACM SIGKDD international conference on
Knowledge discovery and data mining, pp. 623–631. ACM, 2013.
Martin Renqiang Min, Xia Ning, Chao Cheng, and Mark Gerstein. Interpretable sparse high-order
boltzmann machines. In Artificial Intelligence and Statistics, pp. 614–622, 2014.
Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In
Advances in neural information processing systems, pp. 2204–2212, 2014.
W James Murdoch, Peter J Liu, and Bin Yu. Beyond word importance: Contextual decomposition
to extract interactions from lstms. International Conference on Learning Representations, 2018.
R Kelley Pace and Ronald Barry. Sparse spatial autoregressions. Statistics & Probability Letters, 33
(3):291–297, 1997.
Sanjay Purushotham, Martin Renqiang Min, C-C Jay Kuo, and Rachel Ostroff. Factorized sparse
learning models with interpretable high order feature interactions. In Proceedings of the 20th
ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 552–561.
ACM, 2014.
Steffen Rendle. Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International
Conference on, pp. 995–1000. IEEE, 2010.
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the
predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM, 2016.
David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions.
International Conference on Learning Representations, 2018.
Andrew Slavin Ross, Michael C. Hughes, and Finale Doshi-Velez. Right for the right reasons:
Training differentiable models by constraining their explanations. In Proceedings of the Twenty-
Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pp. 2662–2670, 2017.
Simone Scardapane, Danilo Comminiello, Amir Hussain, and Aurelio Uncini. Group sparse regu-
larization for deep neural networks. Neurocomputing, 241:81–89, 2017.
Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through
propagating activation differences. In International Conference on Machine Learning, pp. 3145–
3153, 2017.
Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Vi-
sualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
14
Published as a conference paper at ICLR 2018
Daria Sorokina, Rich Caruana, and Mirek Riedewald. Additive groves of regression trees. Machine
Learning: ECML 2007, pp. 323–334, 2007.
Daria Sorokina, Rich Caruana, Mirek Riedewald, and Daniel Fink. Detecting statistical interactions
with additive groves of trees. In Proceedings of the 25th international conference on Machine
learning, pp. 1000–1007. ACM, 2008.
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In
International Conference on Machine Learning, pp. 3319–3328, 2017.
Matus Telgarsky. benefits of depth in neural networks. In Conference on Learning Theory, pp.
1517–1539, 2016.
Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical
Society. Series B (Methodological), pp. 267–288, 1996.
Kush R Varshney and Homa Alemzadeh. On the safety of machine learning: Cyber-physical sys-
tems, decision sciences, and data products. arXiv preprint arXiv:1610.01256, 2016.
Ben James Winer, Donald R Brown, and Kenneth M Michels. Statistical principles in experimental
design, volume 2. McGraw-Hill New York, 1971.
Thomas H Wonnacott and Ronald J Wonnacott. Introductory statistics, volume 19690. Wiley New
York, 1972.
Simon Wood. Generalized additive models: an introduction with R. CRC press, 2006.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov,
Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation
with visual attention. In ICML, volume 14, pp. 77–81, 2015.
Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural
networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.
Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In
European conference on computer vision, pp. 818–833. Springer, 2014.
15
Published as a conference paper at ICLR 2018
Given a trained feedforward neural network as defined in Section 2.3, we can construct a di-
rected acyclic graph G = (V, E) based on non-zero weights as follows. We create a vertex
for each input feature and hidden unit in the neural network: V = {v`,i |∀i, `}, where v`,i is
the vertex corresponding to the i-th hidden unit in the `-th layer. Note that the final output
y is not included. We create edges based on the non-zero entries in the weight matrices, i.e.,
`
E = {(v`−1,i , v`,j ) |Wj,i 6= 0, ∀i, j, `}. Note that under the graph representation, the value of any
hidden unit is a function of parent hidden units. In the following proposition, we will use vertices
and hidden units interchangeably.
Proposition 2 (Interactions at Common Hidden Units). Consider a feedforward neural network
with input feature xi , i ∈ [p], where y = ϕ (x1 , . . . , xp ). For any interaction I ⊂ [p] in ϕ (·), there
exists a vertex vI in the associated directed graph such that I is a subset of the ancestors of vI at
the input layer (i.e., ` = 0).
Since that j∈Si wjy fj xIj is not a function of xi , we have that ϕ (·) is a function without the
P
The reverse of this statement, that a common descendant will create an interaction among input
features, holds true in most cases. The existence of counterexamples is manifested when early
hidden layers capture an interaction that is negated in later layers. For example, the effects of two
interactions may be directly removed in the next layer, as in the case of the following expression:
max{w1 x1 + w2 x2 , 0} − max{−w1 x1 − w2 x2 , 0} = w1 x1 + w2 x2 . Such an counterexample is
legitimate; however, due to random fluctuations, it is highly unlikely in practice that the w1 s and the
w2 s from the left hand side are exactly equal.
We can provide a finer interaction strength analysis on a bivariate ReLU function: max{α1 x1 +
α2 x2 , 0}, where x1 , x2 are two variables and α1 , α2 are the weights for this simple network. We
quantify the strength of the interaction between x1 and x2 with the cross-term coefficient of the best
quadratic approximation. That is,
ZZ 1 h
β0 , . . . , β5 = argmin β0 + β1 x1 + β2 x2 + β3 x21 + β4 x22 + β5 x1 x2
βi ,i=0,...,5 −1
i2
− max{α1 x1 + α2 x2 , 0} dx1 dx2 .
16
Published as a conference paper at ICLR 2018
Note that the choice of the region (−1, 1) × (−1, 1) is arbitrary: for larger region (−c, c) × (−c, c)
with c > 1, we found that |β5 | scales with c−1 . By the results of Proposition B, the strength of the
interaction can be well-modeled by the minimum value between |α1 | and |α2 |. Note that the factor
before min{|α1 |, |α2 |} in Equation (2) is almost a constant with less than 20% fluctuation.
Proof. For non-differentiable φ (·) such as the ReLU function, we can replace it with a series of
differentiable 1-Lipschitz functions that converges to φ (·) in the limit. Therefore, without loss
of generality, we assume that φ (·) is differentiable with |∂x φ(x)| ≤ 1. We can take the partial
(`)
derivative of the final output with respect to hi , the i-th unit at the `-th hidden layer:
(L) (`+1)
∂y X ∂y ∂hjL ∂hj`+1
(`)
= (L) (L−1)
··· (`)
∂hi j`+1 ,...,jL ∂hjL ∂hjL−1 ∂hi
=wy > diag(φ̇(L) )W(L) · · · diag(φ̇(`+1) )W(`+1) ,
Proof. Suppose for the purpose of contradiction that S ⊆ R and ∀s ∈ S, ω(s) ≥ ω(I). Because
ωj (I) > d1 ωj (s),
X X 1 X X 1 X
ω(I) = zj ωj (I) > zj ωj (s) = ω(s).
j propose s
d j propose s
d
s∈S∩R s∈S∩R s∈S∩R
17
Published as a conference paper at ICLR 2018
1 X 1 X
ω(s) ≥ ω(I)
d d
s∈S∩R s∈S∩R
Since S ⊆ R, |S ∩ R| = d. Therefore,
1 X 1
ω(I) ≥ ω(I)d ≥ ω(I),
d d
s∈S∩R
which is a contradiction.
E ROC C URVES
0.2 Mean ROC 0.2 Mean ROC 0.2 0.2 Mean ROC 0.2
(AUC = 0.995 ± 4.4E-03) (AUC = 0.853 ± 3.9E-02) Mean ROC (AUC = 0.996 ± 4.7E-03) Mean ROC
0.0 ± 1 std. dev. 0.0 ± 1 std. dev. 0.0 (AUC = 1.000 ± 0.0E+00) 0.0 ± 1 std. dev. 0.0 (AUC = 1.000 ± 0.0E+00)
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate False Positive Rate False Positive Rate False Positive Rate False Positive Rate
F1 F2 F3 F4 F5
1.0 1.0 1.0 1.0 1.0
0.2 Mean ROC 0.2 Mean ROC 0.2 Mean ROC 0.2 Mean ROC 0.2 Mean ROC
(AUC = 0.702 ± 4.8E-02) (AUC = 0.819 ± 2.2E-02) (AUC = 0.989 ± 4.5E-03) (AUC = 0.833 ± 3.7E-02) (AUC = 0.992 ± 2.1E-02)
0.0 ± 1 std. dev. 0.0 ± 1 std. dev. 0.0 ± 1 std. dev. 0.0 ± 1 std. dev. 0.0 ± 1 std. dev.
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate False Positive Rate False Positive Rate False Positive Rate False Positive Rate
F6 F7 F8 F9 F10
F L ARGE p E XPERIMENT
We evaluate our approach in a large p setting with pair-
wise interactions using the same synthetic function as in Pu-
rushotham et al. (2014). Specifically, we 1.0
generate a dataset
of n samples and p features { X(i) , y (i) } using the function 0.8
True Positive Rate
0.4
where X(i) ∈ Rp is the ith instance of the design matrix
X ∈ Rp×n , y (i) ∈ R is the ith instance of the response 0.2
variable y ∈ Rn×1 , W ∈ Rp×p contains the weights of ROC
0.0 (AUC = 0.984)
pairwise interactions, β ∈ Rp contains the weights of main 0.0 0.2 0.4 0.6 0.8 1.0
effects, (i) is noise, and i = 1, . . . , n. W was generated as False Positive Rate
PK
a sum of K rank one matrices, W = k=1 ak a> k. Figure 10: ROC curve for large p ex-
In this experiment, we set p = 1000, n = 1e4, and K = 5. periment
X is normally distributed with mean 0 and variance 1, and
(i) is normally distributed with mean 0 and variance 0.1.
Both ak and β are sparse vectors of 2% nonzero density and are normally distributed with mean 0
18
Published as a conference paper at ICLR 2018
and variance 1. We train MLP-M with the same hyperparameters as before (Section 5.1) but with
a larger main network architecture of five hidden layers, with first-to-last layers sizes of 500, 400,
300, 200, and 100. Interactions are then extracted using the NID framework.
From this experiment, we obtain a pairwise interaction strength AUC of 0.984 on 950 ground truth
pairwise interactions, where the AUC is measured in the same way as those in Table 2. The corre-
sponding ROC curve is shown in Figure 10.
We compare the average performance of NID for different regularizations on MLP-M networks.
Specifically, we compare interaction detection performance when an MLP-M network has L1, L2, or
group lasso regularization4 . While L1 and L2 are common methods for regularizing neural network
weights, group lasso is used to specifically regularize groups in the input weight matrix because
weight connections into the first hidden layer are especially important in this work. In particular,
we study group lasso by 1) forming groups associated with input features, and 2) forming groups
associated with hidden units in the input weight matrix. In this experimental setting, group lasso
effectively conducts variable selection for associated groups. Scardapane et al. (2017) define group
(i)
lasso regularization for neural networks in Equation 5. Denote group lasso with input groups a RGL
(h)
and group lasso with hidden unit groups as RGL . In order to apply both group and individual level
sparsity, Scardapane et al. (2017) further define sparse group lasso in Equation 7. Denote sparse
(i) (h)
group lasso with input groups as RSGL and sparse group lasso with hidden unit groups as RSGL .
Networks that had group lasso or sparse group lasso applied to the input weight matrix had L1
regularization applied to all other weights. In our experiments, we use large dataset sizes of 1e5
and tune the regularizers by gradually increasing their respective strengths from zero until validation
performance worsens from underfitting. The rest of the neural network hyperparameters are the same
as those discussed in Section 5.1. In the case of the group lasso and sparse group lasso experiments,
L1 norms were tuned the same as in the standard L1 experiments.
In Table 4, we report average pairwise interaction strength AUC over 10 trials of each function in
our synthetic test suite (Table 1) for the different regularizers.
Table 4: Average AUC of pairwise interaction strengths proposed by NID for different regularizers.
Evaluation was conducted on the test suite of synthetic functions (Table 1).
(i) (h) (i) (h)
L1 L2 RGL RGL RSGL RSGL
average 0.94 ± 2.9e−2 0.94 ± 2.5e−2 0.95 ± 2.4e−2 0.94 ± 2.5e−2 0.93 ± 3.2e−2 0.94 ± 3.0e−2
We perform experiments with our NID approach on synthetic datasets that have binary class labels as
opposed to continuous outcome variables (e.g. Table 1). In our evaluation, we compare our method
against two logistic regression methods for multiplicative interaction detection, Factorization Based
High-Order Interaction Model (FHIM) (Purushotham et al., 2014) and Sparse High-Order Logistic
Regression (Shooter) (Min et al., 2014). In both comparisons, we use dataset sizes of p = 200
features and n = 1e4 samples based on MLP-M’s fit on the data and the performance of the base-
lines. We also make the following modifications to MLP-M hyperparameters based on validation
performance: the main MLP-M network has first-to-last layer sizes of 100, 60, 20 hidden units, the
univariate networks do not have any hidden layers, and the L1 regularization constant is set to 5e−4.
All other hyperparameters are kept the same as in Section 5.1.
FHIM Purushotham et al. (2014) developed a feature interaction matrix factorization method with
L1 regularization, FHIM, that identifies pairwise multiplicative interactions in flexible p settings.
4
We omit dropout in these experiments because it significantly lowered the predictive performance of MLP-
M in our regression evaluation.
19
Published as a conference paper at ICLR 2018
When used in a logistic regression model, FHIM detects pairwise interactions that are predictive of
binary class labels. For this comparison, we used data generated by Equation 5 in Purushotham et al.
(2014), with K = 2 and sparsity factors being 5% to generate 73 ground truth pairwise interactions.
In Table 5, we report average pairwise interaction detection AUC over 10 trials, with a maximum
and a minimum AUC removed.
Table 5: Pairwise AUC comparison between FHIM and NID with p = 200, n = 1e4
FHIM NID
average 0.925 ± 2.3e−3 0.973 ± 6.1e−3
Shooter Min et al. (2014) developed Shooter, an approach of using a tree-structured feature ex-
pansion to identify pairwise and higher-order multiplicative interactions in a L1 regularized logistic
regression model. This approach is special because it relaxes our hierarchical hereditary assump-
tion, which requires subset interactions to exist when their corresponding higher-order interaction
also exists (Section 3). Specifically, Shooter relaxes this assumption by only requiring at least one
(d − 1)-way interaction to exist when its corresponding d-way interaction exists.
With this relaxed assumption, Shooter can be evaluated in depth per level of interaction order. We
compare NID and Shooter under the relaxed assumption by also evaluating NID per level of in-
teraction order, where Algorithm 1 is specifically being evaluated. We note that our method of
finding a cutoff on interaction rankings (Section 4.3) strictly depends on the hierarchical hereditary
assumption both within the same interaction order and across orders, so instead we set cutoffs by
thresholding the interaction strengths by a low value, 1e−3.
For this comparison, we generate and consider interaction orders up to degree 5 (5-way interac-
tion) using the procedure discussed in Min et al. (2014), where the interactions do not have strict
hierarchical correspondence. We do not extend beyond degree 5 because MLP-M’s validation per-
formance begins to degrade quickly on the generated dataset. The sparsity factor was set to be 5%,
and to simplify the comparison, we did not add noise to the data.
In Table 6, we report precision and recall scores of Shooter and NID, where the scores for NID are
averaged over 10 trials. While Shooter performs near perfectly, NID obtains fair precision scores
but generally low recall. When we observe the interactions identified by NID per level of interaction
order, we find that the interactions across levels are always subsets or supersets of another predicted
interaction. This strict hierarchical correspondence would inevitably cause NID to miss many true
interactions under this experimental setting. The limitation of Shooter is that it must assume the
form of interactions, which in this case is multiplicative.
Table 6: Precision and recall of interaction detection per interaction order, where |I| denotes inter-
action cardinality. The Shooter implementation is deterministic.
20
Published as a conference paper at ICLR 2018
For example, let x10 = 1 and z 2 = x28 + x29 , then by taylor series expansion at z = 0,
p 1 1 1
z 2 + 1 ≈ 1 + z 2 = 1 + x28 + x29 .
2 2 2
By symmetry under the assumed conditions,
1 1 1
q
x28 + x29 + x210 ≈ c + x28 + x29 + x210 ,
2 2 2
where c is a constant.
In Figure 11, we visualize the x8 , x9 , x10 univariate networks of a MLP-M (Figure 2) that is trained
on F6 . The plots confirm our hypothesis that the MLP-M models the {8,9,10} interaction as spurious
main effects with parabolas scaled by 21 .
J I NTERACTION V ISUALIZATION
500000
40°N
400000
38°N
300000
36°N
200000
34°N
100000
123°W 120°W 117°W
Figure 12: A heatmap showing relative housing prices in California based on longitude and latitude.
In Figure 12, we visualize the interaction between longitude and latitude for predicting relative
housing price in California. This visualization is extracted from the longitude-latitude interaction
network within MLP-Cutoff 5 , which was trained on the cal housing dataset (Pace & Barry, 1997).
We can see that this visualization cannot be generated by longitude and latitude information in
additive form, but rather the visualization needs special joint information from both features. The
highly interacting nature between longitude and latitude confirms the high rank of this interaction in
our NID experiments (see the {1, 2} interaction for cal housing in Figures 5 and 7).
5
MLP-Cutoff is a generalization of GA2 Ms, which are known as intelligible models (Lou et al., 2013).
21