2502.07672v1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 69

CHEAP PERMUTATION TESTING

B Y C ARLES D OMINGO -E NRICH1,a , R AAZ DWIVEDI2,c , AND L ESTER M ACKEY1,b


1 Microsoft Research New England, a carlesd@microsoft.com; b lmackey@microsoft.com

2 Cornell Tech, c dwivedi@cornell.edu

Permutation tests are a popular choice for distinguishing distributions


and testing independence, due to their exact, finite-sample control of false
positives and their minimax optimality when paired with U-statistics. How-
arXiv:2502.07672v1 [math.ST] 11 Feb 2025

ever, standard permutation tests are also expensive, requiring a test statistic to
be computed hundreds or thousands of times to detect a separation between
distributions. In this work, we offer a simple approach to accelerate testing:
group your datapoints into bins and permute only those bins. For U and V-
statistics, we prove that these cheap permutation tests have two remarkable
properties. First, by storing appropriate sufficient statistics, a cheap test can
be run in time comparable to evaluating a single test statistic. Second, cheap
permutation power closely approximates standard permutation power. As a
result, cheap tests inherit the exact false positive control and minimax opti-
mality of standard permutation tests while running in a fraction of the time.
We complement these findings with improved power guarantees for stan-
dard permutation testing and experiments demonstrating the benefits of cheap
permutations over standard maximum mean discrepancy (MMD), Hilbert-
Schmidt independence criterion (HSIC), random Fourier feature, Wilcoxon-
Mann-Whitney, cross-MMD, and cross-HSIC tests.

1. Introduction. Permutation tests [15, 14] are commonly used in statistics, machine
learning, and the sciences to test for independence, detect distributional discrepancies, eval-
uate fairness, and assess model quality [17, 38, 12, 29]. By simply permuting the labels
assigned to datapoints, these tests offer both exact, non-asymptotic control over false posi-
tives [22] and minimax optimal error rates when paired with popular U and V test statistics
[25, 4, 41, 42].
However, standard permutation tests are also expensive, requiring the recomputation of a
test statistic hundreds or thousands of times. This work offers a simple remedy: divide your
data into s bins and permute the bin labels. In Sec. 4 we show that this cheap permutation
strategy enjoys the best known error rates for U and V-statistic permutation testing, even when
the bin count s is independent of the sample size n. Moreover, we establish in Sec. 3 that,
after computing the initial test statistic, the same cheap tests can be run in time independent
of n by keeping track of simple sufficient statistics. Together, these findings give rise to new
exact tests that inherit the minimax rate optimality of standard permutation testing while
running in time comparable to the cost of a single test statistic.
The remainder of the paper is organized as follows. After reviewing related work in
Sec. 1.1 and hypothesis testing background in Sec. 2, we formally define our cheap per-
mutation tests in Sec. 3. Sec. 4 analyzes the power of binned permutation testing with U and
V-statistics, providing both improved guarantees for standard permutation tests and compa-
rable explicit finite-sample guarantees for cheap tests. In Sec. 5, we build on these results to

MSC2020 subject classifications: Primary 62G10; secondary 62G09, 62C20.


Keywords and phrases: Permutation testing, nonparametric, minimax optimality, homogeneity, independence,
quadratic test statistics, U-statistics, V-statistics, maximum mean discrepancy, Hilbert-Schmidt independence cri-
terion, random Fourier features, Wilcoxon-Mann-Whitney.

1
2

establish the minimax rate optimality of cheap testing for a variety of standard testing prob-
lems. We complement this theory in Sec. 6 with experiments demonstrating the benefits of
cheap testing over standard maximum mean discrepancy (MMD), Hilbert-Schmidt indepen-
dence criterion (HSIC), random Fourier feature, Wilcoxon-Mann-Whitney, cross-MMD, and
cross-HSIC tests. Finally, we conclude with a discussion of several extensions of the cheap
testing framework in Sec. 7.

1.1. Related work. A number of approaches to speeding up permutation testing have


appeared in the literature. Zmigrod, Vieira and Cotterell [54] showed that, for a special
class of G-bounded, additively-decomposable, and integer-valued test statistics, one can
run an independence permutation test using all n! distinct permutations of n points in time
O(Gn log(Gn) log(n)). Unfortunately, this class does not cover the more general U and V-
statistics studied in this work.
In the context of homogeneity testing, Chung et al. [9] replaced random permutations
with transpositions of random pairs of elements showing that, for certain test statistics, each
transposed statistic could be computed in time independent of the sample size n. However, for
the higher-order U and V-statistics studied in this work, the cost of computing a transposed
statistic grows linearly in n, while our cheap permutation statistics can still be computed
in time independent of n. In addition, Chung et al. [9] did not analyze the power of their
proposed test.
Domingo-Enrich, Dwivedi and Mackey [13] used binned permutation in combination with
sample compression to reduce the runtime of homogeneity testing with a bounded kernel
maximum mean discrepancy test statistic. Here we show that binned permutation provides
benefits for a much broader range of tests, including independence tests and homogeneity
tests based on general quadratic test statistics (see Def. 1).
Anderson and Verbeeck [1] computed the exact mean and variance of the permutation dis-
tribution for a class of multivariate generalizations of the Mann-Whitney test called general-
ized pairwise comparison (GPC) statistics. However, the authors noted, “In order to construct
a hypothesis test for the GPC statistics, any use of the means and variances would require the
assumption of asymptotic normality,” an assumption which is not needed for our proposed
methodology. Relatedly, a wide variety of works accelerate testing by fitting an approxima-
tion to the permutation distribution [see, e.g., 53, 26, 27, 43, 21]. However, these works do not
analyze the power of their approximate tests, and, in each case, the approximation sacrifices
finite-sample validity.
Finally, Ramdas et al. [40] established the validity of a wide variety of generalized per-
mutation tests (involving arbitrary subsets of the permutation distribution and nonuniform
distributions over them). However, their work does not discuss power or computational com-
plexity, two matters of primary importance in the present work.

1.2. Notation. Given a permutation π of the indices [n] ≜ {1, . . . , n} and an index i ∈
[n], we write either π(i) or πi to denote the index to which i is mapped. For integers p ≥ q ≥
1, we define (p)q ≜ p(p − 1) · · · (p − q + 1) and let ipq denote the set of all q -tuples that can be
drawn without replacement from the set [p]. For a function g integrable under a probability
measure P, we use the notation Pg ≜ EX∼P [g(X)]. We use I for the indicator function.

2. Background on hypothesis testing. We define a test as any measurable procedure


that takes as input a dataset X ≜ (X1 , . . . , Xn ) and outputs a probability in [0, 1]. Given two
competing hypotheses (H0 and H1 ) concerning the data generating distribution, we interpret
∆(X) as the probability of rejecting the null hypothesis H0 in favor of the alternative hypoth-
esis H1 . The quality of a test is measured by its size, its probability of (incorrectly) rejecting
CHEAP PERMUTATION TESTING 3

the null when the null is true, and its power, its probability of (correctly) rejecting H0 when
H1 is true. A test often comes paired with a nominal level, a desired upper bound α ∈ (0, 1)
on the Type I error. We say ∆ is a valid level-α test if its size never exceeds α when H0
holds and an exact level-α test if its size exactly equals α whenever H0 holds. All of the tests
studied in this work are exact and take the form

1
 if T (X) > c(X),
(1) ∆(X) = ξ(X) if T (X) = c(X), and

0 if T (X) < c(X),

where T (X) is called the test statistic, c(X) is called the critical value, and ξ(X) is a rejection
probability chosen to ensure exactness. We will consider two broad classes of tests in this
work: tests of homogeneity and tests of independence.

2.1. Homogeneity testing. Homogeneity tests (also known as two-sample tests) test for
equality of distribution. More precisely, given a sample Y = (Yi )ni=1
1
drawn i.i.d. from an un-
n2
known distribution P and an independent sample Z = (Zj )j=1 drawn i.i.d. from an unknown
distribution Q, a homogeneity test judges the hypotheses
H0 : P = Q versus H1 : P ̸= Q.
A common approach to testing homogeneity is to threshold a quadratic test statistic (QTS).

D EFINITION 1 (Quadratic test statistic (QTS)). Given samples Y and Z of size n1 and
n2 respectively, we call T (Y, Z) a quadratic test statistic, if T is a quadratic form of the em-
b≜ 1 P 1 P
pirical distributions P y∈Y δy and Q ≜ n2 z∈Z δz . That is, there exist base functions
b
n1
ϕn1 , ϕn2 , and ϕn1 ,n2 for which T satisfies

(2) b×P
T (Y, Z) = (P b×Q
b)ϕn1 + (P b ×Q
b )ϕn1 ,n2 + (Q b )ϕn2 .

An important example of a QTS the canonical homogeneity U-statistic.

D EFINITION 2 (Homogeneity U-statistic). Given samples Y and Z of size n1 and n2


respectively, we call U (Y, Z), or Un1 ,n2 for short, a homogeneity U-statistic if
Un1 ,n2 = (n1 )21(n2 )2 (i1 ,i2 )∈in2 1 (j1 ,j2 )∈in2 2 hho (Yi1 , Yi2 ; Zj1 , Zj2 ), and
P P
(3)

hho (y1 , y2 ; z1 , z2 ) = g(y1 , y2 ) + g(z1 , z2 ) − g(y1 , z2 ) − g(y2 , z1 )


for some symmetric real-valued base function g .

Notably, any homogeneity U-statistic can be written as a QTS (2) via the mappings
1 g(y,y)+g(z,z)
ϕw (y, z) = 1−1/w (g(y, z) − w+1 ) for w ∈ {n1 , n2 } and ϕn1 ,n2 (y, z) = −2g(y, z).
Moreover, many commonly used test statistics can be recovered as homogeneity U-statistics
with the right choice of g .

E XAMPLE 1 (Kernel two-sample test statistics [20]). When g is chosen to be a positive-


definite kernel k [48, Def. 4.15], then Un1 ,n2 (3) is a kernel two-sample test statistic and an
unbiased estimator of the squared maximum mean discrepancy (MMD) between the distri-
butions P and Q:
(4) MMD2k (P, Q) ≜ (P × P)k + (Q × Q)k − 2(P × Q)k.
4

E XAMPLE 2 (Energy statistics [3, 50, 44]). When g is the negative Euclidean distance
on Rd , Un1 ,n2 (3) is called an energy statistic. More generally, if g(y, z) = −ρ(y, z) for a
semimetric ρ of negative type [44, Defs. 1-2], then Un1 ,n2 is a generalized energy statistic
and an unbiased estimate of the energy distance between P and Q:
Eρ (P, Q) = 2(P × Q)ρ − (P × P)ρ − (Q × Q)ρ.

In both of these examples, the homogeneity U-statistic (3) is degenerate under the null with
(n1 + n2 ) Un1 ,n2 converging to a complicated, P-dependent, non-Gaussian limit as n1 , n2 →
∞ [45, Sec. 5.5.2]. As a result, one typically turns to permutation testing, as described in
Sec. 3.1, to select the critical values for these statistics [49, 30, 50, 3].

2.2. Independence testing. Independence tests test for dependence between paired vari-
ables. More precisely, given a paired sample (Y, Z) ≜ (Yi , Zi )ni=1 drawn i.i.d. from an un-
known distribution Π with marginals P and Q, independence testing judges the hypotheses
H0 : Π = P × Q versus H1 : Π ̸= P × Q.
A common approach to testing independence is to threshold an independence V-statistic.

D EFINITION 3 (Independence V-statistic). Given a paired sample (Y, Z) of size n, we


call V (Y, Z), or Vn for short, an independence V-statistic if
(5) Vn = n14 (i1 ,i2 ,i3 ,i4 )∈[n]4 hin ((Yi1 , Zi1 ), (Yi2 , Zi2 ), (Yi3 , Zi3 ), (Yi4 , Zi4 )), for
P

(6) hin ((y1 , z1 ), (y2 , z2 ), (y3 , z3 ), (y4 , z4 ))



≜ gY (y1 , y2 ) + gY (y3 , y4 ) − gY (y1 , y3 ) − gY (y2 , y4 )

× gZ (z1 , z2 ) + gZ (z3 , z4 ) − gZ (z1 , z3 ) − gZ (z2 , z4 ) .
for some symmetric real-valued base functions gY and gZ .

Many commonly used test statistics can be recovered as independence V-statistics with the
right choices of gY and gZ .

E XAMPLE 3 (Hilbert-Schmidt independence criterion (HSIC) [18]). When gY and gZ


are positive-definite kernels, 41 Vn (5) is the Hilbert-Schmidt independence criterion, i.e.,
the squared maximum mean discrepancy MMD2k (Π, b Pb×Q b≜
b ) (4), where k = gY gZ , Π
1 Pn 1 Pn 1 Pn
i=1 δ(Yi ,Zi ) , P ≜ n i=1 δYi , and Q ≜ n i=1 δZi [42, App. B].
b b
n

E XAMPLE 4 (Distance covariance [51, 44]). When gY (y1 , y2 ) = ∥y1 − y2 ∥2 and


gZ (z1 , z2 ) = ∥z1 − z2 ∥2 for y1 ∈ Rd1 and z1 ∈ Rd2 , 14 Vn (5) is called the distance covari-
ance and forms the basis for the energy test of independence. More generally, if gY and gZ
are semimetrics of negative type [44, Defs. 1-2], then Vn is a generalized distance covariance.

In both of these examples, the independence V-statistic (6) is degenerate under the null
with nVn converging to a complicated, P and Q-dependent, non-Gaussian limit as n → ∞
[45, Sec. 5.5.2]. As a result, one typically turns to permutation testing, for example, as de-
scribed in Sec. 3.3, to select critical values for these statistics [51, 10].

3. Cheap permutation testing. Given a test statistic, permutation testing provides an


automated procedure for selecting the critical value and rejection probability of a test (1).
In this section, we review standard (Monte Carlo) permutation tests for homogeneity and
independence and introduce new cheap permutation tests that reduce the runtime and memory
demands. The results of this section are summarized in Tab. 1.
CHEAP PERMUTATION TESTING 5

3.1. Permutation tests of homogeneity. Let X ≜ (Xi )ni=1 = (Y1 , . . . , Yn1 , Z1 , . . . , Zn2 )
represent a concatenation of the observed samples Y = (Yi )ni=1 1
and Z = (Zi )ni=1
2
with
n ≜ n1 + n2 . In a standard permutation test of homogeneity, one samples B independent
permutations (πb )Bb=1 of the indices [n], computes the permuted test statistics (T (X ))b=0
πb B
π n
for π0 the identity permutation and X ≜ (Xπ(i) )i=1 , and finally selects a 1 − α quantile of
1 PB
the empirical distribution B+1 b=0 δT (Xπb ) as the critical test value c(X).
For a quadratic test statistic (Def. 1), this permutation test only depends on the data via the
3n2 sufficient statistics

Sho ≜ ϕ(xi , xj ) | i, j ∈ [n], ϕ ∈ {ϕn1 , ϕn2 , ϕn1 ,n2 } .
Hence, if we let cϕ denote the maximum time required to evaluate a single element of Sho ,
then one can precompute and store the elements of Sho given Θ(cϕ n2 ) time and Θ(n2 ) mem-
ory and, thereafter, compute all permuted test statistics using Θ(Bn2 ) elementary operations.
When the quadratic memory cost is unsupportable, the elements of Sho can instead be re-
computed for each permutation at a total cost of Θ(cϕ Bn2 ) time and Θ(n) memory.

3.2. Cheap permutation tests of homogeneity. To alleviate both the time and memory
burden of standard permutation testing, we propose to permute datapoint bins instead indi-
vidual points. More precisely, we partition the indices [n] into s consecutive sets I1 , . . . , Is
of equal size, identify the corresponding datapoint bins Xj ≜ (Xi )i∈Ij , and define the bin-
permuted samples Yπ ≜ (Xπ(i) )si=1 1
and Zπ ≜ (Xπ(i) )ss1 +1 for each permutation π on [s] and
s1 ≜ s nn1 . Importantly, the bin-permuted QTS,
n1 ,n2
T (Yπ , Zπ ) = n12 si,j=1 Φnij1 + n11n2 si=1
P 1 Ps
+ n12 si,j=s1 +1 Φnij2 ,
P1 P
1 j=s1 +1 Φij 2

only depends on the data via the 3s2


sufficient statistics
n o
Sho,cheap ≜ Φw
P P
ij ≜ a∈Ii b∈Ij ϕ w (Xa , X b ) | i, j ∈ [s], w ∈ {n 1 , n2 , (n 1 , n2 )} .

Hence, one can precompute and store the elements of Sho,cheap using only Θ(s2 ) memory
and Θ(cϕ n2 ) time and subsequently compute all bin-permuted statistics using only Θ(Bs2 )
elementary operations. Alg. 1 details the steps of this cheap homogeneity test which recovers
the standard permutation test when s = n.
Notably, the overhead of cheap permutation is negligible whenever Bs2 ≪ cϕ n2 . In Secs. 4
and 6, we will see that it suffices to choose s as a slow-growing or even constant function of
n, yielding total runtime comparable to that of evaluating the original test statistic.

3.3. Wild bootstrap tests of independence. To ease notation, we will focus on the case
of even n. In a standard wild bootstrap permutation test of independence [10], one begins
by sampling B independent wild bootstrap permutations (πb )B b=1 of the indices [n]. Each
wild bootstrap permutation π swaps the indices i and swap(i) ≜ i + n2 − nI i > n2 with
probability 12 and does so independently for each i ≤ n/2. Then, one computes the permuted
test statistics (T (Y, Zπb ))B π n
b=0 for π0 the identity permutation and Z ≜ (Zπ(i) )i=1 and finally
1 P B
selects a 1 − α quantile of the empirical distribution B+1 b=0 δT (Y,Zπb ) as the critical test
value c(Y, Z).
Now consider the independence V-statistic V (Y, Z) of Def. 3, and introduce the shorthand
(ḡY,i , ḡZ,i , ḡY , ḡZ ) ≜ ( n1 nj=1 gY (Yi , Yj ), n1 nj=1 gZ (Zi , Zj ), n1 ni=1 ḡY,i , n1 ni=1 ḡZ,i )
P P P P

and Gij,kℓ ≜ (gY (Yi , Yj ) − ḡY,i − ḡY,j + ḡY )(gZ (Zk , Zℓ ) − ḡZ,k − ḡZ,ℓ + ḡZ )
6

TABLE 1
Time and memory costs for cheap and standard permutation tests with n datapoints, B permutations, and s bins.
The multipliers cϕ and cg denote the cost of evaluating a single homogeneity QTS base function (see Def. 1) and
a single independence V-statistic base function (see Def. 3). All costs are reported up to a universal constant.

Permutation test Test statistic T (Y, Z) T (Y, Z) time Permutation time Memory

Bn2 n2
Homogeneity QTS (Def. 1) cϕ n2
cϕ Bn2 n
Cheap homogeneity (Alg. 1) QTS (Def. 1) cϕ n2 Bs2 s2
Bn2 n2
Independence V-statistic (Def. 3) cg n2
cg Bn2 n
Cheap independence (Alg. 2) V-statistic (Def. 3) cg n2 Bs2 s2

Algorithm 1: Cheap homogeneity testing


Input: Samples (Yi )n1 n2
i=1 , (Zi )i=1 , bin count s, base functions ϕn1 , ϕn2 , ϕn1 ,n2 (Def. 1), level α,
permutation count B
1 Define (Xi )n i=1 ≜ (Y1 , . . . , Yn1 , Z1 , . . . , Zn2 ) for n ← n1 + n2 and s1 ← s n1 /n
2 Split the indices {1, . . . , n} into s consecutive bins I1 , . . . , Is of equal size
3 // Compute sufficient statistics using Θ(cϕ n2 ) time and Θ(s2 ) memory
for i, j ∈ [s] and w ∈ {n1 , n2 , (n1 , n2 )} do Φw
P
4 ij ← a∈Ii ,b∈Ij ϕw (Xa , Xb ) ;
5 // Compute original and permuted test statistics using Θ(Bs2 ) elementary operations
6 for b = 0, 1, 2, . . . , B do
7 π ← identity permutation if b = 0 else uniform permutation of [s]
n1 ,n2
Tb ← n12 si,j=1 Φn + n11n2 si=1 n2
P1 1
P 1 Ps 1 Ps
8
π(i)π(j) j=s1 +1 Φπ(i)π(j) + n2 i,j=s1 +1 Φπ(i)π(j)
1 2
9 end
10 // Return rejection probability
11 R ← 1+ number of permuted statistics (Tb )B
b=1 smaller than T0 if ties are broken at random
12 return pα = min(1, max(0, R−(1−α)(B+1)))

for all indices i, j, k, ℓ ∈ [n]. With V (Y, Z) as its test statistic, the standard wild bootstrap test
only depends on the data via the n2 /2 sufficient statistics
P
S1 ≜ a,b∈{i,swap(i)} (Gab,ba , Gab,swap(b)swap(a) ) | i ∈ [n/2] and
P
S2 ≜ a∈{i,swap(i)},b∈{j,swap(j)} (Gab,ba , Gab,swap(b)swap(a) ),
P
a∈{i,swap(i)},b∈{j,swap(j)} (Gab,bswap(a) , Gab,swap(b)a ) | j < i ∈ [n/2] .

since, for each wild bootstrap permutation π and j ∈ [n], we have π(j) ∈ {j, swap(j)} and
Pn/2 P
4V (Y, Zπ ) = n12 i=1 a,b∈{i,swap(i)} Gab,π(a)π(b)
Pn/2 P
+ n22 i=1 i−1
P
j=1 a∈{i,swap(i)},b∈{j,swap(j)} Gab,π(a)π(b) .

Hence, if we let cg denote the maximum time required to evaluate gY or gZ , then one can
precompute and store the elements of S1 ∪ S2 given Θ(cg n2 ) time and Θ(n2 ) memory and,
thereafter, compute all permuted test statistics using Θ(Bn2 ) elementary operations. When
the quadratic memory cost is unsupportable, the elements of S2 can instead be recomputed
for each permutation at a total cost of Θ(cg Bn2 ) time and Θ(n) memory.
CHEAP PERMUTATION TESTING 7

3.4. Cheap wild bootstrap tests of independence. To alleviate both the time and memory
burden of standard wild bootstrap testing, we again propose to permute datapoint bins instead
of individual points. To simplify our presentation, we will assume that s is even and divides
n evenly. As before, we partition the indices [n] into s consecutive sets I1 , . . . , Is of equal
size, identify the corresponding datapoint bins Zj ≜ (Zi )i∈Ij , and define the bin-permuted
sample Zπ ≜ (Zπ(i) )si=1 for each wild bootstrap permutation π on [s]. If we further define the
combination bins Ii′ ≜ Ii ∪ Ii+s/2 for each i ∈ [s/2], then the bin-permuted V-statistic
Ps/2 Ps/2 P
4V (Y, Zπ ) = n12 i=1 Diπ(i) + n22 i=1 i−1 j=1 Sijπ(j)π(i)

only depends on the data via the s2 /2 sufficient statistics


S1′ ≜ (Dii , Di(i+s/2) ) ≜ a,b∈Ii′ (Gab,ba , Gab,swap(b)swap(a) ) | i ∈ [s/2] and
 P

S2′ ≜ (Sijji , Sij(j+s/2)(i+s/2) ) ≜ a∈Ii′ , b∈Ij′ (Gab,ba , Gab,swap(b)swap(a) ),


 P

P
(Sijj(i+s/2) , Sij(j+s/2)i ) ≜ a∈Ii′ , b∈Ij′ (Gab,bswap(a) , Gab,swap(b)a ) | j < i ∈ [s/2] .
Hence, one can precompute and store the elements of S1′ ∪ S2′ using only Θ(s2 ) memory
and Θ(cg n2 ) time and subsequently compute all bin-permuted statistics using only Θ(Bs2 )
elementary operations. Alg. 2 details the steps of this cheap independence test which recovers
the standard wild bootstrap permutation test when s = n.
As in the homogeneity setting, we find that the overhead of cheap permutation is negligible
whenever Bs2 ≪ cϕ n2 , and we will see in Secs. 4 and 6 that a slow-growing or constant
setting of s suffices to maintain power while substantially reducing runtime.

3.5. Finite-sample exactness. We conclude this section by noting that the cheap permu-
tation tests of Algs. 1 and 2 are exact, that is, their sizes exactly match the nominal level α for
all sample sizes and data distributions. This is a principal advantage of permutation testing
over asymptotic tests that (a) require precise knowledge of a test statistic’s asymptotic null
distribution and (b) can violate the level constraint at any finite sample size. The exactness
follows immediately from the fact that cheap permutation tests are standard permutation tests
over bins of datapoints and a standard exchangeability argument due to Hoeffding [22].

P ROPOSITION 1 (Cheap permutation tests are exact). Under their respective null hy-
potheses, the cheap permutation tests of Algs. 1 and 2 reject the null with probability α.

4. Power of cheap testing. We now turn our attention to the power of cheap testing with
homogeneity U-statistics (Def. 2) and independence V-statistics (Def. 3).

4.1. Power guarantees for homogeneity testing. For the homogeneity U-statistic of
Def. 2, define the symmetrized version of hho ,
1 P P
(7) hho (y1 , y2 ; z1 , z2 ) ≜ 2!2! (i1 ,i2 )∈i22 (j1 ,j2 )∈i22 hho (yi1 , yi2 ; zj1 , zj2 ),
and the variance components
ψY,1 ≜ max{Var(E[hho (Y1 , Y2 ; Z1 , Z2 )|Y1 ]), Var(E[hho (Z3 , Y2 ; Z1 , Z2 )|Z3 ])},
(8) ψZ,1 ≜ max{Var(E[hho (Y1 , Y2 ; Z1 , Z2 )|Z1 ]), Var(E[hho (Y1 , Y2 ; Y3 , Z2 )|Y3 ])}, and
ψY Z,2 ≜ max{E[g 2 (Y1 , Y2 )], E[g 2 (Y1 , Z1 )], E[g 2 (Z1 , Z2 )]}.
Our first theorem, proved in App. B, bounds the minimum separation between P and Q
required for standard and cheap homogeneity tests to reject with power at least 1 − β . Here,
separation is measured by the mean of the test statistic E[Un1 ,n2 ], which is zero under the
null (P = Q) but may be non-zero under the alternative (P ̸= Q).
8

Algorithm 2: Cheap independence testing


Input: Paired sample ((Yi , Zi ))n
i=1 , bin count s, base functions gY , gZ (Def. 3), level α, wild
bootstrap count B
1 Split the indices {1, . . . , n} into s consecutive bins I1 , . . . , Is of equal size
2 // Compute sufficient statistics using Θ(cg n2 ) time and Θ(s2 ) memory
for a ∈ [n] do (ḡY,a , ḡZ,a ) ← n1 n
P
3 b=1 (gY (Ya , Yb ), gZ (Za , Zb ));
(ḡY , ḡZ ) ← n1 n
P
4 (ḡ ,
a=1 Y,a Z,a ḡ )
Define the mappings swap(a) ≜ a + n2 − nI a > n2 and

5
Gab,cd ≜ (gY (Ya , Yb ) − ḡY,a − ḡY,b + ḡY )(gZ (Zc , Zd ) − ḡZ,c − ḡZ,d + ḡZ )
6 for i ∈ [s] do P
7 (Dii , Di(i+s/2) ) ← a,b∈Ii (Gab,ba , Gab,swap(b)swap(a) )
8 for j ∈ [i − 1] do P
9 (Sijji , Sij(j+s/2)(i+s/2) ) ← a∈Ii ,b∈Ij (Gab,ba , Gab,swap(b)swap(a) )
P
10 (Sij(j+s/2)i , Sijj(i+s/2) ) ← a∈Ii ,b∈Ij (Gab,swap(b)a , Gab,bswap(a) )
11 end
12 end
13 // Compute original and permuted test statistics using Θ(Bs2 ) elementary operations
14 for b = 0, 1, 2, . . . , B do
15 π ← identity permutation if b = 0 else wild bootstrap permutation of [s]
Tb ← n12 si=1 Diπ(i) + n22 si=1 i−1
P P P
16 j=1 Sijπ(j)π(i)
17 end
18 // Return rejection probability
19 R ← 1+ number of permuted statistics (Tb )B b=1 smaller than T0 if ties are broken at random
20 return pα = min(1, max(0, R−(1−α)(B+1)))

T HEOREM 1 (Power of cheap homogeneity: finite variance). Consider a homogeneity


U-statistic Un1 ,n2 (Def. 2) with finite variance components (8) and 4 ≤ n1 ≤ n2 . A standard
permutation test (Alg. 1 with s = n) based on Un1 ,n2 has power at least 1 − β whenever
q q
4ψY,1 4ψZ,1  1−α⋆ (36n21 +36n22 +198n1 n2 )ψY Z,2
(9) E[Un1 ,n2 ]≥γn1 ,n2 ≜ 3−β β n1 + n2 + α⋆ βn1 (n1 −1)n2 (n2 −1)

α β 1/⌊α(B+1)⌋
for α⋆ ≜ 2e

3 . A cheap permutation test (Alg. 1 with 4 ≤ s < n) based on Un1 ,n2
has power at least 1 − β whenever
q
201s2
γn1 ,n2 + 4n2
ψY Z,2 + 12s max(ψY,1 ,ψZ,1 )
(10) E[Un1 ,n2 ] ≥ γn1 ,n2 ,s ≜ γn1 ,n2 + q
βα⋆
n

s 24(1−α⋆ ) ρn1 n2 −1
3

n1 (n1 −1)2 n2 (n2 −1)2


for ρn1 n2 ≜ n6 .

R EMARK 1q (Equivalent separation rates and thresholds). We have γn1 ,n2 ,s = Θ(γn1 ,n2 )
⋆ )(1+Ω(1))
whenever s ≥ 3 24(1−α
βα⋆ ρn n and γn1 ,n2 ,s = γn1 ,n2 (1 + o(1)) whenever s = ω( nn12 ).
1 2

The separation threshold γn1 ,n2 for standard permutation testing is a refined, quantitative
version of the one derived in [25, Thm. 4.1]. Here, we keep track of all numerical factors
and, thanks to the quantile comparison method introduced in Sec. 4.3, establish a tighter
dependency on the number of permutations B . The separation threshold γn1 ,n2 ,s for cheap
permutation testing is entirely new and implies two remarkable properties summarized in
Rem. 1. First, the cheap and standard thresholds are asymptotically equivalent (in the strong
CHEAP PERMUTATION TESTING 9

sense that γn1 ,n2 ,s /γn1 ,n2 = 1 + o(1)) whenever s = ω( nn21 ). For n1 = Θ(n2 ), this occurs
whenever s grows unboundedly with n, even if the growth rate is arbitrarily slow. Second,
q and standard separation rates are identical (i.e., γn1 ,n2 ,s = Θ(γn1 ,n2 )) whenever
the cheap

s ≥ 3 24(1−α )(1+Ω(1))
βα⋆ ρn1 n2 . Hence, when n1 = Θ(n2 ), cheap testing can match the separation
rate of standard permutation testing using a constant number of bins independent of the
sample size n.
When the base function g is a positive-definite kernel k (Ex. 1), the expected test statistic
is nonnegative and equal to the squared maximum mean discrepancy p (MMD, (4)). In this
case, Thm. 1 also yields a detectable separation
p threshold of order 1/n1 + 1/n2 for the
root mean test statistic, MMDk (P, Q) = E[Un1 ,n2 ].

C OROLLARY 1 (Power of cheap homogeneity: finite variance, PD). Under the assump-
tions of Thm. 1 and notation of Ex. 1, suppose that g is a positive-definite kernel k with
ξQ , ξP < ∞ for ξµ ≜ max(E[MMD2k (δY1 , µ)], E[MMD2k (δZ1 , µ)]).
A standard permutation test based on Un1 ,n2 has power at least 1 − β whenever
q q 2 2
4ξQ 4 1−α⋆ (36n1 +36n2 +198n1 n2 )ψY Z,2
(11) MMDk (P, Q) ≥ ϵn1 ,n2 ≜ 3−β β n1 + 4ξP 
n2 + α⋆ βn1 (n1 −1)n2 (n2 −1) .

A cheap permutation test based on Un1 ,n2 has power at least 1 − β whenever
√ 12s r
201βα⋆ s5 ρn n
ϵn1 ,n2 + n
max(ξQ ,ξP )+ 4 96(1−α⋆ )n12 2 ψY Z,2
(12) MMDk (P, Q) ≥ ϵn1 ,n2 ,s ≜ ϵn1 ,n2 + q
βα⋆
.
s3 24(1−α ⋆ ) ρn1 n2 −1

R EMARK 2q (Equivalent separation rates and thresholds). We have ϵn1 ,n2 ,s = Θ(ϵn1 ,n2 )

3 24(1−α )(1+Ω(1))
whenever s ≥ βα⋆ ρn n and ϵn1 ,n2 ,s = ϵn1 ,n2 (1 + o(1)) whenever s = ω( nn21 ).
1 2

The proof of Cor. 1 can be found in App. C. Finally, we establish improved power guar-
antees for cheap testing with only logarithmic dependence on β and α⋆ under a commonly-
satisfied sub-Gaussianity assumption.

T HEOREM 2 (Power of cheap homogeneity: sub-Gaussian, PD). Under the assumptions


of Thm. 1 and notation of Ex. 1, suppose that g is a positive-definite kernel k with finite
sub-Gaussian parameters
(EX∼µ |EX̃∼µ MMDk (δX ,δX̃ )|p )1/p
σP , σQ < ∞ for σµ ≜ supp≥1 √
p

and MMD moments


M4,P , M4,Q < ∞ Mp,µ ≜ EX∼µ MMDpk (δX , µ).
for
α β 1/⌊α(B+1)⌋
Suppose moreover that s1 ≜ n1sn ⋆ ⋆

+n2 > 2 log(4/α ) + 1 for α ≜ 2e 3 . A cheap
1

permutation test based on Un1 ,n2 has power at least 1 − β whenever


q √
1 40 2 log( α2⋆ )/3(Ã4 +B̃( β9 )2 )+4(Ã22 +B̃( β9 )2 ) 3Λ̃P ( β3 ) 3Λ̃Q ( β3 )
MMDk (P, Q) ≥ 2 log(4/α⋆ )+1 √
n1 + √
n −1
+ √
n −1
1− s1
1 2

√ √ 
40 log( α2⋆ )(Ã2 +B̃( β9 )) 2( 2Ã2 +B̃( β9 ))
(13) + √
3s1 n1
+ √
s1 n 1 for
q q
Ãp ≜ p2 e log( 1δ )(σP2 +σQ
2 ), Λ̃ (δ) ≜ 2e log( 1δ ).
p p
Mp,P +Mp,Q , B̃(δ) ≜ 8 µ 2M2,µ +8σµ
10

R EMARK 3 (Bounded kernels


p are sub-Gaussian). If k is bounded, then the sub-Gaussian
parameter σµ ≤ 2 supx∈X k(x, x) for all distributions µ.

For equal sample sizes, Thm. 2, proved in App. D,√establishes


√ that cheap testing powerfully
log(1/β) 4 log(1/α⋆ )
detects sub-Gaussian MMD separations of order √
n1 using only s = (2 +
Θ(1))(2 ⋆
log(4/α ) + 1) bins. This represents an exponential improvement over the order
√ √
1/β 4 1/(α⋆ β)
q 
1

n1 threshold and s = Ω 3
βα⋆ bin size requirement of Cor. 1.

4.2. Power guarantees for independence testing. For the independence V-statistic of
Def. 3, we introduce the shorthand Xi ≜ (Yi , Zi ) and define a symmetrized version of hin ,
hin (x1 , x2 , x3 , x4 ) ≜ 4!1 (i1 ,i2 ,i3 ,i4 )∈i44 hin (xi1 , xi2 , xi3 , xi4 ),
P

the variance components


ψ1′ ≜ Var(E[hin (X1 , X2 , X3 , X4 ) | X1 ]),
(14) ψ̃1′ ≜ Var(E[hin ((Y1 , Zπ1 ), (Y2 , Zπ2 ), (Y3 , Zπ3 ), (Y4 , Zπ4 )) | (Y1 , Zπ1 , π1 )]), and
ψ2′ ≜ 16 max E[gY2 (Y1 , Y2 )gZ2 (Zi , Zj )] : (i, j) ∈ {(1, 2), (1, 3), (3, 4)} ,


where π is an independent wild bootstrap permutation of [n], and the mean component
(15) ξ˜ ≜ max{|E[hin ((Yi1 , Zi′1 ), (Yi2 , Zi′2 ), (Yi3 , Zi′3 ), (Yi4 , Zi′4 ))]| : ik , i′k ∈ [8], ∀k ∈ [4]}.
Our next theorem, proved in App. E, bounds the minimum separation between Π and
P × Q required for standard and cheap independence tests to reject with power at least 1 − β .
Here, separation is measured by
U ≜ E[hin (X1 , X2 , X3 , X4 )],
the expected U-statistic corresponding to the V-statistic Vn . Notably, U is zero under the null
(Π = P × Q) but may be non-zero under the alternative (Π ̸= P × Q).

T HEOREM 3 (Power of cheap independence: finite variance). Consider an independence


V-statistic Vn (Def. 3) with finite variance components (14) and n ≥ 8. A standard wild
bootstrap test (Alg. 2 with s = n) based on Vn has power at least 1 − β whenever, for
α β 1/⌊α(B+1)⌋
α⋆ ≜ 2e

3 ,
q q

144ψ2′   32ψ̃′ 960ψ2′  16
(16) U ≥ γn ≜ 34 ( β3 −1) 16ψ 12
+ n ξ˜+ 24
p ′
+ + α⋆ β −2 n + n2 ψ2 .
1 1
n n2 n

A cheap wild bootstrap test (Alg. 2 with s < n) based on Vn has power at least 1−β whenever
  q 
s+1 −1 4 12
 ′ 2304 460800 
(17) U ≥ γn,s ≜ 1 − 3s2 γn + 3 α⋆ β − 2 ψ̃1 ns + ns2 + ψ2′ 4147200
n2 s .

R EMARK 4 (Equivalent separation rates and thresholds). We have γn,s = Θ(γn ) for any
choice of s and γn,s = γn (1 + o(1)) whenever s = ω(1).

The separation threshold γn can be viewed as a refined, quantitative analogue of the


threshold derived in [25, Thm. 5.1] for standard permutation testing with an independence
U-statistic. Here, we additionally keep track of all numerical factors and, thanks to the quan-
tile comparison method introduced in Sec. 4.3, establish a tighter dependency on the number
of permutations B . The separation threshold γn,s for cheap permutation testing is entirely
CHEAP PERMUTATION TESTING 11

new and implies two remarkable properties summarized in Rem. 4. First, the cheap and stan-
dard thresholds are again asymptotically equivalent whenever s = ω(1). Second, the cheap
and standard separation rates are identical whatever the choice of s, even if s = 2. Hence,
cheap independence testing can match the separation rate of standard testing using a constant
number of bins independent of the sample size n.
When the base functions (gY , gZ ) are positive-definite kernels (Ex. 3), the test statistic
1 2 b b
4 n is equal to the squared sample MMD, MMDk (Π, P × Q). In this case, the proof of
V b
1
Thm. 3 also yields a detectable separation threshold of order √n for the population parameter,
MMDk (Π, P × Q).

C OROLLARY 2 (Power of cheap independence: finite variance, PD). Under the assump-
tions of Thm. 3 and notation of Ex. 3, suppose that n ≥ 14 and that gY , gZ are positive-
definite kernels with k = gY gZ , ξ ≜ E[hin (X1 , X1 , X3 , X4 )], and
δ +ν δY ×Q+P×δZ 
ξ˜wild ≜ max E(Y,Z)∼µ MMD2k (Y,Z)

2 , 2 : µ, ν ∈ {Π, P × Q} .
A standard wild bootstrap test based on Vn has power at least 1 − β whenever
√ √
4

(12(α⋆ β)−1 −2)(32 ξ+256 ξ̃wild ) (12(α⋆ β)−1 −2)960 ψ2′ + 10ξ̃
(18) MMDk (Π, P × Q) ≥ ϵn ≜ √ −1 32 √
+ √ −1 32 1/2 √
,
(3− ⋆
(12(α β) −2) n
) n ⋆(3− (12(α β) −2) n
) n

A cheap wild bootstrap test based on Vn has power at least 1 − β whenever


(19) MMDk (Π, P × Q)≥
q  q q 
3− ( α12 32
⋆ β −2) n ξ̃wild ( 18432 + 3686400 ) 4( α12 ′ 4147200
⋆ β −2)ψ2
s s2 s
ϵn,s ≜ s+1 q 12 ϵn + √ q 9α⋆ β √ + q .
3− s2 − ( α⋆ β −2)( 32
n
+ 2304
ns
+ 460800
ns2
) n 12−2α⋆ β − 32 12
(3n− ( α⋆ β −2)32n)1/2

R EMARK 5 (Equivalent MMD separation rates and thresholds). We have ϵn,s = Θ(ϵn )
for any choice of s and ϵn,s = ϵn (1 + o(1)) whenever s = ω(1).

The proof of Cor. 2 can be found in App. F. Finally, we establish improved power guaran-
tees for cheap independence testing with only logarithmic dependence on β and α⋆ whenever
k is bounded.

T HEOREM 4 (Power of cheap independence: bounded, PSD). Under the assumptions of


Thm. 3 and notation of Ex. 3, suppose that gY , gZ qare positive-definite kernels with k = gY gZ
log(6/α⋆
) ⋆)
α β 1/⌊α(B+1)⌋
and K ≜ supx |k(x, x)| < ∞. If 41 + + log(6/α < 1 for α⋆ = 2e

s s 3 ,
then a cheap wild bootstrap test based on Vn has power at least 1 − β whenever
√  √  ⋆) √ 
K 30+22 log(12/β) 72 + 1+12 log(3/α

6 s
+2 log(3/α⋆ )
MMDk (Π, P × Q) ≥ √ 3 log(6/α⋆ )
q
log(6/α⋆ )
 .
n 4
− s
− s

Thm. 4, proven in App. G, establishes


√ that cheap independence testing powerfully detects
log(1/β) log(1/α⋆ )
bounded MMD separations of order √
n
using only s = Θ(log(1/α⋆ )) bins.

1/(βα⋆ )
This represents an exponential improvement over the order √
n
threshold of Cor. 2.

4.3. Quantile comparison method. At the heart of our power guarantees is a quantile
comparison method (Prop. 2) suitable for bounding the power of Monte Carlo permutation
tests. Related results have appeared in the literature under the name two moments method, as
12

they control quantiles using the first two moments of the test statistic. This approach to as-
sessing non-asymptotic power was pioneered by Fromont, Laurent and Reynaud-Bouret [16]
in the context of homogeneity testing and later developed by Kim, Balakrishnan and Wasser-
man [25], Schrab et al. [41], Domingo-Enrich, Dwivedi and Mackey [13], among others.
Earlier guarantees due to [25, 41] placed undue constraints on the number of permutations
B . For example, Prop. D.1 of [25] requires B ≥ α82 log( β4 ), while Thm. 5 of [41] requires
B ≥ α32 (log( β8 ) + α(1 − α)). Here we adapt a tighter argument developed for homogeneity
testing with a bounded positive-definite kernel [13, Lem. 6] and extend it to all forms of
permutation test and, more generally, to any test that thresholds using the quantiles of i.i.d.
perturbations. The result requires only B ≥ α1 − 1 permutations and is proved in App. H.

P ROPOSITION 2 (Quantile comparison method). Consider a test statistic T (X) and an


auxiliary sequence (Tb )Bb=1 generated i.i.d. and independently of T (X) given X. Suppose
that a test ∆(X) rejects whenever T (X) exceeds the bα = ⌈(1 − α)(B + 1)⌉ smallest values
in (Tb )B
b=1 . Then ∆(X) has power at least 1 − β whenever a population parameter
α β 1/⌊α(B+1)⌋
T ≥ Ψn (α⋆ , β3 ) + Φn ( β3 ) for α⋆ ≜ 2e

(20) 3
and complementary quantile bounds Φn , ΨX , Ψn satisfying
(21) Pr(T − T (X) ≥ Φn ( β3 )) ≤ β3 ,
(22) Pr(T1 ≥ ΨX (α⋆ ) | X) ≤ α⋆ almost surely, and
(23) Pr(ΨX (α⋆ ) ≥ Ψn (α⋆ , β3 )) ≤ β3 .

When the test statistic T (X) has finite variance and the auxiliary variable T1 has almost-
surely finite conditional variance, Cantelli’s inequality yields the following corollary, a re-
finement of the two moments method of [25, Prop. D.1]. See App. I for the proof.

C OROLLARY 3 (Refined two moments method). If Var(T (X)), E[Var(T1 | X)] < ∞,
then the following functions satisfy the requirements (21) to (23) of Prop. 2 for T = E[T (X)]:
q q
Φn (β) = ( β1 − 1)Var(T (X)), ΨX (α) = ( α1 − 1)Var(T1 | X) + E[T1 | X],
q
1
(24) Ψn (α, β) = ( αβ − β1 )E[Var(T1 | X)] when E[T1 | X] = 0 almost surely, and
q q
2
(25) Ψn (α, β) = ( αβ − β2 )E[Var(T1 | X)] + E[T1 ] + ( β2 − 1)Var(E[T1 | X]) otherwise.

5. Minimax optimality of cheap testing. In this section, we apply the power results of
Sec. 4 to establish the minimax rate optimality of cheap testing in a variety of settings. Be-
low, for discrete and absolutelyR continuous signed measures µ, we make use of the standard
L2 norm on densities, ∥µ∥22 ≜ ( dµ 2
dν (x)) dν(x), where ν is the counting measure if µ is sup-
ported on a discrete set X and ν is the Lebesgue measure when µ has a Lebesgue density.
Note that
(P
(p(x) − q(x))2 if P, Q have probability mass functions p, q on a set X
∥P − Q∥22 = R x∈X
(p(x) − q(x))2 dx if P, Q have Lebesgue densities p, q .
We also define the Hölder-space ball,
τ −⌊τ ⌋
Hτd (L) ≜ f : [0, 1]d 7→ R |f (⌊τ ⌋) (x) − f (⌊τ ⌋) (x′ )| ≤ L∥x − x′ ∥2 ∀x, x′ ∈ [0, 1]d ,

,

(26) |f (τ ) (x)| ≤ L, ∀x ∈ [0, 1]d , τ ′ ∈ {1, . . . , ⌊τ ⌋} .
CHEAP PERMUTATION TESTING 13

5.1. Discrete homogeneity testing. Our first result, proved in App. J.1, shows that cheap
homogeneity testing is minimax rate optimal for detecting discrete L2 separations.

P ROPOSITION 3 (Power of cheap testing: discrete L2 homogeneity). Suppose P and Q


are discrete distributions on a finite set X with b(1) ≜ max(∥P∥22 , ∥Q∥22 ). A cheap permuta-
tion test (Alg. 1) based on a homogeneity U-statistic (Def. 2) with n1 ≤ n2 and base function
g(y, z) = I(y = z) has power at least 1 − β whenever
1/4
 
Cb(1) 1
∥P − Q∥2 ≥ √ 6 ⋆
√ ⋆
1 + q 3 βα⋆
βα min(β,α ) n1 s 24(1−α⋆ )
ρn1 n2 −1

α β 1/⌊α(B+1)⌋ n1 (n1 −1)2 n2 (n2 −1)2


for a universal constant C , α⋆ ≜ 2e 3 , and ρn1 n2 ≜ n6 .

R EMARK 6 (Cheap optimality: discrete L2 homogeneity). The cheap test of Prop. 3


1/4 √
achieves the minimax optimal separation rate of order b(1) / n1 [25, Prop. 4.4] whenever
q ⋆ )(1+Ω(1))
s ≥ 3 24(1−α
βα⋆ ρn n .
1 2

5.2. Hölder homogeneity testing. Our second result, proved in App. J.2, shows that
cheap homogeneity testing is minimax rate optimal for detecting Hölder L2 separations. The
test statistic, based on discretization of the unit cube, is the same used in [25, Prop. 4.6].

P ROPOSITION 4 (Power of cheap testing: Hölder L2 homogeneity). Suppose P and Q


are distributions on [0, 1]d with Lebesgue densities belonging to the Hölder ball Hτd (L) (26).
Let {B1 , . . . , BK } be a partition of [0, 1]d into d-dimensional hypercubes with side length
2/(4τ +d)
1/⌊n1 ⌋. A cheap permutation test (Alg. 1) based on a homogeneity U-statistic (Def. 2)
with n1 ≤ n2 and base function g(y, z) = K
P
k=1 I(y, z ∈ Bk ) has power at least 1 − β when-
ever

 
4
1 2τ /(4τ +d) C
√ L 1
∥P − Q∥2 ≥ ( n1 ) √
6 ⋆ ⋆
1 + q 3 βα⋆
βα min(β,α ) s 24(1−α⋆ )
ρn1 n2 −1

α β 1/⌊α(B+1)⌋ n1 (n1 −1)2 n2 (n2 −1)2


for a universal constant C , α⋆ ≜ 2e 3 , and ρn1 n2 ≜ n6 .

R EMARK 7 (Cheap optimality: Hölder L2 homogeneity). The cheap test of Prop. 4


−2τ /(4τ +d)
q the minimax optimal separation rate of order n1
achieves [2, Thm. 3] whenever
24(1−α⋆ )(1+Ω(1))
s≥ 3
βα⋆ ρn1 n2 .

5.3. Discrete independence testing. Our third result, proved in App. J.3, shows that
cheap independence testing is minimax rate optimal for detecting discrete L2 dependence.

P ROPOSITION 5 (Power of cheap testing: discrete L2 independence). Suppose that


P and Q are discrete distributions on finite sets Y and Z respectively with b(2) ≜
max(∥Π∥22 , ∥P × Q∥22 ), and consider the augmented samples
i.i.d.
Ỹ ≜ ((Yi , Ai ))ni=1 and Z̃ ≜ ((Zi , Bi ))ni=1 for (Ai )ni=1 , (Bi )ni=1 ∼ Unif((0, 1)).
A cheap permutation test (Alg. 2) based on an independence V-statistic (Def. 3) with
paired sample (Ỹ, Z̃) and base functions gY ((y1 , a1 ), (y2 , a2 )) = I(y1 = y2 , a1 ̸= a2 ) and
gZ ((z1 , b1 ), (z2 , b2 )) = I(z1 = z2 , b1 ̸= b2 ) has power at least 1 − β whenever
1/4
Cb
∥Π − P × Q∥2 ≥ √βα(2)⋆ n ,
α β 1/⌊α(B+1)⌋
for a universal constant C and α⋆ ≜ 2e

3 .
14

R EMARK 8 (Cheap optimality: discrete L2 independence). The cheap test of Prop. 5


1/4 √
achieves the minimax optimal separation rate of order b(2) / n [25, Prop. 5.4] for any
choice of s.

5.4. Hölder independence testing. Our final result, proved in App. J.4, shows that cheap
independence testing is minimax rate optimal for detecting Hölder L2 dependence. The
test statistic base functions, based discretization of the unit cube, are the same used in [25,
Prop. 5.5].

P ROPOSITION 6 (Power of cheap testing: Hölder L2 independence). Suppose that Π is


a distribution on [0, 1]d1 +d2 and that the Lebesgue densities of Π and P × Q belong to the
Hölder ball Hτd1 +d2 (L) (26) For a ∈ {1, 2}, let {Ba,1 , . . . , Ba,Ka } be a partition of [0, 1]da
into da -dimensional hypercubes with side length 1/⌊n2/(4τ +d1 +d2 ) ⌋, and consider the aug-
mented samples
i.i.d.
Ỹ ≜ ((Yi , Ai ))ni=1 and Z̃ ≜ ((Zi , Bi ))ni=1 for (Ai )ni=1 , (Bi )ni=1 ∼ Unif((0, 1)).
A cheap permutation test (Alg. 2) based on an independence V-statistic (Def. 3) with paired
sample (Ỹ, Z̃) and base functions
gY ((y1 , a1 ), (y2 , a2 )) = K
P 1
k=1 I(y1 , y2 ∈ B1,k , a1 ̸= a2 ) and
PK2
gZ ((z1 , b1 ), (z2 , b2 )) = k=1 I(z1 , z2 ∈ B2,k , b1 ̸= b2 )
has power at least 1 − β whenever

C 4
L 1 2τ /(4τ +d1 +d2 )

∥Π − P × Q∥2 ≥ √
βα⋆
· n ,
α β 1/⌊α(B+1)⌋
for a universal constant C and α⋆ ≜

2e 3 .

R EMARK 9 (Cheap optimality: Hölder L2 independence). The cheap test of Prop. 6


achieves the minimax optimal separation rate of order n−2τ /(4τ +d1 +d2 ) [25, Prop. 5.6] for
any choice of s.

6. Experiments. We now complement our methodological development and theoreti-


cal analysis with a series of synthetic experiments exploring the practical benefits of cheap
testing. We implemented all tests in Cython using nominal level α = 0.05, B = 1279 permu-
tations, and a common interface and library of functions to enable a consistent comparison
of computational complexity. For each experiment, we display average rejection rates and
95% Wilson confidence intervals over 10000 independent replications. Python and Cython
code replicating all experiments can be found at https://github.com/microsoft/
cheap-permutations, and supplementary experimental details can be found in App. K.

6.1. Homogeneity testing. We benchmark the power and runtime of homogeneity U-


statistic (Def. 2) tests with n1 = n2 in three settings.
1. MMD TESTING: P = N (−0.03e1 , I) and Q = N (0.03e1 , I) on Rd for d = 10, and the
2 2
base function g is the Gaussian kernel kλ (x, y) ≜ e−∥x−y∥2 /(2λ ) with bandwidth λ equal
to the median pairwise distance among sample points.
2. RFF TESTING: P, Q, and λ are as above, and the base function is the random Fourier
4 Pr
feature [RFF, 39] kernel g(x, y) = r2 k=1 cos(ωk⊤ x + bk ) cos(ωk⊤ y + bk ) for (ωk , bk )
sampled i.i.d. from N (0, λ12 ) × Unif(0, 2π). In this case, because the kernel g has rank
r < n, each test statistic can be computed in time Θ(dnr), and cheap permutation can be
carried out in Θ(Bsr) additional time with Θ(sr) memory, as described in Alg. K.1.
CHEAP PERMUTATION TESTING 15

1.0 Standard MMD


n = 32768 n = 32768
s = 2048 s = 4096 s = 8192 Standard
Cheap s = 512 n = 24576 n = 24576 0.74 s = 256

Cheap s = 128 s = 32
Cheap s = 32 0.72
0.8
Cheap s = 8
n = 16384 n = 16384 s = 16
0.70

0.6 n = 12288 n = 12288 0.68


Power

Power
0.66

0.4 n = 8192 n = 8192


0.64
n = 6144 n = 6144

0.62
n = 4096 n = 4096
0.2 s=8
n = 2048 n = 2048 0.60
Cheap n = 16384

0.1 1 10 100 1000 2 5 10 20 50 100 200


Total computation time (s) Total computation time (s)

F IG 1. Power vs. runtime for cheap and standard MMD tests of homogeneity as total sample size n = n1 + n2
and bin count s vary.

3. WMW TESTING: P = N (−0.015, 1), Q = N (0.015, 1), and our test statistic is the popu-
lar Wilcoxon-Mann-Whitney (WMW) test statistic of Ex. 5.

E XAMPLE 5 (Wilcoxon-Mann-Whitney test statistic [52, 31]). The Wilcoxon-Mann-


Whitney test statistic (also known as the PWilcoxon rank sum statistic and the Mann-Whitney
n1 Pn2
U-statistic) is defined as T (Y, Z) = n1 n2 i=1 j=1 I(Yi ≤ Zj ) − 12 , where Y and Z are as-
1

sumed to have no elements in common. This statistic may be recast into the U-statistic form
(3) by selecting
g(y, z) = − 21 I(y < z) − 14 I(y = z).

Technically, the WMW statistic is an asymmetric homogeneity U-statistic as g(y, z) ̸=


g(z, y). Nevertheless, it remains a quadratic test statistic (Def. 1) and hence is amenable to
cheap testing.

6.1.1. Cheap vs. standard MMD testing. Fig. 1 (left) compares cheap MMD testing to
standard MMD permutation testing as a function of the total sample size n = n1 + n2 and
bin count s. We find that s = 128 bins suffice to closely match the power of standard testing
for higher power levels (1 − β ≥ 0.3) and that s = 32 bins suffice for smaller power levels. In
both cases, cheap testing is consistently 100 times faster than the standard permutation test.
Fig. 1 (right) traces out the power-vs.-runtime trajectory of cheap testing for varying s
and fixed n = 16384. We find that runtime remains essentially constant for s ≤ 128 and that
power remains essentially constant for s ≥ 128.

6.1.2. Cheap MMD vs. asymptotic cross-MMD testing. A recently-proposed alternative


to the MMD permutation test is the asymptotic cross-MMD test [46] which avoids permuta-
tion by thresholding a related, asymptotically-normal test statistic using a Gaussian quantile.
By avoiding permutation, the asymptotic cross-MMD test achieves a substantial speed-up
over a standard MMD permutation test. However, as Shekhar, Kim and Ramdas [46] note,
this comes at the cost of reduced power and the loss of finite-sample validity. In Fig. 2, we
see that cheap MMD achieves nearly the same speed-up benefits as asymptotic cross-MMD
(right) while also retaining the full power and final-sample validity of standard MMD testing
for any given sample size (left).
16

1.0 1.0 Standard MMD n = 32768 n = 32768


Standard MMD
Cheap s = 128 Cheap s = 128 n = 24576
n = 32768 n = 24576
Cheap s = 32 Cheap s = 32
Asymp. cross-MMD Asymp. cross-MMD
0.8 0.8
n = 24576
n = 16384
n = 16384

0.6 0.6 n = 12288


n = 12288

Power
Power

n = 16384

n = 12288
0.4 0.4 n = 8192
n = 8192

n = 6144
n = 8192 n = 6144

n = 6144
n = 4096
0.2 0.2 n = 4096
n = 4096

n = 2048 n = 2048 n = 2048

2048 4096 6144 8192 12288 16384 24576 32768 0.01 0.1 1 10 100 1000
Sample size n Total computation time (s)

F IG 2. Cheap MMD vs. asymptotic cross-MMD tests of homogeneity. (Left) Power as a function of the total
sample size n = n1 + n2 . (Right) Time-power trade-off curves as n and bin count s vary.

0.75
r = 2048
r = 512

0.70
r = 128

0.65
Power

0.60 Standard RFF


Cheap s = 4096
Cheap s = 1024
r = 32
0.55 Cheap s = 256
Cheap s = 64
Cheap s = 16

0.1 1 10 100
Total computation time (s)

F IG 3. Power vs. runtime for cheap and standard RFF tests of homogeneity as the bin count s and number of
random Fourier features r vary. Here, the total sample size is n = 16384.

6.1.3. Cheap vs. standard RFF testing. Fig. 3 compares cheap RFF testing to standard
RFF permutation testing as a function of the number of random Fourier features r and bin
count s. We find that s = 256 bins suffice to closely match the power of standard testing for
all power levels. With this setting, cheap testing is also 100 to 1000 times faster than the
standard permutation test, depending on the choice of r .

6.1.4. Cheap vs. standard WMW testing. Fig. 4 (left) compares cheap WMW testing to
standard WMW permutation testing as a function of the total sample size n = n1 + n2 and
bin count s. We find that s = 128 bins suffice to closely match the power of standard testing
for higher power levels (1 − β ≥ 0.3) and that s = 64 bins suffice for smaller power levels. In
both cases, cheap testing is consistently 300 to 900 times faster than the standard permutation
test.
CHEAP PERMUTATION TESTING 17

Standard WMW n = 16384 0.62


0.6
Cheap s = 256
Cheap s = 128 0.60 s = 2048
s = 4096
s = 256 s = 1024 s = 8192 Standard
Cheap s = 64 n = 12288
0.5 Cheap s = 16 0.58 s = 32

0.56 s = 16

0.4

Power
Power

n = 8192
0.54

n = 6144 0.52
0.3
s=8
0.50
n = 4096

0.2 0.48
n = 2048

0.46 Cheap n = 16384

0.001 0.01 0.1 1 10 100 1 10 100


Total computation time (s) Total computation time (s)

F IG 4. Power vs. runtime for cheap and standard Wilcoxon-Mann-Whitney tests of homogeneity as total sample
size n = n1 + n2 and bin count s vary.

Fig. 4 (right) traces out the power-vs.-runtime trajectory of cheap testing for varying s and
fixed n = 16384. We find that runtime remains essentially constant for s ≤ 64 and that power
increases minimally beyond s = 128.

6.2. Independence testing. We additionally benchmark the power and runtime of inde-
pendence V-statistic (Def. 3) tests in the following setting.
4. HSIC TESTING: Π a multivariate Gaussian distribution with mean zero and covariance
 
I 0.047I
∈ R20×20 .
0.047I I
The base functions gY and gZ are Gaussian kernels kλY and kλZ with bandwidths λY and
λZ equal to the median pairwise distance among Y and Z points respectively.

6.2.1. Cheap vs. standard HSIC testing. Fig. 5 (left) compares cheap HSIC testing to
standard HSIC wild bootstrap testing as a function of the total sample size n = n1 + n2 and
bin count s. We find that s = 128 bins suffice to closely match the power of standard testing
for higher power levels (1 − β ≥ 0.3) and that s = 64 bins suffice for smaller power levels. In
both cases, cheap testing is consistently 15 times faster than the standard wild bootstrap test.
Fig. 5 (right) traces out the power-vs.-runtime trajectory of cheap testing for varying s and
fixed n = 2048. We find that runtime remains essentially constant for s ≤ 128 and that power
increases minimally beyond s = 256.

6.2.2. Cheap HSIC vs. asymptotic cross-HSIC testing. A recently-proposed alternative


to the HSIC wild bootstrap test is the asymptotic cross-HSIC test [47] which avoids wild
bootstrapping by thresholding a related, asymptotically-normal test statistic using a Gaus-
sian quantile. By avoiding wild bootstrapping, the asymptotic cross-HSIC test achieves a
substantial speed-up over a standard HSIC wild bootstrap test. However, as Shekhar, Kim
and Ramdas [47] note, this comes at the cost of reduced power and the loss of finite-sample
validity. In Fig. 6, we see that cheap HSIC achieves speed-up benefits comparable to those
of asymptotic cross-HSIC (right) while also preserving the higher power and final-sample
validity of standard HSIC testing for any given sample size (left).
18

1.0 n = 4096
0.75
Standard HSIC s = 256 s = 512 s = 1024 Standard
n = 3072
Cheap s = 256
s = 32
Cheap s = 128
Cheap s = 64 0.70
0.8
Cheap s = 16 s = 16
n = 2048

0.65
0.6

Power
Power

n = 1536

0.60

0.4
n = 1024
0.55

0.2
n = 512 0.50 s=8

Cheap n = 2048

0.01 0.1 1 10 0.1 0.2 0.5 1 2


Total computation time (s) Total computation time (s)

F IG 5. Power vs. runtime for cheap and standard HSIC tests of independence as total sample size n = n1 + n2
and bin count s vary.

1.0 n = 4096
1.0 Standard HSIC Standard HSIC n = 4096
n = 3072
Cheap s = 128 Cheap s = 128

Cheap s = 64 Cheap s = 64 n = 3072

Asymp. cross-HSIC Asymp. cross-HSIC


0.8
0.8
n = 2048

n = 2048
0.6 0.6
Power

Power

n = 1536

n = 1536

0.4 0.4
n = 1024

n = 1024

0.2 0.2
n = 512
n = 512

512 1024 1536 2048 3072 4096 0.002 0.005 0.01 0.02 0.05 0.1 0.2 0.5 1 2 5 10

Sample size n Total computation time (s)

F IG 6. Cheap HSIC vs. asymptotic cross-HSIC tests of independence. (Left) Power as a function of the total
sample size n = n1 + n2 . (Right) Time-power trade-off curves as n and bin count s vary.

7. Discussion and extensions. We have presented a general framework, cheap permu-


tation testing, for accelerating hypothesis testing while preserving the high power and exact
false positive control of standard permutation tests. The core idea is to group datapoints into
bins and to exploit bin-wise sufficient statistics to avoid the standard sample-size-dependent
overhead of permutation testing. While this work has focused primarily on quadratic test
statistics for two-sample homogeneity and independence testing, the principles can be ex-
tended in a number of directions:
Generalized quadratic test statistics. Many popular test statistics can be recovered as simple
transformations of quadratic statistics. For example, studentized U and V-statistics, which
are known to imbue permutation tests with additional robustness properties [35, 24, 7, 8],
CHEAP PERMUTATION TESTING 19

divide a standard
√ U or V-statistic by an estimate of its standard error and commonly take
the form U/ V where both U and V are quadratic test statistics in the sense of Def. 1. All
functions of one or more quadratic test statistics are amenable to the binned sufficient statistic
constructions of Sec. 3 and hence can benefit from the speed-ups of cheap testing.
Multisample / K -sample testing. In multisample homogeneity testing one observes K ≥ 2
independent samples drawn i.i.d. from unknown distribution P1 , . . . , PK and seeks to test
whether all K distributions match. Similarly, in multisample independence testing one ob-
serves tuples of K ≥ 2 variables and seeks to test whether the components are jointly inde-
pendent. In each case, one can generalize the cheap permutation tests of Algs. 1 and 2 to
accommodate the multisample U and V-statistics commonly employed [see, e.g., 37, 36, 7].
Higher-order test statistics. Cheap testing is also compatible with polynomial test statistics,
that is, degree p polynomial functionals of the sample empirical distributions P b1 , . . . , P
bK :
P b j1 × · · · × P
T = j1 ,...,jp ∈[K] (P bjp )ϕj1 ,...,jp .

In this case, O(sp ) sufficient statistics would suffice to run a cheap test with O(Bsp ) permu-
tation overhead, a substantial improvement over the standard Θ(Bnp ) permutation overhead.

REFERENCES
[1] A NDERSON , W. N. and V ERBEECK , J. (2023). Exact Permutation and Bootstrap Distribution of General-
ized Pairwise Comparisons Statistics. Mathematics 11.
[2] A RIAS -C ASTRO , E., P ELLETIER , B. and S ALIGRAMA , V. (2018). Remember the curse of dimensionality:
the case of goodness-of-fit testing in arbitrary dimension. Journal of Nonparametric Statistics 30 448–
471.
[3] BARINGHAUS , L. and F RANZ , C. (2004). On a new multivariate two-sample test. Journal of Multivariate
Analysis 88 190-206.
[4] B ERRETT, T. B., KONTOYIANNIS , I. and S AMWORTH , R. J. (2021). Optimal rates for independence testing
via U-statistic permutation tests. The Annals of Statistics 49 2457–2490.
[5] B OUCHERON , S., L UGOSI , G. and M ASSART, P. (2013). Concentration Inequalities: A Nonasymptotic
Theory of Independence. OUP Oxford.
[6] C ANTELLI , F. P. (1929). Sui confini della probabilita. In Atti del Congresso Internazionale dei Matematici:
Bologna del 3 al 10 de settembre di 1928 47–60.
[7] C HUNG , E. and ROMANO , J. P. (2013). Exact and asymptotically robust permutation tests. The Annals of
Statistics 41 484 – 507.
[8] C HUNG , E. and ROMANO , J. P. (2016). Asymptotically valid and exact permutation tests based on two-
sample U-statistics. Journal of Statistical Planning and Inference 168 97–105.
[9] C HUNG , M. K., X IE , L., H UANG , S.-G., WANG , Y., YAN , J. and S HEN , L. (2019). Rapid accelera-
tion of the permutation test via transpositions. In Connectomics in NeuroImaging: Third International
Workshop, CNI 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 13, 2019,
Proceedings 3 42–53. Springer.
[10] C HWIALKOWSKI , K. P., S EJDINOVIC , D. and G RETTON , A. (2014). A wild bootstrap for degenerate kernel
tests. Advances in neural information processing systems 27.
[11] DADUSH , D., G UZMÁN , C. and O LVER , N. (2018). Fast, Deterministic and Sparse Dimensionality Re-
duction. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms.
SODA ’18 1330–1344. Society for Industrial and Applied Mathematics, USA.
[12] D I C ICCIO , C., VASUDEVAN , S., BASU , K., K ENTHAPADI , K. and AGARWAL , D. (2020). Evaluating
fairness using permutation tests. In Proceedings of the 26th ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining 1467–1477.
[13] D OMINGO -E NRICH , C., DWIVEDI , R. and M ACKEY, L. (2023). Compress Then Test: Powerful Kernel
Testing in Near-linear Time. In Proceedings of The 26th International Conference on Artificial Intelli-
gence and Statistics. Proceedings of Machine Learning Research 206 1174–1218. PMLR.
[14] DWASS , M. (1957). Modified randomization tests for nonparametric hypotheses. The Annals of Mathemat-
ical Statistics 181–187.
[15] F ISHER , R. A. (1925). Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh.
20

[16] F ROMONT, M., L AURENT, B. and R EYNAUD -B OURET, P. (2013). The two-sample problem for Poisson
processes: Adaptive tests with a nonasymptotic wild bootstrap approach. The Annals of Statistics 41
1431 – 1461.
[17] G OOD , P. (2013). Permutation tests: a practical guide to resampling methods for testing hypotheses.
Springer Science & Business Media.
[18] G RETTON , A., B OUSQUET, O., S MOLA , A. and S CHÖLKOPF, B. (2005). Measuring Statistical Depen-
dence with Hilbert-Schmidt Norms. In Proceedings of the 16th International Conference on Algorith-
mic Learning Theory 63–77. Springer-Verlag, Berlin, Heidelberg.
[19] G RETTON , A., F UKUMIZU , K., T EO , C., S ONG , L., S CHÖLKOPF, B. and S MOLA , A. (2007). A Kernel
Statistical Test of Independence. In Advances in Neural Information Processing Systems 20.
[20] G RETTON , A., B ORGWARDT, K. M., R ASCH , M. J., S CHÖLKOPF, B. and S MOLA , A. (2012). A Kernel
Two-Sample Test. Journal of Machine Learning Research 13 723-773.
[21] H E , H. Y., BASU , K., Z HAO , Q. and OWEN , A. B. (2019). Permutation p-value approximation via gener-
alized Stolarsky invariance. The Annals of Statistics 47 583–611.
[22] H OEFFDING , W. (1952). The Large-Sample Power of Tests Based on Permutations of Observations. The
Annals of Mathematical Statistics 23 169 – 192.
[23] H OEFFDING , W. (1963). Probability Inequalities for Sums of Bounded Random Variables. Journal of the
American Statistical Association 58 13–30.
[24] JANSSEN , A. (1997). Studentized permutation tests for non-iid hypotheses and the generalized Behrens-
Fisher problem. Statistics & probability letters 36 9–21.
[25] K IM , I., BALAKRISHNAN , S. and WASSERMAN , L. (2022). Minimax optimality of permutation tests. The
Annals of Statistics 50 225–251.
[26] K NIJNENBURG , T. A., W ESSELS , L. F., R EINDERS , M. J. and S HMULEVICH , I. (2009). Fewer permuta-
tions, more accurate P-values. Bioinformatics 25 i161–i168.
[27] L ARSON , J. L. and OWEN , A. B. (2015). Moment based gene set tests. BMC bioinformatics 16 1–17.
[28] L EE , A. J. (1990). U-Statistics: Theory and Practice. Statistics: A Series of Textbooks and Monographs.
Taylor & Francis.
[29] L INDGREN , F., H ANSEN , B., K ARCHER , W., S JÖSTRÖM , M. and E RIKSSON , L. (1996). Model validation
by permutation tests: applications to variable selection. Journal of Chemometrics 10 521–532.
[30] L IU , F., X U , W., L U , J., Z HANG , G., G RETTON , A. and S UTHERLAND , D. J. (2020). Learning Deep
Kernels for Non-Parametric Two-Sample Tests. In Proceedings of the 37th International Conference
on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event. Proceedings of Machine Learning
Research 119 6316-6326. PMLR.
[31] M ANN , H. B. and W HITNEY, D. R. (1947). On a Test of Whether one of Two Random Variables is Stochas-
tically Larger than the Other. The Annals of Mathematical Statistics 18 50 – 60.
[32] M ARKOV, A. (1884). On certain applications of algebraic continued fractions. Unpublished Ph. D. thesis,
St Petersburg.
[33] M AURER , A. and P ONTIL , M. (2021). Concentration inequalities under sub-Gaussian and sub-exponential
conditions. In Advances in Neural Information Processing Systems 34 7588–7597. Curran Associates,
Inc.
[34] M OHRI , M., ROSTAMIZADEH , A. and TALWALKAR , A. (2012). Foundations of Machine Learning. The
MIT Press.
[35] N EUHAUS , G. (1993). Conditional rank tests for the two-sample problem under random censorship. The
Annals of Statistics 1760–1779.
[36] PANDA , S., S HEN , C., P ERRY, R., Z ORN , J., L UTZ , A., P RIEBE , C. E. and VOGELSTEIN , J. T. (2025).
Universally consistent K-sample tests via dependence measures. Statistics & Probability Letters 216
110278.
[37] P FISTER , N., B ÜHLMANN , P., S CHÖLKOPF, B. and P ETERS , J. (2018). Kernel-based tests for joint inde-
pendence. Journal of the Royal Statistical Society Series B: Statistical Methodology 80 5–31.
[38] P HIPSON , B. and S MYTH , G. K. (2010). Permutation P-values Should Never Be Zero: Calculating Exact P-
values When Permutations Are Randomly Drawn. Statistical Applications in Genetics and Molecular
Biology 9.
[39] R AHIMI , A. and R ECHT, B. (2008). Random Features for Large-Scale Kernel Machines. In Advances in
Neural Information Processing Systems 20 (J. C. Platt, D. Koller, Y. Singer and S. T. Roweis, eds.)
1177–1184. Curran Associates, Inc.
[40] R AMDAS , A., BARBER , R. F., C ANDÈS , E. J. and T IBSHIRANI , R. J. (2023). Permutation tests using
arbitrary permutation distributions. Sankhya A 1–22.
[41] S CHRAB , A., K IM , I., A LBERT, M., L AURENT, B., G UEDJ , B. and G RETTON , A. (2021). MMD Aggre-
gated Two-Sample Test.
CHEAP PERMUTATION TESTING 21

[42] S CHRAB , A., K IM , I., G UEDJ , B. and G RETTON , A. (2022). Efficient Aggregated Kernel Tests using
Incomplete U -statistics.
[43] S EGAL , B. D., B RAUN , T., E LLIOTT, M. R. and J IANG , H. (2018). Fast approximation of small p-values
in permutation tests by partitioning the permutations. Biometrics 74 196–206.
[44] S EJDINOVIC , D., S RIPERUMBUDUR , B., G RETTON , A. and F UKUMIZU , K. (2013). Equivalence of
distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics 2263–2291.
[45] S ERFLING , R. (2009). Approximation Theorems of Mathematical Statistics 162. John Wiley & Sons.
[46] S HEKHAR , S., K IM , I. and R AMDAS , A. (2022). A permutation-free kernel two-sample test. In Advances
in Neural Information Processing Systems.
[47] S HEKHAR , S., K IM , I. and R AMDAS , A. (2023). A Permutation-Free Kernel Independence Test. Journal
of Machine Learning Research 24 1–68.
[48] S TEINWART, I. and C HRISTMANN , A. (2008). Support vector machines. Springer Science & Business
Media.
[49] S UTHERLAND , D. J., T UNG , H.-Y., S TRATHMANN , H., D E , S., R AMDAS , A., S MOLA , A. and G RET-
TON , A. (2017). Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy.
In International Conference on Learning Representations.
[50] S ZEKELY, G. and R IZZO , M. (2004). Testing for equal distributions in high dimension. InterStat 5.
[51] S ZÉKELY, G. J., R IZZO , M. L. and BAKIROV, N. K. (2007). Measuring and testing dependence by corre-
lation of distances. The Annals of Statistics 35 2769 – 2794.
[52] W ILCOXON , F. (1945). Individual Comparisons by Ranking Methods. Biometrics Bulletin 1 80–83.
[53] Z HOU , C., WANG , H. and WANG , Y. (2009). Efficient moments-based permutation tests. Advances in
neural information processing systems 22.
[54] Z MIGROD , R., V IEIRA , T. and C OTTERELL , R. (2022). Exact Paired-Permutation Testing for Structured
Test Statistics. In Proceedings of the 2022 Conference of the North American Chapter of the Associa-
tion for Computational Linguistics: Human Language Technologies 4894–4902.
22

SUPPLEMENTARY MATERIAL
Supplement to “Cheap Permutation Testing”
This supplement contains proofs of all results as well as supplementary experiment details.

A Additional notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
A.1 Additional notation for cheap homogeneity testing (Alg. 1) . . . . . . . . . . 23
A.2 Additional notation for cheap independence testing (Alg. 2) . . . . . . . . . 23
B Proof of Thm. 1: Power of cheap homogeneity: finite variance . . . . . . . . . . . 24
B.1 Proof of Lem. B.1: Variance of homogeneity U-statistics . . . . . . . . . . . 25
B.2 Proof of Lem. B.2: Conditional variance of permuted homogeneity U-statistics 25
C Proof of Cor. 1: Power of cheap homogeneity: finite variance, PD . . . . . . . . . 31
D Proof of Thm. 2: Power of cheap homogeneity: sub-Gaussian, PD . . . . . . . . . 32
D.1 Bounding the fluctuations of the test statistic . . . . . . . . . . . . . . . . . . 32
D.2 Bounding the fluctuations of the threshold . . . . . . . . . . . . . . . . . . . 33
D.3 Putting everything together . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
D.4 Proof of Lem. D.2: Sub-Gaussian homogeneity quantile bounds . . . . . . . 35
D.5 Proof of Lem. D.4: High-probability bounds on sums of MMD powers . . . . 36
D.6 Proof of Lem. D.5: Sub-Gaussian quantile bound for permutation threshold . 37
E Proof of Thm. 3: Power of cheap independence: finite variance . . . . . . . . . . . 41
E.1 Proof of Lem. E.1: Variance of independence V-statistics . . . . . . . . . . . 43
E.2 Proof of Lem. E.2: Conditional variance of permuted independence V-statistics 43
E.3 Proof of Lem. E.3: Mean of permuted independence V-statistics . . . . . . . 47
F Proof of Cor. 2: Power of cheap independence: finite variance, PD . . . . . . . . . 50
F.1 Proof of Lem. F.1: E[Vn ] bound . . . . . . . . . . . . . . . . . . . . . . . . 51
F.2 Proof of Lem. F.2: ψ1′ bound . . . . . . . . . . . . . . . . . . . . . . . . . . 52
F.3 Proof of Lem. F.3: ψ̃1′ bound . . . . . . . . . . . . . . . . . . . . . . . . . . 52
G Proof of Thm. 4: Power of cheap independence: bounded, PSD . . . . . . . . . . 53
G.1 Bounding the fluctuations of the test statistic . . . . . . . . . . . . . . . . . . 53
G.2 Alternative representation of the independence V-statistic . . . . . . . . . . . 54
G.3 Bounding the fluctuations of the threshold . . . . . . . . . . . . . . . . . . . 55
G.4 Putting everything together . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
G.5 Proof of Lem. G.4: Rademacher representation of Ṽnπ,s . . . . . . . . . . . . 57
G.6 Proof of Lem. G.5: A(i) mean bounds . . . . . . . . . . . . . . . . . . . . . 62
G.7 Proof of Lem. G.6: A(i) complementary quantile bounds . . . . . . . . . . . 63
G.8 Proof of Lem. G.7: Quantile bound for independence permutation threshold . 63
H Proof of Prop. 2: Quantile comparison method . . . . . . . . . . . . . . . . . . . . 65
I Proof of Cor. 3: Refined two moments method . . . . . . . . . . . . . . . . . . . . 66
J Proofs of minimax optimality results . . . . . . . . . . . . . . . . . . . . . . . . . 66
J.1 Proof of Prop. 3: Power of cheap testing: discrete L2 homogeneity . . . . . . 66
J.2 Proof of Prop. 4: Power of cheap testing: Hölder L2 homogeneity . . . . . . 67
J.3 Proof of Prop. 5: Power of cheap testing: discrete L2 independence . . . . . . 67
J.4 Proof of Prop. 6: Power of cheap testing: Hölder L2 independence . . . . . . 68
K Supplementary experiment details . . . . . . . . . . . . . . . . . . . . . . . . . . 68
K.1 Selection of the number of permutations B . . . . . . . . . . . . . . . . . . . 68
K.2 Cheap permutation tests for feature-based kernels . . . . . . . . . . . . . . . 69

APPENDIX A: ADDITIONAL NOTATION


For any dataset X = (Xi )ni=1 , we let δX denote the empirical distribution n1 ni=1 δXi and
P
X−k = (X1 , . . . , Xk−1 , Xk+1 , . . . , Xn ) for each k ∈ [n]. For a given positive-definite kernel
k, we let Hk represent the associated reproducing kernel Hilbert space [RKHS, 48, Def. 4.18]
and ∥·∥k and ⟨·, ·⟩k represent the RKHS norm and inner product [48, Thm. 4.21].
CHEAP PERMUTATION TESTING 23

A.1. Additional notation for cheap homogeneity testing (Alg. 1). Let X be the con-
catenation of Y and Z. We define the number of elements per bin m ≜ ns = n1 +n s , the num-
2

n1 n2 (i)
ber of bins for each sample s1 ≜ m and s2 ≜ m , and the datapoint bins Y = (Yj )j∈Ii and
Z(i) = (Zj )j∈Ii .
For for the homogeneity U-statistic of Def. 2, we define the bin kernel
(27) Hho,m (Y(i) , Y(j) ; Z(i) , Z(j) ) = m14 y∈Y(i) ,y′ ∈Y(j) ,z∈Z(i) ,z ′ ∈Z(j) hho (y, y ′ ; z, z ′ ).
P

Moreover, by the definition of hho , we can rewrite Hho,m as


Hho,m (Y(i) , Y(j) ; Z(i) , Z(j) ) = Gm (Y(i) , Y(j) ) + Gm (Z(i) , Z(j) ) − Gm (Y(i) , Z(j) ) − Gm (Y(j) , Z(i) ),
Gm (Y(i) , Y(j) ) ≜ m12 y∈Y(i) ,y′ ∈Y(j) g(y, y ′ ).
P
(28) where

Observe that Hho,m can also be written in terms of hho :


Hho,m (Y(i) , Y(j) ; Z(i) , Z(j) ) = m14 y∈Y(i) ,y′ ∈Y(j) ,z∈Z(i) ,z ′ ∈Z(j) hho (y, y ′ ; z, z ′ ).
P

Let (X(i) )si=1 denote the concatenation of (Y(i) )si=1 1


and (Z(i) )si=1 2
. For an independent,
uniformly random permutation π of [s], we also define the cheaply-permuted two-sample V
and U-statistics as
(29) Vnπ,s 1 Ps1 Ps2 (πi1 ) , X(πi2 ) ; X(πs1 +j1 ) , X(πs1 +j2 ) ) and
1 ,n2 = s2 s2 i1 ,i2 =1 j1 ,j2 =1 Hho,m (X
1 2

(30) Unπ,s 1 (πi1 ) , X(πi2 ) ; X(πs1 +j1 ) , X(πs1 +j2 ) ).


P P
1 ,n2 = (s1 )(2) (s2 )(2) (j1 ,j2 )∈i22 Hho,m (X
s s
(i1 ,i2 )∈i21

These expressions along with the substitutions n ← s, n1 ← s1 , n2 ← s2 , hho ← Hho,m , and


g ← Gm help us to frame a cheap homogeneity U-statistic permutation test as a standard
homogeneity U-statistic permutation test operating on bins of datapoints.

A.2. Additional notation for cheap independence testing (Alg. 2). We define the num-
ber of elements per bin m ≜ ns and the datapoint bins Y(i) = (Yj )j∈Ii , Z(i) = (Zj )j∈Ii , and
X(i) = ((Yj , Zj ))j∈Ii . When convenient, we will also refer to the m elements of these bins
(i) (i) m (i) m
as (Yj )m j=1 , (Zj )j=1 , and (Xj )j=1 .
For the independence V-statistic of Def. 3, we define the bin kernel Hin,m as
(1) (2) (3) (4)
Hin,m (X(1) , X(2) , X(3) , X(4) ) = m14 m
P
i1 ,i2 ,i3 ,i4 =1 hin (Xi1 , Xi2 , Xi3 , Xi4 )

where, for
g((y, z), (y ′ , z ′ )) ≜ gY (y, y ′ )gZ (x, z ′ ),
we can write
g((y, z), (y ′ , z ′ )) d(δ(Y1 ,Z1 ) + δ(Y4 ,Z4 ) − δ(Y1 ,Z4 ) − δ(Y4 ,Z1 ) )(y, z)
RR
hin (X1 , X2 , X3 , X4 ) =
d(δ(Y2 ,Z2 ) + δ(Y3 ,Z3 ) − δ(Y2 ,Z3 ) − δ(Y3 ,Z2 ) )(y ′ , z ′ ).
An alternative way to express hin (X1 , X2 , X3 , X4 ) is the following:
hin (X1 , X2 , X3 , X4 ) = hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Z (Z1 , Z2 , Z3 , Z4 ), for
(31) hin,Y (Y1 , Y2 , Y3 , Y4 ) ≜ gY (y1 , y2 ) + gY (y3 , y4 ) − gY (y1 , y3 ) − gY (y2 , y4 ),
hin,Z (Z1 , Z2 , Z3 , Z4 ) ≜ gZ (z1 , z2 ) + gZ (z3 , z4 ) − gZ (z1 , z3 ) − gZ (z2 , z4 ).
24

Using this, we can rewrite Hin,m as


Hin,m (X(1) , X(2) , X(3) , X(4) )
= m14 m g((y, z), (y ′ , z ′ )) d(δ(Y (1) ,Z (1) ) + δ(Y (4) ,Z (4) ) − δY (1) ,Z (4) − δY (4) ,Z (1) )(y, z)
P RR
i1 ,i2 ,i3 ,i4 =1 i1 i1 i4 i4 i1 i4 i4 i1

d(δ(Y (2) ,Z (2) ) + δ(Y (3) ,Z (3) ) − δ(Y (2) ,Z (3) ) − δ(Y (3) ,Z (2) ) )(y ′ , z ′ )
i2 i2 i3 i3 i2 i3 i3 i2

m
= g((y, z), (y ′ , z ′ )) d( m12 i1 ,i4 =1 δ(Y (1) ,Z (1) ) + δ(Y (4) ,Z (4) ) − δY (1) ,Z (4) − δY (4) ,Z (1) )(y, z)
RR P
i1 i1 i4 i4 i1 i4 i4 i1

m
d( m12 i2 ,i3 =1 δ(Y (2) ,Z (2) ) + δ(Y (3) ,Z (3) ) − δ(Y (2) ,Z (3) ) − δ(Y (3) ,Z (2) ) )(y ′ , z ′ )
P
i2 i2 i3 i3 i2 i3 i3 i2

′ ′
RR
= g((y, z), (y , z )) d(δ(Y(1) ,Z(1) ) + δ(Y(4) ,Z(4) ) − δY(1) × δZ(4) − δY(4) × δZ(1) )(y, z)
d(δ(Y(2) ,Z(2) ) + δ(Y(3) ,Z(3) ) − δY(2) × δZ(3) − δY(3) × δZ(2) )(y ′ , z ′ )
For a permutation π of [s], we can also define the cheaply-permuted independence V-statistic,
Vnπ,s ≜ s14 (i1 ,i2 ,i3 ,i4 )∈[s]4 Hin,m (Y(i1 ) , Z(πi1 ) ), (Y(i2 ) , Z(πi2 ) ),
P
(32)

(Y(i3 ) , Z(πi3 ) ), (Y(i4 ) , Z(πi4 ) ) .




These expressions along with the substitutions s for n and Hin,m for hin help us to frame a
cheap independence V-statistic permutation test as a standard independence V-statistic per-
mutation test operating on bins of datapoints.

APPENDIX B: Proof of Thm. 1: Power of cheap homogeneity: finite variance


We will establish both claims using the refined two moments method (Cor. 3) with the
auxiliary sequence (Unπ1b,n
,s B
2 )b=1 of cheaply-permuted two-sample U-statistics (30). The ex-

pression (28) for Hho,m ensures that E[Unπ,s 1 ,n2 | X] = 0 surely for all s ≥ 4, so, by Cor. 3, it

suffices to bound Var(Un1 ,n2 ) and E[Var(Unπ,s 1 ,n2 | X)]. Lem. B.1, proved in App. B.1, gives

the unconditional variance bound.

L EMMA B.1 (Variance of homogeneity U-statistics). Under the assumptions of Thm. 1,


4ψY,1 4ψZ,1 (17n1 n2 +4n22 +4n21 )ψY Z,2
(33) Var(Un1 ,n2 ) ≤ n1 + n2 + n1 (n1 −1)n2 (n2 −1) .

Lem. B.2, proved in App. B.2, provides the conditional variance bounds.

L EMMA B.2 (Conditional variance of permuted homogeneity U-statistics). Under the


assumptions of Thm. 1,
2n21 +2n22 +16n1 n2
(34) E[Var(Unπ,n
1 ,n2 |X)] ≤ n1 (n1 −1)n2 (n2 −1) ψ̃Y Z,2 and
2n22 +2n21 +16n1 n2 +400(n21 +n22 )/s 352n4 /s
E[Var(Unπ,s

1 ,n2 |X)] ≤ n1 (n1 −1)n2 (n2 −1) + n1 (n1 −1) n2 (n2 −1) ψ̃Y Z,2
2 2

8n3 m3 E[Un1 ,n2 ]2 +96n3 m2 max{ψY,1 ,ψZ,1 }


(35) + n1 (n1 −1)2 n2 (n2 −1)2 .

Cor. 3 and Lems. B.1 and B.2 together imply power at least 1 − β for standard testing if
q q
4ψY,1 4ψZ,1 (17n1 n2 +4n22 +4n21 )ψY Z,2  1−α⋆ (6n21 +6n22 +48n1 n2 )ψY Z,2
E[Un1 ,n2 ] ≥ 3−β
β n1 + n2 + n1 (n1 −1)n2 (n2 −1) + α⋆ βn1 (n1 −1)n2 (n2 −1) ,
CHEAP PERMUTATION TESTING 25

a condition implied by our assumption (9), and power at least 1 − β for cheap testing if
q
4ψY,1 4ψZ,1 (17n1 n2 +4n22 +4n21 )ψY Z,2 
E[Un1 ,n2 ] ≥ 3−β β n1 + n2 + n1 (n1 −1)n2 (n2 −1)
⋆ −1 2 2 2 2 3
+ (α )β −1 n6n 2 +6n1 +48n1 n2
+ 1200(n1 −1)(n 2 −1)(n1 +n2 )/s+1056n m

(36) 1 (n1 −1)n2 (n2 −1) n1 (n1 −1)2 n2 (n2 −1)2 ψ̃Y Z,2
⋆ −1
q
) −1)n3 m2 max{ψY,1 ,ψZ,1 } 1/2 24((α⋆ )−1 −1)n3 m3 E[Un1 ,n2 ]2
+ 288((α βn 1 (n1 −1) 2 n (n −1)2
2 2
+ βn1 (n1 −1)2 n2 (n2 −1)2 .
We can rewrite (36) equivalently as
q 
⋆ )−1 −1)n3 m3 −1 q 3−β (17n1 n2 +4n22 +4n21 )ψY Z,2 
E[Un1 ,n2 ] ≥ 1 − βn24((α
1 (n1 −1) n2 (n2 −1)
2 2 β
4ψY,1
n1 + 4ψZ,1
n2 + n1 (n1 −1)n2 (n2 −1)

(α⋆ )−1 −1 6n22 +6n21 +48n1 n2 +1200(n21 +n22 )/s 1056n3 m



(37) + β n1 (n1 −1)n2 (n2 −1) + n1 (n1 −1) n2 (n2 −1) ψ̃Y Z,2
2 2


288((α⋆ )−1 −1)n3 m2 max{ψY,1 ,ψZ,1 } 1/2
+ βn1 (n1 −1)2 n2 (n2 −1)2 .

To conclude, we note that our assumption (10) is a sufficient condition for (37) as
n4 = (n21 + n22 )2 + 4n1 n2 (n21 + n22 ) + 4n21 n22 ≥ 8n1 n2 (n21 + n22 ) ≥ 8(n1 − 1)(n2 − 1)(n21 + n22 ).

B.1. Proof of Lem. B.1: Variance of homogeneity U-statistics. By [28, p. 38],


  1 −2 n2 −2 n1 −1 n2 −1 2
Var(Un1 ,n1 ) ≤ 2i=0 2j=0 2i 2j n2−i
P P
2−j 2 2 σ̂i,j
(n1 −2)!2 (n2 −2)!2
= 16 2i=0 2j=0 n1 !i!(2−i)! 2
P P
(38) 2 (n −4+i)! n !j!(2−j)!2 (n −4+j)! σ̂i,j where
1 2 2

2 ≜ Var(E[h (Y , Y ; Z , Z )|Y
σ̂i,j i, j ∈ [2].
ho 1 2 1 2 i+1 , . . . , Y2 , Zj+1 , . . . , Z2 ]) for
Note that for n1 ≥ 4 (and analogously for n2 ≥ 4),

(n1 −2)!2 (n1 −2)(n1 −3) 1
4n1 !(n1 −4)! = 4n1 (n1 −1) ≤ if i = 0


 4
(n1 −2)!2 (n1 −2)!2 n1 −2 1
n1 !i!(2−i)!2 (n1 −4+i)! =  n1 !(n1 −3)! = n1 (n1 −1) ≤ n1 if i = 1 .
2
 (n1 −2)! =
 1
if i = 2
2n1 !(n1 −2)! 2n1 (n1 −1)

Plugging this into the right-hand side of (38), we obtain that for n1 , n2 ≥ 4,
2 2 2 2 2 2
σ̂0,0 σ̂0,1 σ̂1,0 σ̂1,1 σ̂0,2 σ̂2,0
Var(Un1 ,n2 ) ≤ 16 42 + 4n2 + 4n1 + n1 n2 + 8n2 (n2 −1) + 8n1 (n1 −1)
2 2 2
σ̂1,2 σ̂2,1 σ̂2,2 
+ 2n1 n2 (n2 −1) + 2n2 n1 (n1 −1) + 4n1 (n1 −1)n2 (n2 −1)
2 2  2  σ̂2 2
4σ̂1,0 4σ̂0,1 4 σ̂2,2
 σ̂2,2
≤ n1 + n2 + 16 + 9 n1 n2 + 2 + 2 n1 (n2,2
1 −1)
+ 2 + 2 n2 (n2 −1)
2 2
4σ̂1,0 4σ̂0,1 (17n1 n2 +4n22 +4n21 )σ̂2,2
2
≤ n1 + n2 + n1 (n1 −1)n2 (n2 −1) .

In the second inequality, we used that σ̂0,0 2 = 0 and that, by the law of total variance, σ̂ 2 ≤
i,j
2 2 ≤ψ
σ̂2,2 for all i, j ∈ [2]. Equation (33) then follows from the fact that σ̂1,0 = , σ̂ 2 ≤ψ
Y,1 0,1 Z,1 ,
2
and σ̂2,2 ≤ ψY Z,2 by their definitions.

B.2. Proof of Lem. B.2: Conditional variance of permuted homogeneity U-statistics.


We begin by defining the convenient shorthand
i ≜ (i1 , i2 , j1 , j2 , i′1 , i′2 , j1′ , j2′ ) and s(i) ≜ |{i1 , i2 } ∩ {i′1 , i′2 }| + |{j1 , j2 } ∩ {j1′ , j2′ }|.
26

B.2.1. Proof of (34). Following the proof of [25, Thm. 4.1], we define the index sets
Itotal ≜ {i ∈ N8+ : (i1 , i2 ) ∈ in2 1 , (j1 , j2 ) ∈ in2 2 , (i′1 , i′2 ) ∈ in2 1 , (j1′ , j2′ ) ∈ in2 2 },
I1 ≜ {i ∈ Itotal : s(i) ≤ 1}, and Ic1 = {i ∈ Itotal : s(i) > 1}.
By the argument of [25, Thm. 4.1], we have
c
|I1 |
(39) E[Var(Unπ,n
1 ,n2 |X)] ≤ ψ̃Y Z,2 (n1 )2 (n2 )2 ,
(2) (2)

where ψ̃Y Z,2 is the maximum over of i ∈ Ic1


 
E E[hho (Xπi1 , Xπi2 ; Xπn1 +j1 , Xπn1 +j2 )hho (Xπi′ , Xπi′ ; Xπn1 +j′ , Xπn1 +j′ )|X] ,
1 2 1 2

and hho is the symmetrized version of hho defined in (7). Moreover,


|Itotal | = n21 (n1 − 1)2 n22 (n2 − 1)2 ,
|{i ∈ Itotal | s(i) = 0}| = n1 (n1 − 1)(n1 − 2)(n1 − 3)n2 (n2 − 1)(n2 − 2)(n2 − 3), and
|{i ∈ Itotal | s(i) = 1}| = 4n1 (n1 − 1)(n1 − 2)n2 (n2 − 1)((n2 − 2)(n2 − 3) + (n1 − 2)(n1 − 3)),
and therefore
|Ic1 | = |Itotal | − |{i ∈ Itotal | s(i) = 0}| − |{i ∈ Itotal | s(i) = 1}|
(40) = n1 (n1 − 1)n2 (n2 − 1)(2n21 + 2n22 + 16n1 n2 − 34n1 − 34n2 + 60).

Since ψY Z,2 upper-bounds ψ̃Y Z,2 and n ≥ 4, the following estimate, provided by (39)
and (40), completes our proof:
2ψ̃Y Z,2 (n21 +n22 +8n1 n2 −17(n1 +n2 )+30) ψ̃Y Z,2 (2n21 +2n22 +16n1 n2 )
E[Var(Unπ,n
1 ,n2 |X)] ≤ n1 (n1 −1)n2 (n2 −1) ≤ n1 (n1 −1)n2 (n2 −1) .

B.2.2. Proof of (35). In this case, we can write


E[Var(Unπ,s π,s 2 π,s
1 ,n2 |X)] = E[E[(Un1 ,n2 ) |X]] − E[(E[Un1 ,n2 |X]) ]
2

= n2 (n1 −1)21n2 (n2 −1)2 i∈Itotal


P
1 2
(41) 
E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X]
1 2 1 2

− E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|X]E[hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X] ,
1 2 1 2

where π̃i ≜ mπ⌊i/m⌋ + i − m⌊i/m⌋. We now express Itotal as the disjoint union of four sets:
• Idiff , containing s21 (s1 − 1)2 s22 (s2 − 1)2 m8 = s21 (s21 − 2s1 + 1)s22 (s22 − 2s2 + 1)m8 indices
for which the bins of each component are pairwise different.
• Ia , containing the indices for which one or more of the pairs among (i1 , i2 ), (i′1 , i′2 ),
(n1 + j1 , n1 + j1′ ), (n1 + j2 , n1 + j2′ ) share the same bin, but {⌊i1 /m⌋, ⌊i2 /m⌋} ∩
{⌊i′1 /m⌋, ⌊i′2 /m⌋} = ∅, {⌊(n1 + j1 )/m⌋, ⌊(n1 + j2 )/m⌋} ∩ {⌊(n1 + j1′ )/m⌋, ⌊(n1 +
j2′ )/m⌋} = ∅.
• Ib , containing the indices for which exactly one pair among (i1 , i2 ), (i′1 , i′2 ), (n1 + j1 , n1 +
j1′ ), (n1 + j2 , n1 + j2′ ) shares the same bin, and |{⌊i1 /m⌋, ⌊i2 /m⌋} ∩ {⌊i′1 /m⌋, ⌊i′2 /m⌋}| +
|{⌊(n1 + j1 )/m⌋, ⌊(n1 + j2 )/m⌋} ∩ {⌊(n1 + j1′ )/m⌋, ⌊(n1 + j2′ )/m⌋}| = 1.
• Irest , containing the rest of indices. For indices in Irest , there has to be at least one pair
among (i1 , i2 ), (i′1 , i′2 ), (n1 + j1 , n1 + j1′ ), (n1 + j2 , n1 + j2′ ) that shares the same bin, and
there have to be at least two pairs formed by an index in {i1 , i2 , n1 + j1 , n1 + j2 } and an
CHEAP PERMUTATION TESTING 27

index in {i′1 , i′2 , n1 + j1′ , n1 + j2′ } such that both indices belong to the same bin. From this
argument, we can deduce that
|Irest | ≤ (4s1 s42 + 4s21 s32 + 16s21 s32 + 16s31 s22 + 4s41 s2 + 4s31 s22 )m8
= (4s1 s42 + 20s21 s32 + 20s31 s22 + 4s41 s2 )m8 .
Contribution of Idiff terms. To deal with the contribution from Idiff terms, we mirror the
argument of [25, Thm. 4.1]. In App. A.1 we have seen that cheap permutation can be treated
like standard permutation of dataset bins with the substitutions n ← s, n1 ← s1 , n2 ← s2 ,
hho ← Hho,m , g ← Gm . To this end, we define the index sets
Ktotal ≜ {i ∈ N8+ : (i1 , i2 ) ∈ is21 , (j1 , j2 ) ∈ is22 , (i′1 , i′2 ) ∈ is21 , (j1′ , j2′ ) ∈ is22 },
K1 ≜ {i ∈ Ktotal : s(i) ≤ 1}, Kc1 = {i ∈ Ktotal : s(i) > 1}, and K2 ≜ {i ∈ Ktotal : s(i) = 2}.
We compute the size of K2 :
|K2 | = 2s1 (s1 − 1)s2 (s2 − 1)(s2 − 2)(s2 − 3) + 2s1 (s1 − 1)(s1 − 2)(s1 − 3)s2 (s2 − 1)
+ 4s1 (s1 − 1)(s1 − 2) · 4s2 (s2 − 1)(s2 − 2)
= s1 (s1 − 1)s2 (s2 − 1)(2s22 + 2s21 + 16s1 s2 − 42s1 − 42s2 + 88).
Hence, we obtain that
|Kc1 \ K2 | = |Kc1 | − |K2 | = s1 (s1 − 1)s2 (s2 − 1)((−34s1 − 34s2 + 60) − (−42s1 − 42s2 + 88))
= s1 (s1 − 1)s2 (s2 − 1)(8s1 + 8s2 − 28).
Our goal is now to upper-bound the contribution of Idiff to (41),
P
i∈Idiff E[E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )

× hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X]]


1 2 1 2

8
  (πi1 ) (π ) (π ) (π )
, Xn i2 ; Xn s1 +j1 , Xn s1 +j2 )
P
=m i∈Ktotal E E Hho,m (Xn
(πi′ ) (πi′ ) (πs1 +j ′ ) (πs1 +j ′ ) 
× Hho,m (Xn 1
, Xn 2
)X
; Xn 1
, Xn 2

(42) P   Pm (πi1 ) (π ) (π ) (π ) 
≤ i∈Ktotal E E k1 ,k2 ,ℓ1 ,ℓ2 =1 hho (Xk1 , Xk2 i2 ; Xℓ1 s1 +j1 , Xℓ2 s1 +j2 )
Pm (πi′ ) (πi′ ) (πs +j ′ ) (πs +j ′ )  
× k1′ ,k2′ ,ℓ′1 ,ℓ′2 =1 hho (Xk1′
1
, Xk2′ 2 ; Xℓ′1 1 1 , Xℓ′2 1 2 ) X
  (π ) (π ) (π ) (π )
E E hho (Xk1 i1 , Xk2 i2 ; Xℓ1 s1 +j1 , Xℓ2 s1 +j2 )
P P
= i∈Ktotal K∈Jtotal
(πi′ ) (πi′ ) (πs1 +j ′ ) (πs1 +j ′ ) 
× hho (Xk1′ 1
, Xk2′ 2
; Xℓ′1 1
, Xℓ′2 2
)X ,

where we define K ≜ (k1 , k2 , ℓ1 , ℓ2 , k1′ , k2′ , ℓ′1 , ℓ′2 ),


Jtotal ≜ {1, . . . , m}8 , J1 ≜ {K ∈ Jtotal : s(K) ≤ 1}, and Jc1 = {K ∈ Jtotal : s(K) > 1}.
Each non-zero summand in the final expression of (42) has indices satisfying
1. i ∈ K2 , with, for the indices in i that are equal, the corresponding indices in K are also
equal (so that, for each i ∈ K2 , there are at most m6 terms with non-zero contribution), or
2. i ∈ Kc1 \ K2 , and K ∈ Jc1 (as otherwise the terms would have zero contribution).
28

We now bound the size of Jc1 :


|J1 | = |{K ∈ Jtotal | s(K) = 0}| + |{K ∈ Jtotal | s(K) = 1}|

|{K ∈ Jtotal | s(K) = 0}| ≥ m(m − 1)(m − 2)(m − 3)m(m − 1)(m − 2)(m − 3)
+ 4m(m − 1)(m − 2)m(m − 1)(m − 2)(m − 3)

= m(m − 1)(m − 2)m(m − 1)(m − 2)(m − 3) (m − 3) + 4
|{K ∈ Jtotal | s(K) = 1}| ≥ 8m(m − 1)(m − 2)m(m − 1)(m − 2)(m − 3)

=⇒ |Jc1 | = |Jtotal | − |{K ∈ Jtotal | s(K) = 0}| − |{K ∈ Jtotal | s(K) = 1}|
≤ m8 − m(m − 1)(m − 2)m(m − 1)(m − 2)(m − 3) (m − 3) + 4


− 8m(m − 1)(m − 2)m(m − 1)(m − 2)(m − 3)


= 50m6 − 228m5 + 419m4 − 348m + 108 ≤ 50m6 .

Since every non-zero summand is bounded by ψ̃Y Z,2 , we can bound (42) by
|K2 | · m6 ψ̃Y Z,2 + |Kc1 \ K2 | · |Jc1 |ψ̃Y Z,2
≤ s1 (s1 − 1)s2 (s2 − 1)((2s22 + 2s21 + 16s1 s2 )m6 + (8s1 + 8s2 ) · 50m6 )ψ̃Y Z,2
= n1 (n1 − m)n2 (n2 − m)(2n22 + 2n21 + 16n1 n2 + 400(n21 + n22 )/s)ψ̃Y Z,2 ,
which, in turn, bounds the contribution of Idiff to (41).
Contribution of Ia terms. Without loss of generality, assume that i1 and i2 belong to the
same bin but that all other indices belong to pairwise different bins. Then,

E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X]
1 2 1 2

− E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|X]E[hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X]
1 2 1 2

= E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|X]E[hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X]
1 2 1 2

− E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|X]E[hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X] = 0.
1 2 1 2

Contribution of Ib terms. Without loss of generality, assume that i1 and i2 belong to the same
bin and that n1 + j1 and n1 + j2′ also belong to the same bin. Since exchanging the indices
i′1 and n1 + j1′ does not change the distribution, and hho (x, y; x′ , y ′ ) = −hho (x′ , y; x, y ′ ),
 
E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X]
1 2 1 2
 
= E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃n1 +j′ , Xπ̃i′ ; Xπ̃i′ , Xπ̃n1 +j′ )|X]
1 2 1 2
 
= −E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃n1 +j′ , Xπ̃i′ ; Xπ̃i′ , Xπ̃n1 +j′ )|X] = 0.
1 2 1 2

The same argument works to show that


E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|X]E[hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X] = 0.
1 2 1 2

Contribution of Irest terms. We consider further subcases.


CHEAP PERMUTATION TESTING 29

Irest terms, subcase (i). For indices in Irest such that i1 , i2 , n1 + j1 , n1 + j2 , i′1 , i′2 , n1 + j1′ ,
n1 + j2′ are all pairwise different,
 
E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X]
1 2 1 2
 
= E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|π]
1 2 1 2
 
= E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|π]E[hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|π] .
1 2 1 2

Here, E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|π] and E[hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|π]
1 2 1 2
play the same role, so we will focus on the latter. We consider the following cases:
• π̃i1 ≤ n1 , π̃i2 ≤ n1 , π̃n1 +j1 > n1 , π̃n1 +j2 > n1 or analogous configurations: Then,
E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|π] = E[hho (Y1 , Y2 ; Z1 , Z2 )] = E[Un1 ,n2 ],
• π̃i1 ≤ n1 , π̃i2 ≤ n1 , π̃n1 +j1 ≤ n1 , π̃n1 +j2 > n1 or analogous configurations: Then,
E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|π] = E[hho (Y1 , Y2 ; Y3 , Z1 )]
= E[hho (Y3 , Y2 ; Y1 , Z1 )] = −E[hho (Y1 , Y2 ; Y3 , Z1 )] = 0.
• π̃i1 ≤ n1 , π̃i2 ≤ n1 , π̃n1 +j1 ≤ n1 , π̃n1 +j2 ≤ n1 or analogous configurations: Then,
E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|π] = E[hho (Y1 , Y2 ; Y3 , Y4 )]
= E[hho (Y3 , Y2 ; Y1 , Y4 )] = −E[hho (Y1 , Y2 ; Y3 , Y4 )] = 0.
Thus, for all terms in subcase (i), we have that
E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X] ≤ E[Un1 ,n2 ]2 .
 
1 2 1 2

Moreover, the fraction of Irest terms in subcase (i) is m2 (m − 1)2 (m − 2)2 (m − 3)2 /m8 .
Irest terms, subcase (ii). Now we consider indices in Irest such that there is exactly one pair
among i1 , i2 , n1 + j1 , n1 + j2 , i′1 , i′2 , n1 + j1′ , n1 + j2′ that are the same, and the other
indices are pairwise different. Without loss of generality, we can assume that i1 = i′1 (recall
that i1 ̸= i2 always). Then,
 
E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃i1 , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X]
2 1 2
 
= E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃i1 , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|π]
2 1 2

= E E[E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|Xπ̃i1 , π]

× E[hho (Xπ̃i1 , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|Xπ̃i1 , π]|π] .
2 1 2

Here, E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|Xπ̃i1 , π] and E[hho (Xπ̃i1 , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|Xπ̃i1 , π]
2 1 2
also play the same role, so we will focus on the latter. We consider the following cases with
(Y, Y ′ , Y ′′ , Z, Z ′ , Z ′′ ) ∼ P × P × P × Q × Q × Q given (Xπ̃i1 , π):
• π̃i2 ≤ n1 , π̃n1 +j1 > n1 , π̃n1 +j2 > n1 or analogous configurations: Then,
E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|Xπ̃i1 , π] = E[hho (Xπ̃i1 , Y ; Z, Z ′ )|Xπ̃i1 , π].
• π̃i2 ≤ n1 , π̃n1 +j1 ≤ n1 , π̃n1 +j2 > n1 or analogous configurations: Then,
E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|Xπ̃i1 , π] = E[hho (Xπ̃i1 , Y ; Y ′ , Z ′ )|Xπ̃i1 , π].
30

• π̃i2 ≤ n1 , π̃n1 +j1 > n1 , π̃n1 +j2 ≤ n1 or analogous configurations: Then,


E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|Xπ̃i1 , π] = E[hho (Xπ̃i1 , Y ; Z ′ , Y ′ )|Xπ̃i1 , π]
= E[hho (Xπ̃i1 , Y ′ ; Z ′ , Y )|Xπ̃i1 , π] = −E[hho (Xπ̃i1 , Y ; Z ′ , Y ′ )|Xπ̃i1 , π] = 0.
• π̃i2 ≤ n1 , π̃n1 +j1 ≤ n1 , π̃n1 +j2 ≤ n1 or analogous configurations: Then,
E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|Xπ̃i1 , π] = E[hho (Xπ̃i1 , Y ; Y ′ , Y ′′ )|Xπ̃i1 , π]
= E[hho (Xπ̃i1 , Y ′′ ; Y ′ , Y )|Xπ̃i1 , π] = −E[hho (Xπ̃i1 , Y ; Y ′ , Y ′′ )|Xπ̃i1 , π] = 0.
Thus, we obtain that
 
E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃i1 , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X]
2 1 2

≤ max E E[E[hho (Xπ̃i1 , Y ; Z, Z )|Xπ̃i1 , π] ] , E E[E[hho (Xπ̃i1 , Y ; Y ′ , Z ′ )|Xπ̃i1 , π]2 ]


′ 2
    

= max nn1 E E[hho (Y ′ , Y ; Z, Z ′ )|Y ′ ]2 + nn2 E E[hho (Z ′′ , Y ; Z, Z ′ )|Z ′′ ]2 ,


    

n1 ′′ ′ ′ ′′ 2 ′′ ′ ′ 2 .
  n2  
n E E[hho (Y , Y ; Y , Z )|Y ] + n E E[hho (Z , Y ; Y , Z)|Z ]
Now, notice that
E E[hho (Y ′ , Y ; Z, Z ′ )|Y ′ ]2 = Var E[hho (Y ′ , Y ; Z, Z ′ )|Y ′ ] + E[hho (Y ′ , Y ; Z, Z ′ )]2
  

= Var E[hho (Y ′ , Y ; Z, Z ′ )|Y ′ ] + E[Un1 ,n2 ]2 ,




E E[hho (Z ′′ , Y ; Z, Z ′ )|Z ′′ ]2 = Var E[hho (Z ′′ , Y ; Z, Z ′ )|Z ′′ ] + E[hho (Z ′′ , Y ; Z, Z ′ )]2


  

= Var E[hho (Z ′′ , Y ; Z, Z ′ )|Z ′′ ] ,




E E[hho (Y ′′ , Y ; Y ′ , Z ′ )|Y ′′ ]2 = Var E[hho (Y ′′ , Y ; Y ′ , Z ′ )|Y ′′ ] , and


  

E E[hho (Z ′′ , Y ; Y ′ , Z ′ )|Z ′′ ]2 = Var E[hho (Y ′ , Y ; Z ′′ , Z ′ )|Z ′′ ] + E[Un1 ,n2 ]2 .


  

Thus, for Irest terms in subcase (ii),


 
E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃i1 , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X]
2 1 2

≤ max{ψY,1 , ψZ,1 } + E[Un1 ,n2 ]2 .


Moreover, the fraction of Irest terms in this subcase is upper-bounded by 12m2 (m − 1)2 (m −
2)2 (m − 3)/m8 .
Irest terms, subcase (iii). For the remaining terms in Irest we can write
 
E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃i1 , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X] ≤ ψ̃Y Z,2 .
2 1 2

Since there need to be at least two pairs of indices that share the same value, we have that the
fraction of elements of Irest that fall under subcase (iii) is upper-bounded by
4 4 4
1
 6  6  4 6  44
m8 3 m + 3 m + 2 · 2 m = m2 .
Total contribution of Irest terms. We end up with a bound of the form
P
i∈Idiff E[E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j ′ , Xπ̃n1 +j ′ )|X]]
1 2 1 2

(m−1)2 (m−2)2 (m−3)2


≤ |Irest | m6 E[Un1 ,n2 ]2
12(m−1)2 (m−2)2 (m−3) 44
max{ψY,1 , ψZ,1 } + E[Un1 ,n2 ]2 +
 
+ m6 m2 ψ̃Y Z,2
12 44
≤ (4s1 s42 + 20s21 s32 + 20s31 s22 + 4s41 s2 )m8 E[Un1 ,n2 ]2 +

m max{ψY,1 , ψZ,1 } + m2 ψ̃Y Z,2 .
CHEAP PERMUTATION TESTING 31

Combining our Idiff , Ia , Ib , and Irest estimates, we obtain the advertised variance bound,
E[Var(Unπ,s 2 2 2
1 ,n2 |X)] n1 (n1 − 1) n2 (n2 − 1)
2

≤ n1 (n1 − m)n2 (n2 − m)(2n22 + 2n21 + 16n1 n2 + 400(n21 + n22 )/s)ψ̃Y Z,2
12 44
+ (4n1 n42 + 20n21 n32 + 20n31 n22 + 4n41 n2 )m3 E[Un1 ,n2 ]2 +

m max{ψY,1 , ψZ,1 } + m2 ψ̃Y Z,2

≤ (n1 − m)(n2 − m)(2n22 + 2n21 + 16n1 n2 + 400(n21 + n22 )/s)ψ̃Y Z,2


12 44
+ (4n32 + 20n1 n22 + 20n21 n2 + 4n31 )m3 E[Un1 ,n2 ]2 +

m max{ψY,1 , ψZ,1 } + m2 ψ̃Y Z,2

≤ (n1 − m)(n2 − m)(2n22 + 2n21 + 16n1 n2 + 400(n21 + n22 )/s)ψ̃Y Z,2


+ 8n3 m3 E[Un1 ,n2 ]2 + 12 44

m max{ψY,1 , ψZ,1 } + m2 ψ̃Y Z,2 .

APPENDIX C: Proof of Cor. 1: Power of cheap homogeneity: finite variance, PD


We begin with lemma that relates the parameters of Cor. 1 to those of Thm. 1.

L EMMA C.1 (Positive-definite bounds on ψY,1 , ψZ,1 ). Under the premises of Cor. 1,
ψY,1 ≤ MMD2k (P, Q)ξQ and ψZ,1 ≤ MMD2k (P, Q)ξP .

P ROOF. We prove one case as the others are identical. By Cauchy-Schwarz,


|E[hho (Y1 , Y2 ; Z1 , Z2 )|Z1 ]| = |E[hho (Y1 , Y2 ; Z1 , Z2 )|Z1 ]|
= |E[k(Y1 , Y2 ) + k(Z1 , Z2 ) − k(Y1 , Z2 ) − k(Y2 , Z1 )|Z1 ]|
= k(x, x′ ) d(P − Q)(x) d(P − δZ1 )(x′ )
R
R R
= ⟨ k(x, ·) d(P − Q)(x), k(x, ·) d(P − δZ1 )(x)⟩k
R R
≤ ∥ k(x, ·) d(P − Q)(x)∥k ∥ k(x, ·) d(P − δZ1 )(x)∥k
= MMDk (P, Q)MMDk (δZ1 , P).
Since E[E[hho (Y1 , Y2 ; Z1 , Z2 )|Z1 ]] = 0, squaring and taking expectations yields the claim.

We first show that (11) implies the sufficient power condition (9) of Thm. 1. Since
q q
3−β 4MMD2k (P,Q)ξQ 2
3−β 4ψY,1 4ψZ,1 
+ 4MMDnk2(P,Q)ξP

(43) β n1 + n2 ≤ β n1

by Lem. C.1 and E[Un1 ,n2 ] = MMD2k (P, Q) by [20, Lemma 6], the following inequality is a
sufficient condition for (9) to hold:
ax2 − bx − c ≥ 0 for x = MMDk (P, Q),
q q
4ξQ 1−α⋆ (36n21 +36n22 +198n1 n2 )ψY Z,2
a = 1, b = 3−β β n1 + 4ξP 
n2 , and c = α⋆ βn1 (n1 −1)n2 (n2 −1) .

By the quadratic formula, this holds whenever x ≥ b + c, yielding the first claim.
q We next show that (12) implies the sufficient power condition (10) of Thm. 1. Let f =

βα 2
s3 24(1−α ⋆ ) ρn1 n2 . Using the equality E[Un1 ,n2 ] = MMDk (P, Q) and the inequalities

q q q
201s2 12s 201s2 12s
4n2 ψY Z,2 + n max(ψ Y,1 , ψZ,1 ) ≤ 4n2 ψY Z,2 + n max(ξQ , ξP )MMDk (P, Q)
32

and (43) justified by Lem. C.1 and the triangle inequality, we find that (10) holds whenever
ax2 − bx − c ≥ 0 for x = MMDk (P, Q), a = 1 − 1/f,
q q
4ξQ
b = 3−β β n1 + 4ξP 
n2 + 1
f
12s
n max(ξQ , ξP ), and
q q
⋆ (36n2 +36n2 +198n1 n2 )ψY Z,2 201s2
c = 1−α α ⋆
1 2
βn1 (n1 −1)n2 (n2 −1) + 1
f 4n2 ψY Z,2 .

Since a ≤ 1, this holds when x ≥ ab + ac ≥ ab + ac , yielding the second claim.
p

APPENDIX D: Proof of Thm. 2: Power of cheap homogeneity: sub-Gaussian, PD


We will establish the claim using the quantile comparison method (Prop. 2) by identifying
complementary quantile bounds (CQBs) Φn1 ,n2 , ΨX,s , and Ψn1 ,n2 ,s satisfying the tail bounds
(21), (22), and (23) for the test statistic Un1 ,n2 , auxiliary sequence (Unπ1b,n
,s B
2 )b=1 , and popula-
2
tion parameter T = E[Un1 ,n2 ] = MMDk (P, Q). Throughout, we will call a random variable
sub-Gaussian if it has finite sub-Gaussian norm
p 1/p
∥X∥ψ2 ≜ supp≥1 (E[|X|

p
])
.

D.1. Bounding the fluctuations of the test statistic. We begin by identifying a CQB
Φn1 ,n2 satisfying the test statistic tail bound (21). We first introduce another way to express
the U-statistic Un1 ,n2 and its corresponding V-statistic Vn1 ,n2 . If we let δi = 1 for i ∈ [n1 ] and
δi = −1 for i ∈ n1 + [n2 ] and Kij = g(Xi , Xj ), we can write
I[δi =1,δj =1]
Vn1 ,n2 ≜ MMD2k (P b ) = Pn + I[δi =−1,δ j =−1]
− I[δinδ1j n=−1]

b, Q Kij ,
i,j=1 n21 n22 2

Un1 ,n2 = ni,j=1 I[δi =1,δ j =1,i̸=j]


+ I[δi =−1,δ j =−1,i̸=j]
− I[δinδ1j n=−1]
P 
n1 (n1 −1) n2 (n2 −1) 2
Kij .
Hence,
I[δi =1,δj =1]
+ I[δi =−1,δ j =−1]
− I[δi =1,δ j =1,i̸=j]
− I[δi =−1,δ j =−1,i̸=j]
Pn 
Vn1 ,n2 −Un1 ,n2 = i,j=1 n21 n22 n1 (n1 −1) n2 (n2 −1) Kij
Pn I[δi =1] I[δi =−1] I[δi =1,δj =1] I[δi =−1,δj =−1] 
)Kii − ni,j=1,i̸=j
P
(44) = i=1 ( n21 + n22 n21 (n1 −1) + n22 (n2 −1) Kij
where we used that n12 − n1 (n11 −1) = − n2 (n11 −1) .
1 1
Our proof makes use of two lemmas. The first, essentially due to Maurer and Pontil [33],
shows that functions of independent variables concentrate around their mean when coordi-
natewise function differences are sub-Gaussian.

L EMMA D.1 (Sub-Gaussian differences inequality [33, Sec. 5]). Suppose an integrable
function f of a sequence of independent datapoints X = (Xi )ni=1 satisfies
|f (X) − E[f (X) | X−k ]| ≤ fk (Xk ) almost surely
for each k ∈ [n] and measurable functions (fk )nk=1 . If each fk (Xk ) is sub-Gaussian, then
q
f (X) − E[f (X)] < 64e log(1/δ) k∈[n] ∥fk (Xk )∥2ψ2 with probability at least 1 − δ.
P

P ROOF. The proof of Thm. 3 in Sec. 5 of [33] shows that


(45) Pr(f (X) − E[f (X)] ≥ ε) ≤ exp(16eβ 2 k∈[n] ∥fk (Xk )∥2ψ2 − βε) for all ε, β > 0.
P

Minimizing over β > 0 yields the result.1

1
The statement of [33, Thm. 3] contains the apparent typo 32 in place of 64, presumably due to incorrect
minimization of (45).
CHEAP PERMUTATION TESTING 33

The second lemma, proved in App. D.4, uses Lem. D.1 to derive suitable CQBs for Un1 ,n2
and Vn1 ,n2 .

L EMMA D.2 (Sub-Gaussian homogeneity quantile bounds). Under the assumptions of


Thm. 2, for Vn1 ,n2 ≜ MMD2k (P b ) and each β ∈ (0, 1),
b, Q
Pr |MMD2k (P, Q) − Vn1 ,n2 | ≥ Λ2Vn

(β) + 2MMDk (P, Q)ΛVn1 ,n2 (β) ≤ β and
1 ,n2

Pr |MMD2k (P, Q) − Un1 ,n2 | ≥ Φn1 ,n2 (β) ≤ β where



q q q
M2,P M2,Q 1 σP2 σQ2 
ΛVn1 ,n2 (β) ≜ n1 + n2 + 64e log( β ) n1 + n2
q q q
2 (n1 +1)σP
2 (n2 +1)σQ2
ΛUn1 ,n2 (β) ≜ 2nM 2,P
1 −1
+ 2 M2,Q
n2 −1 + 128e log( β )( n 2 + n2 ), and
1 2

Φn1 ,n2 (β) ≜ Λ2Un (β) + 2MMDk (P, Q)ΛUn1 ,n2 (β).
1 ,n2

D.2. Bounding the fluctuations of the threshold. We next identify CQBs ΨX,s and
Ψn1 ,n2 ,s that satisfy the properties
(46) Pr(Unπ,s
1 ,n2 ≥ ΨX,s (α) | X) ≤ α, ∀α ∈ (0, 1), and
(47) Pr(ΨX,s (α) ≥ Ψn1 ,n2 ,s (α, β)) ≤ β, ∀α, β ∈ (0, 1)
for the cheaply permuted U-statistic Unπ,s
(30). Our derivation makes use of three lemmas.
1 ,n2

The first is tail bound for random quadratic forms known as the Hanson-Wright inequality.

L EMMA D.3 (Hanson-Wright inequality [11, Thm. 3.1]). Let X ∈ Rn be a random vec-
2 2
tor with independent components satisfying E[eλXi ] ≤ eλ ν /2 for some ν ≥ 0, all λ ∈ R,
and all i ∈ [n]. Let A ∈ Rn×n be a symmetric matrix such that aii = 0 for i ∈ [n]. Then, there
exists a universal constant c > 3/20 such that for any t > 0,
2
(48) Pr(X ⊤ AX > t) ≤ exp(−c min{ ν 4 ∥A∥
t t
ν ∥A∥op }).
2 , 2
F

Equivalently, for any δ ∈ (0, 1), with probability at least 1 − δ , we have


q
2
∥A∥op
X ⊤ AX ≤ max ν 2 ∥A∥F log(1/δ) , log(1/δ)ν

(49) c c .
Moreover, when δ ∈ (0, 0.86), we have
log(1/δ)ν 2 ∥A∥F 20 log(1/δ)ν 2 ∥A∥F
(50) X ⊤ AX ≤ c ≤ 3 .

P ROOF. Equation (48) is the content of [11, Thm. 3.1], and equivalence with (49) follows
ct2 ct
p
from selecting t = max a log(1/δ), b log(1/δ) in (48) or δ = exp(− min{ ν 4 ∥A∥ ν ∥A∥op })
2 , 2
F
in (49). Now, suppose that δ ∈ (0, 0.86). Since 0.86 ≤ exp(−3/20) ≤ exp(−c), we have
q
log(1/δ) log(1/δ)
c ≤ c . Since ∥A∥op ≤ ∥A∥F , the claim (50) follows.

Our second lemma, proved in App. D.5, provides high-probability bounds on the sums of
MMD powers.

L EMMA D.4 (High-probability bounds on sums of MMD powers). Under the assump-
tions of Thm. 2, if p ∈ {1, 2, 4}, then
Ps1 p Ps2 p 1/p
Zp (X) ≜ i=1 MMDk (δYi , P) + i=1 MMDk (δZi , Q)
< Ap (s1 , s2 , m) + B(s1 , s2 , m, δ),
34

with probability at least 1 − δ for


q
n1 σP2 n2 σQ2 
B(s1 , s2 , m, δ) ≜ 64e log(1/δ) m 2 + m2 ,
q 2
q 2
A1 (s1 , s2 , m) ≜ s1 EY ∼P [MMDm
k (δY ,P)]
+ s2 EZ∼Q [MMD m
k (δZ ,Q)]
,
q 2 2
A2 (s1 , s2 , m) ≜ s1 EY ∼P [MMD
m
k (δY ,P)]
+ s2 EZ∼Q [MMD
m
k (δZ ,Q)]
, and
1/4
A4 (s1 , s2 , m) ≜ m32 + m13 s1 EY ∼P [MMD4k (δY , P)] + s2 EZ∼Q [MMD4k (δZ , Q)]

.

Our third lemma, Lem. D.5, proved in App. D.6, uses Lems. D.3 and D.4 to verify the
CQB properties (46) and (47) for a particular pairing of (ΨX,s , Ψn1 ,n2 ,s ).

L EMMA D.5 (Sub-Gaussian quantile bound for permutation threshold). Instantiate the
assumptions and notation of Lem. D.4. For all α, β ∈ (0, 1),

Ψn1 ,n2 ,s (α, β) ≜ 20 log(2/α)
3/2 12MMDk (P, Q)(A2 (s1 , s2 , m) + B(s1 , s2 , m, β/3))
3s1

+ 24(A4 (s1 , s2 , m) + B(s1 , s2 , m, β/3))2


2
+ s21 (A2 (s1 , s2 , m) + B(s1 , s2 , m, β/3))2

+ MMDk (P, Q)(A1 (s1 , s2 , m) + B(s1 , s2 , m, β/3))
MMD2k (P,Q)(2 log(4/α)+1)
+ s1 .
satisfies Pr(ΨX,s (α) ≥ Ψn1 ,n2 ,s (α, β)) ≤ β for ΨX,s (α) defined in equation (55) and satisfy-
ing Pr(Unπ,s
1 ,n2 ≥ ΨX,s (α) | X) ≤ α. Here, m ≜
n1 +n2 sn1
s , s1 ≜ n1 +n2 , and s2 ≜ s − s1 .

In the sequel, we will make use of a simplified version of Ψn1 ,n2 ,s that, as an immediate
corollary to Lem. D.5, also satisfies the CQB property (47).

C OROLLARY D.1 (Simplified Sub-Gaussian quantile bound for permutation threshold).


Instantiate the assumptions and notation of Thm. 2. For all α, β ∈ (0, 1),
√ √
20 log(2/α) 12MMDk (P,Q)(Ã2 +B̃( β3 )) 24(Ã4 +B̃( β3 )2 ) 
Ψn1 ,n2 ,s (α, β) ≜ 3s1

m
+ m
β 2

2
2 2(Ã2 +B̃( 3 ) ) MMDk (P,Q)( 2Ã2 +B̃( β3 )) 
+ s1 m + √
ms1
MMD2k (P,Q)(2 log(4/α)+1)
+ s1 .
satisfies Pr(ΨX,s (α) ≥ Ψn1 ,n2 ,s (α, β)) ≤ β for m, s1 , and ΨX,s as in Lem. D.5.

D.3. Putting everything together. Fix any β ∈ (0, 1), and suppose that
(51) E[Un1 ,n2 ] ≥ Ψn1 ,n2 ,s (α⋆ , β/3) + Φn1 ,n2 (β/3)
for Ψn1 ,n2 ,s and Φn1 ,n2 defined in Cor. D.1 and Lem. D.2. By Cor. D.1, Lem. D.2, and Prop. 2,
a cheap permutation test based on Un1 ,n2 has power at least 1 − β . We now show that the
assumed condition (13) is sufficient to imply (51). Since E[Un1 ,n2 ] = MMD2k (P, Q), the con-
dition (51) is equivalent to the inequality ax2 − bx − c ≥ 0 with x = MMDk (P, Q),
2 log(4/α⋆ )+1
a=1− s1 ,
CHEAP PERMUTATION TESTING 35
√ √
40 log(2/α⋆ )(Ã2 +B̃( β3 )) 2( 2Ã2 +B̃( β3 )) 2Λ̃P (β) 2Λ̃Q (β)
b= √
3s1 n1
+ √
s1 n1 +√ n1 −1
+√ n2 −1
, and
√  2
40 2 log(2/α⋆ )/3(Ã4 +B̃( β3 )2 )+4(Ã22 +B̃( β3 )2 ) Λ̃ (β) Λ̃ (β)
c= n1 + √P
n1 −1
+ √Q
n2 −1
.

By the triangle inequality and the fact a ≥ a, it therefore suffices to have
√ √
2
x ≥ a1 (b + c) ≥ ab + ac ≥ b+ b2a+4ac .
p

The advertised result now follows from the observation that


r √
√ 40 2 log(2/α⋆ )/3(Ã4 +B̃( β3 )2 )+4(Ã22 +B̃( β3 )2 ) Λ̃ (β) Λ̃ (β)
c≤ n1 + √P
n1 −1
+ √Q
n2 −1
.

D.4. Proof of Lem. D.2: Sub-Gaussian homogeneity quantile bounds. Fix any β ∈
(0, 1). Let X be the concatenation of Y and Z, and let Ỹ and Z̃ be independent draws from P
and Q respectively. By the triangle inequality, we have the following almost sure inequalities
for each k ∈ [n]:
|MMDk (P, Q) − MMDk (P b )| ≤ MMDk (P, P
b, Q b) + MMDk (Q, Q b ) ≜ ∆(X)
(
1
E[MMDk (δYk , δỸ ) | Yk ] if k ≤ n1
|∆(X) − E[∆(X) | X−k ]| ≤ n11
n2 E[MMDk (δZk−n1 , δZ̃ ) | Zk−n1 ] otherwise.

Furthermore, by Jensen’s inequality, we have


q q
E[∆(X)] ≤ E[MMDk (P, P b)2 ] + E[MMDk (Q, Q b )2 ]
q q
= E[((P − P) × (P − P))k] + E[((Q
b b b − Q) × ( Q
b − Q))k]
q q
= E[ n12 ni=1 ((δYi − P) × (δYi − P))k] + E[ n12 ni=1
P 1 P 2
((δZi − Q) × (δZi − Q))k]
1 2
q q
= Mn2,1 P + Mn2,2Q .

Hence, by Lem. D.1, with probability at least 1 − β ,


q q q
M2,P M2,Q σ2 σQ2
|MMDk (P, Q) − MMDk (P, Q)| <
b b
n1 + n2 + 64e log( β1 )( nP1 + n2 ) = ΛVn1 ,n2 (β).

Since |a2 − b2 | = |a − b||b + a| ≤ |a − b|(|b − a| + 2|a|) for all real a, b, we further have
(52) |MMD2k (P, Q) − MMD2k (P b )| < Λ2
b, Q
Vn (β) + 2MMDk (P, Q)ΛVn1 ,n2 (β)
1 ,n2

with probability at least 1 − β , as advertised.


To establish the Un1 ,n2 bound, we begin by writing
(53) MMD2k (P, Q) − Un1 ,n2 = MMD2k (P, Q) − Vn1 ,n2 + Vn1 ,n2 − Un1 ,n2 .
To bound the term Vn1 ,n2 − Un1 ,n2 , observe that, by (44),
Vn1 ,n2 − Un1 ,n2 = n12 ni=1 MMD2k (δYi , δY−i ) + n12 ni=1 MMD2k (δZi , δZ−i ) ≥ 0.
P 1 P 2
1 2

Hence, we can write


˜ X) = Vn1 ,n2 − Un1 ,n2 = 1 n1 1 n2
p 
∆( n1 (MMDk (δYi , δY−i ))i=1 , n2 (MMDk (δZi , δZ−i ))i=1 2
.
36

By repeated use of the triangle inequality, we have, for each k ∈ [n1 ],


˜ X) − E[∆(˜ X)|Y−k , Z] ≤ 1 E ∥ 1 MMDk (δYk , δ ) n1
   
∆( n1 n1 Ỹ i=1,i̸=k , MMDk (δYk , δỸ ) ∥2 |Yk
q q
= n11 1 + n1n−1 ≤ n11 1 + n11 E MMDk (δYk , δỸ )|Yk
   
2 E MMDk (δYk , δỸ )|Yk
1

almost surely. Identical reasoning yields the following almost sure bound for each k ∈ [n2 ]:
q
˜ X) − E[∆( ˜ X)|Y, Z−k ] ≤ 1 1 + 1 E MMDk (δZk , δ )|Zk .
 
∆( n2 n2 Z̃

Next, by the Cauchy-Schwarz inequality, we find that


˜ X)2 ] = E 12 Pn1 MMD2 (δYi , δY−i ) +
˜ X)]2 ≤ E[∆( 1 Pn2 2
 
E[∆( n i=1 k n22 i=1 MMDk (δZi , δZ−i ),
1

where
E MMD2k (δYi , δY−i ) = E
   RR 
k(x, y) d(δYi − P + P − δY−i )(x) d(δYi − P + P − δY−i )(y)
= E (1 + n11−1 ) k(x, y) d(δYi − P)(x) d(δYi − P)(y) = 1 + n11−1 M2,P .
 RR  

Analogously, E MMD2k (δZi , δZ−i ) = 1 + n21−1 M2,Q . Thus, we conclude that


  

q
E[∆(˜ X)] ≤ M2,P + M2,Q .
n1 −1 n2 −1

Hence, applying Lem. D.1, we obtain that with probability at least 1 − β/2,
q q
M2,P M2,Q (n +1)σ 2 (n2 +1)σQ2
Vn1 ,n2 − Un1 ,n2 < n1 −1 + n2 −1 + 64e log( β2 )( 1 n2 P +
p
n2 ).
1 2

This estimate together with the union bound and the relations (52) and (53) yield the claim.

D.5. Proof of Lem. D.4: High-probability bounds on sums of MMD powers. Define
ϕ(y) ≜ EY ′ ∼P [MMDk (δy , δY ′ )] and ϕ′ (z) ≜ EZ ′ ∼Q [MMDk (δz , δZ ′ )],
fix any p ∈ {1, 2, 4}, and note that we may write
(MMDk (δYi , P))si=1 , (MMDk (δZi , Q))si=1

Zp (X) = 1 2
p
, and
(
1
ϕ(Yk ) if k ≤ n1
|Zp (X) − E[Zp (X) | X−k ]| ≤ m1 ′
m ϕ (Zk−n1 ) otherwise

almost surely by the triangle inequality. Applying Lem. D.1 we therefore have
Zp (X) − E[Zp (X)] < B(s1 , s2 , m, δ) with probability at least 1 − δ.
Hence, it remains to upper-bound E[Zp (X)]. Applying Hölder’s inequality, we find that
E[Z1 (X)] = si=1 E[MMDk (δYi , P)] + si=1
P1 P2
E[MMDk (δZi , Q)]
q P2 q
= si=1 E[MMD2k (δYi , P)] + si=1 E[MMD2k (δZi , Q)] = A1 (s1 , s2 , m),
P1

E[Z2 (X)]2 ≤ E[Z2 (X)2 ] = si=1 E[MMD2k (δYi , P)] + si=1 E[MMD2k (δZi , Q)]
P1 P2

= A2 (s1 , s2 , m), and


E[Z4 (X)]4 ≤ E[Z4 (X)4 ] = si=1 E[MMD4k (δYi , P)] + si=1 E[MMD4k (δZi , Q)]
P1 P2
CHEAP PERMUTATION TESTING 37
RR
Now consider the shorthand (δYi − P)(δYi − P)k ≜ k(x, y) d(δYi − P)(x) d(δYi − P)(y).
To conclude our fourth moment bound, we notice that
E[MMD4k (δYi , P)]
 2 
= E (δYi − P)(δYi − P)k
= E m14 m
 P  
k,l,k′ ,l′ =1 (δYmi+k − P)(δYmi+l − P)k (δYmi+k′ − P)(δYmi+l′ − P)k
Pm  Pm
= E m14
 
k=1 (δYmi+k − P)(δYmi+k − P)k l=1 (δYmi+l − P)(δYmi+l − P)k
2 2 
+ m24 m 1 Pm
P
k,l=1 (δYmi+k − P)(δYmi+l − P)k + m4 k=1 (δYmi+k − P)(δYmi+k − P)k

≤ EY ∼P m12 + m22 + m13 MMD4k (δY , P) = EY ∼P m32 + m13 MMD4k (δY , P)


     

and the analogous result for E[MMD4k (δZi , P)]. This completes the proof.

D.6. Proof of Lem. D.5: Sub-Gaussian quantile bound for permutation threshold.
Fix any α, β ∈ (0, 1). We begin by identifying ΨX,s (α) satisfying
Pr(Vnπ,s
1 ,n2 ≥ ΨX,s (α) | X) ≤ α

where Vnπ,s π,s


1 ,n2 is the cheaply permuted V-statistic (29). This in turn will imply (46) as Vn1 ,n2 ≥
π,s
Un1 ,n2 by the same argument (44) used to show that Vn1 ,n2 ≥ Un1 ,n2 .
We follow a structure similar to that in the proof of [25, Thm. 6.1]. Our first step is to iden-
tify a second random variable that, conditioned on X, has the same distribution as Vnπ,s 1 ,n2 . Let

ℓ = (ℓ1 , . . . , ℓs1 ) be an s1 tuple drawn without replacement from [s2 ] generated independently
of (X, π) and ζ = (ζk )sk=1 1
be independent random variables that take values −1 and 1 with
equal probability, independently of (X, π, ℓ). For k1 , k2 ∈ [s1 ], define the random measures
( (
P if πk ≤ s1 ′ P if πs1 +ℓk′ ≤ s1
µk = and µk′ =
Q otherwise Q otherwise.
By construction, (µk1 , µk2 , µ′k1 , µ′k2 ) are population distributions of the respective sample
points (Xπk1 , Xπk2 , Xπs1 +ℓk , Xπs1 +ℓk ). Next, let us define
1 2

Venπ,ℓ,ζ,s 1 Ps1
1 ,n2 ≜ s21 k1 ,k2 =1 ζk1 ζk2 Hho,m (Xπk1 , Xπk2 ; Xπs1 +ℓk1 , Xπs1 +ℓk2 ),

where Hho,m is defined in (27). By [25, Sec. 6.1], conditional on X, the distribution of
E[Venπ,ℓ,ζ,s
1 ,n2 | X, π, ζ] matches that of Vnπ,s
1 ,n2 .

Our goal is now to bound the tail of E[Venπ,ℓ,ζ,s


1 ,n2 | X, π, ζ]. To this end, we define
1 ′ ′
ak1 ,k2 (π, ℓ) = s21 (Hho,m (Xπk1 , Xπk2 ; Xπs1 +ℓk1 , Xπs1 +ℓk2 ) − ⟨µk1 k − µk1 k, µk2 k − µk2 k⟩k ),

for 1 ≤ k1 , k2 ≤ s1 and Aπ,ℓ = (ak1 ,k2 (π, ℓ))sk11 ,k2 =1 . We remark that

Venπ,ℓ,ζ,s
1 ,n2 = ζ ⊤ Aπ,ℓ ζ + cζ,π,ℓ ,
where cζ,π,ℓ ≜ s12 sk11 ,k2 =1 ζk1 ζk2 ⟨µk1 k − µ′k1 k, µk2 k − µ′k2 k⟩k ,
P
1

which means that E[Venπ,ℓ,ζ,s


1 ,n2 | X, π, ζ] = ζ ⊤ E[Aπ,ℓ | X, π]ζ + E[cζ,π,ℓ | ζ, π].
Observe that ⟨µk1 k−µk1 k, µk2 k−µ′k2 k⟩k can only take on the values 0 and ±MMD2k (P, Q).

To make this more precise, we can partition [s1 ] into the following subsets: A = {k ∈ [s1 ] |
µk = P, µ′k = P}, B = {k ∈ [s1 ] | µk = P, µ′k = Q}, C = {k ∈ [s1 ] | µk = Q, µ′k = P},
38

D = {k ∈ [s1 ] | µk = Q, µ′k = Q}. We will also introduce the partition G ⊔ F 2 , where


G = {k ∈ [s1 ] | µk = P} and F = {k ∈ [s1 ] | µk = Q}. Observe that G = A ⊔ B and
F = C ⊔ D and that the membership of k in G or F is determined by π , while its mem-
bership in A, B , C , or D depends on ℓ as well.
Introducing the more compact indicator function notation 1A ≜ I(A), we have
⟨µk1 k − µ′k1 k, µk2 k − µ′k2 k⟩k = MMD2k (P, Q)(1(k1 ,k2 )∈B×B∪C×C − 1(k1 ,k2 )∈B×C∪C×B ),
which means that
Ps1
E[cζ,π,ℓ | ζ, π]= s12 1(k1 ,k2 )∈B×B∪C×C −1(k1 ,k2 )∈B×C∪C×B | π MMD2k (P, Q).
 
1 k1 ,k2 =1 ζk1 ζk2 E
 
If we let bk1 ,k2 (π) ≜ E 1(k1 ,k2 )∈B×B∪C×C − 1(k1 ,k2 )∈B×C∪C×B | π , we have that bk1 ,k2 (π)
takes the following values depending on the pair (k1 , k2 ):
  

E 1(k1 ,k2 )∈B×B | π, k1 , k2 ∈ G, k1 ̸= k2 ≜ b1 if k1 , k2 ∈ G, k1 ̸= k2
  
E 1k1 ∈B | π, k1 ∈ G ≜ b2 if k1 , k2 ∈ G, k1 = k2




  
E 1(k1 ,k2 )∈C×C | π, k1 , k2 ∈ F, k1 ̸= k2 ≜ b3 if k1 , k2 ∈ F, k1 ̸= k2
bk1 ,k2 (π) =  

E 1k1 ∈C | π, k1 ∈ F ≜ b4 if k1 , k2 ∈ F, k1 = k2
  
− 1 | π, k ∈ G, k ∈ F ≜ −b if k1 ∈ G, k2 ∈ F



 E (k1 ,k2 )∈B×C 1 2 5

  
−E 1(k1 ,k2 )∈B×B | π, k1 ∈ F, k2 ∈ G ≜ −b6 if k1 ∈ F, k2 ∈ G.
Given π , let qG be the number of indices k such that 1 ≤ k ≤ s2 and πs1 +k ≤ s1 , and analo-
gously, let qF be the number of indices k such that 1 ≤ k ≤ s2 and πs1 +k2 > s1 , which means
(qG −1)
that qG + qF = s2 . Observe that b1 = E 1(k1 ,k2 )∈B×B | π, k1 , k2 ∈ G, k1 ̸= k2 = qsG2 (s
 
2 −1)
=
qG2 qG2
s22 − qsG2(s(s22−qG)
−1) = s22 − s2q(sG2q−1)
F
. This is because the index k1 belongs to B with probability qG
s2 ,
2 2
−1
and then conditioned on this fact, the index k2 belongs to B with probability qsG2 −1 . Similarly,
  qG qF qG qF qG qF
b5 = E 1(k1 ,k2 )∈B×C | π, k1 ∈ G, k2 ∈ F = s2 (s2 −1) = s2 + s2 (s2 −1) . Through analogous
2 2
arguments we find that
2
qF qF qG qG qF qG qF
b3 = s22 − s22 (s2 −1) and b6 = s22 + s22 (s2 −1) .

Therefore,
Ps1
k1 ,k2 =1 ζk1 ζk2 bk1 ,k2 (π)
P 2 P   P 2 P  
= k∈G ζk b1 + k∈G 1 b2 − b1 + k∈F ζk b3 + k1 ∈F 1 b4 − b3
P  P  P  P 
− k∈G ζk k∈F ζk b5 − k∈G ζk k∈F ζk b6
P 2 qG2 qG qF  P  
= k∈G ζk s22 − s22 (s2 −1) + k∈G 1 b2 − b1
P 2 q 2 qF qG  P  
+ k∈F ζk s22 − s22 (s2 −1) + k1 ∈F 1 b4 − b3
F

P  P  qG qF qG qF  P  P  qG qF qG qF 
− k∈G ζk k∈F ζk 2
s2 + 2
s2 (s2 −1) − k∈G ζ k k∈F ζ k s22 + 2
s (s2 −1)
2

qG qF 2 2
− s2q(sG2q−1)
P  P   P P 
k∈G ζk s2 −
F
= k∈F ζk s2 k∈G ζk + k∈F ζk
2
 
+ |G| b2 − b1 + |F| b4 − b3
P    P   2
≤ k∈G ζk E 1k∈B | π, k ∈ G − k∈F ζk E 1k∈C | π, k ∈ F + s1 .

2
⊔ denotes a disjoint union
CHEAP PERMUTATION TESTING 39
   
Let c1 = E 1k∈B | π, k ∈ G , c2 = E 1k∈C | π, k ∈ F . By Hoeffding’s inequality [23,
Thm. 2] and the union bound
t2 t2
P P   
Pr |c1 k∈G ζk + c2 k∈F ζk | ≥ t | π ≤ 2 exp − 2(c2 |G|+c 2
|F |) ≤ 2 exp − 2s1 . 1 2

Therefore, with probability at least 1 − δ ′′ ,


2
E[cζ,π,ℓ | ζ, π]= MMDsk2(P,Q)
Ps1
1 k1 ,k2 =1 ζk1 ζk2 bk1 ,k2 (π)
MMD2k (P,Q) P P 2 
≤ s21 c1 k∈G ζk + c2 k∈F ζk + s1
MMD2k (P,Q) MMD2k (P,Q)(2 log(2/δ ′′ )+1)
2 log(2/δ ′′ )s1 + s1 =

< s21 s1 .
It remains to bound the tail of
ζ ⊤ E[Aπ,ℓ | X, π]ζ = ζ ⊤ E[Aod
π,ℓ | X, π]ζ + E[Tr(Aπ,ℓ ) | X, π],

where Aod
π,ℓ has the off-diagonal components of Aπ,ℓ and zeros in the diagonal. We bound
2
ζ E[Aπ,ℓ | X, π]ζ first. Since supλ∈R E[eλζ1 −λ /2 ] ≤ 1 [5, Lem. 2.2], Lem. D.3 implies that
⊤ od

20 log(1/δ ′ )∥E[Aod
π,ℓ |X,π]∥F 20 log(1/δ ′ ) supπ,ℓ ∥Aod
π,ℓ ∥F
(54) ζ ⊤ E[Aod
π,ℓ | X, π]ζ ≤ 3 ≤ 3

with probability at least 1 − δ ′ conditioned on (X, π), provided that δ ′ ∈ (0, 0.86). Setting
δ ′′ = α2 and δ ′ = α2 , we find that with probability at least 1 − α conditioned on X,

E[Venπ,ℓ,ζ,s
1 ,n2 | X, π, ζ] = ζ ⊤ E[Aod
π,ℓ | X, π]ζ + E[Tr(Aπ,ℓ ) | X, π] + E[cζ,π,ℓ | π, ζ]
20 log( α2 ) supπ,ℓ ∥Aod
π,ℓ ∥F MMD2k (P,Q)(2 log(4/α)+1)
(55) < ΨX,s (α) ≜ 3 +supπ,ℓ Tr(Aπ,ℓ ) + s1 .
Now it only remains to find Ψn1 ,n2 ,s satisfying Pr(ΨX,s (α) ≥ Ψn1 ,n2 ,s (α, β)) ≤ β . To
achieve this we will develop quantile bounds for supπ,ℓ ∥Aod
π,ℓ ∥F and supπ,ℓ Tr(Aπ,ℓ ). We
od
begin with supπ,ℓ ∥Aπ,ℓ ∥F . Since
Hho,m (Xπk1 , Xπk2 ; Xπs1 +ℓk , Xπs1 +ℓk ) = ⟨δXπk k − δXπs k, δXπk k − δXπs k⟩k
1 2 1 1 +ℓk1 2 1 +ℓk2

we have, by the triangle inequality and Cauchy-Schwarz,


s21 |ak1 ,k2 (π, ℓ)| = |Hho,m (Xπk1 , Xπk2 ; Xπs1 +ℓk , Xπs1 +ℓk ) − ⟨µk1 k − µ′k1 k, µk2 k − µ′k2 k⟩k |
1 2

= |⟨δXπk k − δXπs k, δXπk k − δXπs k⟩k − ⟨µk1 k − µ′k1 k, µk2 k − µ′k2 k⟩k |
1 1 +ℓk1 2 1 +ℓk2

≤ |⟨δXπk k − δXπs k − (µk1 k − µ′k1 k), δXπk k − δXπs k⟩k |


1 1 +ℓk1 2 1 +ℓk2

+ |⟨µk1 k − µ′k1 k, δXπk k − δXπs k − (µk2 k − µ′k2 k)⟩k |


2 1 +ℓk2

≤ ∥δXπk k − δXπs k − (µk1 k − µ′k1 k)∥k ∥δXπk k − δXπs k∥k


1 1 +ℓk1 2 1 +ℓk2

+ ∥µk1 k − µ′k1 k∥k ∥δXπk k − δXπs k − (µk2 k − µ′k2 k)∥k


2 1 +ℓk2

≤ (MMDk (δXπk , µk1 )+MMDk (δXπs +ℓ , µ′k1 ))


1 1 k1

× (MMDk (δXπk , µk2 )+MMDk (δXπs +ℓ , µ′k2 )+MMDk (µk2 , µ′k2 ))


2 1 k2

+ MMDk (µk1 , µ′k1 )(MMDk (δXπk , µ′k2 )+MMDk (δXπs +ℓ , µ′k2 )).
2 1 k2
40

Hence, by Jensen’s inequality and the arithmetic-geometric mean inequality,


Ps1 4 2
k1 ,k2 =1 s1 |ak1 ,k2 (π, ℓ)|

Ps1
≤ k1 ,k2 =1 (MMDk (δXπk , µk1 )+MMDk (δXπs +ℓ , µ′k1 ))
1 1 k1

× (MMDk (δXπk , µk2 )+MMDk (δXπs +ℓ , µ′k2 )+MMDk (µk2 , µ′k2 ))


2 1 k2

2
+MMDk (µk1 , µ′k1 )(MMDk (δXπk , µ′k2 )+MMDk (δXπs +ℓ , µ′k2 ))
2 1 k 2

Ps1
≤ k1 ,k2 =1 3(MMDk (δXπk1 , µk1 )+MMDk (δXπs1+ℓk , µ′k1 ))2
1

× (MMDk (δXπk , µk2 )+MMDk (δXπs +ℓ , µ′k2 ))2


2 1 k2

+3MMD2k (µk2 , µ′k2 )(MMDk (δXπk , µk1 )+MMDk (δXπs +ℓ , µ′k1 ))2
1 1 k 1

+3MMD2k (µk1 , µ′k1 )(MMDk (δXπk , µ′k2 )+MMDk (δXπs +ℓ , µ′k2 ))2
2 1 k2
Ps1
≤ 3
k1 ,k2 =1 2 (MMDk (δXπk , µk1 )+MMDk (δXπs +ℓ , µ′k1 ))4
1 1 k1

+(MMDk (δXπk , µk2 )+MMDk (δXπs +ℓ , µ′k2 ))4



2 1 k2

+6MMD2k (µk2 , µ′k2 )(MMD2k (δXπk , µk1 )+MMD2k (δXπs +ℓ , µ′k1 ))


1 1 k 1

+6MMD2k (µk1 , µ′k1 )(MMD2k (δXπk , µ′k2 )+MMD2k (δXπs +ℓ , µ′k2 ))


2 1 k2
Ps1
≤ k1 ,k2 =1 12 MMD4k (δXπk , µk1 )+MMD4k (δXπs +ℓ , µ′k1 )
1 1 k 1

+MMD4k (δXπk , µk2 )+MMD4k (δXπs +ℓ , µ′k2 )



2 1 k2

+6MMD2k (P, Q)(MMD2k (δXπk , µk1 )+MMD2k (δXπs +ℓ , µ′k1 )


1 1 k 1

+MMD2k (δXπk , µ′k2 )+MMD2k (δXπs +ℓ , µ′k2 ))


2 1 k2
Ps1 4 2 2

≤ i=1 24s1 MMDk (δYi , P)+12s1 MMDk (P, Q)MMDk (δYi , P)
+ si=1 24s1 MMD4k (δZi , Q)+12s1 MMD2k (P, Q)MMD2k (δZi , Q)
P2 

Let E be the event that the inequality of Lem. D.4 with δ = β3 holds simultaneously for all
p ∈ {1, 2, 4}. By the union bound, this holds with probability at least 1 − β . Moreover, on E ,
Ps1 4 2 4
k1 ,k2 =1 s1 |ak1 ,k2 (π, ℓ)| < 24s1 (A4 (s1 , s2 , m) + B(s1 , s2 , m, β/3))

+ 12s1 MMD2k (P, Q)(A2 (s1 , s2 , m) + B(s1 , s2 , m, β/3))2


Therefore, on E ,
(56) supπ,ℓ ∥Aod
π,ℓ ∥F
r
24s1 (A4 (s1 ,s2 ,m)+B(s1 ,s2 ,m, β3 ))4 +12s1 MMD2k (P,Q)(A2 (s1 ,s2 ,m)+B(s1 ,s2 ,m, β3 ))2
< s41
√ √
24(A4 (s1 ,s2 ,m)+B(s1 ,s2 ,m, β3 ))2 12MMDk (P,Q)(A2 (s1 ,s2 ,m)+B(s1 ,s2 ,m, β3 ))
≤ 3/2 + 3/2 .
s1 s1
CHEAP PERMUTATION TESTING 41

Next, we bound supπ,ℓ Tr(Aπ,ℓ ) using Jensen’s inequality:


Tr(Aπ,ℓ ) = sk11 =1 ak1 ,k1 (π, ℓ)
P

≤ s12 sk11 =1 (MMDk (δXπk , µk1 )+MMDk (δXπs +ℓ , µ′k1 ))2


P
1 1 1 k1

+ 2MMDk (µk1 , µ′k1 )(MMDk (δXπk , µk1 )+MMDk (δXπs +ℓ , µ′k1 ))


1 1 k 1
Ps1
≤ 1
s21
2 2
k1 =1 2(MMDk (δXπk1 , µk1 )+MMDk (δXπs1+ℓk , µ′k1 ))
1

+ 2MMDk (P, Q)(MMDk (δXπk , µk1 )+MMDk (δXπs +ℓ , µ′k1 ))


1 1 k1

2 Ps1 2 Ps2 2
≤ s21 i=1 MMDk (δYi , P) + i=1 MMDk (δZi , Q)
Ps1 Ps2 
+ MMDk (P, Q) i=1 MMDk (δYi , P) + i=1 MMDk (δZi , Q)
On E we therefore have
2
supπ,ℓ Tr(Aπ,ℓ ) < s21 (A2 (s1 , s2 , m) + B(s1 , s2 , m, β/3))2
(57) 
+ MMDk (P, Q)(A1 (s1 , s2 , m) + B(s1 , s2 , m, β/3)) .
From (54), (56), and (57), we therefore obtain that with probability at least 1 − β ,
ΨX,s (α) < Ψn1 ,n2 ,s (α, β).

APPENDIX E: Proof of Thm. 3: Power of cheap independence: finite variance


We will establish both claims using the refined two moments method (Cor. 3) with the aux-
iliary sequence (Vnπb ,s )B
b=1 of cheaply-permuted independence V-statistics (32). By Cor. 3
with option (25), it suffices to bound Var(Vn ), E[Var(Vnπ,s | X)], Var(E[Vnπ,s | X]), and
E[Vnπ,s ]. These quantities are bounded in Lems. E.1, E.2, and E.3 with proofs in Apps. E.1,
E.2, and E.3.

L EMMA E.1 (Variance of independence V-statistics). Under the assumptions of Thm. 3,


16ψ1′ 144ψ2′
Var(Vn ) ≤ n + n2 .

L EMMA E.2 (Conditional variance of permuted independence V-statistics). Under the


assumptions of Thm. 3,
32ψ̃1′ 960ψ2′
(58) max(Var(E[Vnπ,n | X]), E[Var(Vnπ,n | X)]) ≤ Var(Vnπ,n ) ≤ n + n2 and
max(Var(E[Vnπ,s | X]), E[Var(Vnπ,s | X)]) ≤ Var(Vnπ,s )
(59)
≤ ψ̃1′ 32 2304 460800
+ ψ2′ 960 4147200
 
n + ns + ns2 n2 + n2 s .

L EMMA E.3 (Mean of permuted independence V-statistics). Under the assumptions of


Thm. 3,
E[Vnπ,n ] ≤ 14 U + 10n −44n+48
2
(60) n3 ξ˜ and
E[Vnπ,s ] ≤ 14 1 + s +s−4
2 2
U + 10n −44n+48 ˜

(61) s3 n3 ξ.

Now fix any α, β ∈ (0, 1). By Lem. E.2,


q q
2
( αβ − β2 )E[Var(Vnπ,s | X)] + ( β2 − 1)Var(E[Vnπ,s | X])
q q q
π,s
Var(Vnπ,s ),
2 2 2
p 4

≤ αβ − β + β − 1 Var(V n ) ≤ αβ −2
42

with the same result holding when Vnπ,n is substituted for Vnπ,s . Hence, by Cor. 3
and Lems. E.1 to E.3 the standard and cheap tests enjoy power at least 1 − β whenever
q q
1 3 16ψ1′ 144ψ2′  12
 32ψ̃1′ 960ψ2′  ˜
(62) E[Vn ]− 4 U ≥ ( β − 1) n + n2 + α⋆ β − 2 n + n2 + 10
n ξ and
q ′
144ψ2′ 
1 s2 +s−4
U ≥ ( β3 − 1) 16ψ

E[Vn ]− 4 1 + s3 n + n2
1

q
12
− 2 ψ̃ ′ 32 + 2304 + 460800 + ψ ′ 960 4147200
+ ξ˜
   10
(63) + ⋆ α β 1 n 2 ns 2 + ns 2 2 n n s n

respectively. To recover the looser


√ sufficient condition (16) for standard testing, we note that,
24 ψ2′ +6ξ̃
by Lem. E.4, E[Vn ] ≥ U − . To recover the looser sufficient condition (17) for cheap
n √
24 ψ2′ +6ξ̃
testing, we again invoke E[Vn ] ≥ U − n and additionally note that
q ′
q
144ψ2′ 
( β3 − 1) 16ψ 12
 ′ 32 2304 460800 960 4147200
+ ψ2′
 
+ + α⋆ β − 2 ψ̃1 n + ns + +
1
n n2 ns2 n2 n2 s
√ ′
+ 10 ˜+ 24 ψ2 +6ξ̃
ξ
n n
q
≤ 43 γn + 12
 ′ 2304 460800 
+ ψ2′ 4147200

α⋆ β − 2 ψ̃1 ns + ns2 n2 s .

L EMMA E.4 (Independence V vs. U-statistics). Let Un be the U-statistic associated


1 P
to hin , which takes the form Un = (n) 4 (i1 ,i2 ,i3 ,i4 )∈in
4
hin (Xi1 , Xi2 , Xi3 , Xi4 ), and satisfies
E[Un ] = E[hin (X1 , X2 , X3 , X4 )] = U . Under the assumptions of Thm. 3,

24 ψ2′ +6ξ̃
|E[Vn − Un ]| ≤ n .

P ROOF. By definition,
1 P
Vn − Un = n4 (i1 ,i2 ,i3 ,i4 )∈[n]4 hin (Xi1 , Xi2 , Xi3 , Xi4 )
1 P
− (n)4 (i1 ,i2 ,i3 ,i4 )∈in
4
hin (Xi1 , Xi2 , Xi3 , Xi4 )
1 1
P
= n4 − (n) 4 (i1 ,i2 ,i3 ,i4 )∈in
4
hin (Xi1 , Xi2 , Xi3 , Xi4 )
1 P
+ n4 (i1 ,i2 ,i3 ,i4 )∈[n]4 \in
4
hin (Xi1 , Xi2 , Xi3 , Xi4 ).
Thus, we have that
1 1 P  
|E[Vn − Un ]| ≤ n4 − (n)4 (i1 ,i2 ,i3 ,i4 )∈in
4
E hin (Xi1 , Xi2 , Xi3 , Xi4 )
1 P
 
+ n4 (i1 ,i2 ,i3 ,i4 )∈[n]4 \in
4
E hin (Xi1 , Xi2 , Xi3 , Xi4 ) .
We bound the first term on the right-hand side using Jensen’s inequality and the ψ2′ definition:
1 1 P  
n4 − (n)4 (i1 ,i2 ,i3 ,i4 )∈in 4
E hin (Xi1 , Xi2 , Xi3 , Xi4 )

≤ n14 − (n)1 2 1/2


P  
4 (i 1 ,i 2 ,i 3 ,i 4 )∈in
4
E h in (Xi 1
, Xi 2
, Xi 3
, X i 4
)
1/2
≤ n14 − (n)1
(n)4 16ψ2′

4
.

We bound the second term using the definition of ξ˜:


1 P 1 1 ˜
 
n4 (i1 ,i2 ,i3 ,i4 )∈[n]4 \in
4
E h in (Xi 1
, Xi 2
, Xi 3
, Xi 4
) ≤ n4 − (n)4 (n)4 ξ.

The advertised conclusion follows as 1 − (n)


n4 = 1 −
4 (n−1)(n−2)(n−3)
n3 = 6n2 −11n+6
n3 ≤ n6 .
CHEAP PERMUTATION TESTING 43

E.1. Proof of Lem. E.1: Variance of independence V-statistics. Define the index sets
Ktotal ≜ [n]8 ,
K1 ≜ {(i1 , i2 , i3 , i4 , i′1 , i′2 , i′3 , i′4 ) ∈ Ktotal : |{i1 , i2 , i3 , i4 } ∩ {i′1 , i′2 , i′3 , i′4 }| = 0},
K2 ≜ {(i1 , i2 , i3 , i4 , i′1 , i′2 , i′3 , i′4 ) ∈ Ktotal : |{i1 , i2 , i3 , i4 } ∩ {i′1 , i′2 , i′3 , i′4 }| = 1}, and
(K1 ∪ K2 )c ≜ {(i1 , i2 , i3 , i4 , i′1 , i′2 , i′3 , i′4 ) ∈ Ktotal : |{i1 , i2 , i3 , i4 } ∩ {i′1 , i′2 , i′3 , i′4 }| > 1}.
Then,
Var(Vn ) = E[Vn2 ] − E[Vn ]2
= n18 (i1 ,...,i4 ,i′1 ,...,i′4 )∈Ktotal E[hin (Xi1 , Xi2 , Xi3 , Xi4 )hin (Xi′1 , Xi′2 , Xi′3 , Xi′4 )]
P

− E[hin (Xi1 , Xi2 , Xi3 , Xi4 )]E[hin (Xi′1 , Xi′2 , Xi′3 , Xi′4 )]
= (I) + (II) + (III),
where (I) is the summation over (i1 , . . . , i′4 ) ∈ K1 , (II) is the summation over (i1 , . . . , i′4 ) ∈
K2 , and (III) is the summation over (i1 , . . . , i′4 ) ∈ (K1 ∪ K2 )c . Note that
|K2 | ≤ n4 · 4 · 4 · n3 = 16n7 and |(K1 ∪ K2 )c | ≤ n4 · 4 · 3 · 4 · 3 · n2 = 144n6 .
We also remark that for (i1 , . . . , i′4 ) ∈ K1 , E[hin (Xi1 , Xi2 , Xi3 , Xi4 )hin (Xi′1 , Xi′2 , Xi′3 , Xi′4 )] =
E[hin (Xi1 , Xi2 , Xi3 , Xi4 )]E[hin (Xi′1 , Xi′2 , Xi′3 , Xi′4 )], which implies that (I) = 0.

By the definition (14) of ψ1′ , we have that (II) ≤ n18 |K2 |ψ1′ ≤ 16ψ n . By the nonnegativity
1

of hin , Cauchy-Schwarz, Jensen’s inequality, and the definition (14) of ψ2′ , we have that each
summand of (III) is upper-bounded by
E[hin (Xi1 , Xi2 , Xi3 , Xi4 )hin (Xi′1 , Xi′2 , Xi′3 , Xi′4 )]
1/2 1/2
≤ E[hin (Xi1 , Xi2 , Xi3 , Xi4 )2 ] E[hin (Xi′1 , Xi′2 , Xi′3 , Xi′4 )2 ]
1/2 1/2
≤ E[hin (Xi1 , Xi2 , Xi3 , Xi4 )2 ] E[hin (Xi′1 , Xi′2 , Xi′3 , Xi′4 )2 ] ≤ ψ2′ .
144ψ2′
Hence, (III) ≤ 1
n8 |(K1 ∪ K2 )c |ψ2′ ≤ n2 , which concludes the proof.

E.2. Proof of Lem. E.2: Conditional variance of permuted independence V-statistics.


Throughout we use the shorthand i ≜ (i1 , i2 , j1 , j2 , i′1 , i′2 , j1′ , j2′ ).

E.2.1. Proof of (58). Our approach is similar to that employed in [25, Thm. 5.1] to study
independence U-statistics. We begin by defining the convenient shorthand
s(i) ≜ |{i1 , i2 , i3 , i4 , n+1−i1 , n+1−i2 , n+1−i3 , n+1−i4 } ∩ {i′1 , i′2 , i′3 , i′4 }|,
Ktotal ≜ [n]8 , K1 ≜ {i ∈ Ktotal : s(i) = 0}, K2 ≜ {i ∈ Ktotal : s(i) = 1}, and
(K1 ∪ K2 )c ≜ {i ∈ Ktotal : s(i) > 1},
and writing
hin (x1 , x2 , x3 , x4 ) = hin,Y (y1 , y2 , y3 , y4 )hin,Z (z1 , z2 , z3 , z4 ), for
hin,Y (y1 , y2 , y3 , y4 ) ≜ gY (y1 , y2 ) + gY (y3 , y4 ) − gY (y1 , y3 ) − gY (y2 , y4 ) and
hin,Z (z1 , z2 , z3 , z4 ) ≜ gZ (z1 , z2 ) + gZ (z3 , z4 ) − gZ (z1 , z3 ) − gZ (z2 , z4 ).
44

By Cauchy-Schwarz,
E[Var(Vnπ,n |X)] = E[E[(Vnπ,n )2 |X] − (E[Vnπ,n |X])2 ] ≤ E[(Vnπ,n )2 ] − (E[Vnπ,n ])2
= n18 i∈Ktotal E HY H′Y hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 )hin,Z (Zπi′ , Zπi′ , Zπi′ , Zπi′ )
P  
1 2 3 4
(64)
− E HY hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 ) E H′Y hin,Z (Zπi′ , Zπi′ , Zπi′ , Zπi′ )
   
1 2 3 4

= (I) + (II) + (III),


where HY ≜ hin,Y (Yi1 , Yi2 , Yi3 , Yi4 ), H′Y ≜ hin,Y (Yi′1 , Yi′2 , Yi′3 , Yi′4 ), and (I), (II), (III) are
the summations over i ∈ K1 , i ∈ K2 , and i ∈ (K1 ∪ K2 )c respectively.
Note that for i ∈ K1 ,
E HY H′Y hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 )hin,Z (Zπi′ , Zπi′ , Zπi′ , Zπi′ )
 
1 2 3 4

= E HY H′Y hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 )hin,Z (Zπi′ , Zπi′ , Zπi′ , Zπi′ )
 
1 2 3 4
   ′ 
= E HY hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 ) E HY hin,Z (Zπi′ , Zπi′ , Zπi′ , Zπi′ ) ,
1 2 3 4

so that (I) = 0. Also, note that


|K2 | ≤ n4 · 2 · 4 · 4 · n3 = 32n7 and |(K1 ∪ K2 )c | ≤ n4 · 2 · 4 · 3 · 2 · 4 · 3 · n2 = 576n6 .
For i, j ∈ [4], let us define K̄2,i,j as the set of indices i ∈ K2 with |i| = 7 and either ii = i′j
or ii = n+1−i′j . We further define K̄2 as the union of these 16 sets. Note that K2 \ K̃2 can
be characterized as the set of i ∈ K2 such that, for exactly three indices i, j, k of the eight
indices, the pairs {i, n+1−i}, {j, n+1−j} and {k, n+1−k} are equal. This characterization
allows us to write
|K2 \ K̄2 | = 4 · 4 · 6 · 22 · n6 = 384n6 .
Note that if i ∈ K̄2 with i1 = i′1 , then the corresponding summand from (64) equals

E hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Z (Zπ1 , Zπ2 , Zπ3 , Zπ4 )
 
− E hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Z (Zπ1 , Zπ2 , Zπ3 , Zπ4 )
× hin,Y (Y1 , Y5 , Y6 , Y7 )hin,Z (Zπ1 , Zπ5 , Zπ6 , Zπ7 )
 
− E hin,Y (Y1 , Y5 , Y6 , Y7 )hin,Z (Zπ1 , Zπ5 , Zπ6 , Zπ7 )
= E Var(hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Z (Zπ1 , Zπ2 , Zπ3 , Zπ4 ) | Y1 , π1 , Zπ1 ) = ψ̃1′ ,
 

where the last equality holds since the distribution of hin ((Y1 , Zπ1 ), (Y2 , Zπ2 ), (Y3 , Zπ3 ), (Y4 , Zπ4 ))
conditional on (Y1 , Zπ1 , π1 ) is invariant to the order of the four arguments of hin and hence
E[hin ((Y1 , Zπ1 ), (Y2 , Zπ2 ), (Y3 , Zπ3 ), (Y4 , Zπ4 )) | (Y1 , Zπ1 , π1 )]
= E[hin ((Y1 , Zπ1 ), (Y2 , Zπ2 ), (Y3 , Zπ3 ), (Y4 , Zπ4 )) | (Y1 , Zπ1 , π1 )].
Similarly, if i ∈ K̄2 with i1 = n + 1 − i′1 , then the corresponding summand of (64) equals

E hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Z (Zπ1 , Zπ2 , Zπ3 , Zπ4 )
 
− E hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Z (Zπ1 , Zπ2 , Zπ3 , Zπ4 )
× hin,Y (Yn , Y5 , Y6 , Y7 )hin,Z (Zπn , Zπ5 , Zπ6 , Zπ7 )
 
− E hin,Y (Yn , Y5 , Y6 , Y7 )hin,Z (Zπn , Zπ5 , Zπ6 , Zπ7 )
≤ 21 E[Var(hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Z (Zπ1 , Zπ2 , Zπ3 , Zπ4 ) | Y1 , π1 , Zπ1 )]
+ 21 E[Var(hin,Y (Yn , Y5 , Y6 , Y7 )hin,Z (Zπn , Zπ5 , Zπ6 , Zπ7 ) | Yn , πn , Zπn )] = ψ̃1′ ,
CHEAP PERMUTATION TESTING 45

by Cauchy-Schwarz and the arithmetic-geometric mean inequality.


Moreover, for any i ∈ (K1 ∪ K2 )c ∪ (K2 \ K̄2 ), the corresponding summand of (64) equals

E hin,Y (Yi1 , Yi2 , Yi3 , Yi4 )hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 )
 
− E hin,Y (Yi1 , Yi2 , Yi3 , Yi4 )hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 )
× hin,Y (Yi′1 , Yi′2 , Yi′3 , Yi′4 )hin,Z (Zπi′ , Zπi′ , Zπi′ , Zπi′ )
1 2 3 4
 
− E hin,Y (Yi′1 , Yi′2 , Yi′3 , Yi′4 )hin,Z (Zπi′ , Zπi′ , Zπi′ , Zπi′ )
1 2 3 4

 2 1/2
≤ E hin,Y (Yi1 , Yi2 , Yi3 , Yi4 )hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 )
2 1/2
≤ ψ2′

× E hin,Y (Yi′1 , Yi′2 , Yi′3 , Yi′4 )hin,Z (Zπi′ , Zπi′ , Zπi′ , Zπi′ )
1 2 3 4

by Cauchy-Schwarz. Thus, (II) is upper-bounded by ψ2′ |K2n\8K̄2 | + ψ̃1′ |K̄2|


n8 , and we find that
c
960ψ2′ 32ψ̃1′
Var(Vnπ,n ) ≤ ψ2′ |(K1 ∪K2 )n|+|K
8
2 \K̄2 |
+ ψ̃1′ |K̄2|
n8 ≤ n2 + n .
Finally, by Jensen’s inequality,
Var(E[Vnπ,n |X]) = E[(E[Vnπ,n |X])2 ] − (E[E[Vnπ,n |X]])2
≤ E[E[(Vnπ,n )2 |X]] − (E[E[Vnπ,n |X]])2 = Var(Vnπ,n ).

E.2.2. Proof of (59). Our argument is analogous to that of App. E.2.1 with s substituted
for n and Hin,m substituted for hin . We begin by defining the convenient shorthand
s(i) ≜ |{i1 , i2 , i3 , i4 , s+1−i1 , s+1−i2 , s+1−i3 , s+1−i4 } ∩ {i′1 , i′2 , i′3 , i′4 }|,
Ktotal ≜ [s]8 , K1 ≜ {i ∈ Ktotal : s(i) = 0}, K2 ≜ {i ∈ Ktotal : s(i) = 1}, and
(K1 ∪ K2 )c ≜ {i ∈ Ktotal : s(i) > 1},
and define K̄2 as the set of indices i ∈ K2 with |i| = 7 and either ii = i′j or ii = n+1−i′j for
some i, j ∈ [4]. As in App. E.2.1 case we have
|K2 | ≤ 32s7 , |(K1 ∪ K2 )c | ≤ 576s6 , |K2 \ K̄2 | = 384s6 .
Next, we define K̃2 as the set of i ∈ K2 such that for exactly three indices i, j, k of the eight
indices, the pairs {i, s+1−i}, {j, s+1−j}, and {k, s+1−k} are equal, and the rest of such
pairs are all different. We then have
|K̃2 | = 4 · 4 · 6 · 22 · s(s − 2)(s − 4)(s − 6)(s − 8)(s − 10)
= 384s(s − 2)(s − 4)(s − 6)(s − 8)(s − 10) and
|K2 \ (K̄2 ∪ K̃2 )| = |K2 \ K̄2 | − |K̃2 |
≤ 384(s6 − s(s − 2)(s − 4)(s − 6)(s − 8)(s − 10)) ≤ 384 · 5 · 6s5 = 11520s5 .
We next define K3 as the set of i ∈ (K1 ∪ K2 )c such that for exactly four indices i, j, k, r of
the eight indices, we have that {i, s+1−i} = {j, s+1−j} and {k, s+1−k} = {r, s+1−r},
and the rest of such pairs are all distinct. Note that
|K3 | = 4 · 3 · 2 · 4 · 3 · 2s(s − 2)(s − 4)(s − 6)(s − 8)(s − 10)
= 576s(s − 2)(s − 4)(s − 6)(s − 8)(s − 10) and
|(K1 ∪ K2 ∪ K3 )c | = |(K 1 ∪ K2 )c | − |K 3|

≤ 576(s6 − s(s − 2)(s − 4)(s − 6)(s − 8)(s − 10)) ≤ 576 · 5 · 6 · s5 = 17280s5 .


46

Finally, for K ≜ (k1 , k2 , k3 , k4 , k1′ , k2′ , k3′ , k4′ ) and s(K) ≜ |{k1 , k2 , k3 , k4 }∩{k1′ , k2′ , k3′ , k4′ }|,
we define the index sets
Ltotal = [m]8 , L1 = {K ∈ Ltotal : s(K) = 0}, L2 = {K ∈ Ltotal : s(K) = 1}, and
(L1 ∪ L2 )c = {K ∈ Ltotal : s(K) > 1}
and note that
|L2 | ≤ m4 · 4 · 4 · m3 = 16m7 and |(L1 ∪ L2 )c | ≤ m4 · 4 · 3 · 4 · 3 · m2 = 144m6 .
With this notation in hand, we use Cauchy-Schwarz to write
E[Var(Vnπ,s |X)] = E[E[(Vnπ,s )2 |X] − (E[Vnπ,s |X])2 ] ≤ E[(Vnπ,s )2 ] − (E[Vnπ,s ])2

1 P P  (i ) (i ) (i ) (i ) (i′ ) (i′ ) (i′ ) (i′ )
= n8 i∈Ktotal K∈Ltotal E hin,Y (Yk1 1 , Yk2 2 , Yk3 3 , Yk4 4 )hin,Y (Yk1′ 1 , Yk2′ 2 , Yk3′ 3 , Yk4′ 4 )

(π ) (π ) (π ) (π ) (πi′ ) (πi′ ) (πi′ ) (πi′ ) 


× hin,Z (Zk1 i1 , Zk2 i2 , Zk3 i3 , Zk4 i4 )hin,Z (Zk1′ 1 , Zk2′ 2 , Zk3′ 3 , Zk4′ 4 )
 (i ) (i ) (i ) (i ) (π ) (π ) (π ) (π ) 
− E hin,Y (Yk1 1 , Yk2 2 , Yk3 3 , Yk4 4 )hin,Z (Zk1 i1 , Zk2 i2 , Zk3 i3 , Zk4 i4 )

 (i′ ) (i′ ) (i′ ) (i′ ) (πi′ ) (πi′ ) (πi′ ) (πi′ ) 
× E hin,Y (Yk1′ 1 , Yk2′ 2 , Yk3′ 3 , Yk4′ 4 )hin,Z (Zk1′ 1 , Zk2′ 2 , Zk3′ 3 , Zk4′ 4 )

= (I ′ ) + (II ′ ) + (III ′ ),
where (I ′ ) is the summation over i ∈ K1 , (II ′ ) is the summation over i ∈ K2 , and (III ′ ) is
the summation over i ∈ (K1 ∪ K2 )c . Note that (I ′ ) = 0 by the same argument that we used to
show that (I) = 0 in (64).
To bound (II ′ ), we will further divide the corresponding summands into subcases:
• i ∈ K̄2 : Suppose without loss of generality that i1 = i′1 . The only non-zero summands in
the sum over K have k1 = k1′ . There are m7 such summands, and each contributes at most
ψ̃1′ to the sum.
• i ∈ K̃2 : Consider the three indices of i that satisfy the defining K̃2 property. In the sum-
mation over K ∈ Ltotal , a summand contributes at most ψ̃1′ when exactly two of the three i
indices share the same k -index, and there are at most 3m7 such summands. The only other
non-zero summands contribute at most ψ2′ when all three i indices share the same k -index,
and there are are most m6 such summands.
• i ∈ K2 \ (K̄2 ∪ K̃2 ): Considering the summation over K ∈ Ltotal , there will be a contribution
bounded by ψ̃1′ when K ∈ L2 , and there are at most 16m7 such summands, The only other
non-zero contributions are bounded by ψ2′ and occur when K ∈ (L1 ∪ L2 )c , so the number
of such summands is bounded by 144m6 .
Thus, we have that (II ′ ) is upper-bounded by
7 7 7 6 c
ψ̃1′ |K̄2n|m
8 + ψ̃1′ |K̃2n|·3m
8 + ψ̃1′ |K2 \(K̄2 ∪nK̃8 2 )|·16m + ψ2′ |K̃2n|·m
8 + ψ2′ |K2 \(K̄2 ∪K̃n2 )|·|(L
8
1 ∪L2 ) |

7 7 6 7 5 7 6 6 5 6
≤ ψ̃1′ 32sn8m + 384sn8·3m + 11520sn8·16m + ψ2′ 384sn8 m + 11520sn8·144m .
Similarly, to bound (III ′ ), we divide the corresponding summands into subcases:
• i ∈ K3 : Consider the two pairs of indices of i satisfying the defining property of K3 . When
each pair respectively shares the same k -index, the summand contributes at most ψ2′ , and
there are at most m6 such summands. The only other non-zero summands have exactly
one of the two pairs sharing a k -index; these summands contribute at most ψ̃1′ , and there
are at most 2m7 such summands.
CHEAP PERMUTATION TESTING 47

• i ∈ (K1 ∪ K2 ∪ K3 )c : In the summation over K ∈ Ltotal , a summand will contribute at most


ψ̃1′ when K ∈ L2 , and there are at most 16m7 such summands. The only other non-zero
summands are bounded by ψ2′ and occur when K ∈ (L1 ∪ L2 )c , so there are at most 144m6
such summands.
Thus, (III ′ ) is upper-bounded by
7 c 6 c c
ψ̃1′ |K3n|·2m
8 + ψ̃1′ |(K1 ∪K2 ∪K
n8
3) |·|L2 |
+ ψ2′ |K3n|·m
8 + ψ2′ |(K1 ∪K2 ∪K3n)8 |·|(L1 ∪L2 ) |

576s6 ·2m7 5 7 6 6 5 6
≤ ψ̃1′ n8 + 17280sn8·16m + ψ2′ 576sn8·m + 17280sn8·144m
We conclude that
Var(Vnπ,s ) ≤ ψ̃1′ 32 2304 460800
+ ψ2′ 960 4147200
 
n + ns + ns2 n2 + n2 s .
Finally, by Jensen’s inequality,
Var(E[Vnπ,s |X]) = E[(E[Vnπ,s |X])2 ] − (E[E[Vnπ,s |X]])2
≤ E[E[(Vnπ,s )2 |X]] − (E[E[Vnπ,s |X]])2 = Var(Vnπ,s ).

E.3. Proof of Lem. E.3: Mean of permuted independence V-statistics.

E.3.1. Proof of (60). Define the sets of indices


Jtotal ≜ [n]4 ,
J1 ≜ {(i1 , i2 , i3 , i4 ) ∈ Jtotal :
(65)
|{i1 , i2 , i3 , i4 , swap(i1 ), swap(i2 ), swap(i3 ), swap(i4 )}| = 8},
J2 ≜ {(i1 , i2 , i3 , i4 ) ∈ Jtotal : (i1 = i4 ) ∨ (i2 = i3 )},
and note that |J1 | = n(n − 2)(n − 4)(n − 6), |J2 | = 2n3 , and
|(J1 ∪ J2 )c | = n4 − (n4 − 12n3 + 44n2 − 48n) − 2n3 = 10n3 − 44n2 + 48n.
Recalling the definitions (31) of hin,Y and hin,Z , we have
E[Vnπ,n ] = n14 (i1 ,...,i4 )∈[n]4 E hin,Y (Yi1 , Yi2 , Yi3 , Yi4 )hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 )
P  

= n14 (i1 ,...,i4 )∈[n]4 E E hin,Y (Yi1 , Yi2 , Yi3 , Yi4 )hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 ) | π
P   

n(n−2)(n−4)(n−6) 3 2
+48n ˜ 2
≤ n4 E[Uπ ] + 10n −44n
n4 ξ ≤ E[Uπ ] + 10n −44n+48
n3 ξ˜ for
 
Uπ ≜ E hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Z (Zπ1 , Zπ2 , Zπ3 , Zπ4 ) | π .
The penultimate inequality follows from the following facts:
 
• For (i1 , . . . , i4 ) ∈ J1 , E hin,Y (Yi1 , Yi2 , Yi3 , Yi4 )hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 )|π = Uπ .
 
• For (i1 , . . . , i4 ) ∈ J2 , E hin,Y (Yi1 , Yi2 , Yi3 , Yi4 )hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 )|π = 0.
• For (i1 , . . . , i4 ) ∈ (J1 ∪J2 )c , E hin,Y (Yi1 , Yi2 , Yi3 , Yi4 )hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 )|π ≤ ξ˜,
 

where ξ˜ is defined in (15).


Finally, equation (68) of the following lemma shows that E[Uπ ] = 41 E[U] completing the
proof.
48

L EMMA E.5 (Range and expectation of Uπ ). Instantiate the assumptions of Thm. 3, fix
any i1 , i2 , i3 , i4 ∈ [n] satisfying
|{i1 , i2 , i3 , i4 , swap(i1 ), swap(i2 ), swap(i3 ), swap(i4 )}| = 8,
and let π be any permutation on [n] with
(πij , πswap(ij ) ) ∼ Unif({(ij , swap(ij )), (swap(ij ), ij )}) for each j ∈ [4].
Then we always have
 
Uπ ≜ E hin,Y (Yi1 , Yi2 , Yi3 , Yi4 )hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 )|π
(66)
∈ E hin (Xi1 , Xi2 , Xi3 , Xi4 )]{0, 41 , 21 , 1}


with
E[Uπ ] ≤ 12 E hin (Xi1 , Xi2 , Xi3 , Xi4 )].

(67)
If (πi1 , πi2 , πi3 , πi4 ) are independent, then
E[Uπ ] = 41 E hin (Xi1 , Xi2 , Xi3 , Xi4 )].

(68)
If πi1 = i1 ⇔ πi4 = i4 with independent (πi1 , πi2 , πi3 ) or πi2 = i2 ⇔ πi3 = i3 with indepen-
dent (πi1 , πi2 , πi4 ), then
E[Uπ ] = 41 E hin (Xi1 , Xi2 , Xi3 , Xi4 )].

(69)
If πi1 = i1 ⇔ πi2 = i2 with independent (πi1 , πi3 , πi4 ); πi1 = i1 ⇔ πi3 = i3 with independent
(πi1 , πi2 , πi4 ); πi2 = i2 ⇔ πi4 = i4 with independent (πi1 , πi2 , πi3 ); or πi3 = i3 ⇔ πi4 = i4
with independent (πi1 , πi2 , πi3 ), then
E[Uπ ] = 38 E hin (Xi1 , Xi2 , Xi3 , Xi4 )].


P ROOF. We will use the shorthand


(
Π if i = j
Pi,j ≜
P×Q otherwise,

and the notation (P(i) , Q(i) ) to denote copies of (P, Q) while keeping track of indices to make
the derivation easier to follow. We begin by proving (66). Uπ may be rewritten as
Uπ = gY (y, y ′ )gZ (z, z ′ ) d(Pi1 ,πi1 + Pi4 ,πi4 − P × Q(πi4 ) − P × Q(πi1 ) )(y, z)
RR

d(Pi2 ,πi2 + Pi3 ,πi3 − P × Q(πi3 ) − P × Q(πi2 ) )(y ′ , z ′ ).


Thus, if we define the Unif({0, 1}) random variables ϵik = I(πik = ik ), we can write
Uπ = gY (y, y ′ )gZ (z, z ′ ) d(ϵi1 (Π − P × Q) + ϵi4 (Π − P × Q))(y, z)
RR

d(ϵi2 (Π − P × Q) + ϵi3 (Π − P × Q))(y ′ , z ′ )


gY (y, y ′ )gZ (z, z ′ ) d(Π − P × Q)(y, z)
  RR
= ϵi1 + ϵi4 ϵi2 + ϵi3
d(Π − P × Q)(y ′ , z ′ )
1
  
= 4 ϵi1 + ϵi4 ϵi2 + ϵi3 E hin (X1 , X2 , X3 , X4 )].
Setting ϵi1 , ϵi2 , ϵi3 , ϵi4 to {0, 1} implies that all the possible values Uπ can take are
{0, 41 , 21 , 1}E hin (X1 , X2 , X3 , X4 )].
CHEAP PERMUTATION TESTING 49

To show (67), we will prove that for any joint  distribution over the Unif({0, 1}) variables
ϵi1 , ϵi2 , ϵi3 , ϵi4 , we have E[ ϵi1 + ϵi4 ϵi2 + ϵi3 ] ≤ 2. Indeed, by Cauchy-Schwarz,
    2 1/2  2 1/2
E ϵi1 + ϵi4 ϵi2 + ϵi3 ≤ E ϵi1 + ϵi4 E ϵi2 + ϵi3 ,
 2  1
and E ϵi1 + ϵi4 is maximized when ϵi1 = ϵi4 almost surely, with maximum value 0 × 2 +
4 × 21 = 2. Thus, E ϵi1 + ϵi4 ϵi2 + ϵi3 has maximum value 2.

To show (68), we compute the probability that Uπ takes each of its possible values when
(π1 , . . . , π4 ) are independent:
E[Uπ ] = 214 0 × 40 + 0 × (2 + 41 ) + 41 E hin (Xi1 , Xi2 , Xi3 , Xi4 )] × 4
  

+ 12 E hin (Xi1 , Xi2 , Xi3 , Xi4 )] × 43 + E hin (Xi1 , Xi2 , Xi3 , Xi4 )] × 44
   

1
   
= 16 E hin (Xi1 , Xi2 , Xi3 , Xi4 )] + 2E hin (Xi1 , Xi2 , Xi3 , Xi4 )] + E hin (Xi1 , Xi2 , Xi3 , Xi4 )]
= 41 E hin (Xi1 , Xi2 , Xi3 , Xi4 )].


Similarly, when πi1 = i1 ⇔ πi4 = i4 and (πi1 , πi2 , πi3 ) are independent,
E[Uπ ] = 213 0 × 4 + 0 × 1 + 21 E hin (Xi1 , Xi2 , Xi3 , Xi4 )] × 2 + E hin (Xi1 , Xi2 , Xi3 , Xi4 )] × 1
  

= 41 E hin (Xi1 , Xi2 , Xi3 , Xi4 )].




Alternatively, when πi1 = i1 ⇔ πi2 = i2 and (πi1 , πi3 , πi4 ) are independent,
E[Uπ ] = 213 0 × 3 + 14 E hin (Xi1 , Xi2 , Xi3 , Xi4 )] × 2 + 21 E hin (Xi1 , Xi2 , Xi3 , Xi4 )] × 2
 

+ E hin (Xi1 , Xi2 , Xi3 , Xi4 )] × 1 = 83 E hin (Xi1 , Xi2 , Xi3 , Xi4 )].
  

E.3.2. Proof of (61). In the cheap permutation case, the permutation values πi , πj are
not independent when i and j belong to the same bin, which requires a finer analysis. In the
following, for 1 ≤ i ≤ s, we let swaps (i) ≜ i + 2s − sI i > 2s and recall that swap(i) ≜


i + n2 − nI i > n2 . We have that


E[Vnπ,s ]
= n14 (i1 ,...,i4 )∈[s]4 (j1 ,...,j4 )∈Π4k=1 Ii E hin,Y (Yj1 , Yj2 , Yj3 , Yj4 )hin,Z (Zπj1 , Zπj2 , Zπj3 , Zπj4 )
P P  
k

= (I) + (II) + (III) + (IV ) + (V ),


where
• (I) includes the terms for which  the bin indices i1 , . . . , i4 , swaps (i1 ), . . . , swaps (i
 4)
are distinct. For such terms, E hin,Y (Yj1 , Yj2 , Yj3 , Yj4 )hin,Z (Zπj1 , Zπj2 , Zπj3 , Zπj4 ) =
1 1

4 E hin (X1 , X2 , X3 , X4 )] = 4 U by equation (68) of Lem. E.5. The number of terms in
(I) is s(s − 2)(s − 4)(s − 6)n4 .
• (II) includes the terms for which there are exactly two pairs of repeated bin indices
among i1 , . . . , i4 , swaps (i1 ), . . . , swaps (i4 ), and the two pairs are one of the follow-
ing: {(i1 , i4 ), (swaps (i1 ), swaps (i4 ))}, {(i1 , swaps (i4 )), (swaps (i1 ), i4 )} (so that πi1 =
i1 ⇔ πi4 = i4 ), or {(i2 , i3 ), (swaps (i2 ), swaps (i3 ))}, {(i2 , swaps (i3 )), (swaps (i2 ), i3 )},
 that πi2 = i2 ⇔ πi3 = i3 ). For all such terms, equation
(so (69) of Lem. E.5 implies
E hin,Y (Yj1 , Yj2 , Yj3 , Yj4 )hin,Z (Zπj1 , Zπj2 , Zπj3 , Zπj4 ) = 14 U when j1 ̸= j4 (respectively,

 
j2 ̸= j3 ), and E hin,Y (Yj1 , Yj2 , Yj3 , Yj4 )hin,Z (Zπj1 , Zπj2 , Zπj3 , Zπj4 ) = 0 when j1 ̸= j4
(respectively, j2 ̸= j3 ), since hin,Y (Yj1 , Yj2 , Yj3 , Yj4 ) = 0. The number of terms in (II) is
4s(s − 2)(s − 4)n4 .
50

• (III) includes the terms for which all indices j1 , . . . , j4 , swap(j1 ), . . . , swap(j4 ) are
different and for which there are exactly two pairs of repeated bin indices among
i1 , . . . , i4 , swaps (i1 ), . . . , swaps (i4 ), and these two pairs are one of the following:
{(i1 , i2 ), (swaps (i1 ), swaps (i2 ))}, {(i1 , swaps (i2 )), (swaps (i1 ), i2 )} (so that πi1 =
i1 ⇔ πi2 = i2 ); {(i1 , i3 ), (swaps (i1 ), swaps (i3 ))}, {(i1 , swaps (i3 )), (swaps (i1 ), i3 )}
(both of which imply that πi1 = i1 ⇔ πi2 = i2 ); {(i2 , i4 ), (swaps (i2 ), swaps (i4 ))},
{(i2 , swaps (i4 )), (swaps (i2 ), i4 )} (both of which imply that πi2 = i2 ⇔ πi4 = i4 ); or
{(i3 , i4 ), (swaps (i3 ), swaps (i4 ))}, {(i3 , swaps (i4 )), (swaps (i3 ), i4 )} (both of which
imply that πi3 = i3 ⇔ πi4 = i4 ). By equation (69) of Lem. E.5,  for all such terms we
can write E hin,Y (Yj1 , Yj2 , Yj3 , Yj4 )E hin,Z (Zπj1 , Zπj2 , Zπj3 , Zπj4 ) = 83 U . The number
of terms in (III) is 8s(s − 2)(s − 4)n3 (n − 1) ≤ 8s(s − 2)(s − 4)n4 .
• (IV ) includes the terms for which all indices j1 , . . . , j4 , swap(j1 ), . . . , swap(j4 ) are dif-
ferent and which do not belong  to either (I), (II) or (III)
 . For such terms, we have that
E hin,Y (Yj1 , Yj2 , Yj3 , Yj4 )E hin,Z (Zπj1 , Zπj2 , Zπj3 , Zπj4 ) ≤ 21 U , by equation (67).

• (V ) includes all terms for which (j1 , . . . , j4 ) belongs to (J1 ∪ J2 )c , where J1 and J2 are
2
defined in (65). The total contribution of such terms can be upper-bounded by 4n −11n+6 n3 ξ˜
by the same argument as in the standard permutation case.
We conclude that
s(s−2)(s−4)(s−6)
E[Vnπ,s ] ≤ s4 × 14 U + 4s(s−2)(s−4)
s4 × 14 U + 8s(s−2)(s−4)
s4 × 83 U
| {z } | {z } | {z }
(I) (II) (III)

s4 −s(s−2)(s−4)(s−6)−12s(s−2)(s−4) 2
+ s4 × 12 U + 10n −44n+48
n3 ξ˜
| {z } | {z }
(IV ) (V )

1 s2 +s−4 10n2 −44n+48 ˜



= 4 1+ s3 U+ n3 ξ.

APPENDIX F: Proof of Cor. 2: Power of cheap independence: finite variance, PD


To establish the standard testing claim, we will show that (18) implies the sufficient power
condition (62) in the proof of Thm. 3. Our proof makes use of three lemmas, proved in
Apps. F.1, F.2, and F.3 respectively, that relate parameters of (62) to MMDk (Π, P × Q).

L EMMA F.1 (E[Vn ] bound). Under the assumptions of Cor. 2,


E[Vn ] ≥ 4MMD2k (Π, P × Q) = U.

L EMMA F.2 (ψ1′ bound). Under the assumptions of Cor. 2,


ψ1′ ≤ 4MMD2k (Π, P × Q)ξ.
for ψ1′ as defined in equation (14) and ξ as defined in Cor. 2.

L EMMA F.3 (ψ̃1′ bound). Under the assumptions of Cor. 2,


ψ̃1′ ≤ 4ξ˜wild MMD2k (Π, P × Q) + MMD4k (Π, P × Q),
for ψ̃1′ as defined in equation (14) and ξ as defined in Cor. 2.

Since E[Vn ] ≥ 4MMD2k (Π, P × Q) = U by Lem. F.1, equation (62) holds whenever
q q
2 3 16ψ1′ 144ψ2′  12
 32ψ̃′ 960ψ2′  ˜
3MMDk (Π, P × Q) ≥ ( β − 1) n + n2 + α⋆ β − 2 n + n2
1
+ 10
n ξ.
CHEAP PERMUTATION TESTING 51

Next, by Lems. F.2 and F.3, we can upper-bound the right-hand side of this expression by
q 2 ′ 
( β3 − 1) 64MMDkn(Π,P×Q)ξ + 144ψ n2
2

q  128ξ̃wild MMD2k (Π,P×Q)+32MMD4k (Π,P×Q) 960ψ2′  10


+ 12
α⋆ β − 2 n + n2 + n ξ˜
q  (256ξ̃wild +32ξ)MMD2 (Π,P×Q) q 12  32MMD4 (Π,P×Q)
12
≤ ⋆
α β − 2 n
k
+ ⋆
α β − 2 k
n
q ′
12
 960ψ2 10 ˜
+ α⋆ β − 2 n2 + n ξ,

where we used that β3 − 1 ≤ 14 α12



⋆ β − 2 . Hence, for (62) to hold, it suffices to have

ax2 − bx − c ≥ 0,
q
12
 32
with x = MMDk (Π, P × Q), a=3− ⋆
α β − 2 n,
q  (256ξ̃wild +32ξ) q  960ψ2′
12 12 10 ˜
b= ⋆
α β − 2 n , c = α⋆ β − 2 n2 + n ξ.

2
Equation (62) therefore holds when x ≥ ab + ac ≥ b+ b2a+4ac , concluding the proof of (18).
p
To establish the cheap testing claim, we will show that (19) implies the sufficient power
condition (63) in the proof of Thm. 3. Since E[Vn ] ≥ U by Lem. F.1, (63) holds when
q ′
144ψ2′ 
3− s+1 2
( β3 − 1) 16ψ

s2 MMD k (Π, P × Q )≥ n + n2
1

q
12 ˜
 ′ 32 2304 460800 
+ ψ2′ 960 4147200
 10
+ α⋆ β − 2 ψ̃1 n + ns + ns2 n2 + n2 s + n ξ.
Next, by Lems. F.2 and F.3, we can upper-bound the right-hand side of this expression by
q 2 ′ 
( 3 − 1) 64MMDk (Π,P×Q)ξ + 144ψ
β n 2
2
n+ 10 ξ˜ n
1/2
+ 12
α⋆ β −2 (4ξ˜wild MMD2k (Π, P × Q) + MMD4k (Π, P × Q))( 32
n +
2304
ns + 460800
ns2 )

4147200 1/2
+ ψ2′ 960

n2 + n2 s
s+1
≤ (3 − s2 − a)MMD2k (Π, P × Q) + bMMDk (Π, P × Q) + c for
q
a ≜ 3 − s+1 12
 32 2304 460800 
s2 − α⋆ β − 2 n + ns + ns2 ,
q
12

˜ 256 18432 3686400 32ξ 
b≜ α⋆ β − 2 ξwild ( n + ns + ns2 ) + n , and
q
12 ˜
 ′ 960 4147200  10
c≜ α⋆ β − 2 ψ2 n2 + n2 s + n ξ.
The sufficiency of (19) now follows from the sufficiency of the quadratic inequality ax2 −
bx − c ≥ 0 with x = MMDk (Π, P × Q), exactly as in the standard case, and the triangle
inequality.

F.1. Proof of Lem. F.1: E[Vn ] bound. Let (Π b ′, P


b′ , Q
b ′ ) represent an independent copy
of (Π,
b Pb, Q
b ). Then, two applications of Cauchy-Schwarz and [42, App. B] imply that
MMD2k (Π, P × Q) = E k(x, x′ ) d(Π b′ − P b′ × Qb ′ )(x′ )
 RR 
b −P b×Q b )(x) d(Π
b′ − P b′ × Q
b ′ )(x′ )∥k
 R R 
≤ E ∥ k(x, ·) d(Π b −P b×Q b )(x)∥k ∥ k(x, ·) d(Π
b )(x)∥2 = 1 E[Vn ].
 R 
≤ E ∥ k(x, ·) d(Π b −P b×Q
k 4
52

Moreover, we have
U = E[hin (Xi1 , Xi2 , Xi3 , Xi4 )]
 
= E gY (Y1 , Y2 ) + gY (Y3 , Y4 ) − gY (Y1 , Y3 ) − gY (Y2 , Y4 )

× gZ (Z1 , Z2 ) + gZ (Z3 , Z4 ) − gZ (Z1 , Z3 ) − gZ (Z2 , Z4 )
= 4(Π × Π)(gY gZ ) + 4((P × Q) × (P × Q))(gY gZ )
− 4(Π × (P × Q))(gY gZ ) − 4((P × Q) × Π)(gY gZ ) = 4MMD2k (Π, P × Q).

F.2. Proof of Lem. F.2: ψ1′ bound. By Jensen’s inequality and Cauchy Schwarz,
ψ1′ = Var(E[hin (X1 , X2 , X3 , X4 ) | X1 ])
≤ 4!1 (i1 ,i2 ,i3 ,i4 )∈i44 Var(E[hin (Xi1 , Xi2 , Xi3 , Xi4 ) | X1 ])
P

= Var(E[hin (X1 , X2 , X3 , X4 ) | X1 ]) ≤ E[(E[hin (X1 , X2 , X3 , X4 ) | X1 ])2 ]


= E[(2 k(x, x′ ) d(δX1 + Π − δY1 × Q − P × δZ1 )(x)d(Π − P × Q)(x′ ))2 ]
RR

≤ 4E[∥ k(x, ·) d(δX1 + Π − δY1 × Q − P × δZ1 )(x)∥2k ]∥ k(x, ·) d(Π − P × Q)(x)∥2k .


R R

The advertised result now follows from the observation that


R
k(x, ·) d(Π − P × Q)(x) k = MMDk (Π, P × Q),
and that
R 2
E[ k(x, ·) d(δX1 + Π − δY1 × Q − P × δZ1 )(x) k ]
= E[ k(x, x′ ) d(δX1 + Π − δY1 × Q − P × δZ1 )(x) d(δX1 + Π − δY1 × Q − P × δZ1 )(x′ )]
RR

= E[E[hin (X1 , X2 , X3 , X4 ) | X1 ]] = ξ.

F.3. Proof of Lem. F.3: ψ̃1′ bound. By the definition (14) of ψ̃1′ , we can write
ψ̃1′ = E hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Z (Zπ1 , Zπ2 , Zπ3 , Zπ4 )


× hin,Y (Y1 , Y5 , Y6 , Y7 )hin,Z (Zπ1 , Zπ5 , Zπ6 , Zπ7 )
 
− E hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Z (Zπ1 , Zπ2 , Zπ3 , Zπ4 )
 
× E hin,Y (Y1 , Y5 , Y6 , Y7 )hin,Z (Zπ1 , Zπ5 , Zπ6 , Zπ7 ) .
The result now follows immediately from the following lemma.

L EMMA F.4 (Permuted hin moments). Instantiate the assumptions of Cor. 2, and suppose
π is an independent wild bootstrap permutation of [n]. Then
E hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Z (Zπ1 , Zπ2 , Zπ3 , Zπ4 ) = MMD2k (Π, P × Q).
 
(70)
If, in addition, n ≥ 14, then

E hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Y (Y1 , Y5 , Y6 , Y7 )
× hin,Z (Zπ1 , Zπ2 , Zπ3 , Zπ4 )hin,Z (Zπ1 , Zπ5 , Zπ6 , Zπ7 ) ≤ 4ξ˜wild MMD2k (Π, P × Q).


P ROOF. Since n ≥ 8, (π1 , π2 , π3 , π4 ) are independent. Hence, equation (70) follows from
equation (68) of Lem. E.5 and Lem. F.1.
CHEAP PERMUTATION TESTING 53

Now suppose that n ≥ 14, so that |[7] ∪ {swap(i) : i ∈ [7]}| = 14, and define the
Unif({0, 1}) variables ϵi ≜ I(i = πi ) for i ∈ [7] and µi ≜ ϵi Π + (1 − ϵi )P × Q. We have
E[hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Y (Y1 , Y5 , Y6 , Y7 )
× hin,Z (Zπ1 , Zπ2 , Zπ3 , Zπ4 )hin,Z (Zπ1 , Zπ5 , Zπ6 , Zπ7 ) | π, Y1 , Zπ1 ]
k((y, z), (y ′ , z ′ ))k((y ′′ , z ′′ ), (y ′′′ , z ′′′ ))
RRRR
=
d(δ(Y1 ,Zπ1 ) + µ4 − δY1 × Q − P × δZπ1 )(y, z)
d(µ2 + µ3 − P × Q − P × Q)(y ′ , z ′ )
d(δ(Y1 ,Zπ1 ) + µ7 − δY1 × Q − P × δZπ1 )(y ′′ , z ′′ )
d(µ5 + µ6 − P × Q − P × Q)(y ′′′ , z ′′′ )
k((y, z), ·)k((y ′′ , z ′′ ), ·) d(δ(Y1 ,Zπ1 ) + µ4 − δY1 × Q − P × δZπ1 )(y, z)
RR

d(δ(Y1 ,Zπ1 ) + µ7 − δY1 × Q − P × δZπ1 )(y ′′ , z ′′ ) k⊗k

k((y ′ , z ′ ), ·)k((y ′′′ , z ′′′ ), ·) d(µ + µ3 − P × Q − P × Q)(y ′ , z ′ )


RR
× 2

d(µ5 + µ6 − P × Q − P × Q)(y ′′′ , z ′′′ ) k⊗k


   
δ(Y1 ,Zπ1 ) +µ4 δY1 ×Q+P×δZπ1 δ(Y1 ,Zπ1 ) +µ7 δY1 ×Q+P×δZπ1
= 4MMDk 2 , 2 MMDk 2 , 2

× (ϵ2 + ϵ3 )(ϵ5 + ϵ6 )MMD2k (Π, P × Q)


by Cauchy-Schwarz. Here, ∥ · ∥k⊗k denotes the RKHS norm of the tensor product RKHS
Hk ⊗ Hk . The result now follows by taking expectations and applying Cauchy-Schwarz.

APPENDIX G: Proof of Thm. 4: Power of cheap independence: bounded, PSD


We will establish the claim using the quantile comparison method (Prop. 2) by identifying
complementary quantile bounds (CQBs) Φn , ΨX , Ψn satisfying the tail bounds (21), (22),
and (23) for the test statistic Vn , auxiliary sequence (Vnπb ,s )B
b=1 , and population parameter
2
T = 4MMDk (Π, P × Q).

G.1. Bounding the fluctuations of the test statistic. We begin by identifying a CQB
Φn satisfying the test statistic tail bound (21). Our proof makes use of two lemmas. The first
shows that functions of independent variables concentrate around their mean when coordi-
natewise function differences are bounded.

L EMMA G.1 (Bounded differences inequality [34, Thm. D.8]). Suppose that, for all x1 ∈
X1 , x2 ∈ X2 , . . . , xn ∈ Xn and i ∈ [n], a function f : X1 × X2 × · · · × Xn → R satisfies
supx′i ∈Xi |f (x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) − f (x1 , . . . , xi−1 , x′i , xi+1 , . . . , xn )| ≤ ci .
If X1 , X2 , . . . , Xn are independent with each Xi ∈ Xi then, with probability at least 1 − δ ,
q
f (X1 , . . . , Xn ) − E[f (X1 , . . . , Xn )] < 12 log(1/δ) ni=1 c2i .
P

The second lemma uses Lem. G.1 to derive a suitable CQB for Vn .

L EMMA G.2 (Bounded independence quantile bound). Under the assumptions of Thm. 4,
Pr |4MMD2k (Π, P × Q) − Vn | ≥ Φn (β) ≤ β for all β ∈ (0, 1)

54
√ √
6 K+6 2K log(1/β)
where Φn (β) ≜ Λ2 (β) + 4MMDk (Π, P × Q)Λ(β) and Λ(β) ≜ √
n
.

P ROOF. Let X̃ = (Ỹ , Z̃) be an independent draw from Π. By [42, App. B],
Vn = n14 (i1 ,i2 ,i3 ,i4 )∈[n]4 hin (Xi1 , Xi2 , Xi3 , Xi4 ) = 4MMD2k (Π,
P b Pb×Q
b ).

By the triangle inequality, we have the following sure inequalities for each k ∈ [n]:
|MMDk (Π, P × Q) − MMDk (Π,
b Pb×Q
b )| ≤ MMDk (Π, Π)
b + MMDk (P × Q, P
b×Q
b)

≜ ∆(X) + ∆′ (X),

|∆(X) − ∆((X−k , X̃))| ≤ n1 MMDk (δXk , δX̃ ) ≤ 2 K
n , and
|∆′ (X) − ∆′ ((X−k , X̃))|
q q √
1 b )gZ + 1 MMDgZ (δZ , δ ) supy gY (y, y) ≤ 4 K
≤ n MMDgY (δYk , δỸ )
b ×Q
(Q n k Z̃ n .

Furthermore, by Jensen’s inequality, we have


q q
(71) E[∆(X)] ≤ E[MMDk (Π, Π) b 2 ] = E[ 12 Pn ((δXi − Π) × (δXi − Π))k]
n i=1
q q
≤ E[ n12 ni=1 k(Xi , Xi )] ≤ K
P
n.

The triangle inequality, Cauchy-Schwarz, and analogous reasoning also yield


E[∆′ (X)] ≤ E[MMDk (P × Q, P × Q
b )] + E[MMDk (P × Q b,P b×Q b )]
p q
(72) = E[ (P × P)gY MMDgZ (Q, Q b )] + E[ (Qb ×Qb )gZ MMDgY (P, P
b)]
q q q
= E[(P × P)gY ]E[MMD2 (Q, Q b ×Q
b )] + E[(Q
gZ
b )gZ MMD2 (P, Pb)] ≤ 2 K .
gY n

Hence, by Lem. G.1, with probability at least 1 − β ,


|2MMDk (Π, P × Q) − 2MMDk (Π,
b Pb×Q
b )| < ΛV (β).

Since |a2 − b2 | = |a − b||b + a| ≤ |a − b|(|b − a| + 2|a|) for all real a, b the result follows.

G.2. Alternative representation of the independence V-statistic. We next define an


alternative, equivalent way to represent the independence V-statistic (Def. 3) and its permuted
versions, due to [19]. We will leverage this view of the V-statistic when deriving CQBs for
permuted V-statistics in App. G.3. For i ∈ [n] and j ∈ [n], we define
K̄ij ≜ gY (Yi , Yj ) − n1 nj′ =1 gY (Yi , Yj ′ ) and L̄ij ≜ gZ (Zi , Zj ) − n1 nj′ =1 gZ (Zi , Zj ′ ).
P P

For i, j ∈ [s], we also define


P
(73) (Rijji , Rij(swaps (j))(swaps (i)) ) = a∈Ii ,b∈Ij (K̄ab L̄ba , K̄ab L̄swap(b)swap(a) )
P
(Rij(swaps (j))i , Rijj(swaps (i)) ) = a∈Ii ,b∈Ij (K̄ab L̄swap(b)a , K̄ab L̄bswap(a) )
where the bins I1 , . . . , Is and the mapping swap are defined in Alg. 2. Note the similarity
between Rijkl and the quantities Sijkl from Alg. 2. Now define the V-statistic
Ṽn ≜ n12 ni,j=1 K̄ij L̄ji = s12 si,j=1 Rijji
P P
(74)
CHEAP PERMUTATION TESTING 55

and its cheaply permuted version for any wild bootstrap permutation π on [s],
Ṽnπ,s ≜ s12 si,j=1 Rijπj πi .
P
(75)

Our next result shows the close relationship between V and Ṽ .

L EMMA G.3 (V-statistic relationship). In the notation of (74), (75), and Def. 3
Ṽn = 14 Vn and Ṽnπ,s = 14 Vnπ,s
for any wild bootstrap permutation π on [s].

P ROOF. It suffices to show the result for Ṽnπ,s . We expand Vnπ,s :


Vnπ,s = n14 (i1 ,i2 ,i3 ,i4 )∈[n]4 gY (Yi1 , Yi2 ) + gY (Yi3 , Yi4 ) − gY (Yi1 , Yi3 ) − gY (Yi2 , Yi4 )
P 

× gZ (Zπi1 , Zπi2 ) + gZ (Zπi3 , Zπi4 ) − gZ (Zπi1 , Zπi3 ) − gZ (Zπi2 , Zπi4 )

4 P
= n2 (i1 ,i2 )∈[n]2 gY (Yi1 , Yi2 )gZ (Zπi1 , Zπi2 ) − gY (Yi1 , Yi2 ) n1 i3 ∈[n] gZ (Zπi2 , Zπi3 )
P 

1 P 
− n i4 ∈[n] gY (Yi1 , Yi4 ) gZ (Zπi1 , Zπi2 )

1 1
P  P 
+ n i3 ∈[n] gZ (Zπi2 , Zπi3 ) n i4 ∈[n] gY (Yi1 , Yi4 )

4 1
P P 
= n2 (i1 ,i2 )∈[n]2 gY (Yi1 , Yi2 ) − n i4 ∈[n] gY (Yi1 , Yi4 )
1 P 
× gZ (Zπi2 , Zπi1 ) − n i3 ∈[n] gZ (Zπi2 , Zπi3 )
4
K̄i1 ,i2 L̄πi2 ,πi1 = 4Ṽnπ,s .
P
= n2 (i1 ,i2 )∈[n]2

G.3. Bounding the fluctuations of the threshold. We next identify CQBs ΨX,s and
Ψn,s that satisfy the properties
Pr(Vnπ,s ≥ ΨX,s (α) | X) ≤ α, ∀α ∈ (0, 1), and
Pr(ΨX,s (α) ≥ Ψn,s (α, β)) ≤ β, ∀α, β ∈ (0, 1)
for the cheaply permuted V-statistic Vnπ,s (32). Since the alternative V-statistic representation
Ṽnπ,s (75), satisfies Ṽnπ,s = 14 Vnπ,s by Lem. G.3, we will find Ψ̃n,s and Ψ̃X,s satisfying
(76) Pr(Ṽnπ,s ≥ Ψ̃X,s (α) | X) ≤ α, ∀α ∈ (0, 1), and
(77) Pr(Ψ̃X,s (α) ≥ Ψ̃n,s (α, β)) ≤ β, ∀α, β ∈ (0, 1),

and set ΨX,s (α) = 4Ψ̃X,s (α) and Ψn,s (α, β) = 4Ψ̃n,s (α, β). Our derivation makes use of four
lemmas. The first, proved in App. G.5, rewrites Ṽnπ,s in terms of independent Rademacher
random variables. Below,

L EMMA G.4 (Rademacher representation of Ṽnπ,s ). Instantiate the assumptions of


Thm. 4, and let ∥·∥F represent the Frobenius norm, 1 ∈ Rs be a vector of ones, and
swaps (i) ≜ i + 2s − sI i > 2s

56

for each i ∈ [s]. If π is a wild bootstrap permutation on [s] drawn independently of X, then
the alternative V-statistic representation (75) satisfies
Ṽnπ,s = ϵ⊤ Aϵ + b⊤ ϵ + c,
where ϵ ∼ Unif({±1}s/2 ) is independent of X, and A ∈ Rs/2×s/2 , b ∈ Rs/2 , and c ∈ R satisfy
MMD2k (Π,P×Q) 2 (A(1) )2 (A(2) )2 MMD2k (Π,P×Q)((A(1) )2 +(A(2) )2 )
A− s2 11⊤ F ≤ 4s4 + 4s3 ,
MMD2k (Π,P×Q) ⊤ ≤ A(1) A(2)
 MMDk (Π,P×Q)(A(1) +A(2) )
Tr A − s 2 11 2s2 + 4s3/2 ,
MMD2k (Π,P×Q) 1 (A(3) )2 2A(3) MMDk (Π,P×Q) 
c− 4 ≤ 4 s + √
s
, and
MMD2k (Π,P×Q) 2 3(A(2) )2 (A(3) )2 +(A(1) )2 (A(4) )2
b− s 1 2 ≤ 4s3
3MMD2k (Π,P×Q)((A(1) )2 +(A(2) )2 +(A(3) )2 +(A(4) )2 )
+ 4s2

for the random variables


qP qP
s s
A(1) ≜ i=1 ∥(µ i − Π + P × Q )k∥ 2,
k A (2) ≜ 2
i=1 ∥(µ̃i − Π + P × Q)k∥k ,
√ √
A(3) ≜ s∥(µ − Π + P × Q)k∥k , and A(4) ≜ s∥(µ̃ − Π + P × Q)k∥k
defined in terms of the random signed measures
µi ≜ δX(i) − δY × δZ(i) − δY(i) × δZ(swaps (i)) + δY × δZ(swaps (i)) , µ̃i ≜ δX(i) − δY(i) × δZ(swaps (i)) ,
µ ≜ δX − 2δY × δZ + 1s sj=1 δY(j) × δZ(swaps (j)) , and µ̃ ≜ δX − 1s sj=1 δY(j) × δZ(swaps (j)) .
P P

Our second lemma, proved in App. G.6, bounds the expectation of each quantity A(i)
introduced in Lem. G.4.

L EMMA G.5 (A(i) mean bounds). Under the assumptions and notation of Lem. G.4 with
m ≜ ns ,
√ √
E A(1) ≤ D1 ≜ Ks √5m + √2n , E A(2) ≤ D2 ≜ Ks √3m ,
    

√ √ √ √
E A(3) ≤ D3 ≜ Ks 5+ √ 8, and E A(4) ≤ D4 ≜ Ks 1+ √ 8.
   
n n

Our third lemma, proved in App. G.7, uses Lem. G.5 and the bounded differences inequal-
ity (Lem. G.1) to provide CQBs for each quantity A(i) introduced in Lem. G.4.

L EMMA G.6 (A(i) complementary quantile bounds). Instantiate the assumptions and
notation of Lem. G.5, and fix any i ∈ [4]. With probability at least 1 − δ ,
q √ √ √ √ √
A(i) ≤ Di + Ci log(1/δ)
2 for (C1 , C2 , C3 , C4 ) = ( 10√mKs + 2 √Ks
n
, 6√Ks 14√ Ks 6 √Ks
m
, n
, n
).

Our fourth lemma, proved in App. G.8, uses the estimates of Lem. G.6 to verify the CQB
properties (76) and (77) for a particular pairing of (Ψ̃X,s , Ψ̃n,s ).
CHEAP PERMUTATION TESTING 57

L EMMA G.7 (Quantile bound for independence permutation threshold). Instantiate the
assumptions and notation of Lem. G.6. Then, for all α, β ∈ (0, 1),
1
4 Ψn,s (α, β) ≜ Ψ̃n,s (α, β)
q
K
p 3 5 log(3/α) 3

≜ n (196 + 148 log(4/β) + 48 log(4/β)) 4 + 3 + log(3/α)
2
√ q
1+12 log(3/α)
+ MMDk (Π, P × Q) √K
p  1 3

n
30 + 22 log(4/β) 2 + √
6 s
+ 2 log(3/α)
q
+ MMD2k (Π, P × Q) 14 + log(6/α) + log(6/α)

s s

satisfies Pr(Ψ̃X,s (α) ≥ Ψ̃n,s (α, β)) ≤ β for Ψ̃X,s (α) defined in equation (87) and satisfying
Pr(Ṽnπ,s ≥ Ψ̃X,s (α) | X) ≤ α.

G.4. Putting everything together. Fix any β ∈ (0, 1), and suppose that
(78) T = 4MMD2k (Π, P × Q) ≥ Ψn,s (α⋆ , β/3) + Φn (β/3),
for Ψn,s and Φn defined in Lems. G.2 and G.7. By Lems. G.2, G.3, and G.7 and Prop. 2, a
cheap wild bootstrap test based on Vn has power at least 1−β . We now show that the assumed
condition of Thm. 4 is sufficient to imply (78). Indeed, we see that (78) holds whenever
ax2 − bx − c ≥ 0 for x = MMDk (Π, P × Q),
q
log(6/α⋆ ) ⋆)
3
a= 4 − s − log(6/α s ,
√ q
1+12 log(3/α⋆ )
b = √K 3
p  
30 + 22 log(12/β) 3 + + ⋆
2 log(3/α )

n 6 s
√  1 1+12 log(3/α⋆ ) q 3 √ √
≥ √K ⋆ ) + 6 K(1+ √2 log(3/β))
p 
n
30 + 22 log(12/β) 2 + √
6 s
+ 2 log(3/α n
q q
c= K 7 5 ⋆ 2
p 2

n (14 + 7 log(12/β)) 8 + 3 log(3/α )

q
≥K 2 7 + 5 log(3/α ) + 3
p 
(14 + 7 log(12/β)) ⋆
n 8 3 2 log(3/α )
q
K 3 5 log(3/α⋆ )
+ 32 log(3/α⋆ )
p 
≥ n (196 + 148 log(12/β) + 48 log(12/β)) 4 + 3

+ 3K
p
n (1 + 2 log(3/β))2
√ √
2
Since a ≤ 1, it suffices to have x ≥ a1 (b + c) ≥ ab + ac ≥ b+ b2a+4ac . The advertised result
p
q q q
now follows as 3 + 12 78 ≤ 27 and 32 + 12 53 ≤ 2.

G.5. Proof of Lem. G.4: Rademacher representation of Ṽnπ,s . Recall the definition
(73) of Rijkl , and, for each i ∈ [s], define the Rademacher vector ϵ̃i = 2I(i = πi ) − 1. Observe
that by construction, ϵ̃i = ϵ̃swaps (i) for all i ∈ [s/2]. Then
1+ϵ̃j 1+ϵ̃i 1−ϵ̃j
Rijπj πi = ( 1+ϵ̃
2 )( 2 )Rijji + ( 2 )( 2 )Rij(swaps (j))i
i

1+ϵ̃j 1−ϵ̃i 1−ϵ̃j


+ ( 1−ϵ̃
2 )( 2 )Rijj(swaps (i)) + ( 2 )( 2 )Rij(swaps (j))(swaps (i))
i

ϵ̃i ϵ̃j 
= 4 Rijji − Rij(swaps (j))i − Rijj(swaps (i)) + Rij(swaps (j))(swaps (i))
ϵ̃i

+ 4 Rijji + Rij(swaps (j))i − Rijj(swaps (i)) − Rij(swaps (j))(swaps (i))
58
ϵ̃j 
+ 4 Rijji − Rij(swaps (j))i + Rijj(swaps (i)) − Rij(swaps (j))(swaps (i))
1

+ 4 Rijji + Rij(swaps (j))i + Rijj(swaps (i)) + Rij(swaps (j))(swaps (i)) .
Hence,
ϵ̃i ϵ̃j
Ṽnπ,s = 1 Ps 
n2 i,j=1 4 Rijji − Rij(swaps (j))i − Rijj(swaps (i)) + Rij(swaps (j))(swaps (i))
ϵ̃i

+ 4 Rijji + Rij(swaps (j))i − Rijj(swaps (i)) − Rij(swaps (j))(swaps (i))
ϵ̃j 
+ 4 Rijji − Rij(swaps (j))i + Rijj(swaps (i)) − Rij(swaps (j))(swaps (i))
1

+ 4 Rijji + Rij(swaps (j))i + Rijj(swaps (i)) + Rij(swaps (j))(swaps (i))
≜ ϵ̃⊤ Aϵ̃ + b⊤ ϵ̃ + c,

where à ∈ Rs×s , b̃ ∈ Rs , and c̃ ∈ R have the components


1

ãij = 4n2 Rijji − Rij(swaps (j))i − Rijj(swaps (i)) + Rij(swaps (j))(swaps (i)) ,
b̃i = 4n1 2 nj=1 Rijji + Rij(swaps (j))i − Rijj(swaps (i)) − Rij(swaps (j))(swaps (i))
P

+ Rjiij + Rji(swaps (i))j − Rjii(swaps (j)) − Rji(swaps (i))(swaps (j)) ,
c̃ = 4n1 2 si,j=1 Rijji + Rij(swaps (j))i + Rijj(swaps (i)) + Rij(swaps (j))(swaps (i)) .
P 

We now rewrite the terms Rijji , Rijqi , Rijjr , and Rijqr in terms of kernel expectations:
1 1 Pm −1
Pn 
m2 Rijji = m2 k,l=1 gY (Ym(i−1)+k , Ym(j−1)+l ) − n j ′ =1 gY (Ym(i−1)+k , Yj ′ )

× gZ (Zm(j−1)+k , Zm(i−1)+l ) − n−1 ni′ =1 gZ (Zm(j−1)+k , Zi′ )


P 
(79)
= ((δX(j) − δY × δZ(j) ) × (δX(i) − δY(i) × δZ ))k,
Pm
1 1
gY (Ym(i−1)+k , Ym(j−1)+l ) − n−1 nj′ =1 gY (Ym(i−1)+k , Yj ′ )
P 
m2 Rijqi = m2 k,l=1

× gZ (Zm(q−1)+k , Zm(i−1)+l ) − n−1 ni′ =1 gZ (Zm(q−1)+k , Zi′ )


P 
(80)
= ((δY(j) × δZ(q) − δY × δZ(q) ) × (δX(i) − δY(i) × δZ ))k,
Pm
1 1
gY (Ym(i−1)+k , Ym(j−1)+l ) − n−1 nj′ =1 gY (Ym(i−1)+k , Yj ′ )
P 
m2 Rijjr = m2 k,l=1

× gZ (Zm(j−1)+k , Zm(r−1)+l ) − n−1 ni′ =1 gZ (Zm(j−1)+k , Zi′ )


P 
(81)
= ((δX(j) − δY × δZ(j) ) × (δY(i) × δZ(r) − δY(i) × δZ ))k, and

1 1 Pm −1
Pn 
m2 Rijqr = m2 k,l=1 gY (Ym(i−1)+k , Ym(j−1)+l ) − n ′
j =1 gY (Y m(i−1)+k , Yj ′)

−1
Pn 
(82) × gZ (Zm(q−1)+k , Zm(r−1)+l ) − n i′ =1 gZ (Zm(q−1)+k , Zi′ )

= ((δY(j) × δZ(q) − δY × δZ(q) ) × (δY(i) × δY(r) − δY(i) × δZ ))k.


Plugging q = swaps (j) and r = swaps (i) in (79), (80), (81), (82), we obtain
Rij(swaps (j))i R swaps (i)) R
ãij = 4s12 Rmijji = 4s12 (µj × µ̃i )k.

2 − − ijj(m + ij(swapsm
(j))(swaps (i))
m2 2 2
CHEAP PERMUTATION TESTING 59

We bound the following quantity using the Cauchy-Schwarz inequality:


1 2
ãij − 4s2 MMDk (Π, P × Q)
1
= 4s2 ⟨µj k, µ̃i k⟩k − ⟨(Π − P × Q)k, (Π − P × Q)k⟩k
1
= 4s2 ⟨µj k − (Π − P × Q)k, µ̃i k⟩k
− ⟨(Π − P × Q)k − µ̃i k, (Π − P × Q)k⟩k
(83) 1
= 4s2 ⟨µj k − (Π − P × Q)k, µ̃i k − (Π − P × Q)k⟩k
+ ⟨µj k − (Π − P × Q)k, (Π − P × Q)k⟩
− ⟨(Π − P × Q)k − µ̃i k, (Π − P × Q)k⟩k
1
≤ 4s2 ∥(µj − Π + P × Q)k∥k ∥(µ̃i − Π + P × Q)k∥k

+ MMDk (Π, P × Q) ∥(µj − Π + P × Q)k∥k + ∥(µ̃i − Π + P × Q)k∥k .
Plugging q = swaps (j) and r = swaps (i) into (79), (80), (81), (82), we find that
Rij(swaps (j))i R swaps (i)) R
b̃i = 4s12 sj=1 Rmijji
P
2 + m2 − ijj(m 2 − ij(swapsm
(j))(swaps (i))
2

Rji(swaps (i))j R swaps (j)) R


+ Rmjiij

2 + m2 − jii(m 2 − ji(swapsm(i))(swaps (j))
2

Ps
= 4s12 j=1 (δX(j) − δY × δZ(j) + δY(j) × δZ(swaps (j)) − δY × δZ(swaps (j)) )k,

(δX(i) − δY(i) × δZ − δY(i) × δZ(swaps (i)) + δY(i) × δZ )k k

+ (δX(i) − δY × δZ(i) + δY(i) × δZ(swaps (i)) − δY × δZ(swaps (i)) )k,


Ps 
j=1 (δX(j) − δY(j) × δZ − δY(j) × δZ(swaps (j)) + δY(j) × δZ )k k
1
= 4s (⟨µk, µ̃i k⟩k + ⟨µi k, µ̃k⟩k ).
We also bound the following using the Cauchy-Schwarz inequality:
1 2
b̃i − 2s MMDk (Π, P × Q)
1
= 4s ⟨µk, µ̃i k⟩k + ⟨µi k, µ̃k⟩k − 2⟨(Π − P × Q)k, (Π − P × Q)k⟩k
1
≤ 4s ∥(µ − Π + P × Q)k∥k ∥(µ̃i − Π + P × Q)k∥k
(84)
+ ∥(µi − Π + P × Q)k∥k ∥(µ̃ − Π + P × Q)k∥k
+ MMDk (Π, P × Q) ∥(µ − Π + P × Q)k∥k + ∥(µ̃i − Π + P × Q)k∥k

+ ∥(µi − Π + P × Q)k∥k + ∥(µ̃ − Π + P × Q)k∥k .
Plugging q = swaps (j) and r = swaps (i) into (79), (80), (81), (82), we find that
Rij(swaps (j))i R swaps (i)) R
c̃ = 4s12 si,j=1 Rmijji
P 
2 + m2 + ijj(m2 + ij(swapsm
(j))(swaps (i))
2

 2
= 14 δX − 2δY × δZ + si=1 δY(i) × δZ(swaps (i)) k k = 14 ∥µk∥2k
P
60

We also bound the scalar c̃ − 14 MMD2k (Π, P × Q) using Cauchy-Schwarz:


c̃ − 41 MMD2k (Π, P × Q) = 1 2
4 ∥µk∥k − 14 ⟨(Π − P × Q)k, (Π − P × Q)k⟩k
1
= 4 ⟨(µ − Π + P × Q)k, µk⟩k − ⟨(Π − P × Q − µ)k, (Π − P × Q)k⟩k
1
= 4 ∥(µ − Π + P × Q)k∥2k − 2⟨(Π − P × Q − µ)k, (Π − P × Q)k⟩k
1
∥(µ − Π + P × Q)k∥2k + 2∥(µ − Π + P × Q)k∥k MMDk (Π, P × Q)

≤ 4
(A(3) )2 (3)
(Π,P×Q) 
= 1
4 s + 2A MMD√k
s
.

At this point, we further rewrite the expression Ṽnπ,s = ϵ̃⊤ Ãϵ̃ + b̃⊤ ϵ̃ + c̃ in terms of ϵ̃1:s/2 .
Since ϵ̃s/2:s = ϵ̃1:s/2 , we have that
Ṽnπ,s = ϵ̃⊤ ⊤
1:s/2 Ã1:s/2,1:s/2 ϵ̃1:s/2 + ϵ̃s/2:s Ãs/2:s,1:s/2 ϵ̃1:s/2

+ ϵ̃⊤ ⊤
1:s/2 Ã1:s/2,s/2:s ϵ̃s/2:s + ϵ̃s/2:s Ãs/2:s,s/2:s ϵ̃s/2:s

+ b̃⊤ ⊤
1:s/2 ϵ̃1:s/2 + b̃s/2:s ϵ̃s/2:s + c̃

= ϵ̃⊤

1:s/2 Ã1:s/2,1:s/2 + Ãs/2:s,1:s/2 + Ã1:s/2,s/2:s + Ãs/2:s,s/2:s ϵ̃1:s/2
⊤
+ b̃1:s/2 + b̃s/2:s ϵ̃1:s/2 + c̃.
Thus, if we define
(85) ϵ = ϵ̃1:s/2 ∼ Unif({±1}s/2 ),
A = Ã1:s/2,1:s/2 + Ãs/2:s,1:s/2 + Ã1:s/2,s/2:s + Ãs/2:s,s/2:s ∈ Rs/2×s/2 ,
b = b̃1:s/2 + b̃s/2:s ∈ Rs/2 , and c = c̃ ∈ R,
we have that
Ṽnπ,s = ϵ⊤ Aϵ + b⊤ ϵ + c.
If we let aij and bi be the components of A and b, we have that
aij = ãij + ã(swaps (i))j + ãi(swaps (j)) + ã(swaps (i))(swaps (j)) and bi = b̃i + b̃swaps (i) .
Next, define
Ā = A − 1 2 ⊤
s2 MMDk (Π, P × Q)11 ,
MMD2k (Π,P×Q)
(86) b̄ = b − s 11⊤ , and
c̄ = c − 41 MMD2k (Π, P × Q).
If we let āij and b̄i be the components of Ā and b̄, by equations (83) and (84), we have that
1 2 1 2
āij ≤ ãij − 4s2 MMDk (Π, P × Q) + ã(swaps (i))j − 4s2 MMDk (Π, P × Q)
1 2
+ ãi(swaps (j)) − + ã(swaps (i))(swaps (j)) − 4s12 MMD2k (Π, P × Q)
4s2 MMDk (Π, P × Q)
1

≤ 4s2 ∥(µj − Π + P × Q)k∥k + ∥(µswaps (j) − Π + P × Q)k∥k

× ∥(µ̃i − Π + P × Q)k∥k + ∥(µ̃swaps (i) − Π + P × Q)k∥k
+ MMDk (Π, P × Q) ∥(µj − Π + P × Q)k∥k + ∥(µ̃i − Π + P × Q)k∥k

+ ∥(µswaps (j) − Π + P × Q)k∥k + ∥(µ̃swaps (i) − Π + P × Q)k∥k
CHEAP PERMUTATION TESTING 61

and
1 2 1 2
b̄i ≤ b̃i − 2s MMDk (Π, P × Q) + b̃swaps (i) − 2s MMDk (Π, P × Q)
1

≤ 4s∥(µ − Π + P × Q)k∥k ∥(µ̃i − Π + P × Q)k∥k + ∥(µ̃swaps (i) − Π + P × Q)k∥k

+ ∥(µi − Π + P × Q)k∥k + ∥(µswaps (i) − Π + P × Q)k∥k ∥(µ̃ − Π + P × Q)k∥k
+ MMDk (Π, P × Q) 2∥(µ − Π + P × Q)k∥k + 2∥(µ̃ − Π + P × Q)k∥k
+ ∥(µ̃i − Π + P × Q)k∥k + ∥(µ̃swaps (i) − Π + P × Q)k∥k

+ ∥(µi − Π + P × Q)k∥k + ∥(µswaps (i) − Π + P × Q)k∥k .

We now turn to bounding ∥Ā∥2F in terms of A(1) and A(2) using Jensen’s inequality:
Ps/2
16s4 i,j=1 ā2ij
Ps/2 
≤ i,j=1 ∥(µj − Π + P × Q)k∥k + ∥(µswaps (j) − Π + P × Q)k∥k

× ∥(µ̃i − Π + P × Q)k∥k + ∥(µ̃swaps (i) − Π + P × Q)k∥k
+ MMDk (Π, P × Q) ∥(µj − Π + P × Q)k∥k + ∥(µ̃i − Π + P × Q)k∥k
2
+ ∥(µswaps (j) − Π + P × Q)k∥k + ∥(µ̃swaps (i) − Π + P × Q)k∥k
Ps/2
2 2∥(µj − Π + P × Q)k∥2k + 2∥(µswaps (j) − Π + P × Q)k∥2k

≤ i,j=1

× 2∥(µ̃i − Π + P × Q)k∥2k + 2∥(µ̃swaps (i) − Π + P × Q)k∥2k




+ 2MMDk (Π, P × Q) 4∥(µj − Π + P × Q)k∥2k + 4∥(µ̃i − Π + P × Q)k∥2k


+ 4∥(µswaps (j) − Π + P × Q)k∥2k + 4∥(µ̃swaps (i) − Π + P × Q)k∥2k

Ps 2
 Ps 2

≤4 j=1 ∥(µj − Π + P × Q)k∥k i=1 ∥(µ̃i − Π + P × Q)k∥k
Ps 2
Ps 2

+ 4sMMDk (Π, P × Q) j=1 ∥(µj − Π + P × Q)k∥k + i=1 ∥(µ̃i − Π + P × Q)k∥k

= 4(A(1) )2 (A(2) )2 + 4sMMD2k (Π, P × Q) (A(1) )2 + (A(2) )2 .




Next, we bound Tr(Ā) in terms of A(1) and A(2) using Cauchy-Schwarz:


Ps/2 Ps/2
4s2 i=1 āii ≤ i=1 ∥(µi − Π + P × Q)k∥k + ∥(µswaps (i) − Π + P × Q)k∥k


× ∥(µ̃i − Π + P × Q)k∥k + ∥(µ̃swaps (i) − Π + P × Q)k∥k
+ MMDk (Π, P × Q) ∥(µi − Π + P × Q)k∥k + ∥(µ̃i − Π + P × Q)k∥k

+ ∥(µswaps (i) − Π + P × Q)k∥k + ∥(µ̃swaps (i) − Π + P × Q)k∥k
2 1/2
Ps  Ps 1/2
≤2 i=1 ∥(µi − Π + P × Q)k∥k i=1 ∥(µ̃i − Π + P × Q)k∥k
√ qP
s 2+
qP
s 2

+ sMMDk (Π, P × Q) i=1 ∥(µ i −Π+ P × Q )k∥k i=1 ∥(µ̃i −Π+ P × Q)k∥k

= 2A(1) A(2) + sMMDk (Π, P × Q) A(1) + A(2) .

62

Finally, we bound ∥b̄∥22 in terms of A(1) , A(2) , A(3) and A(4) :


Ps/2
16s2 i=1 b̄2i
Ps/2 
≤ i=1 ∥(µ − Π + P × Q)k∥k ∥(µ̃i − Π + P × Q)k∥k + ∥(µ̃swaps (i) − Π + P × Q)k∥k

+ ∥(µi − Π + P × Q)k∥k + ∥(µswaps (i) − Π + P × Q)k∥k ∥(µ̃ − Π + P × Q)k∥k
+ MMDk (Π, P × Q) 2∥(µ − Π + P × Q)k∥k + 2∥(µ̃ − Π + P × Q)k∥k
+ ∥(µ̃i − Π + P × Q)k∥k + ∥(µ̃swaps (i) − Π + P × Q)k∥k
2
+ ∥(µi − Π + P × Q)k∥k + ∥(µswaps (i) − Π + P × Q)k∥k
Ps/2
∥(µ − Π + P × Q)k∥2k ∥(µ̃i − Π + P × Q)k∥2k + ∥(µ̃swaps (i) − Π + P × Q)k∥2k

≤ 12 i=1

+ ∥(µi − Π + P × Q)k∥2k + ∥(µswaps (i) − Π + P × Q)k∥2k ∥(µ̃ − Π + P × Q)k∥2k




+ MMDk (Π, P × Q)2 2∥(µ − Π + P × Q)k∥2k + 2∥(µ̃ − Π + P × Q)k∥2k


+ ∥(µ̃i − Π + P × Q)k∥2k + ∥(µ̃swaps (i) − Π + P × Q)k∥2k
+ ∥(µi − Π + P × Q)k∥2k + ∥(µswaps (i) − Π + P × Q)k∥2k


12(A(2) )2 (A(3) )2 12(A(1) )2 (A(4) )2


= s + s

+ 12MMD2k (Π, P × Q) (A(1) )2 + (A(2) )2 + (A(3) )2 + (A(4) )2 .




G.6. Proof of Lem. G.5: A(i) mean bounds.

G.6.1. Bounding E[A(1) ] and E[A(2) ]. By the triangle inequality,


qPs
E A(1) = E
  2

i=1 ∥(µi − Π + P × Q)k∥k
pPs  pPs 
≤E 2 2
i=1 MMDk (δX(i) , Π) + E i=1 MMDk (δY(i)×δZ(swaps (i)) , P × Q)
pPs  pPs 
2 2
+E i=1 MMDk (δY × δZ(i) , P × Q) + E i=1 MMDk (δY ×δZ(swaps (i)) , P × Q) .
Cauchy-Schwarz and the expected squared MMD bound of (71) now yield
pPs  q  Ps  q sK
MMD (δ (i) , Π)
2 ≤ MMD (δ (i) , Π)
2 ≤
E i=1 k X E i=1 k X m .
Similarly, the triangle inequality, Cauchy-Schwarz, and the expected squared MMD bounds
of (72) yield
pPs 
2
E i=1 MMDk (δY ×δZ(i) , P × Q)
pPs  pPs 
≤E 2 2
i=1 MMDk (δY ×δZ(i) , δY × Q) + E i=1 MMDk (δY × Q, P × Q)
q P  q  Ps
s 2 + 2

≤ E i=1 MMD (δ
k Y ×δ Z(i) , δY × Q ) E i=1 MMDk (δY × Q, P × Q)
q q
≤ sK m + sK
n .
Parallel reasoning yields the bounds
pPs  q sK q sK
MMD (δ ×δ , × )2 ≤
E i=1 k Y Z(swap s (i)) P Q m + n and
pPs q
sK

2
E i=1 MMDk (δY(i)×δZ(swaps (i)) , P × Q) ≤ 2 m .

Hence E[A(1) ] ≤ D1 as advertised, and E[A(2) ] ≤ D2 follows from parallel reasoning.


CHEAP PERMUTATION TESTING 63

G.6.2. Bounding E[A(3) ] and E[A(4) ]. Since δY(j) × δZ(swaps (j)) = P × Q, independence
and our prior MMD moment bounds imply that
Ps/2 Ps/2
E[MMDk ( 2s j=1 δY(j)×δZ(swaps (j)) , P × Q)2 ] = s42 j=1 E[MMDk (δY(j)×δZ(swaps (j)) , P × Q)2 ]
Ps/2 Ps/2
≤ s82 j=1 E[MMDk (δY(j)× Q, P × Q)2 ] + s82 j=1 E[MMDk (δY(j)× Q, P × Q)2 ] ≤ 8K n .

Hence, by the triangle inequality, Cauchy-Schwarz, and our prior MMD moment bounds
√   √
E[A(3) ] = sE ∥(µ − Π + P × Q)k∥k ≤ sE[MMDk (δX , Π) + 2MMDk (δY × δZ , P × Q)
Ps/2
+ 21 MMDk ( 2s j=1 δY(j)×δZ(swaps (j)) , P × Q) + 21 MMDk ( 2s sj= s +1 δY(j)×δZ(swaps (j)) , P × Q)]
P
2
q q q
≤ sK n +4
sK
n +
8sK
n .

Analogous reasoning gives E[A(4) ] ≤ D4 , completing the proof.

G.7. Proof of Lem. G.6: A(i) complementary quantile bounds. Fix any k ∈ [n], let
X̃ = (Ỹ , Z̃) be an independent draw from Π, and let X̃ = (X1 , . . . , Xk−1 , X̃, Xk , . . . , Xn ) =
(Ỹ, Z̃). Viewing A(1) as a function of X and using the triangle inequality, we can write
1 2
|A(1) (X) − A(1) (X̃)| ≤ m
p
MMDk (δXk , δX̃ ) + m MMDgZ (δZk , δZ̃ ) (δỸ × δỸ )gY
q
1 1
p
+m MMDgY (δYk , δỸ ) supz gZ (z, z) + m MMDgZ (δZk , δZ̃ ) supy gY (y, y)
qP √ √
s
+ n1 MMDgY (δYk , δỸ ) 2 10 K
i=1 MMDgZ (δZ(i) , δZ(swaps (i)) ) ≤ m +
2 sK
n .

Viewing A(3) as a function of X and using the triangle inequality, we can also write
√1 |A(3) (X) − A(3) (X̃)| ≤ 1 MMDk (δX , δ ) + 2 MMDgZ (δZ , δ ) (δ × δ )gY
p
s n k X̃ n k Z̃ Ỹ Ỹ

+ n2 MMDgY (δYk , δỸ ) (δZ × δZ )gY + sm1


p p
MMDgY (δYk , δỸ ) supz gZ (z, z)
q √
1
+ sm MMDgZ (δZk , δZ̃ ) supy gY (y, y) ≤ 14 n K .

Parallel reasoning additionally yields the bounds


√ √
6 K 6 Ks
|A(2) (X) − A(2) (X̃)| ≤ m and |A(4) (X) − A(4) (X̃)| ≤ n .

The advertised result now follows from the bounded differences inequality (Lem. G.1) and
the mean estimates of Lem. G.5.

G.8. Proof of Lem. G.7: Quantile bound for independence permutation threshold.

G.8.1. Bounding Ṽ π,s in terms of Ψ̃X,s (α). Fix any α, β ∈ (0, 1). By Lem. G.4,
Ṽnπ,s = ϵ⊤ Aϵ + b⊤ ϵ + c
MMD2k (Π,P×Q) ⊤ 2
P×Q) ⊤ 2
P×Q)
= ϵ⊤ Āϵ + s2 ϵ 11⊤ ϵ + b̄⊤ ϵ + MMDk (Π,
s 1 ϵ + c̄ + MMDk (Π,
4
ϵ⊤ 11⊤ ϵ 1⊤ ϵ
≤ |ϵ⊤ Āϵ − Tr(Ā)| + Tr(Ā) + b̄⊤ ϵ + c̄ + MMD2k (Π, P × Q) 1

4 + s2 + s .
where ϵ ∼ Unif({±1}s/2 ) is independent of X, A, b, and c were defined in (85), and Ā, b̄,
and c̄ were defined in (86).
64

Now fix any δ ′ ∈ (0, 0.86) and δ ′′ , δ ′′′ > 0. By equation (50) of the Hanson-Wright in-
equality (Lem. D.3), with probability at least 1 − δ ′ ,
20∥Ā∥F log(1/δ ′ )
|ϵ⊤ Āϵ − Tr(Ā)| = |ϵ⊤ Āϵ − E[ϵ⊤ Āϵ]| ≤ 3 .
Meanwhile, Hoeffding’s inequality [23, Thm. 2] implies that
b̄⊤ ϵ ≤ b̄ 2 2 log(1/δ ′′ ) with probability at least 1 − δ ′′ and
p

|1⊤ ϵ| ≤ s log(2/δ ′′′ ) with probability at least 1 − δ ′′′ .


p

Hence, with probability at least 1 − δ ′′′ ,


q q
1 ϵ⊤ 11⊤ ϵ 1⊤ ϵ 1 s log(2/δ ′′′ ) log(1/δ ′′′ ) 1 log(2/δ ′′′ ) log(2/δ ′′′ )
4 + s2 + s ≤ 4 + s2 + s = 4 + s + s .
Moreover, Lem. G.4 guarantees that
1 (A(3) )2 2A(3) MMDk (Π,P×Q) 
|c̄| ≤ 4 s + √
s
(A(1) )2 (A(2) )2 MMD2k (Π,P×Q)((A(1) )2 +(A(2) )2 )
∥Ā∥2F ≤ 4s4 + 4s3 ,
A(1) A(2) MMDk (Π,P×Q)(A(1) +A(2) )
Tr(Ā) ≤ 2s2 + 4s3/2 , and
2 (2) 2 (3) 2 (1) 2 (4) 2 2
3(A ) (A ) +(A ) (A ) 3MMD (Π,P×Q)((A(1) )2 +(A(2) )2 +(A(3) )2 +(A(4) )2 )
b̄ 2
≤ 4s3 + k
4s2 .
Hence, if we choose δ ′ = δ ′′ = δ ′′′ = α/3, then, by the union bound and the triangle inequal-
ity, we have, with probability at least 1 − α,
√3
10 log(3/α)  log(3/α)(A(2) A(3) +A(1) A(4) ) (3) 2
π,s
Ṽn ≤ Ψ̃X,s (α) ≜ s2A(1) A(2) 1
2 + 3 + 2
s3/2 + (A4s )
√3
(1)
+A(2) 1 10 log(3/α)  log(3/α)(A(1)+A(2) +A(3)+A(4) ) (3) 
(87) + MMDk (Π, P × Q) A s3/2 4 + 3 + 2
s +A√
2 s
q
+ MMD2k (Π, P × Q) 14 + log(6/α) s + log(6/α) 
s .

G.8.2. Bounding Ψ̃X,s (α) in terms of Ψ̃n,s (α, β). Recall the definitions of (Di )4i=1 and
(Ci )4i=1 in Lems. G.5 and G.6 respectively. We have
√ √ √ √ √ √ √
C1 √ K
10 2√ K 12√ K C2 6√ K 6√ K C3 14√ K C4 6√ K
s = sm
+ sn
≤ n
, s = sm
= n
, √
s
= n
, and √
s
= n
,

√ √ √ √
D1 5√ K 2√ K 7√ K D2 3√ K
s = ns
+ sm
≤ n
, s = n
,
√ √ √ √ √ √
D3 (5+ 8) K D4 (1+ 8) K

s
= √
n
≤ 8√nK , and √
s
= √
n
≤ 4√ K
n
.

By Lem. G.6, with probability at least 1 − β , the following A(i) -based bounds all hold simul-
taneously. First,
q q
log(4/β) 
A(1) A(2) 1
D2 + C2 log(4/β)

s2 ≤ s2 D 1 + C1 2 2
√ √ p √ √ p
≤ 7√nK + 12√2nK log(4/β) × 3√nK + 6√2n K
 
log(4/β)

≤K
p 
n 21 + 56 log(4/β) + 36 log(4/β) .
CHEAP PERMUTATION TESTING 65

Second,
q  q  q  q 
log(4/β) log(4/β) log(4/β) log(4/β)
A(2) A(3) +A(1) A(4) D2 +C2 D3 +C3 + D1 +C1 D4 +C4
s3/2 ≤ 2 2
s3/2
2 2

√ √ q √ √ q
3√ K 6√ K log(4/β)  14√ K 8√ K log(4/β) 
≤ n
+ n 2 × n
+ n 2
√ √ q √ √ q
+ 12√nK log(4/β)
7√ K
× 6√nK + 4√ K log(4/β) 

+ n 2 n 2
K
p 
≤ n 84 + 148 log(4/β) + 48 log(4/β) .
Third,
q q √
log(4/β)
14 + 8 log(4/β)
(3)
A D3 C3 √K


s
≤ √
s
+ √
s
≤ 2 2 n
and hence
(A(3) )2 K
p 
s ≤ n 196 + 112 log(4/β) + 64 log(4/β) .
Fourth,
q q √
A(1) +A(2) log(4/β) log(4/β) 
√1 D1 C1 D2 C2 K
p 
s3/2 ≤ s s
+ s 2 + s + s 2 ≤ √ns 10 + 13 log(4/β) .
Fifth,
A(1) +A(2) +A(3) +A(4)
s
q q q q
D1 C1 log(4/β) log(4/β) log(4/β) log(4/β) 
≤ s + s 2 + Ds2 + Cs2 2 + √1 √
s
D3
s
+ Cs

s 2 + D4

s
+ C4

s 2

√K 30 + 22 log(4/β) .
p 
= n

Hence, with probability at least 1 − β , the quantity Ψ̃X,s (α) (87) satisfies
10 log(3/α) 
Ψ̃X,s (α) ≤ K 1
p 
n 21 + 56 log(4/β) + 36 log(4/β) × 2 + 3
q
+K 3
p 
n 84 + 148 log(4/β) + 48 log(4/β) × 2 log(3/α)

+K
p 
n 196 + 112 log(4/β) + 64 log(4/β)
√
K
10 + 13 log(4/β) × 14 + 10 log(3/α)
p  
+ MMDk (Π, P × Q) √ns 3
√ q 3 √ q 
log(4/β) 
+ √K √K 14 + 8
p
n
30 + 22 log(4/β) 2 log(3/α) + n 2
q
+ MMD2k (Π, P × Q) 14 + log(6/α) + log(6/α)

s s ≤ Ψ̃n,s (α, β).

APPENDIX H: Proof of Prop. 2: Quantile comparison method


Let (T(b) )B represent the auxiliary values (Tb )B
b=1 b=1 ordered increasingly. We begin by
relating the random critical value T(bα ) to the conditional quantiles of T1 given X.

L EMMA H.1 (High probability bound on critical value [13, Lem. 6, Cor. 1]). Under the
assumptions of Prop. 2, let FX be the cumulative distribution function of T1 given X and let
1 1
δ α ⌊α(B+1)⌋

q1−α(δ) (X) ≜ inf x ∈ R : 1 − α(δ) ≤ FX (x) for α(δ) ≜ ( B ) ⌊α(B+1)⌋ ≥ 2e δ
(⌊α(B+1)⌋)
be the 1 − α(δ) quantile of T1 given X. Then, with probability at least 1 − δ ,
T(bα ) ≤ q1−α(δ) (X).
66

P ROOF. Lem. 6 and Cor. 1 of [13] establish this result for the special case of permuted
MMD coreset test statistics, but the proof for the general case is identical.

Since ΨX (α) ≥ q1−α(β/3) (X) almost surely by assumption (22), Lem. H.1 implies that

Pr(Ac1 ) ≤ β/3 for A1 ≜ {T(bα ) ≤ ΨX (α⋆ )}.


Moreover, our assumption (23) on Ψn ensures that
Pr(Ac2 ) ≤ β/3 for A2 ≜ {ΨX (α⋆ ) ≤ Ψn (α⋆ , β/3)}.
Since ∆(X) rejects when T (X) > T(bα ) , the acceptance probability is upper-bounded by
Pr(T (X) ≤ T(bα ) ) = Pr(T (X) ≤ T(bα ) , A1 ∩ A2 ) + Pr(T (X) ≤ T(bα ) , Ac1 ∪ Ac2 )
≤ Pr(T (X) ≤ Ψn (α⋆ , β/3)) + Pr(Ac1 ∪ Ac2 )
≤ Pr(T (X) ≤ T − Φn ( β3 )) + Pr(Ac1 ) + Pr(Ac2 ) ≤ β,
where the final two inequalities used the union bound and our assumptions (20) and (21).

APPENDIX I: Proof of Cor. 3: Refined two moments method


Our choices of Φn and ΨX satisfy the requirements (21) and (22) by Cantelli’s inequality
[6, (18)]. Moreover, Markov’s inequality [32] and Cantelli’s inequality imply
p p
Pr( (α−1 − 1)Var(T1 |X) ≥ (α−1 − 1)β −1 E[Var(T1 |X)] | X)
= Pr((α−1 − 1)Var(T1 |X) ≥ (α−1 − 1)β −1 E[Var(T1 |X)] | X) ≤ β and
p
Pr(E[T1 |X] − E[T1 ] ≥ (2β −1 − 1)Var(E[T1 |X])) ≤ β/2
for all β ∈ (0, 1]. Hence, our Ψn choice (24) satisfies (23) when E[T1 | X] = 0 almost surely
and, by the union bound, our Ψn choice (25) satisfies (23) more generally.

APPENDIX J: PROOFS OF MINIMAX OPTIMALITY RESULTS


J.1. Proof of Prop. 3: Power of cheap testing: discrete
√ L2 homogeneity. Introduce
√ βα⋆ 3
the shorthand t ≜ s ρn1 n2 , u ≜ 24(1−α⋆ ) , and v ≜ t u − 1. Thm. 1 and the facts that
3

ψY,1 , ψZ,1 ≤ 4 b(1) ∥P − Q∥22 and ψY Z,2 ≤ b(1) by [25, App. G] imply that the cheap dis-
p
crete homogeneity permutation test has power at least 1 − β whenever
r √ 
2 16(3−β) b(1) ∥P−Q∥22 1 1
 q 1−α⋆ (36n21 +36n22 +198n1 n2 )b(1)
∥P − Q∥2 = E[Un1 ,n2 ] ≥ β n1 + n2 + α⋆ βn1 (n1 −1)n2 (n2 −1)
q
201s2

b(1) + 48s b(1) ∥P−Q∥22
×(1+ v1 ) + 4n2 n
v .
By the triangle inequality, it suffices to have ax2 − bx − c ≥ 0 for x = ∥P − Q∥2 , a = 1,
r √ r √

16(3−β) b(1) 1 1
 1 48 b(1) t
b= n1 + n2 ×(1+ v ) + n 3 ρn n v , and

β 1 2

q √
2 2
1−α⋆ (36n1 +36n2 +198n1 n2 )b(1) 201b(1) t
c= α⋆ βn1 (n1 −1)n2 (n2 −1) ×(1 + v1 ) + 2n √
3 ρn n v.
1 2
CHEAP PERMUTATION TESTING 67
√ √
Hence, by the quadratic formula, it suffices to have x ≥ b + c ≥ (b + b2 + 4ac)/(2a).

Finally, since 1/(n 3 ρn1 n2 ) = O(1/n1 ), there exist universal constants C, C ′ such that
√ 1/4 √ 1/4
C ′ b(1) t√ C ′ b(1) (1+v)1/3
b+ c≤ √ ⋆
(1 + min(v, v)
)= √ ⋆
(1 + u1/6
√ )
min(v, v)
min(β,α ) n1 min(β,α ) n1
1/4
Cb
≤ √
6 ⋆
√ (1)
(1 + v1 ).
βα min(β,α⋆ ) n1

J.2. Proof of Prop. 4: Power of cheap testing: Hölder 2


R L homogeneity. Let P̃ be a
discrete distribution over {B1 , . . . , BK } with probability Bk dP(x) of selecting bin Bk for
2/(4τ +d)
each k ∈ [K]. Define Q̃ analogously based on Q, and let κ(1) = ⌊n1 ⌋. By [2, Lem. 3],

∥P̃ − Q̃∥22 ≥ C1 κ−d 2


(1) ∥P − Q∥2 ,

for a universal constant C1 . Furthermore, by the argument in [25, App. I],


b(1) ≜ max{∥P∥22 , ∥Q∥22 } ≤ Lκ−d
(1) .

These bounds, together with Prop. 3, yield the sufficient power condition

−d/4
 
C ′ L1/4 κ(1) 1 −d/2

6 ⋆
√ ⋆
1 + q 3 βα⋆ ≤ C1 κ(1) ∥P − Q∥2 ≤ ∥P̃ − Q̃∥2
βα min(β,α ) n1 s 24(1−α⋆ )
ρn1 n2 −1

2/(4τ +d)
for a universal constant C ′ . The result now follows from the upper estimate n1 ≥ κ(1) .

J.3. Proof of Prop. 5: Power of cheap testing: discrete L2 independence. Note that
i.i.d.
the augmentation variables (Ai )ni=1 , (Bi )ni=1 ∼ Unif((0, 1)) are, with probability 1, all dis-
tinct. Therefore, gY ((Yi , Ai ), (Yi , Ai )) = gZ ((Zi , Bi ), (Zi , Bi )) = 0 for all i ∈ [n]. The argu-
ments of [25, Prop. 5.3] imply the variance component (14) bounds
ψ1′ ≤ C1 b(2) ∥Π − P × Q∥22 , ψ2′ ≤ C2 b(2) , and ψ̃1′ ≤ C3 b(2) ∥Π − P × Q∥22
p p

for universal constants C1 , C2 , and C3 . Additionally, we can bound the mean component ξ˜
(15) by bounding the following expression for arbitrary i1 , . . . , i4 , i′1 , . . . , i′4 ∈ [8].

E hin ((Yi1 , Ai1 ), (Zi′1 , Bi′1 )), ((Yi2 , Ai2 ), (Zi′2 , Bi′2 )),

((Yi3 , Ai3 ), (Zi′3 , Bi′3 )), ((Yi4 , Ai4 ), (Zi′4 , Bi′4 )

= E gY ((Yi1 , Ai1 ), (Yi2 , Ai2 )) + gY ((Yi3 , Ai3 ), (Yi4 , Ai4 ))

− gY ((Yi1 , Ai1 ), (Yi3 , Ai3 )) − gY ((Yi2 , Ai2 ), (Yi4 , Ai4 ))
× gZ ((Zi′1 , Bi′1 ), (Zi′2 , Bi′2 )) + gZ ((Zi′3 , Bi′3 ), (Zi′4 , Bi′4 ))

− gZ ((Zi′1 , Bi′1 ), (Zi′3 , Bi′3 )) − gZ ((Zi′2 , Bi′2 ), (Zi′4 , Bi′4 ))
(i)
= E I(Yi1 = Yi2 , i1 ̸= i2 ) + I(Yi3 = Yi4 , i3 ̸= i4 )

− I(Yi1 = Yi3 , i1 ̸= i3 ) − I(Yi2 = Yi4 , i2 ̸= i4 )
× I Zi′1 = Zi′2 , i′1 ̸= i′2 + I Zi′3 = Zi′4 , i′3 ̸= i′4
 

− I Zi′1 = Zi′3 , i′1 ̸= i′3 − I Zi′2 = Zi′2 , i′3 ̸= i′4


 

(ii) 
≤ 16 max E[I(Y1 = Y2 )I(Zi = Zj )] : (i, j) ∈ {(1, 2), (1, 3), (3, 4)}
68

(iii) 
= 16 max E[I(Y1 = Y2 , A1 ̸= A2 )I(Zi = Zj , Bi ̸= Bj )] : (i, j) ∈ {(1, 2), (1, 3), (3, 4)}
= 16 max E[gY ((Y1 , A1 ), (Y2 , A2 ))2 gZ ((Zi , Bi ), (Zj , Bj ))2 ] : (i, j) ∈ {(1, 2), (1, 3), (3, 4)}


= ψ2′ .
Above, equality (i) holds because, with probability 1, Ai = Aj ⇔ i = j and Bi = Bj ⇔ i = j .
Inequality (ii) holds because, after expanding the product, the only terms that remain are of
the form I(Yi = Yj )I(Zk = Zl ) with i ̸= j and k ̸= l. Equality (iii) holds because A1 ̸= A2
and Bi ̸= Bj almost surely. We conclude that
ξ˜ ≤ ψ ′ ≤ C2 b(2) .
2

Since U = 4∥Π − P × Q∥22 , Thm. 3, our mean and variance component bounds, and the
triangle inequality imply that the cheap discrete independence test has power at least 1 − β
whenever ax2 − bx − c ≥ 0 for x = ∥Π − P × Q∥2 , a = 1 − s+1 3s2 ,
r √ q p q  32 q 12
3 3 16C1 b(2) 12
 2304 460800 
4 b = ( β −1) n + C3 b(2) ⋆
α β −2 n + α⋆ β − 2 ns + ns2 , and
q  q 12  960  24 16 q 12
3 3 144
p  4147200 
4 c = C 2 b(2) ( β −1) n 2 + ⋆
α β −2 n2 + n + n + α⋆ β − 2 n2 s .
√ √
2
Since a ≤ 1, it therefore suffices to have x ≥ a1 (b + c) ≥ ab + ac ≥ b+ b2a+4ac by the
p
√ Cb
1/4

quadratic formula. The claim now follows as a1 (b + c) ≤ √βα(2)⋆ n for a universal constant C .

J.4. Proof of Prop. 6: Power of cheap testing: Hölder RL2 independence. Let P̃ be a
discrete distribution over {B1,1 , . . . , B1,K1 } with probability B1,k dP(y) of selecting bin B1,k
for each k ∈ [K1 ]. Define Q̃ analogously as a discrete distribution over {B2,1 , . . . , B2,K2 }
based on Q and Π̃ analogously as a discrete distribution over {(B1,k1 , B2,k2 ) : k1 ∈ [K1 ], k2 ∈
[K2 ]} based on Π. Let κ(2) = ⌊n2/(4τ +d1 +d2 ) ⌋. By [2, Lem. 3],
−(d1 +d2 )
∥Π̃ − P̃ × Q̃∥22 ≥ C1 κ(2) ∥Π − P × Q∥22 ,
for a universal constant C1 . Furthermore, by the argument in [25, App. M],
−(d1 +d2 )
b(2) ≜ max{∥Π∥22 , ∥P × Q∥22 } ≤ Lκ(2) ,
These bounds, together with Prop. 5, yield the sufficient power condition
−(d +d )/4
C ′ L1/4 κ(2) 1 2 √ −(d +d )/2
√ ⋆
βα n
≤ C1 κ(2) 1 2 ∥Π − P × Q∥2 ≤ ∥Π̃ − P̃ × Q̃∥2
for a universal constant C ′ . The result now follows from the estimate n2/(4τ +d1 +d2 ) ≥ κ(2) .

APPENDIX K: SUPPLEMENTARY EXPERIMENT DETAILS


All homogeneity experiments were run with Python 3.8.6 on an Intel Xeon Platinum 8268
CPU. All independence experiments were run with Python 3.13.0 on an AMD EPYC 7V13
CPU.

K.1. Selection of the number of permutations B . To select our common permutation


count of B = 1279 for the homogeneity testing experiments, we evaluated the power of the
standard MMD test with total sample size n = 16384 and varying permutation count of the
form B = 20 × 2a − 1 for a ≥ 0 in the MMD SETTING of Sec. 6.1. We followed the same
procedure to select the permutation count for independence testing experiments, only using
the HSIC SETTING of Sec. 6.2 with n = 2048 in place of the MMD setting. In both cases,
we found that power saturated after B = 1279 as shown in Fig. K.1.
CHEAP PERMUTATION TESTING 69

Standard Standard
1279 2559
0.74 639
1279 2559
0.74 639
319
159 159 319

79
0.72 0.72 79

39
Power

Power
0.70 0.70 39

0.68
0.68

19 0.66 19
0.66

5 10 20 50 100 0.1 0.2 0.5 1 2 5


Total computation time (s) Total computation time (s)

F IG K.1. Power and runtime of standard MMD permutation tests of homogeneity (left) and standard HSIC wild
bootstrap tests of independence (right) as number of permutations B varies.

K.2. Cheap permutation tests for feature-based kernels. When a homogeneity U-


statistic (Def. 2) has a rank-r base function of the form
g(y, z) = rj=1 fj (y)fj (z),
P
(88)
each test statistic can be computed in time Θ(cf nr) where cf is the maximum time required
to evaluate a single feature function fj , and cheap permutation can be carried out in Θ(Bsr)
additional time with Θ(sr) memory, as described in Alg. K.1.

Algorithm K.1: Cheap homogeneity testing with feature kernels


Input: Samples (Yi )n1 n2 r
i=1 , (Zi )i=1 , bin count s, feature maps f = (fj )j=1 (88), level α,
permutation count B
1 Define (Xi )n i=1 ≜ (Y1 , . . . , Yn1 , Z1 , . . . , Zn2 ) for n ← n1 + n2 , m ← n/s, and s1 ← s n1 /n
2 // Compute sufficient statistics using Θ(cf nr) time and Θ(sr) memory
for i = 1, 2, . . . , s do Fi ← m
P
3 k=1 f (X(i−1)m+k ) ;
4 // Compute original and permuted test statistics using Θ(Bsr) elementary operations
5 for b = 0, 1, 2, . . . , B do
6 π ← identity permutation if b = 0 else random uniform permutation of [s]
Tb ← n12 ∥ si=1 (2 I(πb (i) ≤ s1 ) − 1)Fi ∥22
P
7
8 end
9 // Return rejection probability
10 R ← 1+ number of permuted statistics (Tb )B b=1 smaller than T0 if ties are broken at random
11 return pα = min(1, max(0, R−(1−α)(B+1)))

You might also like