2502.07672v1
2502.07672v1
2502.07672v1
ever, standard permutation tests are also expensive, requiring a test statistic to
be computed hundreds or thousands of times to detect a separation between
distributions. In this work, we offer a simple approach to accelerate testing:
group your datapoints into bins and permute only those bins. For U and V-
statistics, we prove that these cheap permutation tests have two remarkable
properties. First, by storing appropriate sufficient statistics, a cheap test can
be run in time comparable to evaluating a single test statistic. Second, cheap
permutation power closely approximates standard permutation power. As a
result, cheap tests inherit the exact false positive control and minimax opti-
mality of standard permutation tests while running in a fraction of the time.
We complement these findings with improved power guarantees for stan-
dard permutation testing and experiments demonstrating the benefits of cheap
permutations over standard maximum mean discrepancy (MMD), Hilbert-
Schmidt independence criterion (HSIC), random Fourier feature, Wilcoxon-
Mann-Whitney, cross-MMD, and cross-HSIC tests.
1. Introduction. Permutation tests [15, 14] are commonly used in statistics, machine
learning, and the sciences to test for independence, detect distributional discrepancies, eval-
uate fairness, and assess model quality [17, 38, 12, 29]. By simply permuting the labels
assigned to datapoints, these tests offer both exact, non-asymptotic control over false posi-
tives [22] and minimax optimal error rates when paired with popular U and V test statistics
[25, 4, 41, 42].
However, standard permutation tests are also expensive, requiring the recomputation of a
test statistic hundreds or thousands of times. This work offers a simple remedy: divide your
data into s bins and permute the bin labels. In Sec. 4 we show that this cheap permutation
strategy enjoys the best known error rates for U and V-statistic permutation testing, even when
the bin count s is independent of the sample size n. Moreover, we establish in Sec. 3 that,
after computing the initial test statistic, the same cheap tests can be run in time independent
of n by keeping track of simple sufficient statistics. Together, these findings give rise to new
exact tests that inherit the minimax rate optimality of standard permutation testing while
running in time comparable to the cost of a single test statistic.
The remainder of the paper is organized as follows. After reviewing related work in
Sec. 1.1 and hypothesis testing background in Sec. 2, we formally define our cheap per-
mutation tests in Sec. 3. Sec. 4 analyzes the power of binned permutation testing with U and
V-statistics, providing both improved guarantees for standard permutation tests and compa-
rable explicit finite-sample guarantees for cheap tests. In Sec. 5, we build on these results to
1
2
establish the minimax rate optimality of cheap testing for a variety of standard testing prob-
lems. We complement this theory in Sec. 6 with experiments demonstrating the benefits of
cheap testing over standard maximum mean discrepancy (MMD), Hilbert-Schmidt indepen-
dence criterion (HSIC), random Fourier feature, Wilcoxon-Mann-Whitney, cross-MMD, and
cross-HSIC tests. Finally, we conclude with a discussion of several extensions of the cheap
testing framework in Sec. 7.
1.2. Notation. Given a permutation π of the indices [n] ≜ {1, . . . , n} and an index i ∈
[n], we write either π(i) or πi to denote the index to which i is mapped. For integers p ≥ q ≥
1, we define (p)q ≜ p(p − 1) · · · (p − q + 1) and let ipq denote the set of all q -tuples that can be
drawn without replacement from the set [p]. For a function g integrable under a probability
measure P, we use the notation Pg ≜ EX∼P [g(X)]. We use I for the indicator function.
the null when the null is true, and its power, its probability of (correctly) rejecting H0 when
H1 is true. A test often comes paired with a nominal level, a desired upper bound α ∈ (0, 1)
on the Type I error. We say ∆ is a valid level-α test if its size never exceeds α when H0
holds and an exact level-α test if its size exactly equals α whenever H0 holds. All of the tests
studied in this work are exact and take the form
1
if T (X) > c(X),
(1) ∆(X) = ξ(X) if T (X) = c(X), and
0 if T (X) < c(X),
where T (X) is called the test statistic, c(X) is called the critical value, and ξ(X) is a rejection
probability chosen to ensure exactness. We will consider two broad classes of tests in this
work: tests of homogeneity and tests of independence.
2.1. Homogeneity testing. Homogeneity tests (also known as two-sample tests) test for
equality of distribution. More precisely, given a sample Y = (Yi )ni=1
1
drawn i.i.d. from an un-
n2
known distribution P and an independent sample Z = (Zj )j=1 drawn i.i.d. from an unknown
distribution Q, a homogeneity test judges the hypotheses
H0 : P = Q versus H1 : P ̸= Q.
A common approach to testing homogeneity is to threshold a quadratic test statistic (QTS).
D EFINITION 1 (Quadratic test statistic (QTS)). Given samples Y and Z of size n1 and
n2 respectively, we call T (Y, Z) a quadratic test statistic, if T is a quadratic form of the em-
b≜ 1 P 1 P
pirical distributions P y∈Y δy and Q ≜ n2 z∈Z δz . That is, there exist base functions
b
n1
ϕn1 , ϕn2 , and ϕn1 ,n2 for which T satisfies
(2) b×P
T (Y, Z) = (P b×Q
b)ϕn1 + (P b ×Q
b )ϕn1 ,n2 + (Q b )ϕn2 .
Notably, any homogeneity U-statistic can be written as a QTS (2) via the mappings
1 g(y,y)+g(z,z)
ϕw (y, z) = 1−1/w (g(y, z) − w+1 ) for w ∈ {n1 , n2 } and ϕn1 ,n2 (y, z) = −2g(y, z).
Moreover, many commonly used test statistics can be recovered as homogeneity U-statistics
with the right choice of g .
E XAMPLE 2 (Energy statistics [3, 50, 44]). When g is the negative Euclidean distance
on Rd , Un1 ,n2 (3) is called an energy statistic. More generally, if g(y, z) = −ρ(y, z) for a
semimetric ρ of negative type [44, Defs. 1-2], then Un1 ,n2 is a generalized energy statistic
and an unbiased estimate of the energy distance between P and Q:
Eρ (P, Q) = 2(P × Q)ρ − (P × P)ρ − (Q × Q)ρ.
In both of these examples, the homogeneity U-statistic (3) is degenerate under the null with
(n1 + n2 ) Un1 ,n2 converging to a complicated, P-dependent, non-Gaussian limit as n1 , n2 →
∞ [45, Sec. 5.5.2]. As a result, one typically turns to permutation testing, as described in
Sec. 3.1, to select the critical values for these statistics [49, 30, 50, 3].
2.2. Independence testing. Independence tests test for dependence between paired vari-
ables. More precisely, given a paired sample (Y, Z) ≜ (Yi , Zi )ni=1 drawn i.i.d. from an un-
known distribution Π with marginals P and Q, independence testing judges the hypotheses
H0 : Π = P × Q versus H1 : Π ̸= P × Q.
A common approach to testing independence is to threshold an independence V-statistic.
Many commonly used test statistics can be recovered as independence V-statistics with the
right choices of gY and gZ .
In both of these examples, the independence V-statistic (6) is degenerate under the null
with nVn converging to a complicated, P and Q-dependent, non-Gaussian limit as n → ∞
[45, Sec. 5.5.2]. As a result, one typically turns to permutation testing, for example, as de-
scribed in Sec. 3.3, to select critical values for these statistics [51, 10].
3.1. Permutation tests of homogeneity. Let X ≜ (Xi )ni=1 = (Y1 , . . . , Yn1 , Z1 , . . . , Zn2 )
represent a concatenation of the observed samples Y = (Yi )ni=1 1
and Z = (Zi )ni=1
2
with
n ≜ n1 + n2 . In a standard permutation test of homogeneity, one samples B independent
permutations (πb )Bb=1 of the indices [n], computes the permuted test statistics (T (X ))b=0
πb B
π n
for π0 the identity permutation and X ≜ (Xπ(i) )i=1 , and finally selects a 1 − α quantile of
1 PB
the empirical distribution B+1 b=0 δT (Xπb ) as the critical test value c(X).
For a quadratic test statistic (Def. 1), this permutation test only depends on the data via the
3n2 sufficient statistics
Sho ≜ ϕ(xi , xj ) | i, j ∈ [n], ϕ ∈ {ϕn1 , ϕn2 , ϕn1 ,n2 } .
Hence, if we let cϕ denote the maximum time required to evaluate a single element of Sho ,
then one can precompute and store the elements of Sho given Θ(cϕ n2 ) time and Θ(n2 ) mem-
ory and, thereafter, compute all permuted test statistics using Θ(Bn2 ) elementary operations.
When the quadratic memory cost is unsupportable, the elements of Sho can instead be re-
computed for each permutation at a total cost of Θ(cϕ Bn2 ) time and Θ(n) memory.
3.2. Cheap permutation tests of homogeneity. To alleviate both the time and memory
burden of standard permutation testing, we propose to permute datapoint bins instead indi-
vidual points. More precisely, we partition the indices [n] into s consecutive sets I1 , . . . , Is
of equal size, identify the corresponding datapoint bins Xj ≜ (Xi )i∈Ij , and define the bin-
permuted samples Yπ ≜ (Xπ(i) )si=1 1
and Zπ ≜ (Xπ(i) )ss1 +1 for each permutation π on [s] and
s1 ≜ s nn1 . Importantly, the bin-permuted QTS,
n1 ,n2
T (Yπ , Zπ ) = n12 si,j=1 Φnij1 + n11n2 si=1
P 1 Ps
+ n12 si,j=s1 +1 Φnij2 ,
P1 P
1 j=s1 +1 Φij 2
Hence, one can precompute and store the elements of Sho,cheap using only Θ(s2 ) memory
and Θ(cϕ n2 ) time and subsequently compute all bin-permuted statistics using only Θ(Bs2 )
elementary operations. Alg. 1 details the steps of this cheap homogeneity test which recovers
the standard permutation test when s = n.
Notably, the overhead of cheap permutation is negligible whenever Bs2 ≪ cϕ n2 . In Secs. 4
and 6, we will see that it suffices to choose s as a slow-growing or even constant function of
n, yielding total runtime comparable to that of evaluating the original test statistic.
3.3. Wild bootstrap tests of independence. To ease notation, we will focus on the case
of even n. In a standard wild bootstrap permutation test of independence [10], one begins
by sampling B independent wild bootstrap permutations (πb )B b=1 of the indices [n]. Each
wild bootstrap permutation π swaps the indices i and swap(i) ≜ i + n2 − nI i > n2 with
probability 12 and does so independently for each i ≤ n/2. Then, one computes the permuted
test statistics (T (Y, Zπb ))B π n
b=0 for π0 the identity permutation and Z ≜ (Zπ(i) )i=1 and finally
1 P B
selects a 1 − α quantile of the empirical distribution B+1 b=0 δT (Y,Zπb ) as the critical test
value c(Y, Z).
Now consider the independence V-statistic V (Y, Z) of Def. 3, and introduce the shorthand
(ḡY,i , ḡZ,i , ḡY , ḡZ ) ≜ ( n1 nj=1 gY (Yi , Yj ), n1 nj=1 gZ (Zi , Zj ), n1 ni=1 ḡY,i , n1 ni=1 ḡZ,i )
P P P P
and Gij,kℓ ≜ (gY (Yi , Yj ) − ḡY,i − ḡY,j + ḡY )(gZ (Zk , Zℓ ) − ḡZ,k − ḡZ,ℓ + ḡZ )
6
TABLE 1
Time and memory costs for cheap and standard permutation tests with n datapoints, B permutations, and s bins.
The multipliers cϕ and cg denote the cost of evaluating a single homogeneity QTS base function (see Def. 1) and
a single independence V-statistic base function (see Def. 3). All costs are reported up to a universal constant.
Permutation test Test statistic T (Y, Z) T (Y, Z) time Permutation time Memory
Bn2 n2
Homogeneity QTS (Def. 1) cϕ n2
cϕ Bn2 n
Cheap homogeneity (Alg. 1) QTS (Def. 1) cϕ n2 Bs2 s2
Bn2 n2
Independence V-statistic (Def. 3) cg n2
cg Bn2 n
Cheap independence (Alg. 2) V-statistic (Def. 3) cg n2 Bs2 s2
for all indices i, j, k, ℓ ∈ [n]. With V (Y, Z) as its test statistic, the standard wild bootstrap test
only depends on the data via the n2 /2 sufficient statistics
P
S1 ≜ a,b∈{i,swap(i)} (Gab,ba , Gab,swap(b)swap(a) ) | i ∈ [n/2] and
P
S2 ≜ a∈{i,swap(i)},b∈{j,swap(j)} (Gab,ba , Gab,swap(b)swap(a) ),
P
a∈{i,swap(i)},b∈{j,swap(j)} (Gab,bswap(a) , Gab,swap(b)a ) | j < i ∈ [n/2] .
since, for each wild bootstrap permutation π and j ∈ [n], we have π(j) ∈ {j, swap(j)} and
Pn/2 P
4V (Y, Zπ ) = n12 i=1 a,b∈{i,swap(i)} Gab,π(a)π(b)
Pn/2 P
+ n22 i=1 i−1
P
j=1 a∈{i,swap(i)},b∈{j,swap(j)} Gab,π(a)π(b) .
Hence, if we let cg denote the maximum time required to evaluate gY or gZ , then one can
precompute and store the elements of S1 ∪ S2 given Θ(cg n2 ) time and Θ(n2 ) memory and,
thereafter, compute all permuted test statistics using Θ(Bn2 ) elementary operations. When
the quadratic memory cost is unsupportable, the elements of S2 can instead be recomputed
for each permutation at a total cost of Θ(cg Bn2 ) time and Θ(n) memory.
CHEAP PERMUTATION TESTING 7
3.4. Cheap wild bootstrap tests of independence. To alleviate both the time and memory
burden of standard wild bootstrap testing, we again propose to permute datapoint bins instead
of individual points. To simplify our presentation, we will assume that s is even and divides
n evenly. As before, we partition the indices [n] into s consecutive sets I1 , . . . , Is of equal
size, identify the corresponding datapoint bins Zj ≜ (Zi )i∈Ij , and define the bin-permuted
sample Zπ ≜ (Zπ(i) )si=1 for each wild bootstrap permutation π on [s]. If we further define the
combination bins Ii′ ≜ Ii ∪ Ii+s/2 for each i ∈ [s/2], then the bin-permuted V-statistic
Ps/2 Ps/2 P
4V (Y, Zπ ) = n12 i=1 Diπ(i) + n22 i=1 i−1 j=1 Sijπ(j)π(i)
P
(Sijj(i+s/2) , Sij(j+s/2)i ) ≜ a∈Ii′ , b∈Ij′ (Gab,bswap(a) , Gab,swap(b)a ) | j < i ∈ [s/2] .
Hence, one can precompute and store the elements of S1′ ∪ S2′ using only Θ(s2 ) memory
and Θ(cg n2 ) time and subsequently compute all bin-permuted statistics using only Θ(Bs2 )
elementary operations. Alg. 2 details the steps of this cheap independence test which recovers
the standard wild bootstrap permutation test when s = n.
As in the homogeneity setting, we find that the overhead of cheap permutation is negligible
whenever Bs2 ≪ cϕ n2 , and we will see in Secs. 4 and 6 that a slow-growing or constant
setting of s suffices to maintain power while substantially reducing runtime.
3.5. Finite-sample exactness. We conclude this section by noting that the cheap permu-
tation tests of Algs. 1 and 2 are exact, that is, their sizes exactly match the nominal level α for
all sample sizes and data distributions. This is a principal advantage of permutation testing
over asymptotic tests that (a) require precise knowledge of a test statistic’s asymptotic null
distribution and (b) can violate the level constraint at any finite sample size. The exactness
follows immediately from the fact that cheap permutation tests are standard permutation tests
over bins of datapoints and a standard exchangeability argument due to Hoeffding [22].
P ROPOSITION 1 (Cheap permutation tests are exact). Under their respective null hy-
potheses, the cheap permutation tests of Algs. 1 and 2 reject the null with probability α.
4. Power of cheap testing. We now turn our attention to the power of cheap testing with
homogeneity U-statistics (Def. 2) and independence V-statistics (Def. 3).
4.1. Power guarantees for homogeneity testing. For the homogeneity U-statistic of
Def. 2, define the symmetrized version of hho ,
1 P P
(7) hho (y1 , y2 ; z1 , z2 ) ≜ 2!2! (i1 ,i2 )∈i22 (j1 ,j2 )∈i22 hho (yi1 , yi2 ; zj1 , zj2 ),
and the variance components
ψY,1 ≜ max{Var(E[hho (Y1 , Y2 ; Z1 , Z2 )|Y1 ]), Var(E[hho (Z3 , Y2 ; Z1 , Z2 )|Z3 ])},
(8) ψZ,1 ≜ max{Var(E[hho (Y1 , Y2 ; Z1 , Z2 )|Z1 ]), Var(E[hho (Y1 , Y2 ; Y3 , Z2 )|Y3 ])}, and
ψY Z,2 ≜ max{E[g 2 (Y1 , Y2 )], E[g 2 (Y1 , Z1 )], E[g 2 (Z1 , Z2 )]}.
Our first theorem, proved in App. B, bounds the minimum separation between P and Q
required for standard and cheap homogeneity tests to reject with power at least 1 − β . Here,
separation is measured by the mean of the test statistic E[Un1 ,n2 ], which is zero under the
null (P = Q) but may be non-zero under the alternative (P ̸= Q).
8
α β 1/⌊α(B+1)⌋
for α⋆ ≜ 2e
3 . A cheap permutation test (Alg. 1 with 4 ≤ s < n) based on Un1 ,n2
has power at least 1 − β whenever
q
201s2
γn1 ,n2 + 4n2
ψY Z,2 + 12s max(ψY,1 ,ψZ,1 )
(10) E[Un1 ,n2 ] ≥ γn1 ,n2 ,s ≜ γn1 ,n2 + q
βα⋆
n
s 24(1−α⋆ ) ρn1 n2 −1
3
R EMARK 1q (Equivalent separation rates and thresholds). We have γn1 ,n2 ,s = Θ(γn1 ,n2 )
⋆ )(1+Ω(1))
whenever s ≥ 3 24(1−α
βα⋆ ρn n and γn1 ,n2 ,s = γn1 ,n2 (1 + o(1)) whenever s = ω( nn12 ).
1 2
The separation threshold γn1 ,n2 for standard permutation testing is a refined, quantitative
version of the one derived in [25, Thm. 4.1]. Here, we keep track of all numerical factors
and, thanks to the quantile comparison method introduced in Sec. 4.3, establish a tighter
dependency on the number of permutations B . The separation threshold γn1 ,n2 ,s for cheap
permutation testing is entirely new and implies two remarkable properties summarized in
Rem. 1. First, the cheap and standard thresholds are asymptotically equivalent (in the strong
CHEAP PERMUTATION TESTING 9
sense that γn1 ,n2 ,s /γn1 ,n2 = 1 + o(1)) whenever s = ω( nn21 ). For n1 = Θ(n2 ), this occurs
whenever s grows unboundedly with n, even if the growth rate is arbitrarily slow. Second,
q and standard separation rates are identical (i.e., γn1 ,n2 ,s = Θ(γn1 ,n2 )) whenever
the cheap
⋆
s ≥ 3 24(1−α )(1+Ω(1))
βα⋆ ρn1 n2 . Hence, when n1 = Θ(n2 ), cheap testing can match the separation
rate of standard permutation testing using a constant number of bins independent of the
sample size n.
When the base function g is a positive-definite kernel k (Ex. 1), the expected test statistic
is nonnegative and equal to the squared maximum mean discrepancy p (MMD, (4)). In this
case, Thm. 1 also yields a detectable separation
p threshold of order 1/n1 + 1/n2 for the
root mean test statistic, MMDk (P, Q) = E[Un1 ,n2 ].
C OROLLARY 1 (Power of cheap homogeneity: finite variance, PD). Under the assump-
tions of Thm. 1 and notation of Ex. 1, suppose that g is a positive-definite kernel k with
ξQ , ξP < ∞ for ξµ ≜ max(E[MMD2k (δY1 , µ)], E[MMD2k (δZ1 , µ)]).
A standard permutation test based on Un1 ,n2 has power at least 1 − β whenever
q q 2 2
4ξQ 4 1−α⋆ (36n1 +36n2 +198n1 n2 )ψY Z,2
(11) MMDk (P, Q) ≥ ϵn1 ,n2 ≜ 3−β β n1 + 4ξP
n2 + α⋆ βn1 (n1 −1)n2 (n2 −1) .
A cheap permutation test based on Un1 ,n2 has power at least 1 − β whenever
√ 12s r
201βα⋆ s5 ρn n
ϵn1 ,n2 + n
max(ξQ ,ξP )+ 4 96(1−α⋆ )n12 2 ψY Z,2
(12) MMDk (P, Q) ≥ ϵn1 ,n2 ,s ≜ ϵn1 ,n2 + q
βα⋆
.
s3 24(1−α ⋆ ) ρn1 n2 −1
R EMARK 2q (Equivalent separation rates and thresholds). We have ϵn1 ,n2 ,s = Θ(ϵn1 ,n2 )
⋆
3 24(1−α )(1+Ω(1))
whenever s ≥ βα⋆ ρn n and ϵn1 ,n2 ,s = ϵn1 ,n2 (1 + o(1)) whenever s = ω( nn21 ).
1 2
The proof of Cor. 1 can be found in App. C. Finally, we establish improved power guar-
antees for cheap testing with only logarithmic dependence on β and α⋆ under a commonly-
satisfied sub-Gaussianity assumption.
√ √
40 log( α2⋆ )(Ã2 +B̃( β9 )) 2( 2Ã2 +B̃( β9 ))
(13) + √
3s1 n1
+ √
s1 n 1 for
q q
Ãp ≜ p2 e log( 1δ )(σP2 +σQ
2 ), Λ̃ (δ) ≜ 2e log( 1δ ).
p p
Mp,P +Mp,Q , B̃(δ) ≜ 8 µ 2M2,µ +8σµ
10
4.2. Power guarantees for independence testing. For the independence V-statistic of
Def. 3, we introduce the shorthand Xi ≜ (Yi , Zi ) and define a symmetrized version of hin ,
hin (x1 , x2 , x3 , x4 ) ≜ 4!1 (i1 ,i2 ,i3 ,i4 )∈i44 hin (xi1 , xi2 , xi3 , xi4 ),
P
where π is an independent wild bootstrap permutation of [n], and the mean component
(15) ξ˜ ≜ max{|E[hin ((Yi1 , Zi′1 ), (Yi2 , Zi′2 ), (Yi3 , Zi′3 ), (Yi4 , Zi′4 ))]| : ik , i′k ∈ [8], ∀k ∈ [4]}.
Our next theorem, proved in App. E, bounds the minimum separation between Π and
P × Q required for standard and cheap independence tests to reject with power at least 1 − β .
Here, separation is measured by
U ≜ E[hin (X1 , X2 , X3 , X4 )],
the expected U-statistic corresponding to the V-statistic Vn . Notably, U is zero under the null
(Π = P × Q) but may be non-zero under the alternative (Π ̸= P × Q).
A cheap wild bootstrap test (Alg. 2 with s < n) based on Vn has power at least 1−β whenever
q
s+1 −1 4 12
′ 2304 460800
(17) U ≥ γn,s ≜ 1 − 3s2 γn + 3 α⋆ β − 2 ψ̃1 ns + ns2 + ψ2′ 4147200
n2 s .
R EMARK 4 (Equivalent separation rates and thresholds). We have γn,s = Θ(γn ) for any
choice of s and γn,s = γn (1 + o(1)) whenever s = ω(1).
new and implies two remarkable properties summarized in Rem. 4. First, the cheap and stan-
dard thresholds are again asymptotically equivalent whenever s = ω(1). Second, the cheap
and standard separation rates are identical whatever the choice of s, even if s = 2. Hence,
cheap independence testing can match the separation rate of standard testing using a constant
number of bins independent of the sample size n.
When the base functions (gY , gZ ) are positive-definite kernels (Ex. 3), the test statistic
1 2 b b
4 n is equal to the squared sample MMD, MMDk (Π, P × Q). In this case, the proof of
V b
1
Thm. 3 also yields a detectable separation threshold of order √n for the population parameter,
MMDk (Π, P × Q).
C OROLLARY 2 (Power of cheap independence: finite variance, PD). Under the assump-
tions of Thm. 3 and notation of Ex. 3, suppose that n ≥ 14 and that gY , gZ are positive-
definite kernels with k = gY gZ , ξ ≜ E[hin (X1 , X1 , X3 , X4 )], and
δ +ν δY ×Q+P×δZ
ξ˜wild ≜ max E(Y,Z)∼µ MMD2k (Y,Z)
2 , 2 : µ, ν ∈ {Π, P × Q} .
A standard wild bootstrap test based on Vn has power at least 1 − β whenever
√ √
4
√
(12(α⋆ β)−1 −2)(32 ξ+256 ξ̃wild ) (12(α⋆ β)−1 −2)960 ψ2′ + 10ξ̃
(18) MMDk (Π, P × Q) ≥ ϵn ≜ √ −1 32 √
+ √ −1 32 1/2 √
,
(3− ⋆
(12(α β) −2) n
) n ⋆(3− (12(α β) −2) n
) n
R EMARK 5 (Equivalent MMD separation rates and thresholds). We have ϵn,s = Θ(ϵn )
for any choice of s and ϵn,s = ϵn (1 + o(1)) whenever s = ω(1).
The proof of Cor. 2 can be found in App. F. Finally, we establish improved power guaran-
tees for cheap independence testing with only logarithmic dependence on β and α⋆ whenever
k is bounded.
4.3. Quantile comparison method. At the heart of our power guarantees is a quantile
comparison method (Prop. 2) suitable for bounding the power of Monte Carlo permutation
tests. Related results have appeared in the literature under the name two moments method, as
12
they control quantiles using the first two moments of the test statistic. This approach to as-
sessing non-asymptotic power was pioneered by Fromont, Laurent and Reynaud-Bouret [16]
in the context of homogeneity testing and later developed by Kim, Balakrishnan and Wasser-
man [25], Schrab et al. [41], Domingo-Enrich, Dwivedi and Mackey [13], among others.
Earlier guarantees due to [25, 41] placed undue constraints on the number of permutations
B . For example, Prop. D.1 of [25] requires B ≥ α82 log( β4 ), while Thm. 5 of [41] requires
B ≥ α32 (log( β8 ) + α(1 − α)). Here we adapt a tighter argument developed for homogeneity
testing with a bounded positive-definite kernel [13, Lem. 6] and extend it to all forms of
permutation test and, more generally, to any test that thresholds using the quantiles of i.i.d.
perturbations. The result requires only B ≥ α1 − 1 permutations and is proved in App. H.
When the test statistic T (X) has finite variance and the auxiliary variable T1 has almost-
surely finite conditional variance, Cantelli’s inequality yields the following corollary, a re-
finement of the two moments method of [25, Prop. D.1]. See App. I for the proof.
C OROLLARY 3 (Refined two moments method). If Var(T (X)), E[Var(T1 | X)] < ∞,
then the following functions satisfy the requirements (21) to (23) of Prop. 2 for T = E[T (X)]:
q q
Φn (β) = ( β1 − 1)Var(T (X)), ΨX (α) = ( α1 − 1)Var(T1 | X) + E[T1 | X],
q
1
(24) Ψn (α, β) = ( αβ − β1 )E[Var(T1 | X)] when E[T1 | X] = 0 almost surely, and
q q
2
(25) Ψn (α, β) = ( αβ − β2 )E[Var(T1 | X)] + E[T1 ] + ( β2 − 1)Var(E[T1 | X]) otherwise.
5. Minimax optimality of cheap testing. In this section, we apply the power results of
Sec. 4 to establish the minimax rate optimality of cheap testing in a variety of settings. Be-
low, for discrete and absolutelyR continuous signed measures µ, we make use of the standard
L2 norm on densities, ∥µ∥22 ≜ ( dµ 2
dν (x)) dν(x), where ν is the counting measure if µ is sup-
ported on a discrete set X and ν is the Lebesgue measure when µ has a Lebesgue density.
Note that
(P
(p(x) − q(x))2 if P, Q have probability mass functions p, q on a set X
∥P − Q∥22 = R x∈X
(p(x) − q(x))2 dx if P, Q have Lebesgue densities p, q .
We also define the Hölder-space ball,
τ −⌊τ ⌋
Hτd (L) ≜ f : [0, 1]d 7→ R |f (⌊τ ⌋) (x) − f (⌊τ ⌋) (x′ )| ≤ L∥x − x′ ∥2 ∀x, x′ ∈ [0, 1]d ,
,
′
(26) |f (τ ) (x)| ≤ L, ∀x ∈ [0, 1]d , τ ′ ∈ {1, . . . , ⌊τ ⌋} .
CHEAP PERMUTATION TESTING 13
5.1. Discrete homogeneity testing. Our first result, proved in App. J.1, shows that cheap
homogeneity testing is minimax rate optimal for detecting discrete L2 separations.
5.2. Hölder homogeneity testing. Our second result, proved in App. J.2, shows that
cheap homogeneity testing is minimax rate optimal for detecting Hölder L2 separations. The
test statistic, based on discretization of the unit cube, is the same used in [25, Prop. 4.6].
5.3. Discrete independence testing. Our third result, proved in App. J.3, shows that
cheap independence testing is minimax rate optimal for detecting discrete L2 dependence.
5.4. Hölder independence testing. Our final result, proved in App. J.4, shows that cheap
independence testing is minimax rate optimal for detecting Hölder L2 dependence. The
test statistic base functions, based discretization of the unit cube, are the same used in [25,
Prop. 5.5].
Cheap s = 128 s = 32
Cheap s = 32 0.72
0.8
Cheap s = 8
n = 16384 n = 16384 s = 16
0.70
Power
0.66
0.62
n = 4096 n = 4096
0.2 s=8
n = 2048 n = 2048 0.60
Cheap n = 16384
F IG 1. Power vs. runtime for cheap and standard MMD tests of homogeneity as total sample size n = n1 + n2
and bin count s vary.
3. WMW TESTING: P = N (−0.015, 1), Q = N (0.015, 1), and our test statistic is the popu-
lar Wilcoxon-Mann-Whitney (WMW) test statistic of Ex. 5.
sumed to have no elements in common. This statistic may be recast into the U-statistic form
(3) by selecting
g(y, z) = − 21 I(y < z) − 14 I(y = z).
6.1.1. Cheap vs. standard MMD testing. Fig. 1 (left) compares cheap MMD testing to
standard MMD permutation testing as a function of the total sample size n = n1 + n2 and
bin count s. We find that s = 128 bins suffice to closely match the power of standard testing
for higher power levels (1 − β ≥ 0.3) and that s = 32 bins suffice for smaller power levels. In
both cases, cheap testing is consistently 100 times faster than the standard permutation test.
Fig. 1 (right) traces out the power-vs.-runtime trajectory of cheap testing for varying s
and fixed n = 16384. We find that runtime remains essentially constant for s ≤ 128 and that
power remains essentially constant for s ≥ 128.
Power
Power
n = 16384
n = 12288
0.4 0.4 n = 8192
n = 8192
n = 6144
n = 8192 n = 6144
n = 6144
n = 4096
0.2 0.2 n = 4096
n = 4096
2048 4096 6144 8192 12288 16384 24576 32768 0.01 0.1 1 10 100 1000
Sample size n Total computation time (s)
F IG 2. Cheap MMD vs. asymptotic cross-MMD tests of homogeneity. (Left) Power as a function of the total
sample size n = n1 + n2 . (Right) Time-power trade-off curves as n and bin count s vary.
0.75
r = 2048
r = 512
0.70
r = 128
0.65
Power
0.1 1 10 100
Total computation time (s)
F IG 3. Power vs. runtime for cheap and standard RFF tests of homogeneity as the bin count s and number of
random Fourier features r vary. Here, the total sample size is n = 16384.
6.1.3. Cheap vs. standard RFF testing. Fig. 3 compares cheap RFF testing to standard
RFF permutation testing as a function of the number of random Fourier features r and bin
count s. We find that s = 256 bins suffice to closely match the power of standard testing for
all power levels. With this setting, cheap testing is also 100 to 1000 times faster than the
standard permutation test, depending on the choice of r .
6.1.4. Cheap vs. standard WMW testing. Fig. 4 (left) compares cheap WMW testing to
standard WMW permutation testing as a function of the total sample size n = n1 + n2 and
bin count s. We find that s = 128 bins suffice to closely match the power of standard testing
for higher power levels (1 − β ≥ 0.3) and that s = 64 bins suffice for smaller power levels. In
both cases, cheap testing is consistently 300 to 900 times faster than the standard permutation
test.
CHEAP PERMUTATION TESTING 17
0.56 s = 16
0.4
Power
Power
n = 8192
0.54
n = 6144 0.52
0.3
s=8
0.50
n = 4096
0.2 0.48
n = 2048
F IG 4. Power vs. runtime for cheap and standard Wilcoxon-Mann-Whitney tests of homogeneity as total sample
size n = n1 + n2 and bin count s vary.
Fig. 4 (right) traces out the power-vs.-runtime trajectory of cheap testing for varying s and
fixed n = 16384. We find that runtime remains essentially constant for s ≤ 64 and that power
increases minimally beyond s = 128.
6.2. Independence testing. We additionally benchmark the power and runtime of inde-
pendence V-statistic (Def. 3) tests in the following setting.
4. HSIC TESTING: Π a multivariate Gaussian distribution with mean zero and covariance
I 0.047I
∈ R20×20 .
0.047I I
The base functions gY and gZ are Gaussian kernels kλY and kλZ with bandwidths λY and
λZ equal to the median pairwise distance among Y and Z points respectively.
6.2.1. Cheap vs. standard HSIC testing. Fig. 5 (left) compares cheap HSIC testing to
standard HSIC wild bootstrap testing as a function of the total sample size n = n1 + n2 and
bin count s. We find that s = 128 bins suffice to closely match the power of standard testing
for higher power levels (1 − β ≥ 0.3) and that s = 64 bins suffice for smaller power levels. In
both cases, cheap testing is consistently 15 times faster than the standard wild bootstrap test.
Fig. 5 (right) traces out the power-vs.-runtime trajectory of cheap testing for varying s and
fixed n = 2048. We find that runtime remains essentially constant for s ≤ 128 and that power
increases minimally beyond s = 256.
1.0 n = 4096
0.75
Standard HSIC s = 256 s = 512 s = 1024 Standard
n = 3072
Cheap s = 256
s = 32
Cheap s = 128
Cheap s = 64 0.70
0.8
Cheap s = 16 s = 16
n = 2048
0.65
0.6
Power
Power
n = 1536
0.60
0.4
n = 1024
0.55
0.2
n = 512 0.50 s=8
Cheap n = 2048
F IG 5. Power vs. runtime for cheap and standard HSIC tests of independence as total sample size n = n1 + n2
and bin count s vary.
1.0 n = 4096
1.0 Standard HSIC Standard HSIC n = 4096
n = 3072
Cheap s = 128 Cheap s = 128
n = 2048
0.6 0.6
Power
Power
n = 1536
n = 1536
0.4 0.4
n = 1024
n = 1024
0.2 0.2
n = 512
n = 512
512 1024 1536 2048 3072 4096 0.002 0.005 0.01 0.02 0.05 0.1 0.2 0.5 1 2 5 10
F IG 6. Cheap HSIC vs. asymptotic cross-HSIC tests of independence. (Left) Power as a function of the total
sample size n = n1 + n2 . (Right) Time-power trade-off curves as n and bin count s vary.
divide a standard
√ U or V-statistic by an estimate of its standard error and commonly take
the form U/ V where both U and V are quadratic test statistics in the sense of Def. 1. All
functions of one or more quadratic test statistics are amenable to the binned sufficient statistic
constructions of Sec. 3 and hence can benefit from the speed-ups of cheap testing.
Multisample / K -sample testing. In multisample homogeneity testing one observes K ≥ 2
independent samples drawn i.i.d. from unknown distribution P1 , . . . , PK and seeks to test
whether all K distributions match. Similarly, in multisample independence testing one ob-
serves tuples of K ≥ 2 variables and seeks to test whether the components are jointly inde-
pendent. In each case, one can generalize the cheap permutation tests of Algs. 1 and 2 to
accommodate the multisample U and V-statistics commonly employed [see, e.g., 37, 36, 7].
Higher-order test statistics. Cheap testing is also compatible with polynomial test statistics,
that is, degree p polynomial functionals of the sample empirical distributions P b1 , . . . , P
bK :
P b j1 × · · · × P
T = j1 ,...,jp ∈[K] (P bjp )ϕj1 ,...,jp .
In this case, O(sp ) sufficient statistics would suffice to run a cheap test with O(Bsp ) permu-
tation overhead, a substantial improvement over the standard Θ(Bnp ) permutation overhead.
REFERENCES
[1] A NDERSON , W. N. and V ERBEECK , J. (2023). Exact Permutation and Bootstrap Distribution of General-
ized Pairwise Comparisons Statistics. Mathematics 11.
[2] A RIAS -C ASTRO , E., P ELLETIER , B. and S ALIGRAMA , V. (2018). Remember the curse of dimensionality:
the case of goodness-of-fit testing in arbitrary dimension. Journal of Nonparametric Statistics 30 448–
471.
[3] BARINGHAUS , L. and F RANZ , C. (2004). On a new multivariate two-sample test. Journal of Multivariate
Analysis 88 190-206.
[4] B ERRETT, T. B., KONTOYIANNIS , I. and S AMWORTH , R. J. (2021). Optimal rates for independence testing
via U-statistic permutation tests. The Annals of Statistics 49 2457–2490.
[5] B OUCHERON , S., L UGOSI , G. and M ASSART, P. (2013). Concentration Inequalities: A Nonasymptotic
Theory of Independence. OUP Oxford.
[6] C ANTELLI , F. P. (1929). Sui confini della probabilita. In Atti del Congresso Internazionale dei Matematici:
Bologna del 3 al 10 de settembre di 1928 47–60.
[7] C HUNG , E. and ROMANO , J. P. (2013). Exact and asymptotically robust permutation tests. The Annals of
Statistics 41 484 – 507.
[8] C HUNG , E. and ROMANO , J. P. (2016). Asymptotically valid and exact permutation tests based on two-
sample U-statistics. Journal of Statistical Planning and Inference 168 97–105.
[9] C HUNG , M. K., X IE , L., H UANG , S.-G., WANG , Y., YAN , J. and S HEN , L. (2019). Rapid accelera-
tion of the permutation test via transpositions. In Connectomics in NeuroImaging: Third International
Workshop, CNI 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 13, 2019,
Proceedings 3 42–53. Springer.
[10] C HWIALKOWSKI , K. P., S EJDINOVIC , D. and G RETTON , A. (2014). A wild bootstrap for degenerate kernel
tests. Advances in neural information processing systems 27.
[11] DADUSH , D., G UZMÁN , C. and O LVER , N. (2018). Fast, Deterministic and Sparse Dimensionality Re-
duction. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms.
SODA ’18 1330–1344. Society for Industrial and Applied Mathematics, USA.
[12] D I C ICCIO , C., VASUDEVAN , S., BASU , K., K ENTHAPADI , K. and AGARWAL , D. (2020). Evaluating
fairness using permutation tests. In Proceedings of the 26th ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining 1467–1477.
[13] D OMINGO -E NRICH , C., DWIVEDI , R. and M ACKEY, L. (2023). Compress Then Test: Powerful Kernel
Testing in Near-linear Time. In Proceedings of The 26th International Conference on Artificial Intelli-
gence and Statistics. Proceedings of Machine Learning Research 206 1174–1218. PMLR.
[14] DWASS , M. (1957). Modified randomization tests for nonparametric hypotheses. The Annals of Mathemat-
ical Statistics 181–187.
[15] F ISHER , R. A. (1925). Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh.
20
[16] F ROMONT, M., L AURENT, B. and R EYNAUD -B OURET, P. (2013). The two-sample problem for Poisson
processes: Adaptive tests with a nonasymptotic wild bootstrap approach. The Annals of Statistics 41
1431 – 1461.
[17] G OOD , P. (2013). Permutation tests: a practical guide to resampling methods for testing hypotheses.
Springer Science & Business Media.
[18] G RETTON , A., B OUSQUET, O., S MOLA , A. and S CHÖLKOPF, B. (2005). Measuring Statistical Depen-
dence with Hilbert-Schmidt Norms. In Proceedings of the 16th International Conference on Algorith-
mic Learning Theory 63–77. Springer-Verlag, Berlin, Heidelberg.
[19] G RETTON , A., F UKUMIZU , K., T EO , C., S ONG , L., S CHÖLKOPF, B. and S MOLA , A. (2007). A Kernel
Statistical Test of Independence. In Advances in Neural Information Processing Systems 20.
[20] G RETTON , A., B ORGWARDT, K. M., R ASCH , M. J., S CHÖLKOPF, B. and S MOLA , A. (2012). A Kernel
Two-Sample Test. Journal of Machine Learning Research 13 723-773.
[21] H E , H. Y., BASU , K., Z HAO , Q. and OWEN , A. B. (2019). Permutation p-value approximation via gener-
alized Stolarsky invariance. The Annals of Statistics 47 583–611.
[22] H OEFFDING , W. (1952). The Large-Sample Power of Tests Based on Permutations of Observations. The
Annals of Mathematical Statistics 23 169 – 192.
[23] H OEFFDING , W. (1963). Probability Inequalities for Sums of Bounded Random Variables. Journal of the
American Statistical Association 58 13–30.
[24] JANSSEN , A. (1997). Studentized permutation tests for non-iid hypotheses and the generalized Behrens-
Fisher problem. Statistics & probability letters 36 9–21.
[25] K IM , I., BALAKRISHNAN , S. and WASSERMAN , L. (2022). Minimax optimality of permutation tests. The
Annals of Statistics 50 225–251.
[26] K NIJNENBURG , T. A., W ESSELS , L. F., R EINDERS , M. J. and S HMULEVICH , I. (2009). Fewer permuta-
tions, more accurate P-values. Bioinformatics 25 i161–i168.
[27] L ARSON , J. L. and OWEN , A. B. (2015). Moment based gene set tests. BMC bioinformatics 16 1–17.
[28] L EE , A. J. (1990). U-Statistics: Theory and Practice. Statistics: A Series of Textbooks and Monographs.
Taylor & Francis.
[29] L INDGREN , F., H ANSEN , B., K ARCHER , W., S JÖSTRÖM , M. and E RIKSSON , L. (1996). Model validation
by permutation tests: applications to variable selection. Journal of Chemometrics 10 521–532.
[30] L IU , F., X U , W., L U , J., Z HANG , G., G RETTON , A. and S UTHERLAND , D. J. (2020). Learning Deep
Kernels for Non-Parametric Two-Sample Tests. In Proceedings of the 37th International Conference
on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event. Proceedings of Machine Learning
Research 119 6316-6326. PMLR.
[31] M ANN , H. B. and W HITNEY, D. R. (1947). On a Test of Whether one of Two Random Variables is Stochas-
tically Larger than the Other. The Annals of Mathematical Statistics 18 50 – 60.
[32] M ARKOV, A. (1884). On certain applications of algebraic continued fractions. Unpublished Ph. D. thesis,
St Petersburg.
[33] M AURER , A. and P ONTIL , M. (2021). Concentration inequalities under sub-Gaussian and sub-exponential
conditions. In Advances in Neural Information Processing Systems 34 7588–7597. Curran Associates,
Inc.
[34] M OHRI , M., ROSTAMIZADEH , A. and TALWALKAR , A. (2012). Foundations of Machine Learning. The
MIT Press.
[35] N EUHAUS , G. (1993). Conditional rank tests for the two-sample problem under random censorship. The
Annals of Statistics 1760–1779.
[36] PANDA , S., S HEN , C., P ERRY, R., Z ORN , J., L UTZ , A., P RIEBE , C. E. and VOGELSTEIN , J. T. (2025).
Universally consistent K-sample tests via dependence measures. Statistics & Probability Letters 216
110278.
[37] P FISTER , N., B ÜHLMANN , P., S CHÖLKOPF, B. and P ETERS , J. (2018). Kernel-based tests for joint inde-
pendence. Journal of the Royal Statistical Society Series B: Statistical Methodology 80 5–31.
[38] P HIPSON , B. and S MYTH , G. K. (2010). Permutation P-values Should Never Be Zero: Calculating Exact P-
values When Permutations Are Randomly Drawn. Statistical Applications in Genetics and Molecular
Biology 9.
[39] R AHIMI , A. and R ECHT, B. (2008). Random Features for Large-Scale Kernel Machines. In Advances in
Neural Information Processing Systems 20 (J. C. Platt, D. Koller, Y. Singer and S. T. Roweis, eds.)
1177–1184. Curran Associates, Inc.
[40] R AMDAS , A., BARBER , R. F., C ANDÈS , E. J. and T IBSHIRANI , R. J. (2023). Permutation tests using
arbitrary permutation distributions. Sankhya A 1–22.
[41] S CHRAB , A., K IM , I., A LBERT, M., L AURENT, B., G UEDJ , B. and G RETTON , A. (2021). MMD Aggre-
gated Two-Sample Test.
CHEAP PERMUTATION TESTING 21
[42] S CHRAB , A., K IM , I., G UEDJ , B. and G RETTON , A. (2022). Efficient Aggregated Kernel Tests using
Incomplete U -statistics.
[43] S EGAL , B. D., B RAUN , T., E LLIOTT, M. R. and J IANG , H. (2018). Fast approximation of small p-values
in permutation tests by partitioning the permutations. Biometrics 74 196–206.
[44] S EJDINOVIC , D., S RIPERUMBUDUR , B., G RETTON , A. and F UKUMIZU , K. (2013). Equivalence of
distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics 2263–2291.
[45] S ERFLING , R. (2009). Approximation Theorems of Mathematical Statistics 162. John Wiley & Sons.
[46] S HEKHAR , S., K IM , I. and R AMDAS , A. (2022). A permutation-free kernel two-sample test. In Advances
in Neural Information Processing Systems.
[47] S HEKHAR , S., K IM , I. and R AMDAS , A. (2023). A Permutation-Free Kernel Independence Test. Journal
of Machine Learning Research 24 1–68.
[48] S TEINWART, I. and C HRISTMANN , A. (2008). Support vector machines. Springer Science & Business
Media.
[49] S UTHERLAND , D. J., T UNG , H.-Y., S TRATHMANN , H., D E , S., R AMDAS , A., S MOLA , A. and G RET-
TON , A. (2017). Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy.
In International Conference on Learning Representations.
[50] S ZEKELY, G. and R IZZO , M. (2004). Testing for equal distributions in high dimension. InterStat 5.
[51] S ZÉKELY, G. J., R IZZO , M. L. and BAKIROV, N. K. (2007). Measuring and testing dependence by corre-
lation of distances. The Annals of Statistics 35 2769 – 2794.
[52] W ILCOXON , F. (1945). Individual Comparisons by Ranking Methods. Biometrics Bulletin 1 80–83.
[53] Z HOU , C., WANG , H. and WANG , Y. (2009). Efficient moments-based permutation tests. Advances in
neural information processing systems 22.
[54] Z MIGROD , R., V IEIRA , T. and C OTTERELL , R. (2022). Exact Paired-Permutation Testing for Structured
Test Statistics. In Proceedings of the 2022 Conference of the North American Chapter of the Associa-
tion for Computational Linguistics: Human Language Technologies 4894–4902.
22
SUPPLEMENTARY MATERIAL
Supplement to “Cheap Permutation Testing”
This supplement contains proofs of all results as well as supplementary experiment details.
A Additional notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
A.1 Additional notation for cheap homogeneity testing (Alg. 1) . . . . . . . . . . 23
A.2 Additional notation for cheap independence testing (Alg. 2) . . . . . . . . . 23
B Proof of Thm. 1: Power of cheap homogeneity: finite variance . . . . . . . . . . . 24
B.1 Proof of Lem. B.1: Variance of homogeneity U-statistics . . . . . . . . . . . 25
B.2 Proof of Lem. B.2: Conditional variance of permuted homogeneity U-statistics 25
C Proof of Cor. 1: Power of cheap homogeneity: finite variance, PD . . . . . . . . . 31
D Proof of Thm. 2: Power of cheap homogeneity: sub-Gaussian, PD . . . . . . . . . 32
D.1 Bounding the fluctuations of the test statistic . . . . . . . . . . . . . . . . . . 32
D.2 Bounding the fluctuations of the threshold . . . . . . . . . . . . . . . . . . . 33
D.3 Putting everything together . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
D.4 Proof of Lem. D.2: Sub-Gaussian homogeneity quantile bounds . . . . . . . 35
D.5 Proof of Lem. D.4: High-probability bounds on sums of MMD powers . . . . 36
D.6 Proof of Lem. D.5: Sub-Gaussian quantile bound for permutation threshold . 37
E Proof of Thm. 3: Power of cheap independence: finite variance . . . . . . . . . . . 41
E.1 Proof of Lem. E.1: Variance of independence V-statistics . . . . . . . . . . . 43
E.2 Proof of Lem. E.2: Conditional variance of permuted independence V-statistics 43
E.3 Proof of Lem. E.3: Mean of permuted independence V-statistics . . . . . . . 47
F Proof of Cor. 2: Power of cheap independence: finite variance, PD . . . . . . . . . 50
F.1 Proof of Lem. F.1: E[Vn ] bound . . . . . . . . . . . . . . . . . . . . . . . . 51
F.2 Proof of Lem. F.2: ψ1′ bound . . . . . . . . . . . . . . . . . . . . . . . . . . 52
F.3 Proof of Lem. F.3: ψ̃1′ bound . . . . . . . . . . . . . . . . . . . . . . . . . . 52
G Proof of Thm. 4: Power of cheap independence: bounded, PSD . . . . . . . . . . 53
G.1 Bounding the fluctuations of the test statistic . . . . . . . . . . . . . . . . . . 53
G.2 Alternative representation of the independence V-statistic . . . . . . . . . . . 54
G.3 Bounding the fluctuations of the threshold . . . . . . . . . . . . . . . . . . . 55
G.4 Putting everything together . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
G.5 Proof of Lem. G.4: Rademacher representation of Ṽnπ,s . . . . . . . . . . . . 57
G.6 Proof of Lem. G.5: A(i) mean bounds . . . . . . . . . . . . . . . . . . . . . 62
G.7 Proof of Lem. G.6: A(i) complementary quantile bounds . . . . . . . . . . . 63
G.8 Proof of Lem. G.7: Quantile bound for independence permutation threshold . 63
H Proof of Prop. 2: Quantile comparison method . . . . . . . . . . . . . . . . . . . . 65
I Proof of Cor. 3: Refined two moments method . . . . . . . . . . . . . . . . . . . . 66
J Proofs of minimax optimality results . . . . . . . . . . . . . . . . . . . . . . . . . 66
J.1 Proof of Prop. 3: Power of cheap testing: discrete L2 homogeneity . . . . . . 66
J.2 Proof of Prop. 4: Power of cheap testing: Hölder L2 homogeneity . . . . . . 67
J.3 Proof of Prop. 5: Power of cheap testing: discrete L2 independence . . . . . . 67
J.4 Proof of Prop. 6: Power of cheap testing: Hölder L2 independence . . . . . . 68
K Supplementary experiment details . . . . . . . . . . . . . . . . . . . . . . . . . . 68
K.1 Selection of the number of permutations B . . . . . . . . . . . . . . . . . . . 68
K.2 Cheap permutation tests for feature-based kernels . . . . . . . . . . . . . . . 69
A.1. Additional notation for cheap homogeneity testing (Alg. 1). Let X be the con-
catenation of Y and Z. We define the number of elements per bin m ≜ ns = n1 +n s , the num-
2
n1 n2 (i)
ber of bins for each sample s1 ≜ m and s2 ≜ m , and the datapoint bins Y = (Yj )j∈Ii and
Z(i) = (Zj )j∈Ii .
For for the homogeneity U-statistic of Def. 2, we define the bin kernel
(27) Hho,m (Y(i) , Y(j) ; Z(i) , Z(j) ) = m14 y∈Y(i) ,y′ ∈Y(j) ,z∈Z(i) ,z ′ ∈Z(j) hho (y, y ′ ; z, z ′ ).
P
A.2. Additional notation for cheap independence testing (Alg. 2). We define the num-
ber of elements per bin m ≜ ns and the datapoint bins Y(i) = (Yj )j∈Ii , Z(i) = (Zj )j∈Ii , and
X(i) = ((Yj , Zj ))j∈Ii . When convenient, we will also refer to the m elements of these bins
(i) (i) m (i) m
as (Yj )m j=1 , (Zj )j=1 , and (Xj )j=1 .
For the independence V-statistic of Def. 3, we define the bin kernel Hin,m as
(1) (2) (3) (4)
Hin,m (X(1) , X(2) , X(3) , X(4) ) = m14 m
P
i1 ,i2 ,i3 ,i4 =1 hin (Xi1 , Xi2 , Xi3 , Xi4 )
where, for
g((y, z), (y ′ , z ′ )) ≜ gY (y, y ′ )gZ (x, z ′ ),
we can write
g((y, z), (y ′ , z ′ )) d(δ(Y1 ,Z1 ) + δ(Y4 ,Z4 ) − δ(Y1 ,Z4 ) − δ(Y4 ,Z1 ) )(y, z)
RR
hin (X1 , X2 , X3 , X4 ) =
d(δ(Y2 ,Z2 ) + δ(Y3 ,Z3 ) − δ(Y2 ,Z3 ) − δ(Y3 ,Z2 ) )(y ′ , z ′ ).
An alternative way to express hin (X1 , X2 , X3 , X4 ) is the following:
hin (X1 , X2 , X3 , X4 ) = hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Z (Z1 , Z2 , Z3 , Z4 ), for
(31) hin,Y (Y1 , Y2 , Y3 , Y4 ) ≜ gY (y1 , y2 ) + gY (y3 , y4 ) − gY (y1 , y3 ) − gY (y2 , y4 ),
hin,Z (Z1 , Z2 , Z3 , Z4 ) ≜ gZ (z1 , z2 ) + gZ (z3 , z4 ) − gZ (z1 , z3 ) − gZ (z2 , z4 ).
24
d(δ(Y (2) ,Z (2) ) + δ(Y (3) ,Z (3) ) − δ(Y (2) ,Z (3) ) − δ(Y (3) ,Z (2) ) )(y ′ , z ′ )
i2 i2 i3 i3 i2 i3 i3 i2
m
= g((y, z), (y ′ , z ′ )) d( m12 i1 ,i4 =1 δ(Y (1) ,Z (1) ) + δ(Y (4) ,Z (4) ) − δY (1) ,Z (4) − δY (4) ,Z (1) )(y, z)
RR P
i1 i1 i4 i4 i1 i4 i4 i1
m
d( m12 i2 ,i3 =1 δ(Y (2) ,Z (2) ) + δ(Y (3) ,Z (3) ) − δ(Y (2) ,Z (3) ) − δ(Y (3) ,Z (2) ) )(y ′ , z ′ )
P
i2 i2 i3 i3 i2 i3 i3 i2
′ ′
RR
= g((y, z), (y , z )) d(δ(Y(1) ,Z(1) ) + δ(Y(4) ,Z(4) ) − δY(1) × δZ(4) − δY(4) × δZ(1) )(y, z)
d(δ(Y(2) ,Z(2) ) + δ(Y(3) ,Z(3) ) − δY(2) × δZ(3) − δY(3) × δZ(2) )(y ′ , z ′ )
For a permutation π of [s], we can also define the cheaply-permuted independence V-statistic,
Vnπ,s ≜ s14 (i1 ,i2 ,i3 ,i4 )∈[s]4 Hin,m (Y(i1 ) , Z(πi1 ) ), (Y(i2 ) , Z(πi2 ) ),
P
(32)
These expressions along with the substitutions s for n and Hin,m for hin help us to frame a
cheap independence V-statistic permutation test as a standard independence V-statistic per-
mutation test operating on bins of datapoints.
pression (28) for Hho,m ensures that E[Unπ,s 1 ,n2 | X] = 0 surely for all s ≥ 4, so, by Cor. 3, it
suffices to bound Var(Un1 ,n2 ) and E[Var(Unπ,s 1 ,n2 | X)]. Lem. B.1, proved in App. B.1, gives
Lem. B.2, proved in App. B.2, provides the conditional variance bounds.
Cor. 3 and Lems. B.1 and B.2 together imply power at least 1 − β for standard testing if
q q
4ψY,1 4ψZ,1 (17n1 n2 +4n22 +4n21 )ψY Z,2 1−α⋆ (6n21 +6n22 +48n1 n2 )ψY Z,2
E[Un1 ,n2 ] ≥ 3−β
β n1 + n2 + n1 (n1 −1)n2 (n2 −1) + α⋆ βn1 (n1 −1)n2 (n2 −1) ,
CHEAP PERMUTATION TESTING 25
a condition implied by our assumption (9), and power at least 1 − β for cheap testing if
q
4ψY,1 4ψZ,1 (17n1 n2 +4n22 +4n21 )ψY Z,2
E[Un1 ,n2 ] ≥ 3−β β n1 + n2 + n1 (n1 −1)n2 (n2 −1)
⋆ −1 2 2 2 2 3
+ (α )β −1 n6n 2 +6n1 +48n1 n2
+ 1200(n1 −1)(n 2 −1)(n1 +n2 )/s+1056n m
(36) 1 (n1 −1)n2 (n2 −1) n1 (n1 −1)2 n2 (n2 −1)2 ψ̃Y Z,2
⋆ −1
q
) −1)n3 m2 max{ψY,1 ,ψZ,1 } 1/2 24((α⋆ )−1 −1)n3 m3 E[Un1 ,n2 ]2
+ 288((α βn 1 (n1 −1) 2 n (n −1)2
2 2
+ βn1 (n1 −1)2 n2 (n2 −1)2 .
We can rewrite (36) equivalently as
q
⋆ )−1 −1)n3 m3 −1 q 3−β (17n1 n2 +4n22 +4n21 )ψY Z,2
E[Un1 ,n2 ] ≥ 1 − βn24((α
1 (n1 −1) n2 (n2 −1)
2 2 β
4ψY,1
n1 + 4ψZ,1
n2 + n1 (n1 −1)n2 (n2 −1)
288((α⋆ )−1 −1)n3 m2 max{ψY,1 ,ψZ,1 } 1/2
+ βn1 (n1 −1)2 n2 (n2 −1)2 .
To conclude, we note that our assumption (10) is a sufficient condition for (37) as
n4 = (n21 + n22 )2 + 4n1 n2 (n21 + n22 ) + 4n21 n22 ≥ 8n1 n2 (n21 + n22 ) ≥ 8(n1 − 1)(n2 − 1)(n21 + n22 ).
2 ≜ Var(E[h (Y , Y ; Z , Z )|Y
σ̂i,j i, j ∈ [2].
ho 1 2 1 2 i+1 , . . . , Y2 , Zj+1 , . . . , Z2 ]) for
Note that for n1 ≥ 4 (and analogously for n2 ≥ 4),
(n1 −2)!2 (n1 −2)(n1 −3) 1
4n1 !(n1 −4)! = 4n1 (n1 −1) ≤ if i = 0
4
(n1 −2)!2 (n1 −2)!2 n1 −2 1
n1 !i!(2−i)!2 (n1 −4+i)! = n1 !(n1 −3)! = n1 (n1 −1) ≤ n1 if i = 1 .
2
(n1 −2)! =
1
if i = 2
2n1 !(n1 −2)! 2n1 (n1 −1)
Plugging this into the right-hand side of (38), we obtain that for n1 , n2 ≥ 4,
2 2 2 2 2 2
σ̂0,0 σ̂0,1 σ̂1,0 σ̂1,1 σ̂0,2 σ̂2,0
Var(Un1 ,n2 ) ≤ 16 42 + 4n2 + 4n1 + n1 n2 + 8n2 (n2 −1) + 8n1 (n1 −1)
2 2 2
σ̂1,2 σ̂2,1 σ̂2,2
+ 2n1 n2 (n2 −1) + 2n2 n1 (n1 −1) + 4n1 (n1 −1)n2 (n2 −1)
2 2 2 σ̂2 2
4σ̂1,0 4σ̂0,1 4 σ̂2,2
σ̂2,2
≤ n1 + n2 + 16 + 9 n1 n2 + 2 + 2 n1 (n2,2
1 −1)
+ 2 + 2 n2 (n2 −1)
2 2
4σ̂1,0 4σ̂0,1 (17n1 n2 +4n22 +4n21 )σ̂2,2
2
≤ n1 + n2 + n1 (n1 −1)n2 (n2 −1) .
In the second inequality, we used that σ̂0,0 2 = 0 and that, by the law of total variance, σ̂ 2 ≤
i,j
2 2 ≤ψ
σ̂2,2 for all i, j ∈ [2]. Equation (33) then follows from the fact that σ̂1,0 = , σ̂ 2 ≤ψ
Y,1 0,1 Z,1 ,
2
and σ̂2,2 ≤ ψY Z,2 by their definitions.
B.2.1. Proof of (34). Following the proof of [25, Thm. 4.1], we define the index sets
Itotal ≜ {i ∈ N8+ : (i1 , i2 ) ∈ in2 1 , (j1 , j2 ) ∈ in2 2 , (i′1 , i′2 ) ∈ in2 1 , (j1′ , j2′ ) ∈ in2 2 },
I1 ≜ {i ∈ Itotal : s(i) ≤ 1}, and Ic1 = {i ∈ Itotal : s(i) > 1}.
By the argument of [25, Thm. 4.1], we have
c
|I1 |
(39) E[Var(Unπ,n
1 ,n2 |X)] ≤ ψ̃Y Z,2 (n1 )2 (n2 )2 ,
(2) (2)
Since ψY Z,2 upper-bounds ψ̃Y Z,2 and n ≥ 4, the following estimate, provided by (39)
and (40), completes our proof:
2ψ̃Y Z,2 (n21 +n22 +8n1 n2 −17(n1 +n2 )+30) ψ̃Y Z,2 (2n21 +2n22 +16n1 n2 )
E[Var(Unπ,n
1 ,n2 |X)] ≤ n1 (n1 −1)n2 (n2 −1) ≤ n1 (n1 −1)n2 (n2 −1) .
where π̃i ≜ mπ⌊i/m⌋ + i − m⌊i/m⌋. We now express Itotal as the disjoint union of four sets:
• Idiff , containing s21 (s1 − 1)2 s22 (s2 − 1)2 m8 = s21 (s21 − 2s1 + 1)s22 (s22 − 2s2 + 1)m8 indices
for which the bins of each component are pairwise different.
• Ia , containing the indices for which one or more of the pairs among (i1 , i2 ), (i′1 , i′2 ),
(n1 + j1 , n1 + j1′ ), (n1 + j2 , n1 + j2′ ) share the same bin, but {⌊i1 /m⌋, ⌊i2 /m⌋} ∩
{⌊i′1 /m⌋, ⌊i′2 /m⌋} = ∅, {⌊(n1 + j1 )/m⌋, ⌊(n1 + j2 )/m⌋} ∩ {⌊(n1 + j1′ )/m⌋, ⌊(n1 +
j2′ )/m⌋} = ∅.
• Ib , containing the indices for which exactly one pair among (i1 , i2 ), (i′1 , i′2 ), (n1 + j1 , n1 +
j1′ ), (n1 + j2 , n1 + j2′ ) shares the same bin, and |{⌊i1 /m⌋, ⌊i2 /m⌋} ∩ {⌊i′1 /m⌋, ⌊i′2 /m⌋}| +
|{⌊(n1 + j1 )/m⌋, ⌊(n1 + j2 )/m⌋} ∩ {⌊(n1 + j1′ )/m⌋, ⌊(n1 + j2′ )/m⌋}| = 1.
• Irest , containing the rest of indices. For indices in Irest , there has to be at least one pair
among (i1 , i2 ), (i′1 , i′2 ), (n1 + j1 , n1 + j1′ ), (n1 + j2 , n1 + j2′ ) that shares the same bin, and
there have to be at least two pairs formed by an index in {i1 , i2 , n1 + j1 , n1 + j2 } and an
CHEAP PERMUTATION TESTING 27
index in {i′1 , i′2 , n1 + j1′ , n1 + j2′ } such that both indices belong to the same bin. From this
argument, we can deduce that
|Irest | ≤ (4s1 s42 + 4s21 s32 + 16s21 s32 + 16s31 s22 + 4s41 s2 + 4s31 s22 )m8
= (4s1 s42 + 20s21 s32 + 20s31 s22 + 4s41 s2 )m8 .
Contribution of Idiff terms. To deal with the contribution from Idiff terms, we mirror the
argument of [25, Thm. 4.1]. In App. A.1 we have seen that cheap permutation can be treated
like standard permutation of dataset bins with the substitutions n ← s, n1 ← s1 , n2 ← s2 ,
hho ← Hho,m , g ← Gm . To this end, we define the index sets
Ktotal ≜ {i ∈ N8+ : (i1 , i2 ) ∈ is21 , (j1 , j2 ) ∈ is22 , (i′1 , i′2 ) ∈ is21 , (j1′ , j2′ ) ∈ is22 },
K1 ≜ {i ∈ Ktotal : s(i) ≤ 1}, Kc1 = {i ∈ Ktotal : s(i) > 1}, and K2 ≜ {i ∈ Ktotal : s(i) = 2}.
We compute the size of K2 :
|K2 | = 2s1 (s1 − 1)s2 (s2 − 1)(s2 − 2)(s2 − 3) + 2s1 (s1 − 1)(s1 − 2)(s1 − 3)s2 (s2 − 1)
+ 4s1 (s1 − 1)(s1 − 2) · 4s2 (s2 − 1)(s2 − 2)
= s1 (s1 − 1)s2 (s2 − 1)(2s22 + 2s21 + 16s1 s2 − 42s1 − 42s2 + 88).
Hence, we obtain that
|Kc1 \ K2 | = |Kc1 | − |K2 | = s1 (s1 − 1)s2 (s2 − 1)((−34s1 − 34s2 + 60) − (−42s1 − 42s2 + 88))
= s1 (s1 − 1)s2 (s2 − 1)(8s1 + 8s2 − 28).
Our goal is now to upper-bound the contribution of Idiff to (41),
P
i∈Idiff E[E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )
8
(πi1 ) (π ) (π ) (π )
, Xn i2 ; Xn s1 +j1 , Xn s1 +j2 )
P
=m i∈Ktotal E E Hho,m (Xn
(πi′ ) (πi′ ) (πs1 +j ′ ) (πs1 +j ′ )
× Hho,m (Xn 1
, Xn 2
)X
; Xn 1
, Xn 2
(42) P Pm (πi1 ) (π ) (π ) (π )
≤ i∈Ktotal E E k1 ,k2 ,ℓ1 ,ℓ2 =1 hho (Xk1 , Xk2 i2 ; Xℓ1 s1 +j1 , Xℓ2 s1 +j2 )
Pm (πi′ ) (πi′ ) (πs +j ′ ) (πs +j ′ )
× k1′ ,k2′ ,ℓ′1 ,ℓ′2 =1 hho (Xk1′
1
, Xk2′ 2 ; Xℓ′1 1 1 , Xℓ′2 1 2 ) X
(π ) (π ) (π ) (π )
E E hho (Xk1 i1 , Xk2 i2 ; Xℓ1 s1 +j1 , Xℓ2 s1 +j2 )
P P
= i∈Ktotal K∈Jtotal
(πi′ ) (πi′ ) (πs1 +j ′ ) (πs1 +j ′ )
× hho (Xk1′ 1
, Xk2′ 2
; Xℓ′1 1
, Xℓ′2 2
)X ,
|{K ∈ Jtotal | s(K) = 0}| ≥ m(m − 1)(m − 2)(m − 3)m(m − 1)(m − 2)(m − 3)
+ 4m(m − 1)(m − 2)m(m − 1)(m − 2)(m − 3)
= m(m − 1)(m − 2)m(m − 1)(m − 2)(m − 3) (m − 3) + 4
|{K ∈ Jtotal | s(K) = 1}| ≥ 8m(m − 1)(m − 2)m(m − 1)(m − 2)(m − 3)
=⇒ |Jc1 | = |Jtotal | − |{K ∈ Jtotal | s(K) = 0}| − |{K ∈ Jtotal | s(K) = 1}|
≤ m8 − m(m − 1)(m − 2)m(m − 1)(m − 2)(m − 3) (m − 3) + 4
Since every non-zero summand is bounded by ψ̃Y Z,2 , we can bound (42) by
|K2 | · m6 ψ̃Y Z,2 + |Kc1 \ K2 | · |Jc1 |ψ̃Y Z,2
≤ s1 (s1 − 1)s2 (s2 − 1)((2s22 + 2s21 + 16s1 s2 )m6 + (8s1 + 8s2 ) · 50m6 )ψ̃Y Z,2
= n1 (n1 − m)n2 (n2 − m)(2n22 + 2n21 + 16n1 n2 + 400(n21 + n22 )/s)ψ̃Y Z,2 ,
which, in turn, bounds the contribution of Idiff to (41).
Contribution of Ia terms. Without loss of generality, assume that i1 and i2 belong to the
same bin but that all other indices belong to pairwise different bins. Then,
E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X]
1 2 1 2
− E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|X]E[hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X]
1 2 1 2
= E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|X]E[hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X]
1 2 1 2
− E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|X]E[hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X] = 0.
1 2 1 2
Contribution of Ib terms. Without loss of generality, assume that i1 and i2 belong to the same
bin and that n1 + j1 and n1 + j2′ also belong to the same bin. Since exchanging the indices
i′1 and n1 + j1′ does not change the distribution, and hho (x, y; x′ , y ′ ) = −hho (x′ , y; x, y ′ ),
E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X]
1 2 1 2
= E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃n1 +j′ , Xπ̃i′ ; Xπ̃i′ , Xπ̃n1 +j′ )|X]
1 2 1 2
= −E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃n1 +j′ , Xπ̃i′ ; Xπ̃i′ , Xπ̃n1 +j′ )|X] = 0.
1 2 1 2
Irest terms, subcase (i). For indices in Irest such that i1 , i2 , n1 + j1 , n1 + j2 , i′1 , i′2 , n1 + j1′ ,
n1 + j2′ are all pairwise different,
E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X]
1 2 1 2
= E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|π]
1 2 1 2
= E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|π]E[hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|π] .
1 2 1 2
Here, E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|π] and E[hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|π]
1 2 1 2
play the same role, so we will focus on the latter. We consider the following cases:
• π̃i1 ≤ n1 , π̃i2 ≤ n1 , π̃n1 +j1 > n1 , π̃n1 +j2 > n1 or analogous configurations: Then,
E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|π] = E[hho (Y1 , Y2 ; Z1 , Z2 )] = E[Un1 ,n2 ],
• π̃i1 ≤ n1 , π̃i2 ≤ n1 , π̃n1 +j1 ≤ n1 , π̃n1 +j2 > n1 or analogous configurations: Then,
E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|π] = E[hho (Y1 , Y2 ; Y3 , Z1 )]
= E[hho (Y3 , Y2 ; Y1 , Z1 )] = −E[hho (Y1 , Y2 ; Y3 , Z1 )] = 0.
• π̃i1 ≤ n1 , π̃i2 ≤ n1 , π̃n1 +j1 ≤ n1 , π̃n1 +j2 ≤ n1 or analogous configurations: Then,
E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|π] = E[hho (Y1 , Y2 ; Y3 , Y4 )]
= E[hho (Y3 , Y2 ; Y1 , Y4 )] = −E[hho (Y1 , Y2 ; Y3 , Y4 )] = 0.
Thus, for all terms in subcase (i), we have that
E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X] ≤ E[Un1 ,n2 ]2 .
1 2 1 2
Moreover, the fraction of Irest terms in subcase (i) is m2 (m − 1)2 (m − 2)2 (m − 3)2 /m8 .
Irest terms, subcase (ii). Now we consider indices in Irest such that there is exactly one pair
among i1 , i2 , n1 + j1 , n1 + j2 , i′1 , i′2 , n1 + j1′ , n1 + j2′ that are the same, and the other
indices are pairwise different. Without loss of generality, we can assume that i1 = i′1 (recall
that i1 ̸= i2 always). Then,
E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃i1 , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|X]
2 1 2
= E E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃i1 , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|π]
2 1 2
= E E[E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|Xπ̃i1 , π]
× E[hho (Xπ̃i1 , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|Xπ̃i1 , π]|π] .
2 1 2
Here, E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|Xπ̃i1 , π] and E[hho (Xπ̃i1 , Xπ̃i′ ; Xπ̃n1 +j′ , Xπ̃n1 +j′ )|Xπ̃i1 , π]
2 1 2
also play the same role, so we will focus on the latter. We consider the following cases with
(Y, Y ′ , Y ′′ , Z, Z ′ , Z ′′ ) ∼ P × P × P × Q × Q × Q given (Xπ̃i1 , π):
• π̃i2 ≤ n1 , π̃n1 +j1 > n1 , π̃n1 +j2 > n1 or analogous configurations: Then,
E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|Xπ̃i1 , π] = E[hho (Xπ̃i1 , Y ; Z, Z ′ )|Xπ̃i1 , π].
• π̃i2 ≤ n1 , π̃n1 +j1 ≤ n1 , π̃n1 +j2 > n1 or analogous configurations: Then,
E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )|Xπ̃i1 , π] = E[hho (Xπ̃i1 , Y ; Y ′ , Z ′ )|Xπ̃i1 , π].
30
n1 ′′ ′ ′ ′′ 2 ′′ ′ ′ 2 .
n2
n E E[hho (Y , Y ; Y , Z )|Y ] + n E E[hho (Z , Y ; Y , Z)|Z ]
Now, notice that
E E[hho (Y ′ , Y ; Z, Z ′ )|Y ′ ]2 = Var E[hho (Y ′ , Y ; Z, Z ′ )|Y ′ ] + E[hho (Y ′ , Y ; Z, Z ′ )]2
Since there need to be at least two pairs of indices that share the same value, we have that the
fraction of elements of Irest that fall under subcase (iii) is upper-bounded by
4 4 4
1
6 6 4 6 44
m8 3 m + 3 m + 2 · 2 m = m2 .
Total contribution of Irest terms. We end up with a bound of the form
P
i∈Idiff E[E[hho (Xπ̃i1 , Xπ̃i2 ; Xπ̃n1 +j1 , Xπ̃n1 +j2 )hho (Xπ̃i′ , Xπ̃i′ ; Xπ̃n1 +j ′ , Xπ̃n1 +j ′ )|X]]
1 2 1 2
Combining our Idiff , Ia , Ib , and Irest estimates, we obtain the advertised variance bound,
E[Var(Unπ,s 2 2 2
1 ,n2 |X)] n1 (n1 − 1) n2 (n2 − 1)
2
≤ n1 (n1 − m)n2 (n2 − m)(2n22 + 2n21 + 16n1 n2 + 400(n21 + n22 )/s)ψ̃Y Z,2
12 44
+ (4n1 n42 + 20n21 n32 + 20n31 n22 + 4n41 n2 )m3 E[Un1 ,n2 ]2 +
m max{ψY,1 , ψZ,1 } + m2 ψ̃Y Z,2
L EMMA C.1 (Positive-definite bounds on ψY,1 , ψZ,1 ). Under the premises of Cor. 1,
ψY,1 ≤ MMD2k (P, Q)ξQ and ψZ,1 ≤ MMD2k (P, Q)ξP .
We first show that (11) implies the sufficient power condition (9) of Thm. 1. Since
q q
3−β 4MMD2k (P,Q)ξQ 2
3−β 4ψY,1 4ψZ,1
+ 4MMDnk2(P,Q)ξP
(43) β n1 + n2 ≤ β n1
by Lem. C.1 and E[Un1 ,n2 ] = MMD2k (P, Q) by [20, Lemma 6], the following inequality is a
sufficient condition for (9) to hold:
ax2 − bx − c ≥ 0 for x = MMDk (P, Q),
q q
4ξQ 1−α⋆ (36n21 +36n22 +198n1 n2 )ψY Z,2
a = 1, b = 3−β β n1 + 4ξP
n2 , and c = α⋆ βn1 (n1 −1)n2 (n2 −1) .
√
By the quadratic formula, this holds whenever x ≥ b + c, yielding the first claim.
q We next show that (12) implies the sufficient power condition (10) of Thm. 1. Let f =
⋆
βα 2
s3 24(1−α ⋆ ) ρn1 n2 . Using the equality E[Un1 ,n2 ] = MMDk (P, Q) and the inequalities
q q q
201s2 12s 201s2 12s
4n2 ψY Z,2 + n max(ψ Y,1 , ψZ,1 ) ≤ 4n2 ψY Z,2 + n max(ξQ , ξP )MMDk (P, Q)
32
and (43) justified by Lem. C.1 and the triangle inequality, we find that (10) holds whenever
ax2 − bx − c ≥ 0 for x = MMDk (P, Q), a = 1 − 1/f,
q q
4ξQ
b = 3−β β n1 + 4ξP
n2 + 1
f
12s
n max(ξQ , ξP ), and
q q
⋆ (36n2 +36n2 +198n1 n2 )ψY Z,2 201s2
c = 1−α α ⋆
1 2
βn1 (n1 −1)n2 (n2 −1) + 1
f 4n2 ψY Z,2 .
√
Since a ≤ 1, this holds when x ≥ ab + ac ≥ ab + ac , yielding the second claim.
p
D.1. Bounding the fluctuations of the test statistic. We begin by identifying a CQB
Φn1 ,n2 satisfying the test statistic tail bound (21). We first introduce another way to express
the U-statistic Un1 ,n2 and its corresponding V-statistic Vn1 ,n2 . If we let δi = 1 for i ∈ [n1 ] and
δi = −1 for i ∈ n1 + [n2 ] and Kij = g(Xi , Xj ), we can write
I[δi =1,δj =1]
Vn1 ,n2 ≜ MMD2k (P b ) = Pn + I[δi =−1,δ j =−1]
− I[δinδ1j n=−1]
b, Q Kij ,
i,j=1 n21 n22 2
L EMMA D.1 (Sub-Gaussian differences inequality [33, Sec. 5]). Suppose an integrable
function f of a sequence of independent datapoints X = (Xi )ni=1 satisfies
|f (X) − E[f (X) | X−k ]| ≤ fk (Xk ) almost surely
for each k ∈ [n] and measurable functions (fk )nk=1 . If each fk (Xk ) is sub-Gaussian, then
q
f (X) − E[f (X)] < 64e log(1/δ) k∈[n] ∥fk (Xk )∥2ψ2 with probability at least 1 − δ.
P
1
The statement of [33, Thm. 3] contains the apparent typo 32 in place of 64, presumably due to incorrect
minimization of (45).
CHEAP PERMUTATION TESTING 33
The second lemma, proved in App. D.4, uses Lem. D.1 to derive suitable CQBs for Un1 ,n2
and Vn1 ,n2 .
Φn1 ,n2 (β) ≜ Λ2Un (β) + 2MMDk (P, Q)ΛUn1 ,n2 (β).
1 ,n2
D.2. Bounding the fluctuations of the threshold. We next identify CQBs ΨX,s and
Ψn1 ,n2 ,s that satisfy the properties
(46) Pr(Unπ,s
1 ,n2 ≥ ΨX,s (α) | X) ≤ α, ∀α ∈ (0, 1), and
(47) Pr(ΨX,s (α) ≥ Ψn1 ,n2 ,s (α, β)) ≤ β, ∀α, β ∈ (0, 1)
for the cheaply permuted U-statistic Unπ,s
(30). Our derivation makes use of three lemmas.
1 ,n2
The first is tail bound for random quadratic forms known as the Hanson-Wright inequality.
L EMMA D.3 (Hanson-Wright inequality [11, Thm. 3.1]). Let X ∈ Rn be a random vec-
2 2
tor with independent components satisfying E[eλXi ] ≤ eλ ν /2 for some ν ≥ 0, all λ ∈ R,
and all i ∈ [n]. Let A ∈ Rn×n be a symmetric matrix such that aii = 0 for i ∈ [n]. Then, there
exists a universal constant c > 3/20 such that for any t > 0,
2
(48) Pr(X ⊤ AX > t) ≤ exp(−c min{ ν 4 ∥A∥
t t
ν ∥A∥op }).
2 , 2
F
P ROOF. Equation (48) is the content of [11, Thm. 3.1], and equivalence with (49) follows
ct2 ct
p
from selecting t = max a log(1/δ), b log(1/δ) in (48) or δ = exp(− min{ ν 4 ∥A∥ ν ∥A∥op })
2 , 2
F
in (49). Now, suppose that δ ∈ (0, 0.86). Since 0.86 ≤ exp(−3/20) ≤ exp(−c), we have
q
log(1/δ) log(1/δ)
c ≤ c . Since ∥A∥op ≤ ∥A∥F , the claim (50) follows.
Our second lemma, proved in App. D.5, provides high-probability bounds on the sums of
MMD powers.
L EMMA D.4 (High-probability bounds on sums of MMD powers). Under the assump-
tions of Thm. 2, if p ∈ {1, 2, 4}, then
Ps1 p Ps2 p 1/p
Zp (X) ≜ i=1 MMDk (δYi , P) + i=1 MMDk (δZi , Q)
< Ap (s1 , s2 , m) + B(s1 , s2 , m, δ),
34
Our third lemma, Lem. D.5, proved in App. D.6, uses Lems. D.3 and D.4 to verify the
CQB properties (46) and (47) for a particular pairing of (ΨX,s , Ψn1 ,n2 ,s ).
L EMMA D.5 (Sub-Gaussian quantile bound for permutation threshold). Instantiate the
assumptions and notation of Lem. D.4. For all α, β ∈ (0, 1),
√
Ψn1 ,n2 ,s (α, β) ≜ 20 log(2/α)
3/2 12MMDk (P, Q)(A2 (s1 , s2 , m) + B(s1 , s2 , m, β/3))
3s1
√
+ 24(A4 (s1 , s2 , m) + B(s1 , s2 , m, β/3))2
2
+ s21 (A2 (s1 , s2 , m) + B(s1 , s2 , m, β/3))2
+ MMDk (P, Q)(A1 (s1 , s2 , m) + B(s1 , s2 , m, β/3))
MMD2k (P,Q)(2 log(4/α)+1)
+ s1 .
satisfies Pr(ΨX,s (α) ≥ Ψn1 ,n2 ,s (α, β)) ≤ β for ΨX,s (α) defined in equation (55) and satisfy-
ing Pr(Unπ,s
1 ,n2 ≥ ΨX,s (α) | X) ≤ α. Here, m ≜
n1 +n2 sn1
s , s1 ≜ n1 +n2 , and s2 ≜ s − s1 .
In the sequel, we will make use of a simplified version of Ψn1 ,n2 ,s that, as an immediate
corollary to Lem. D.5, also satisfies the CQB property (47).
D.3. Putting everything together. Fix any β ∈ (0, 1), and suppose that
(51) E[Un1 ,n2 ] ≥ Ψn1 ,n2 ,s (α⋆ , β/3) + Φn1 ,n2 (β/3)
for Ψn1 ,n2 ,s and Φn1 ,n2 defined in Cor. D.1 and Lem. D.2. By Cor. D.1, Lem. D.2, and Prop. 2,
a cheap permutation test based on Un1 ,n2 has power at least 1 − β . We now show that the
assumed condition (13) is sufficient to imply (51). Since E[Un1 ,n2 ] = MMD2k (P, Q), the con-
dition (51) is equivalent to the inequality ax2 − bx − c ≥ 0 with x = MMDk (P, Q),
2 log(4/α⋆ )+1
a=1− s1 ,
CHEAP PERMUTATION TESTING 35
√ √
40 log(2/α⋆ )(Ã2 +B̃( β3 )) 2( 2Ã2 +B̃( β3 )) 2Λ̃P (β) 2Λ̃Q (β)
b= √
3s1 n1
+ √
s1 n1 +√ n1 −1
+√ n2 −1
, and
√ 2
40 2 log(2/α⋆ )/3(Ã4 +B̃( β3 )2 )+4(Ã22 +B̃( β3 )2 ) Λ̃ (β) Λ̃ (β)
c= n1 + √P
n1 −1
+ √Q
n2 −1
.
√
By the triangle inequality and the fact a ≥ a, it therefore suffices to have
√ √
2
x ≥ a1 (b + c) ≥ ab + ac ≥ b+ b2a+4ac .
p
D.4. Proof of Lem. D.2: Sub-Gaussian homogeneity quantile bounds. Fix any β ∈
(0, 1). Let X be the concatenation of Y and Z, and let Ỹ and Z̃ be independent draws from P
and Q respectively. By the triangle inequality, we have the following almost sure inequalities
for each k ∈ [n]:
|MMDk (P, Q) − MMDk (P b )| ≤ MMDk (P, P
b, Q b) + MMDk (Q, Q b ) ≜ ∆(X)
(
1
E[MMDk (δYk , δỸ ) | Yk ] if k ≤ n1
|∆(X) − E[∆(X) | X−k ]| ≤ n11
n2 E[MMDk (δZk−n1 , δZ̃ ) | Zk−n1 ] otherwise.
Since |a2 − b2 | = |a − b||b + a| ≤ |a − b|(|b − a| + 2|a|) for all real a, b, we further have
(52) |MMD2k (P, Q) − MMD2k (P b )| < Λ2
b, Q
Vn (β) + 2MMDk (P, Q)ΛVn1 ,n2 (β)
1 ,n2
almost surely. Identical reasoning yields the following almost sure bound for each k ∈ [n2 ]:
q
˜ X) − E[∆( ˜ X)|Y, Z−k ] ≤ 1 1 + 1 E MMDk (δZk , δ )|Zk .
∆( n2 n2 Z̃
where
E MMD2k (δYi , δY−i ) = E
RR
k(x, y) d(δYi − P + P − δY−i )(x) d(δYi − P + P − δY−i )(y)
= E (1 + n11−1 ) k(x, y) d(δYi − P)(x) d(δYi − P)(y) = 1 + n11−1 M2,P .
RR
q
E[∆(˜ X)] ≤ M2,P + M2,Q .
n1 −1 n2 −1
Hence, applying Lem. D.1, we obtain that with probability at least 1 − β/2,
q q
M2,P M2,Q (n +1)σ 2 (n2 +1)σQ2
Vn1 ,n2 − Un1 ,n2 < n1 −1 + n2 −1 + 64e log( β2 )( 1 n2 P +
p
n2 ).
1 2
This estimate together with the union bound and the relations (52) and (53) yield the claim.
D.5. Proof of Lem. D.4: High-probability bounds on sums of MMD powers. Define
ϕ(y) ≜ EY ′ ∼P [MMDk (δy , δY ′ )] and ϕ′ (z) ≜ EZ ′ ∼Q [MMDk (δz , δZ ′ )],
fix any p ∈ {1, 2, 4}, and note that we may write
(MMDk (δYi , P))si=1 , (MMDk (δZi , Q))si=1
Zp (X) = 1 2
p
, and
(
1
ϕ(Yk ) if k ≤ n1
|Zp (X) − E[Zp (X) | X−k ]| ≤ m1 ′
m ϕ (Zk−n1 ) otherwise
almost surely by the triangle inequality. Applying Lem. D.1 we therefore have
Zp (X) − E[Zp (X)] < B(s1 , s2 , m, δ) with probability at least 1 − δ.
Hence, it remains to upper-bound E[Zp (X)]. Applying Hölder’s inequality, we find that
E[Z1 (X)] = si=1 E[MMDk (δYi , P)] + si=1
P1 P2
E[MMDk (δZi , Q)]
q P2 q
= si=1 E[MMD2k (δYi , P)] + si=1 E[MMD2k (δZi , Q)] = A1 (s1 , s2 , m),
P1
E[Z2 (X)]2 ≤ E[Z2 (X)2 ] = si=1 E[MMD2k (δYi , P)] + si=1 E[MMD2k (δZi , Q)]
P1 P2
and the analogous result for E[MMD4k (δZi , P)]. This completes the proof.
D.6. Proof of Lem. D.5: Sub-Gaussian quantile bound for permutation threshold.
Fix any α, β ∈ (0, 1). We begin by identifying ΨX,s (α) satisfying
Pr(Vnπ,s
1 ,n2 ≥ ΨX,s (α) | X) ≤ α
ℓ = (ℓ1 , . . . , ℓs1 ) be an s1 tuple drawn without replacement from [s2 ] generated independently
of (X, π) and ζ = (ζk )sk=1 1
be independent random variables that take values −1 and 1 with
equal probability, independently of (X, π, ℓ). For k1 , k2 ∈ [s1 ], define the random measures
( (
P if πk ≤ s1 ′ P if πs1 +ℓk′ ≤ s1
µk = and µk′ =
Q otherwise Q otherwise.
By construction, (µk1 , µk2 , µ′k1 , µ′k2 ) are population distributions of the respective sample
points (Xπk1 , Xπk2 , Xπs1 +ℓk , Xπs1 +ℓk ). Next, let us define
1 2
Venπ,ℓ,ζ,s 1 Ps1
1 ,n2 ≜ s21 k1 ,k2 =1 ζk1 ζk2 Hho,m (Xπk1 , Xπk2 ; Xπs1 +ℓk1 , Xπs1 +ℓk2 ),
where Hho,m is defined in (27). By [25, Sec. 6.1], conditional on X, the distribution of
E[Venπ,ℓ,ζ,s
1 ,n2 | X, π, ζ] matches that of Vnπ,s
1 ,n2 .
for 1 ≤ k1 , k2 ≤ s1 and Aπ,ℓ = (ak1 ,k2 (π, ℓ))sk11 ,k2 =1 . We remark that
Venπ,ℓ,ζ,s
1 ,n2 = ζ ⊤ Aπ,ℓ ζ + cζ,π,ℓ ,
where cζ,π,ℓ ≜ s12 sk11 ,k2 =1 ζk1 ζk2 ⟨µk1 k − µ′k1 k, µk2 k − µ′k2 k⟩k ,
P
1
To make this more precise, we can partition [s1 ] into the following subsets: A = {k ∈ [s1 ] |
µk = P, µ′k = P}, B = {k ∈ [s1 ] | µk = P, µ′k = Q}, C = {k ∈ [s1 ] | µk = Q, µ′k = P},
38
Therefore,
Ps1
k1 ,k2 =1 ζk1 ζk2 bk1 ,k2 (π)
P 2 P P 2 P
= k∈G ζk b1 + k∈G 1 b2 − b1 + k∈F ζk b3 + k1 ∈F 1 b4 − b3
P P P P
− k∈G ζk k∈F ζk b5 − k∈G ζk k∈F ζk b6
P 2 qG2 qG qF P
= k∈G ζk s22 − s22 (s2 −1) + k∈G 1 b2 − b1
P 2 q 2 qF qG P
+ k∈F ζk s22 − s22 (s2 −1) + k1 ∈F 1 b4 − b3
F
P P qG qF qG qF P P qG qF qG qF
− k∈G ζk k∈F ζk 2
s2 + 2
s2 (s2 −1) − k∈G ζ k k∈F ζ k s22 + 2
s (s2 −1)
2
qG qF 2 2
− s2q(sG2q−1)
P P P P
k∈G ζk s2 −
F
= k∈F ζk s2 k∈G ζk + k∈F ζk
2
+ |G| b2 − b1 + |F| b4 − b3
P P 2
≤ k∈G ζk E 1k∈B | π, k ∈ G − k∈F ζk E 1k∈C | π, k ∈ F + s1 .
2
⊔ denotes a disjoint union
CHEAP PERMUTATION TESTING 39
Let c1 = E 1k∈B | π, k ∈ G , c2 = E 1k∈C | π, k ∈ F . By Hoeffding’s inequality [23,
Thm. 2] and the union bound
t2 t2
P P
Pr |c1 k∈G ζk + c2 k∈F ζk | ≥ t | π ≤ 2 exp − 2(c2 |G|+c 2
|F |) ≤ 2 exp − 2s1 . 1 2
where Aod
π,ℓ has the off-diagonal components of Aπ,ℓ and zeros in the diagonal. We bound
2
ζ E[Aπ,ℓ | X, π]ζ first. Since supλ∈R E[eλζ1 −λ /2 ] ≤ 1 [5, Lem. 2.2], Lem. D.3 implies that
⊤ od
20 log(1/δ ′ )∥E[Aod
π,ℓ |X,π]∥F 20 log(1/δ ′ ) supπ,ℓ ∥Aod
π,ℓ ∥F
(54) ζ ⊤ E[Aod
π,ℓ | X, π]ζ ≤ 3 ≤ 3
with probability at least 1 − δ ′ conditioned on (X, π), provided that δ ′ ∈ (0, 0.86). Setting
δ ′′ = α2 and δ ′ = α2 , we find that with probability at least 1 − α conditioned on X,
E[Venπ,ℓ,ζ,s
1 ,n2 | X, π, ζ] = ζ ⊤ E[Aod
π,ℓ | X, π]ζ + E[Tr(Aπ,ℓ ) | X, π] + E[cζ,π,ℓ | π, ζ]
20 log( α2 ) supπ,ℓ ∥Aod
π,ℓ ∥F MMD2k (P,Q)(2 log(4/α)+1)
(55) < ΨX,s (α) ≜ 3 +supπ,ℓ Tr(Aπ,ℓ ) + s1 .
Now it only remains to find Ψn1 ,n2 ,s satisfying Pr(ΨX,s (α) ≥ Ψn1 ,n2 ,s (α, β)) ≤ β . To
achieve this we will develop quantile bounds for supπ,ℓ ∥Aod
π,ℓ ∥F and supπ,ℓ Tr(Aπ,ℓ ). We
od
begin with supπ,ℓ ∥Aπ,ℓ ∥F . Since
Hho,m (Xπk1 , Xπk2 ; Xπs1 +ℓk , Xπs1 +ℓk ) = ⟨δXπk k − δXπs k, δXπk k − δXπs k⟩k
1 2 1 1 +ℓk1 2 1 +ℓk2
= |⟨δXπk k − δXπs k, δXπk k − δXπs k⟩k − ⟨µk1 k − µ′k1 k, µk2 k − µ′k2 k⟩k |
1 1 +ℓk1 2 1 +ℓk2
+ MMDk (µk1 , µ′k1 )(MMDk (δXπk , µ′k2 )+MMDk (δXπs +ℓ , µ′k2 )).
2 1 k2
40
2
+MMDk (µk1 , µ′k1 )(MMDk (δXπk , µ′k2 )+MMDk (δXπs +ℓ , µ′k2 ))
2 1 k 2
Ps1
≤ k1 ,k2 =1 3(MMDk (δXπk1 , µk1 )+MMDk (δXπs1+ℓk , µ′k1 ))2
1
+3MMD2k (µk2 , µ′k2 )(MMDk (δXπk , µk1 )+MMDk (δXπs +ℓ , µ′k1 ))2
1 1 k 1
+3MMD2k (µk1 , µ′k1 )(MMDk (δXπk , µ′k2 )+MMDk (δXπs +ℓ , µ′k2 ))2
2 1 k2
Ps1
≤ 3
k1 ,k2 =1 2 (MMDk (δXπk , µk1 )+MMDk (δXπs +ℓ , µ′k1 ))4
1 1 k1
Let E be the event that the inequality of Lem. D.4 with δ = β3 holds simultaneously for all
p ∈ {1, 2, 4}. By the union bound, this holds with probability at least 1 − β . Moreover, on E ,
Ps1 4 2 4
k1 ,k2 =1 s1 |ak1 ,k2 (π, ℓ)| < 24s1 (A4 (s1 , s2 , m) + B(s1 , s2 , m, β/3))
2 Ps1 2 Ps2 2
≤ s21 i=1 MMDk (δYi , P) + i=1 MMDk (δZi , Q)
Ps1 Ps2
+ MMDk (P, Q) i=1 MMDk (δYi , P) + i=1 MMDk (δZi , Q)
On E we therefore have
2
supπ,ℓ Tr(Aπ,ℓ ) < s21 (A2 (s1 , s2 , m) + B(s1 , s2 , m, β/3))2
(57)
+ MMDk (P, Q)(A1 (s1 , s2 , m) + B(s1 , s2 , m, β/3)) .
From (54), (56), and (57), we therefore obtain that with probability at least 1 − β ,
ΨX,s (α) < Ψn1 ,n2 ,s (α, β).
with the same result holding when Vnπ,n is substituted for Vnπ,s . Hence, by Cor. 3
and Lems. E.1 to E.3 the standard and cheap tests enjoy power at least 1 − β whenever
q q
1 3 16ψ1′ 144ψ2′ 12
32ψ̃1′ 960ψ2′ ˜
(62) E[Vn ]− 4 U ≥ ( β − 1) n + n2 + α⋆ β − 2 n + n2 + 10
n ξ and
q ′
144ψ2′
1 s2 +s−4
U ≥ ( β3 − 1) 16ψ
E[Vn ]− 4 1 + s3 n + n2
1
q
12
− 2 ψ̃ ′ 32 + 2304 + 460800 + ψ ′ 960 4147200
+ ξ˜
10
(63) + ⋆ α β 1 n 2 ns 2 + ns 2 2 n n s n
P ROOF. By definition,
1 P
Vn − Un = n4 (i1 ,i2 ,i3 ,i4 )∈[n]4 hin (Xi1 , Xi2 , Xi3 , Xi4 )
1 P
− (n)4 (i1 ,i2 ,i3 ,i4 )∈in
4
hin (Xi1 , Xi2 , Xi3 , Xi4 )
1 1
P
= n4 − (n) 4 (i1 ,i2 ,i3 ,i4 )∈in
4
hin (Xi1 , Xi2 , Xi3 , Xi4 )
1 P
+ n4 (i1 ,i2 ,i3 ,i4 )∈[n]4 \in
4
hin (Xi1 , Xi2 , Xi3 , Xi4 ).
Thus, we have that
1 1 P
|E[Vn − Un ]| ≤ n4 − (n)4 (i1 ,i2 ,i3 ,i4 )∈in
4
E hin (Xi1 , Xi2 , Xi3 , Xi4 )
1 P
+ n4 (i1 ,i2 ,i3 ,i4 )∈[n]4 \in
4
E hin (Xi1 , Xi2 , Xi3 , Xi4 ) .
We bound the first term on the right-hand side using Jensen’s inequality and the ψ2′ definition:
1 1 P
n4 − (n)4 (i1 ,i2 ,i3 ,i4 )∈in 4
E hin (Xi1 , Xi2 , Xi3 , Xi4 )
E.1. Proof of Lem. E.1: Variance of independence V-statistics. Define the index sets
Ktotal ≜ [n]8 ,
K1 ≜ {(i1 , i2 , i3 , i4 , i′1 , i′2 , i′3 , i′4 ) ∈ Ktotal : |{i1 , i2 , i3 , i4 } ∩ {i′1 , i′2 , i′3 , i′4 }| = 0},
K2 ≜ {(i1 , i2 , i3 , i4 , i′1 , i′2 , i′3 , i′4 ) ∈ Ktotal : |{i1 , i2 , i3 , i4 } ∩ {i′1 , i′2 , i′3 , i′4 }| = 1}, and
(K1 ∪ K2 )c ≜ {(i1 , i2 , i3 , i4 , i′1 , i′2 , i′3 , i′4 ) ∈ Ktotal : |{i1 , i2 , i3 , i4 } ∩ {i′1 , i′2 , i′3 , i′4 }| > 1}.
Then,
Var(Vn ) = E[Vn2 ] − E[Vn ]2
= n18 (i1 ,...,i4 ,i′1 ,...,i′4 )∈Ktotal E[hin (Xi1 , Xi2 , Xi3 , Xi4 )hin (Xi′1 , Xi′2 , Xi′3 , Xi′4 )]
P
− E[hin (Xi1 , Xi2 , Xi3 , Xi4 )]E[hin (Xi′1 , Xi′2 , Xi′3 , Xi′4 )]
= (I) + (II) + (III),
where (I) is the summation over (i1 , . . . , i′4 ) ∈ K1 , (II) is the summation over (i1 , . . . , i′4 ) ∈
K2 , and (III) is the summation over (i1 , . . . , i′4 ) ∈ (K1 ∪ K2 )c . Note that
|K2 | ≤ n4 · 4 · 4 · n3 = 16n7 and |(K1 ∪ K2 )c | ≤ n4 · 4 · 3 · 4 · 3 · n2 = 144n6 .
We also remark that for (i1 , . . . , i′4 ) ∈ K1 , E[hin (Xi1 , Xi2 , Xi3 , Xi4 )hin (Xi′1 , Xi′2 , Xi′3 , Xi′4 )] =
E[hin (Xi1 , Xi2 , Xi3 , Xi4 )]E[hin (Xi′1 , Xi′2 , Xi′3 , Xi′4 )], which implies that (I) = 0.
′
By the definition (14) of ψ1′ , we have that (II) ≤ n18 |K2 |ψ1′ ≤ 16ψ n . By the nonnegativity
1
of hin , Cauchy-Schwarz, Jensen’s inequality, and the definition (14) of ψ2′ , we have that each
summand of (III) is upper-bounded by
E[hin (Xi1 , Xi2 , Xi3 , Xi4 )hin (Xi′1 , Xi′2 , Xi′3 , Xi′4 )]
1/2 1/2
≤ E[hin (Xi1 , Xi2 , Xi3 , Xi4 )2 ] E[hin (Xi′1 , Xi′2 , Xi′3 , Xi′4 )2 ]
1/2 1/2
≤ E[hin (Xi1 , Xi2 , Xi3 , Xi4 )2 ] E[hin (Xi′1 , Xi′2 , Xi′3 , Xi′4 )2 ] ≤ ψ2′ .
144ψ2′
Hence, (III) ≤ 1
n8 |(K1 ∪ K2 )c |ψ2′ ≤ n2 , which concludes the proof.
E.2.1. Proof of (58). Our approach is similar to that employed in [25, Thm. 5.1] to study
independence U-statistics. We begin by defining the convenient shorthand
s(i) ≜ |{i1 , i2 , i3 , i4 , n+1−i1 , n+1−i2 , n+1−i3 , n+1−i4 } ∩ {i′1 , i′2 , i′3 , i′4 }|,
Ktotal ≜ [n]8 , K1 ≜ {i ∈ Ktotal : s(i) = 0}, K2 ≜ {i ∈ Ktotal : s(i) = 1}, and
(K1 ∪ K2 )c ≜ {i ∈ Ktotal : s(i) > 1},
and writing
hin (x1 , x2 , x3 , x4 ) = hin,Y (y1 , y2 , y3 , y4 )hin,Z (z1 , z2 , z3 , z4 ), for
hin,Y (y1 , y2 , y3 , y4 ) ≜ gY (y1 , y2 ) + gY (y3 , y4 ) − gY (y1 , y3 ) − gY (y2 , y4 ) and
hin,Z (z1 , z2 , z3 , z4 ) ≜ gZ (z1 , z2 ) + gZ (z3 , z4 ) − gZ (z1 , z3 ) − gZ (z2 , z4 ).
44
By Cauchy-Schwarz,
E[Var(Vnπ,n |X)] = E[E[(Vnπ,n )2 |X] − (E[Vnπ,n |X])2 ] ≤ E[(Vnπ,n )2 ] − (E[Vnπ,n ])2
= n18 i∈Ktotal E HY H′Y hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 )hin,Z (Zπi′ , Zπi′ , Zπi′ , Zπi′ )
P
1 2 3 4
(64)
− E HY hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 ) E H′Y hin,Z (Zπi′ , Zπi′ , Zπi′ , Zπi′ )
1 2 3 4
= E HY H′Y hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 )hin,Z (Zπi′ , Zπi′ , Zπi′ , Zπi′ )
1 2 3 4
′
= E HY hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 ) E HY hin,Z (Zπi′ , Zπi′ , Zπi′ , Zπi′ ) ,
1 2 3 4
where the last equality holds since the distribution of hin ((Y1 , Zπ1 ), (Y2 , Zπ2 ), (Y3 , Zπ3 ), (Y4 , Zπ4 ))
conditional on (Y1 , Zπ1 , π1 ) is invariant to the order of the four arguments of hin and hence
E[hin ((Y1 , Zπ1 ), (Y2 , Zπ2 ), (Y3 , Zπ3 ), (Y4 , Zπ4 )) | (Y1 , Zπ1 , π1 )]
= E[hin ((Y1 , Zπ1 ), (Y2 , Zπ2 ), (Y3 , Zπ3 ), (Y4 , Zπ4 )) | (Y1 , Zπ1 , π1 )].
Similarly, if i ∈ K̄2 with i1 = n + 1 − i′1 , then the corresponding summand of (64) equals
E hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Z (Zπ1 , Zπ2 , Zπ3 , Zπ4 )
− E hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Z (Zπ1 , Zπ2 , Zπ3 , Zπ4 )
× hin,Y (Yn , Y5 , Y6 , Y7 )hin,Z (Zπn , Zπ5 , Zπ6 , Zπ7 )
− E hin,Y (Yn , Y5 , Y6 , Y7 )hin,Z (Zπn , Zπ5 , Zπ6 , Zπ7 )
≤ 21 E[Var(hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Z (Zπ1 , Zπ2 , Zπ3 , Zπ4 ) | Y1 , π1 , Zπ1 )]
+ 21 E[Var(hin,Y (Yn , Y5 , Y6 , Y7 )hin,Z (Zπn , Zπ5 , Zπ6 , Zπ7 ) | Yn , πn , Zπn )] = ψ̃1′ ,
CHEAP PERMUTATION TESTING 45
2 1/2
≤ E hin,Y (Yi1 , Yi2 , Yi3 , Yi4 )hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 )
2 1/2
≤ ψ2′
× E hin,Y (Yi′1 , Yi′2 , Yi′3 , Yi′4 )hin,Z (Zπi′ , Zπi′ , Zπi′ , Zπi′ )
1 2 3 4
E.2.2. Proof of (59). Our argument is analogous to that of App. E.2.1 with s substituted
for n and Hin,m substituted for hin . We begin by defining the convenient shorthand
s(i) ≜ |{i1 , i2 , i3 , i4 , s+1−i1 , s+1−i2 , s+1−i3 , s+1−i4 } ∩ {i′1 , i′2 , i′3 , i′4 }|,
Ktotal ≜ [s]8 , K1 ≜ {i ∈ Ktotal : s(i) = 0}, K2 ≜ {i ∈ Ktotal : s(i) = 1}, and
(K1 ∪ K2 )c ≜ {i ∈ Ktotal : s(i) > 1},
and define K̄2 as the set of indices i ∈ K2 with |i| = 7 and either ii = i′j or ii = n+1−i′j for
some i, j ∈ [4]. As in App. E.2.1 case we have
|K2 | ≤ 32s7 , |(K1 ∪ K2 )c | ≤ 576s6 , |K2 \ K̄2 | = 384s6 .
Next, we define K̃2 as the set of i ∈ K2 such that for exactly three indices i, j, k of the eight
indices, the pairs {i, s+1−i}, {j, s+1−j}, and {k, s+1−k} are equal, and the rest of such
pairs are all different. We then have
|K̃2 | = 4 · 4 · 6 · 22 · s(s − 2)(s − 4)(s − 6)(s − 8)(s − 10)
= 384s(s − 2)(s − 4)(s − 6)(s − 8)(s − 10) and
|K2 \ (K̄2 ∪ K̃2 )| = |K2 \ K̄2 | − |K̃2 |
≤ 384(s6 − s(s − 2)(s − 4)(s − 6)(s − 8)(s − 10)) ≤ 384 · 5 · 6s5 = 11520s5 .
We next define K3 as the set of i ∈ (K1 ∪ K2 )c such that for exactly four indices i, j, k, r of
the eight indices, we have that {i, s+1−i} = {j, s+1−j} and {k, s+1−k} = {r, s+1−r},
and the rest of such pairs are all distinct. Note that
|K3 | = 4 · 3 · 2 · 4 · 3 · 2s(s − 2)(s − 4)(s − 6)(s − 8)(s − 10)
= 576s(s − 2)(s − 4)(s − 6)(s − 8)(s − 10) and
|(K1 ∪ K2 ∪ K3 )c | = |(K 1 ∪ K2 )c | − |K 3|
Finally, for K ≜ (k1 , k2 , k3 , k4 , k1′ , k2′ , k3′ , k4′ ) and s(K) ≜ |{k1 , k2 , k3 , k4 }∩{k1′ , k2′ , k3′ , k4′ }|,
we define the index sets
Ltotal = [m]8 , L1 = {K ∈ Ltotal : s(K) = 0}, L2 = {K ∈ Ltotal : s(K) = 1}, and
(L1 ∪ L2 )c = {K ∈ Ltotal : s(K) > 1}
and note that
|L2 | ≤ m4 · 4 · 4 · m3 = 16m7 and |(L1 ∪ L2 )c | ≤ m4 · 4 · 3 · 4 · 3 · m2 = 144m6 .
With this notation in hand, we use Cauchy-Schwarz to write
E[Var(Vnπ,s |X)] = E[E[(Vnπ,s )2 |X] − (E[Vnπ,s |X])2 ] ≤ E[(Vnπ,s )2 ] − (E[Vnπ,s ])2
1 P P (i ) (i ) (i ) (i ) (i′ ) (i′ ) (i′ ) (i′ )
= n8 i∈Ktotal K∈Ltotal E hin,Y (Yk1 1 , Yk2 2 , Yk3 3 , Yk4 4 )hin,Y (Yk1′ 1 , Yk2′ 2 , Yk3′ 3 , Yk4′ 4 )
= (I ′ ) + (II ′ ) + (III ′ ),
where (I ′ ) is the summation over i ∈ K1 , (II ′ ) is the summation over i ∈ K2 , and (III ′ ) is
the summation over i ∈ (K1 ∪ K2 )c . Note that (I ′ ) = 0 by the same argument that we used to
show that (I) = 0 in (64).
To bound (II ′ ), we will further divide the corresponding summands into subcases:
• i ∈ K̄2 : Suppose without loss of generality that i1 = i′1 . The only non-zero summands in
the sum over K have k1 = k1′ . There are m7 such summands, and each contributes at most
ψ̃1′ to the sum.
• i ∈ K̃2 : Consider the three indices of i that satisfy the defining K̃2 property. In the sum-
mation over K ∈ Ltotal , a summand contributes at most ψ̃1′ when exactly two of the three i
indices share the same k -index, and there are at most 3m7 such summands. The only other
non-zero summands contribute at most ψ2′ when all three i indices share the same k -index,
and there are are most m6 such summands.
• i ∈ K2 \ (K̄2 ∪ K̃2 ): Considering the summation over K ∈ Ltotal , there will be a contribution
bounded by ψ̃1′ when K ∈ L2 , and there are at most 16m7 such summands, The only other
non-zero contributions are bounded by ψ2′ and occur when K ∈ (L1 ∪ L2 )c , so the number
of such summands is bounded by 144m6 .
Thus, we have that (II ′ ) is upper-bounded by
7 7 7 6 c
ψ̃1′ |K̄2n|m
8 + ψ̃1′ |K̃2n|·3m
8 + ψ̃1′ |K2 \(K̄2 ∪nK̃8 2 )|·16m + ψ2′ |K̃2n|·m
8 + ψ2′ |K2 \(K̄2 ∪K̃n2 )|·|(L
8
1 ∪L2 ) |
7 7 6 7 5 7 6 6 5 6
≤ ψ̃1′ 32sn8m + 384sn8·3m + 11520sn8·16m + ψ2′ 384sn8 m + 11520sn8·144m .
Similarly, to bound (III ′ ), we divide the corresponding summands into subcases:
• i ∈ K3 : Consider the two pairs of indices of i satisfying the defining property of K3 . When
each pair respectively shares the same k -index, the summand contributes at most ψ2′ , and
there are at most m6 such summands. The only other non-zero summands have exactly
one of the two pairs sharing a k -index; these summands contribute at most ψ̃1′ , and there
are at most 2m7 such summands.
CHEAP PERMUTATION TESTING 47
576s6 ·2m7 5 7 6 6 5 6
≤ ψ̃1′ n8 + 17280sn8·16m + ψ2′ 576sn8·m + 17280sn8·144m
We conclude that
Var(Vnπ,s ) ≤ ψ̃1′ 32 2304 460800
+ ψ2′ 960 4147200
n + ns + ns2 n2 + n2 s .
Finally, by Jensen’s inequality,
Var(E[Vnπ,s |X]) = E[(E[Vnπ,s |X])2 ] − (E[E[Vnπ,s |X]])2
≤ E[E[(Vnπ,s )2 |X]] − (E[E[Vnπ,s |X]])2 = Var(Vnπ,s ).
= n14 (i1 ,...,i4 )∈[n]4 E E hin,Y (Yi1 , Yi2 , Yi3 , Yi4 )hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 ) | π
P
n(n−2)(n−4)(n−6) 3 2
+48n ˜ 2
≤ n4 E[Uπ ] + 10n −44n
n4 ξ ≤ E[Uπ ] + 10n −44n+48
n3 ξ˜ for
Uπ ≜ E hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Z (Zπ1 , Zπ2 , Zπ3 , Zπ4 ) | π .
The penultimate inequality follows from the following facts:
• For (i1 , . . . , i4 ) ∈ J1 , E hin,Y (Yi1 , Yi2 , Yi3 , Yi4 )hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 )|π = Uπ .
• For (i1 , . . . , i4 ) ∈ J2 , E hin,Y (Yi1 , Yi2 , Yi3 , Yi4 )hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 )|π = 0.
• For (i1 , . . . , i4 ) ∈ (J1 ∪J2 )c , E hin,Y (Yi1 , Yi2 , Yi3 , Yi4 )hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 )|π ≤ ξ˜,
L EMMA E.5 (Range and expectation of Uπ ). Instantiate the assumptions of Thm. 3, fix
any i1 , i2 , i3 , i4 ∈ [n] satisfying
|{i1 , i2 , i3 , i4 , swap(i1 ), swap(i2 ), swap(i3 ), swap(i4 )}| = 8,
and let π be any permutation on [n] with
(πij , πswap(ij ) ) ∼ Unif({(ij , swap(ij )), (swap(ij ), ij )}) for each j ∈ [4].
Then we always have
Uπ ≜ E hin,Y (Yi1 , Yi2 , Yi3 , Yi4 )hin,Z (Zπi1 , Zπi2 , Zπi3 , Zπi4 )|π
(66)
∈ E hin (Xi1 , Xi2 , Xi3 , Xi4 )]{0, 41 , 21 , 1}
with
E[Uπ ] ≤ 12 E hin (Xi1 , Xi2 , Xi3 , Xi4 )].
(67)
If (πi1 , πi2 , πi3 , πi4 ) are independent, then
E[Uπ ] = 41 E hin (Xi1 , Xi2 , Xi3 , Xi4 )].
(68)
If πi1 = i1 ⇔ πi4 = i4 with independent (πi1 , πi2 , πi3 ) or πi2 = i2 ⇔ πi3 = i3 with indepen-
dent (πi1 , πi2 , πi4 ), then
E[Uπ ] = 41 E hin (Xi1 , Xi2 , Xi3 , Xi4 )].
(69)
If πi1 = i1 ⇔ πi2 = i2 with independent (πi1 , πi3 , πi4 ); πi1 = i1 ⇔ πi3 = i3 with independent
(πi1 , πi2 , πi4 ); πi2 = i2 ⇔ πi4 = i4 with independent (πi1 , πi2 , πi3 ); or πi3 = i3 ⇔ πi4 = i4
with independent (πi1 , πi2 , πi3 ), then
E[Uπ ] = 38 E hin (Xi1 , Xi2 , Xi3 , Xi4 )].
and the notation (P(i) , Q(i) ) to denote copies of (P, Q) while keeping track of indices to make
the derivation easier to follow. We begin by proving (66). Uπ may be rewritten as
Uπ = gY (y, y ′ )gZ (z, z ′ ) d(Pi1 ,πi1 + Pi4 ,πi4 − P × Q(πi4 ) − P × Q(πi1 ) )(y, z)
RR
To show (67), we will prove that for any joint distribution over the Unif({0, 1}) variables
ϵi1 , ϵi2 , ϵi3 , ϵi4 , we have E[ ϵi1 + ϵi4 ϵi2 + ϵi3 ] ≤ 2. Indeed, by Cauchy-Schwarz,
2 1/2 2 1/2
E ϵi1 + ϵi4 ϵi2 + ϵi3 ≤ E ϵi1 + ϵi4 E ϵi2 + ϵi3 ,
2 1
and E ϵi1 + ϵi4 is maximized when ϵi1 = ϵi4 almost surely, with maximum value 0 × 2 +
4 × 21 = 2. Thus, E ϵi1 + ϵi4 ϵi2 + ϵi3 has maximum value 2.
To show (68), we compute the probability that Uπ takes each of its possible values when
(π1 , . . . , π4 ) are independent:
E[Uπ ] = 214 0 × 40 + 0 × (2 + 41 ) + 41 E hin (Xi1 , Xi2 , Xi3 , Xi4 )] × 4
+ 12 E hin (Xi1 , Xi2 , Xi3 , Xi4 )] × 43 + E hin (Xi1 , Xi2 , Xi3 , Xi4 )] × 44
1
= 16 E hin (Xi1 , Xi2 , Xi3 , Xi4 )] + 2E hin (Xi1 , Xi2 , Xi3 , Xi4 )] + E hin (Xi1 , Xi2 , Xi3 , Xi4 )]
= 41 E hin (Xi1 , Xi2 , Xi3 , Xi4 )].
Similarly, when πi1 = i1 ⇔ πi4 = i4 and (πi1 , πi2 , πi3 ) are independent,
E[Uπ ] = 213 0 × 4 + 0 × 1 + 21 E hin (Xi1 , Xi2 , Xi3 , Xi4 )] × 2 + E hin (Xi1 , Xi2 , Xi3 , Xi4 )] × 1
Alternatively, when πi1 = i1 ⇔ πi2 = i2 and (πi1 , πi3 , πi4 ) are independent,
E[Uπ ] = 213 0 × 3 + 14 E hin (Xi1 , Xi2 , Xi3 , Xi4 )] × 2 + 21 E hin (Xi1 , Xi2 , Xi3 , Xi4 )] × 2
+ E hin (Xi1 , Xi2 , Xi3 , Xi4 )] × 1 = 83 E hin (Xi1 , Xi2 , Xi3 , Xi4 )].
E.3.2. Proof of (61). In the cheap permutation case, the permutation values πi , πj are
not independent when i and j belong to the same bin, which requires a finer analysis. In the
following, for 1 ≤ i ≤ s, we let swaps (i) ≜ i + 2s − sI i > 2s and recall that swap(i) ≜
• (III) includes the terms for which all indices j1 , . . . , j4 , swap(j1 ), . . . , swap(j4 ) are
different and for which there are exactly two pairs of repeated bin indices among
i1 , . . . , i4 , swaps (i1 ), . . . , swaps (i4 ), and these two pairs are one of the following:
{(i1 , i2 ), (swaps (i1 ), swaps (i2 ))}, {(i1 , swaps (i2 )), (swaps (i1 ), i2 )} (so that πi1 =
i1 ⇔ πi2 = i2 ); {(i1 , i3 ), (swaps (i1 ), swaps (i3 ))}, {(i1 , swaps (i3 )), (swaps (i1 ), i3 )}
(both of which imply that πi1 = i1 ⇔ πi2 = i2 ); {(i2 , i4 ), (swaps (i2 ), swaps (i4 ))},
{(i2 , swaps (i4 )), (swaps (i2 ), i4 )} (both of which imply that πi2 = i2 ⇔ πi4 = i4 ); or
{(i3 , i4 ), (swaps (i3 ), swaps (i4 ))}, {(i3 , swaps (i4 )), (swaps (i3 ), i4 )} (both of which
imply that πi3 = i3 ⇔ πi4 = i4 ). By equation (69) of Lem. E.5, for all such terms we
can write E hin,Y (Yj1 , Yj2 , Yj3 , Yj4 )E hin,Z (Zπj1 , Zπj2 , Zπj3 , Zπj4 ) = 83 U . The number
of terms in (III) is 8s(s − 2)(s − 4)n3 (n − 1) ≤ 8s(s − 2)(s − 4)n4 .
• (IV ) includes the terms for which all indices j1 , . . . , j4 , swap(j1 ), . . . , swap(j4 ) are dif-
ferent and which do not belong to either (I), (II) or (III)
. For such terms, we have that
E hin,Y (Yj1 , Yj2 , Yj3 , Yj4 )E hin,Z (Zπj1 , Zπj2 , Zπj3 , Zπj4 ) ≤ 21 U , by equation (67).
• (V ) includes all terms for which (j1 , . . . , j4 ) belongs to (J1 ∪ J2 )c , where J1 and J2 are
2
defined in (65). The total contribution of such terms can be upper-bounded by 4n −11n+6 n3 ξ˜
by the same argument as in the standard permutation case.
We conclude that
s(s−2)(s−4)(s−6)
E[Vnπ,s ] ≤ s4 × 14 U + 4s(s−2)(s−4)
s4 × 14 U + 8s(s−2)(s−4)
s4 × 83 U
| {z } | {z } | {z }
(I) (II) (III)
s4 −s(s−2)(s−4)(s−6)−12s(s−2)(s−4) 2
+ s4 × 12 U + 10n −44n+48
n3 ξ˜
| {z } | {z }
(IV ) (V )
Since E[Vn ] ≥ 4MMD2k (Π, P × Q) = U by Lem. F.1, equation (62) holds whenever
q q
2 3 16ψ1′ 144ψ2′ 12
32ψ̃′ 960ψ2′ ˜
3MMDk (Π, P × Q) ≥ ( β − 1) n + n2 + α⋆ β − 2 n + n2
1
+ 10
n ξ.
CHEAP PERMUTATION TESTING 51
Next, by Lems. F.2 and F.3, we can upper-bound the right-hand side of this expression by
q 2 ′
( β3 − 1) 64MMDkn(Π,P×Q)ξ + 144ψ n2
2
ax2 − bx − c ≥ 0,
q
12
32
with x = MMDk (Π, P × Q), a=3− ⋆
α β − 2 n,
q (256ξ̃wild +32ξ) q 960ψ2′
12 12 10 ˜
b= ⋆
α β − 2 n , c = α⋆ β − 2 n2 + n ξ.
√
2
Equation (62) therefore holds when x ≥ ab + ac ≥ b+ b2a+4ac , concluding the proof of (18).
p
To establish the cheap testing claim, we will show that (19) implies the sufficient power
condition (63) in the proof of Thm. 3. Since E[Vn ] ≥ U by Lem. F.1, (63) holds when
q ′
144ψ2′
3− s+1 2
( β3 − 1) 16ψ
s2 MMD k (Π, P × Q )≥ n + n2
1
q
12 ˜
′ 32 2304 460800
+ ψ2′ 960 4147200
10
+ α⋆ β − 2 ψ̃1 n + ns + ns2 n2 + n2 s + n ξ.
Next, by Lems. F.2 and F.3, we can upper-bound the right-hand side of this expression by
q 2 ′
( 3 − 1) 64MMDk (Π,P×Q)ξ + 144ψ
β n 2
2
n+ 10 ξ˜ n
1/2
+ 12
α⋆ β −2 (4ξ˜wild MMD2k (Π, P × Q) + MMD4k (Π, P × Q))( 32
n +
2304
ns + 460800
ns2 )
4147200 1/2
+ ψ2′ 960
n2 + n2 s
s+1
≤ (3 − s2 − a)MMD2k (Π, P × Q) + bMMDk (Π, P × Q) + c for
q
a ≜ 3 − s+1 12
32 2304 460800
s2 − α⋆ β − 2 n + ns + ns2 ,
q
12
˜ 256 18432 3686400 32ξ
b≜ α⋆ β − 2 ξwild ( n + ns + ns2 ) + n , and
q
12 ˜
′ 960 4147200 10
c≜ α⋆ β − 2 ψ2 n2 + n2 s + n ξ.
The sufficiency of (19) now follows from the sufficiency of the quadratic inequality ax2 −
bx − c ≥ 0 with x = MMDk (Π, P × Q), exactly as in the standard case, and the triangle
inequality.
Moreover, we have
U = E[hin (Xi1 , Xi2 , Xi3 , Xi4 )]
= E gY (Y1 , Y2 ) + gY (Y3 , Y4 ) − gY (Y1 , Y3 ) − gY (Y2 , Y4 )
× gZ (Z1 , Z2 ) + gZ (Z3 , Z4 ) − gZ (Z1 , Z3 ) − gZ (Z2 , Z4 )
= 4(Π × Π)(gY gZ ) + 4((P × Q) × (P × Q))(gY gZ )
− 4(Π × (P × Q))(gY gZ ) − 4((P × Q) × Π)(gY gZ ) = 4MMD2k (Π, P × Q).
F.2. Proof of Lem. F.2: ψ1′ bound. By Jensen’s inequality and Cauchy Schwarz,
ψ1′ = Var(E[hin (X1 , X2 , X3 , X4 ) | X1 ])
≤ 4!1 (i1 ,i2 ,i3 ,i4 )∈i44 Var(E[hin (Xi1 , Xi2 , Xi3 , Xi4 ) | X1 ])
P
= E[E[hin (X1 , X2 , X3 , X4 ) | X1 ]] = ξ.
F.3. Proof of Lem. F.3: ψ̃1′ bound. By the definition (14) of ψ̃1′ , we can write
ψ̃1′ = E hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Z (Zπ1 , Zπ2 , Zπ3 , Zπ4 )
× hin,Y (Y1 , Y5 , Y6 , Y7 )hin,Z (Zπ1 , Zπ5 , Zπ6 , Zπ7 )
− E hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Z (Zπ1 , Zπ2 , Zπ3 , Zπ4 )
× E hin,Y (Y1 , Y5 , Y6 , Y7 )hin,Z (Zπ1 , Zπ5 , Zπ6 , Zπ7 ) .
The result now follows immediately from the following lemma.
L EMMA F.4 (Permuted hin moments). Instantiate the assumptions of Cor. 2, and suppose
π is an independent wild bootstrap permutation of [n]. Then
E hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Z (Zπ1 , Zπ2 , Zπ3 , Zπ4 ) = MMD2k (Π, P × Q).
(70)
If, in addition, n ≥ 14, then
E hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Y (Y1 , Y5 , Y6 , Y7 )
× hin,Z (Zπ1 , Zπ2 , Zπ3 , Zπ4 )hin,Z (Zπ1 , Zπ5 , Zπ6 , Zπ7 ) ≤ 4ξ˜wild MMD2k (Π, P × Q).
P ROOF. Since n ≥ 8, (π1 , π2 , π3 , π4 ) are independent. Hence, equation (70) follows from
equation (68) of Lem. E.5 and Lem. F.1.
CHEAP PERMUTATION TESTING 53
Now suppose that n ≥ 14, so that |[7] ∪ {swap(i) : i ∈ [7]}| = 14, and define the
Unif({0, 1}) variables ϵi ≜ I(i = πi ) for i ∈ [7] and µi ≜ ϵi Π + (1 − ϵi )P × Q. We have
E[hin,Y (Y1 , Y2 , Y3 , Y4 )hin,Y (Y1 , Y5 , Y6 , Y7 )
× hin,Z (Zπ1 , Zπ2 , Zπ3 , Zπ4 )hin,Z (Zπ1 , Zπ5 , Zπ6 , Zπ7 ) | π, Y1 , Zπ1 ]
k((y, z), (y ′ , z ′ ))k((y ′′ , z ′′ ), (y ′′′ , z ′′′ ))
RRRR
=
d(δ(Y1 ,Zπ1 ) + µ4 − δY1 × Q − P × δZπ1 )(y, z)
d(µ2 + µ3 − P × Q − P × Q)(y ′ , z ′ )
d(δ(Y1 ,Zπ1 ) + µ7 − δY1 × Q − P × δZπ1 )(y ′′ , z ′′ )
d(µ5 + µ6 − P × Q − P × Q)(y ′′′ , z ′′′ )
k((y, z), ·)k((y ′′ , z ′′ ), ·) d(δ(Y1 ,Zπ1 ) + µ4 − δY1 × Q − P × δZπ1 )(y, z)
RR
≤
d(δ(Y1 ,Zπ1 ) + µ7 − δY1 × Q − P × δZπ1 )(y ′′ , z ′′ ) k⊗k
G.1. Bounding the fluctuations of the test statistic. We begin by identifying a CQB
Φn satisfying the test statistic tail bound (21). Our proof makes use of two lemmas. The first
shows that functions of independent variables concentrate around their mean when coordi-
natewise function differences are bounded.
L EMMA G.1 (Bounded differences inequality [34, Thm. D.8]). Suppose that, for all x1 ∈
X1 , x2 ∈ X2 , . . . , xn ∈ Xn and i ∈ [n], a function f : X1 × X2 × · · · × Xn → R satisfies
supx′i ∈Xi |f (x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) − f (x1 , . . . , xi−1 , x′i , xi+1 , . . . , xn )| ≤ ci .
If X1 , X2 , . . . , Xn are independent with each Xi ∈ Xi then, with probability at least 1 − δ ,
q
f (X1 , . . . , Xn ) − E[f (X1 , . . . , Xn )] < 12 log(1/δ) ni=1 c2i .
P
The second lemma uses Lem. G.1 to derive a suitable CQB for Vn .
L EMMA G.2 (Bounded independence quantile bound). Under the assumptions of Thm. 4,
Pr |4MMD2k (Π, P × Q) − Vn | ≥ Φn (β) ≤ β for all β ∈ (0, 1)
54
√ √
6 K+6 2K log(1/β)
where Φn (β) ≜ Λ2 (β) + 4MMDk (Π, P × Q)Λ(β) and Λ(β) ≜ √
n
.
P ROOF. Let X̃ = (Ỹ , Z̃) be an independent draw from Π. By [42, App. B],
Vn = n14 (i1 ,i2 ,i3 ,i4 )∈[n]4 hin (Xi1 , Xi2 , Xi3 , Xi4 ) = 4MMD2k (Π,
P b Pb×Q
b ).
By the triangle inequality, we have the following sure inequalities for each k ∈ [n]:
|MMDk (Π, P × Q) − MMDk (Π,
b Pb×Q
b )| ≤ MMDk (Π, Π)
b + MMDk (P × Q, P
b×Q
b)
≜ ∆(X) + ∆′ (X),
√
|∆(X) − ∆((X−k , X̃))| ≤ n1 MMDk (δXk , δX̃ ) ≤ 2 K
n , and
|∆′ (X) − ∆′ ((X−k , X̃))|
q q √
1 b )gZ + 1 MMDgZ (δZ , δ ) supy gY (y, y) ≤ 4 K
≤ n MMDgY (δYk , δỸ )
b ×Q
(Q n k Z̃ n .
Since |a2 − b2 | = |a − b||b + a| ≤ |a − b|(|b − a| + 2|a|) for all real a, b the result follows.
and its cheaply permuted version for any wild bootstrap permutation π on [s],
Ṽnπ,s ≜ s12 si,j=1 Rijπj πi .
P
(75)
L EMMA G.3 (V-statistic relationship). In the notation of (74), (75), and Def. 3
Ṽn = 14 Vn and Ṽnπ,s = 14 Vnπ,s
for any wild bootstrap permutation π on [s].
1 P
− n i4 ∈[n] gY (Yi1 , Yi4 ) gZ (Zπi1 , Zπi2 )
1 1
P P
+ n i3 ∈[n] gZ (Zπi2 , Zπi3 ) n i4 ∈[n] gY (Yi1 , Yi4 )
4 1
P P
= n2 (i1 ,i2 )∈[n]2 gY (Yi1 , Yi2 ) − n i4 ∈[n] gY (Yi1 , Yi4 )
1 P
× gZ (Zπi2 , Zπi1 ) − n i3 ∈[n] gZ (Zπi2 , Zπi3 )
4
K̄i1 ,i2 L̄πi2 ,πi1 = 4Ṽnπ,s .
P
= n2 (i1 ,i2 )∈[n]2
G.3. Bounding the fluctuations of the threshold. We next identify CQBs ΨX,s and
Ψn,s that satisfy the properties
Pr(Vnπ,s ≥ ΨX,s (α) | X) ≤ α, ∀α ∈ (0, 1), and
Pr(ΨX,s (α) ≥ Ψn,s (α, β)) ≤ β, ∀α, β ∈ (0, 1)
for the cheaply permuted V-statistic Vnπ,s (32). Since the alternative V-statistic representation
Ṽnπ,s (75), satisfies Ṽnπ,s = 14 Vnπ,s by Lem. G.3, we will find Ψ̃n,s and Ψ̃X,s satisfying
(76) Pr(Ṽnπ,s ≥ Ψ̃X,s (α) | X) ≤ α, ∀α ∈ (0, 1), and
(77) Pr(Ψ̃X,s (α) ≥ Ψ̃n,s (α, β)) ≤ β, ∀α, β ∈ (0, 1),
and set ΨX,s (α) = 4Ψ̃X,s (α) and Ψn,s (α, β) = 4Ψ̃n,s (α, β). Our derivation makes use of four
lemmas. The first, proved in App. G.5, rewrites Ṽnπ,s in terms of independent Rademacher
random variables. Below,
for each i ∈ [s]. If π is a wild bootstrap permutation on [s] drawn independently of X, then
the alternative V-statistic representation (75) satisfies
Ṽnπ,s = ϵ⊤ Aϵ + b⊤ ϵ + c,
where ϵ ∼ Unif({±1}s/2 ) is independent of X, and A ∈ Rs/2×s/2 , b ∈ Rs/2 , and c ∈ R satisfy
MMD2k (Π,P×Q) 2 (A(1) )2 (A(2) )2 MMD2k (Π,P×Q)((A(1) )2 +(A(2) )2 )
A− s2 11⊤ F ≤ 4s4 + 4s3 ,
MMD2k (Π,P×Q) ⊤ ≤ A(1) A(2)
MMDk (Π,P×Q)(A(1) +A(2) )
Tr A − s 2 11 2s2 + 4s3/2 ,
MMD2k (Π,P×Q) 1 (A(3) )2 2A(3) MMDk (Π,P×Q)
c− 4 ≤ 4 s + √
s
, and
MMD2k (Π,P×Q) 2 3(A(2) )2 (A(3) )2 +(A(1) )2 (A(4) )2
b− s 1 2 ≤ 4s3
3MMD2k (Π,P×Q)((A(1) )2 +(A(2) )2 +(A(3) )2 +(A(4) )2 )
+ 4s2
Our second lemma, proved in App. G.6, bounds the expectation of each quantity A(i)
introduced in Lem. G.4.
L EMMA G.5 (A(i) mean bounds). Under the assumptions and notation of Lem. G.4 with
m ≜ ns ,
√ √
E A(1) ≤ D1 ≜ Ks √5m + √2n , E A(2) ≤ D2 ≜ Ks √3m ,
√ √ √ √
E A(3) ≤ D3 ≜ Ks 5+ √ 8, and E A(4) ≤ D4 ≜ Ks 1+ √ 8.
n n
Our third lemma, proved in App. G.7, uses Lem. G.5 and the bounded differences inequal-
ity (Lem. G.1) to provide CQBs for each quantity A(i) introduced in Lem. G.4.
L EMMA G.6 (A(i) complementary quantile bounds). Instantiate the assumptions and
notation of Lem. G.5, and fix any i ∈ [4]. With probability at least 1 − δ ,
q √ √ √ √ √
A(i) ≤ Di + Ci log(1/δ)
2 for (C1 , C2 , C3 , C4 ) = ( 10√mKs + 2 √Ks
n
, 6√Ks 14√ Ks 6 √Ks
m
, n
, n
).
Our fourth lemma, proved in App. G.8, uses the estimates of Lem. G.6 to verify the CQB
properties (76) and (77) for a particular pairing of (Ψ̃X,s , Ψ̃n,s ).
CHEAP PERMUTATION TESTING 57
L EMMA G.7 (Quantile bound for independence permutation threshold). Instantiate the
assumptions and notation of Lem. G.6. Then, for all α, β ∈ (0, 1),
1
4 Ψn,s (α, β) ≜ Ψ̃n,s (α, β)
q
K
p 3 5 log(3/α) 3
≜ n (196 + 148 log(4/β) + 48 log(4/β)) 4 + 3 + log(3/α)
2
√ q
1+12 log(3/α)
+ MMDk (Π, P × Q) √K
p 1 3
n
30 + 22 log(4/β) 2 + √
6 s
+ 2 log(3/α)
q
+ MMD2k (Π, P × Q) 14 + log(6/α) + log(6/α)
s s
satisfies Pr(Ψ̃X,s (α) ≥ Ψ̃n,s (α, β)) ≤ β for Ψ̃X,s (α) defined in equation (87) and satisfying
Pr(Ṽnπ,s ≥ Ψ̃X,s (α) | X) ≤ α.
G.4. Putting everything together. Fix any β ∈ (0, 1), and suppose that
(78) T = 4MMD2k (Π, P × Q) ≥ Ψn,s (α⋆ , β/3) + Φn (β/3),
for Ψn,s and Φn defined in Lems. G.2 and G.7. By Lems. G.2, G.3, and G.7 and Prop. 2, a
cheap wild bootstrap test based on Vn has power at least 1−β . We now show that the assumed
condition of Thm. 4 is sufficient to imply (78). Indeed, we see that (78) holds whenever
ax2 − bx − c ≥ 0 for x = MMDk (Π, P × Q),
q
log(6/α⋆ ) ⋆)
3
a= 4 − s − log(6/α s ,
√ q
1+12 log(3/α⋆ )
b = √K 3
p
30 + 22 log(12/β) 3 + + ⋆
2 log(3/α )
√
n 6 s
√ 1 1+12 log(3/α⋆ ) q 3 √ √
≥ √K ⋆ ) + 6 K(1+ √2 log(3/β))
p
n
30 + 22 log(12/β) 2 + √
6 s
+ 2 log(3/α n
q q
c= K 7 5 ⋆ 2
p 2
n (14 + 7 log(12/β)) 8 + 3 log(3/α )
⋆
q
≥K 2 7 + 5 log(3/α ) + 3
p
(14 + 7 log(12/β)) ⋆
n 8 3 2 log(3/α )
q
K 3 5 log(3/α⋆ )
+ 32 log(3/α⋆ )
p
≥ n (196 + 148 log(12/β) + 48 log(12/β)) 4 + 3
+ 3K
p
n (1 + 2 log(3/β))2
√ √
2
Since a ≤ 1, it suffices to have x ≥ a1 (b + c) ≥ ab + ac ≥ b+ b2a+4ac . The advertised result
p
q q q
now follows as 3 + 12 78 ≤ 27 and 32 + 12 53 ≤ 2.
G.5. Proof of Lem. G.4: Rademacher representation of Ṽnπ,s . Recall the definition
(73) of Rijkl , and, for each i ∈ [s], define the Rademacher vector ϵ̃i = 2I(i = πi ) − 1. Observe
that by construction, ϵ̃i = ϵ̃swaps (i) for all i ∈ [s/2]. Then
1+ϵ̃j 1+ϵ̃i 1−ϵ̃j
Rijπj πi = ( 1+ϵ̃
2 )( 2 )Rijji + ( 2 )( 2 )Rij(swaps (j))i
i
ϵ̃i ϵ̃j
= 4 Rijji − Rij(swaps (j))i − Rijj(swaps (i)) + Rij(swaps (j))(swaps (i))
ϵ̃i
+ 4 Rijji + Rij(swaps (j))i − Rijj(swaps (i)) − Rij(swaps (j))(swaps (i))
58
ϵ̃j
+ 4 Rijji − Rij(swaps (j))i + Rijj(swaps (i)) − Rij(swaps (j))(swaps (i))
1
+ 4 Rijji + Rij(swaps (j))i + Rijj(swaps (i)) + Rij(swaps (j))(swaps (i)) .
Hence,
ϵ̃i ϵ̃j
Ṽnπ,s = 1 Ps
n2 i,j=1 4 Rijji − Rij(swaps (j))i − Rijj(swaps (i)) + Rij(swaps (j))(swaps (i))
ϵ̃i
+ 4 Rijji + Rij(swaps (j))i − Rijj(swaps (i)) − Rij(swaps (j))(swaps (i))
ϵ̃j
+ 4 Rijji − Rij(swaps (j))i + Rijj(swaps (i)) − Rij(swaps (j))(swaps (i))
1
+ 4 Rijji + Rij(swaps (j))i + Rijj(swaps (i)) + Rij(swaps (j))(swaps (i))
≜ ϵ̃⊤ Aϵ̃ + b⊤ ϵ̃ + c,
We now rewrite the terms Rijji , Rijqi , Rijjr , and Rijqr in terms of kernel expectations:
1 1 Pm −1
Pn
m2 Rijji = m2 k,l=1 gY (Ym(i−1)+k , Ym(j−1)+l ) − n j ′ =1 gY (Ym(i−1)+k , Yj ′ )
1 1 Pm −1
Pn
m2 Rijqr = m2 k,l=1 gY (Ym(i−1)+k , Ym(j−1)+l ) − n ′
j =1 gY (Y m(i−1)+k , Yj ′)
−1
Pn
(82) × gZ (Zm(q−1)+k , Zm(r−1)+l ) − n i′ =1 gZ (Zm(q−1)+k , Zi′ )
Ps
= 4s12 j=1 (δX(j) − δY × δZ(j) + δY(j) × δZ(swaps (j)) − δY × δZ(swaps (j)) )k,
2
= 14 δX − 2δY × δZ + si=1 δY(i) × δZ(swaps (i)) k k = 14 ∥µk∥2k
P
60
At this point, we further rewrite the expression Ṽnπ,s = ϵ̃⊤ Ãϵ̃ + b̃⊤ ϵ̃ + c̃ in terms of ϵ̃1:s/2 .
Since ϵ̃s/2:s = ϵ̃1:s/2 , we have that
Ṽnπ,s = ϵ̃⊤ ⊤
1:s/2 Ã1:s/2,1:s/2 ϵ̃1:s/2 + ϵ̃s/2:s Ãs/2:s,1:s/2 ϵ̃1:s/2
+ ϵ̃⊤ ⊤
1:s/2 Ã1:s/2,s/2:s ϵ̃s/2:s + ϵ̃s/2:s Ãs/2:s,s/2:s ϵ̃s/2:s
+ b̃⊤ ⊤
1:s/2 ϵ̃1:s/2 + b̃s/2:s ϵ̃s/2:s + c̃
= ϵ̃⊤
1:s/2 Ã1:s/2,1:s/2 + Ãs/2:s,1:s/2 + Ã1:s/2,s/2:s + Ãs/2:s,s/2:s ϵ̃1:s/2
⊤
+ b̃1:s/2 + b̃s/2:s ϵ̃1:s/2 + c̃.
Thus, if we define
(85) ϵ = ϵ̃1:s/2 ∼ Unif({±1}s/2 ),
A = Ã1:s/2,1:s/2 + Ãs/2:s,1:s/2 + Ã1:s/2,s/2:s + Ãs/2:s,s/2:s ∈ Rs/2×s/2 ,
b = b̃1:s/2 + b̃s/2:s ∈ Rs/2 , and c = c̃ ∈ R,
we have that
Ṽnπ,s = ϵ⊤ Aϵ + b⊤ ϵ + c.
If we let aij and bi be the components of A and b, we have that
aij = ãij + ã(swaps (i))j + ãi(swaps (j)) + ã(swaps (i))(swaps (j)) and bi = b̃i + b̃swaps (i) .
Next, define
Ā = A − 1 2 ⊤
s2 MMDk (Π, P × Q)11 ,
MMD2k (Π,P×Q)
(86) b̄ = b − s 11⊤ , and
c̄ = c − 41 MMD2k (Π, P × Q).
If we let āij and b̄i be the components of Ā and b̄, by equations (83) and (84), we have that
1 2 1 2
āij ≤ ãij − 4s2 MMDk (Π, P × Q) + ã(swaps (i))j − 4s2 MMDk (Π, P × Q)
1 2
+ ãi(swaps (j)) − + ã(swaps (i))(swaps (j)) − 4s12 MMD2k (Π, P × Q)
4s2 MMDk (Π, P × Q)
1
≤ 4s2 ∥(µj − Π + P × Q)k∥k + ∥(µswaps (j) − Π + P × Q)k∥k
× ∥(µ̃i − Π + P × Q)k∥k + ∥(µ̃swaps (i) − Π + P × Q)k∥k
+ MMDk (Π, P × Q) ∥(µj − Π + P × Q)k∥k + ∥(µ̃i − Π + P × Q)k∥k
+ ∥(µswaps (j) − Π + P × Q)k∥k + ∥(µ̃swaps (i) − Π + P × Q)k∥k
CHEAP PERMUTATION TESTING 61
and
1 2 1 2
b̄i ≤ b̃i − 2s MMDk (Π, P × Q) + b̃swaps (i) − 2s MMDk (Π, P × Q)
1
≤ 4s∥(µ − Π + P × Q)k∥k ∥(µ̃i − Π + P × Q)k∥k + ∥(µ̃swaps (i) − Π + P × Q)k∥k
+ ∥(µi − Π + P × Q)k∥k + ∥(µswaps (i) − Π + P × Q)k∥k ∥(µ̃ − Π + P × Q)k∥k
+ MMDk (Π, P × Q) 2∥(µ − Π + P × Q)k∥k + 2∥(µ̃ − Π + P × Q)k∥k
+ ∥(µ̃i − Π + P × Q)k∥k + ∥(µ̃swaps (i) − Π + P × Q)k∥k
+ ∥(µi − Π + P × Q)k∥k + ∥(µswaps (i) − Π + P × Q)k∥k .
We now turn to bounding ∥Ā∥2F in terms of A(1) and A(2) using Jensen’s inequality:
Ps/2
16s4 i,j=1 ā2ij
Ps/2
≤ i,j=1 ∥(µj − Π + P × Q)k∥k + ∥(µswaps (j) − Π + P × Q)k∥k
× ∥(µ̃i − Π + P × Q)k∥k + ∥(µ̃swaps (i) − Π + P × Q)k∥k
+ MMDk (Π, P × Q) ∥(µj − Π + P × Q)k∥k + ∥(µ̃i − Π + P × Q)k∥k
2
+ ∥(µswaps (j) − Π + P × Q)k∥k + ∥(µ̃swaps (i) − Π + P × Q)k∥k
Ps/2
2 2∥(µj − Π + P × Q)k∥2k + 2∥(µswaps (j) − Π + P × Q)k∥2k
≤ i,j=1
G.6.2. Bounding E[A(3) ] and E[A(4) ]. Since δY(j) × δZ(swaps (j)) = P × Q, independence
and our prior MMD moment bounds imply that
Ps/2 Ps/2
E[MMDk ( 2s j=1 δY(j)×δZ(swaps (j)) , P × Q)2 ] = s42 j=1 E[MMDk (δY(j)×δZ(swaps (j)) , P × Q)2 ]
Ps/2 Ps/2
≤ s82 j=1 E[MMDk (δY(j)× Q, P × Q)2 ] + s82 j=1 E[MMDk (δY(j)× Q, P × Q)2 ] ≤ 8K n .
Hence, by the triangle inequality, Cauchy-Schwarz, and our prior MMD moment bounds
√ √
E[A(3) ] = sE ∥(µ − Π + P × Q)k∥k ≤ sE[MMDk (δX , Π) + 2MMDk (δY × δZ , P × Q)
Ps/2
+ 21 MMDk ( 2s j=1 δY(j)×δZ(swaps (j)) , P × Q) + 21 MMDk ( 2s sj= s +1 δY(j)×δZ(swaps (j)) , P × Q)]
P
2
q q q
≤ sK n +4
sK
n +
8sK
n .
G.7. Proof of Lem. G.6: A(i) complementary quantile bounds. Fix any k ∈ [n], let
X̃ = (Ỹ , Z̃) be an independent draw from Π, and let X̃ = (X1 , . . . , Xk−1 , X̃, Xk , . . . , Xn ) =
(Ỹ, Z̃). Viewing A(1) as a function of X and using the triangle inequality, we can write
1 2
|A(1) (X) − A(1) (X̃)| ≤ m
p
MMDk (δXk , δX̃ ) + m MMDgZ (δZk , δZ̃ ) (δỸ × δỸ )gY
q
1 1
p
+m MMDgY (δYk , δỸ ) supz gZ (z, z) + m MMDgZ (δZk , δZ̃ ) supy gY (y, y)
qP √ √
s
+ n1 MMDgY (δYk , δỸ ) 2 10 K
i=1 MMDgZ (δZ(i) , δZ(swaps (i)) ) ≤ m +
2 sK
n .
Viewing A(3) as a function of X and using the triangle inequality, we can also write
√1 |A(3) (X) − A(3) (X̃)| ≤ 1 MMDk (δX , δ ) + 2 MMDgZ (δZ , δ ) (δ × δ )gY
p
s n k X̃ n k Z̃ Ỹ Ỹ
The advertised result now follows from the bounded differences inequality (Lem. G.1) and
the mean estimates of Lem. G.5.
G.8. Proof of Lem. G.7: Quantile bound for independence permutation threshold.
G.8.1. Bounding Ṽ π,s in terms of Ψ̃X,s (α). Fix any α, β ∈ (0, 1). By Lem. G.4,
Ṽnπ,s = ϵ⊤ Aϵ + b⊤ ϵ + c
MMD2k (Π,P×Q) ⊤ 2
P×Q) ⊤ 2
P×Q)
= ϵ⊤ Āϵ + s2 ϵ 11⊤ ϵ + b̄⊤ ϵ + MMDk (Π,
s 1 ϵ + c̄ + MMDk (Π,
4
ϵ⊤ 11⊤ ϵ 1⊤ ϵ
≤ |ϵ⊤ Āϵ − Tr(Ā)| + Tr(Ā) + b̄⊤ ϵ + c̄ + MMD2k (Π, P × Q) 1
4 + s2 + s .
where ϵ ∼ Unif({±1}s/2 ) is independent of X, A, b, and c were defined in (85), and Ā, b̄,
and c̄ were defined in (86).
64
Now fix any δ ′ ∈ (0, 0.86) and δ ′′ , δ ′′′ > 0. By equation (50) of the Hanson-Wright in-
equality (Lem. D.3), with probability at least 1 − δ ′ ,
20∥Ā∥F log(1/δ ′ )
|ϵ⊤ Āϵ − Tr(Ā)| = |ϵ⊤ Āϵ − E[ϵ⊤ Āϵ]| ≤ 3 .
Meanwhile, Hoeffding’s inequality [23, Thm. 2] implies that
b̄⊤ ϵ ≤ b̄ 2 2 log(1/δ ′′ ) with probability at least 1 − δ ′′ and
p
G.8.2. Bounding Ψ̃X,s (α) in terms of Ψ̃n,s (α, β). Recall the definitions of (Di )4i=1 and
(Ci )4i=1 in Lems. G.5 and G.6 respectively. We have
√ √ √ √ √ √ √
C1 √ K
10 2√ K 12√ K C2 6√ K 6√ K C3 14√ K C4 6√ K
s = sm
+ sn
≤ n
, s = sm
= n
, √
s
= n
, and √
s
= n
,
√ √ √ √
D1 5√ K 2√ K 7√ K D2 3√ K
s = ns
+ sm
≤ n
, s = n
,
√ √ √ √ √ √
D3 (5+ 8) K D4 (1+ 8) K
√
s
= √
n
≤ 8√nK , and √
s
= √
n
≤ 4√ K
n
.
By Lem. G.6, with probability at least 1 − β , the following A(i) -based bounds all hold simul-
taneously. First,
q q
log(4/β)
A(1) A(2) 1
D2 + C2 log(4/β)
s2 ≤ s2 D 1 + C1 2 2
√ √ p √ √ p
≤ 7√nK + 12√2nK log(4/β) × 3√nK + 6√2n K
log(4/β)
≤K
p
n 21 + 56 log(4/β) + 36 log(4/β) .
CHEAP PERMUTATION TESTING 65
Second,
q q q q
log(4/β) log(4/β) log(4/β) log(4/β)
A(2) A(3) +A(1) A(4) D2 +C2 D3 +C3 + D1 +C1 D4 +C4
s3/2 ≤ 2 2
s3/2
2 2
√ √ q √ √ q
3√ K 6√ K log(4/β) 14√ K 8√ K log(4/β)
≤ n
+ n 2 × n
+ n 2
√ √ q √ √ q
+ 12√nK log(4/β)
7√ K
× 6√nK + 4√ K log(4/β)
+ n 2 n 2
K
p
≤ n 84 + 148 log(4/β) + 48 log(4/β) .
Third,
q q √
log(4/β)
14 + 8 log(4/β)
(3)
A D3 C3 √K
√
s
≤ √
s
+ √
s
≤ 2 2 n
and hence
(A(3) )2 K
p
s ≤ n 196 + 112 log(4/β) + 64 log(4/β) .
Fourth,
q q √
A(1) +A(2) log(4/β) log(4/β)
√1 D1 C1 D2 C2 K
p
s3/2 ≤ s s
+ s 2 + s + s 2 ≤ √ns 10 + 13 log(4/β) .
Fifth,
A(1) +A(2) +A(3) +A(4)
s
q q q q
D1 C1 log(4/β) log(4/β) log(4/β) log(4/β)
≤ s + s 2 + Ds2 + Cs2 2 + √1 √
s
D3
s
+ Cs
√
s 2 + D4
√
s
+ C4
√
s 2
√
√K 30 + 22 log(4/β) .
p
= n
Hence, with probability at least 1 − β , the quantity Ψ̃X,s (α) (87) satisfies
10 log(3/α)
Ψ̃X,s (α) ≤ K 1
p
n 21 + 56 log(4/β) + 36 log(4/β) × 2 + 3
q
+K 3
p
n 84 + 148 log(4/β) + 48 log(4/β) × 2 log(3/α)
+K
p
n 196 + 112 log(4/β) + 64 log(4/β)
√
K
10 + 13 log(4/β) × 14 + 10 log(3/α)
p
+ MMDk (Π, P × Q) √ns 3
√ q 3 √ q
log(4/β)
+ √K √K 14 + 8
p
n
30 + 22 log(4/β) 2 log(3/α) + n 2
q
+ MMD2k (Π, P × Q) 14 + log(6/α) + log(6/α)
s s ≤ Ψ̃n,s (α, β).
L EMMA H.1 (High probability bound on critical value [13, Lem. 6, Cor. 1]). Under the
assumptions of Prop. 2, let FX be the cumulative distribution function of T1 given X and let
1 1
δ α ⌊α(B+1)⌋
q1−α(δ) (X) ≜ inf x ∈ R : 1 − α(δ) ≤ FX (x) for α(δ) ≜ ( B ) ⌊α(B+1)⌋ ≥ 2e δ
(⌊α(B+1)⌋)
be the 1 − α(δ) quantile of T1 given X. Then, with probability at least 1 − δ ,
T(bα ) ≤ q1−α(δ) (X).
66
P ROOF. Lem. 6 and Cor. 1 of [13] establish this result for the special case of permuted
MMD coreset test statistics, but the proof for the general case is identical.
Since ΨX (α) ≥ q1−α(β/3) (X) almost surely by assumption (22), Lem. H.1 implies that
ψY,1 , ψZ,1 ≤ 4 b(1) ∥P − Q∥22 and ψY Z,2 ≤ b(1) by [25, App. G] imply that the cheap dis-
p
crete homogeneity permutation test has power at least 1 − β whenever
r √
2 16(3−β) b(1) ∥P−Q∥22 1 1
q 1−α⋆ (36n21 +36n22 +198n1 n2 )b(1)
∥P − Q∥2 = E[Un1 ,n2 ] ≥ β n1 + n2 + α⋆ βn1 (n1 −1)n2 (n2 −1)
q
201s2
√
b(1) + 48s b(1) ∥P−Q∥22
×(1+ v1 ) + 4n2 n
v .
By the triangle inequality, it suffices to have ax2 − bx − c ≥ 0 for x = ∥P − Q∥2 , a = 1,
r √ r √
√
16(3−β) b(1) 1 1
1 48 b(1) t
b= n1 + n2 ×(1+ v ) + n 3 ρn n v , and
√
β 1 2
q √
2 2
1−α⋆ (36n1 +36n2 +198n1 n2 )b(1) 201b(1) t
c= α⋆ βn1 (n1 −1)n2 (n2 −1) ×(1 + v1 ) + 2n √
3 ρn n v.
1 2
CHEAP PERMUTATION TESTING 67
√ √
Hence, by the quadratic formula, it suffices to have x ≥ b + c ≥ (b + b2 + 4ac)/(2a).
√
Finally, since 1/(n 3 ρn1 n2 ) = O(1/n1 ), there exist universal constants C, C ′ such that
√ 1/4 √ 1/4
C ′ b(1) t√ C ′ b(1) (1+v)1/3
b+ c≤ √ ⋆
(1 + min(v, v)
)= √ ⋆
(1 + u1/6
√ )
min(v, v)
min(β,α ) n1 min(β,α ) n1
1/4
Cb
≤ √
6 ⋆
√ (1)
(1 + v1 ).
βα min(β,α⋆ ) n1
These bounds, together with Prop. 3, yield the sufficient power condition
√
−d/4
C ′ L1/4 κ(1) 1 −d/2
√
6 ⋆
√ ⋆
1 + q 3 βα⋆ ≤ C1 κ(1) ∥P − Q∥2 ≤ ∥P̃ − Q̃∥2
βα min(β,α ) n1 s 24(1−α⋆ )
ρn1 n2 −1
2/(4τ +d)
for a universal constant C ′ . The result now follows from the upper estimate n1 ≥ κ(1) .
J.3. Proof of Prop. 5: Power of cheap testing: discrete L2 independence. Note that
i.i.d.
the augmentation variables (Ai )ni=1 , (Bi )ni=1 ∼ Unif((0, 1)) are, with probability 1, all dis-
tinct. Therefore, gY ((Yi , Ai ), (Yi , Ai )) = gZ ((Zi , Bi ), (Zi , Bi )) = 0 for all i ∈ [n]. The argu-
ments of [25, Prop. 5.3] imply the variance component (14) bounds
ψ1′ ≤ C1 b(2) ∥Π − P × Q∥22 , ψ2′ ≤ C2 b(2) , and ψ̃1′ ≤ C3 b(2) ∥Π − P × Q∥22
p p
for universal constants C1 , C2 , and C3 . Additionally, we can bound the mean component ξ˜
(15) by bounding the following expression for arbitrary i1 , . . . , i4 , i′1 , . . . , i′4 ∈ [8].
E hin ((Yi1 , Ai1 ), (Zi′1 , Bi′1 )), ((Yi2 , Ai2 ), (Zi′2 , Bi′2 )),
((Yi3 , Ai3 ), (Zi′3 , Bi′3 )), ((Yi4 , Ai4 ), (Zi′4 , Bi′4 )
= E gY ((Yi1 , Ai1 ), (Yi2 , Ai2 )) + gY ((Yi3 , Ai3 ), (Yi4 , Ai4 ))
− gY ((Yi1 , Ai1 ), (Yi3 , Ai3 )) − gY ((Yi2 , Ai2 ), (Yi4 , Ai4 ))
× gZ ((Zi′1 , Bi′1 ), (Zi′2 , Bi′2 )) + gZ ((Zi′3 , Bi′3 ), (Zi′4 , Bi′4 ))
− gZ ((Zi′1 , Bi′1 ), (Zi′3 , Bi′3 )) − gZ ((Zi′2 , Bi′2 ), (Zi′4 , Bi′4 ))
(i)
= E I(Yi1 = Yi2 , i1 ̸= i2 ) + I(Yi3 = Yi4 , i3 ̸= i4 )
− I(Yi1 = Yi3 , i1 ̸= i3 ) − I(Yi2 = Yi4 , i2 ̸= i4 )
× I Zi′1 = Zi′2 , i′1 ̸= i′2 + I Zi′3 = Zi′4 , i′3 ̸= i′4
(ii)
≤ 16 max E[I(Y1 = Y2 )I(Zi = Zj )] : (i, j) ∈ {(1, 2), (1, 3), (3, 4)}
68
(iii)
= 16 max E[I(Y1 = Y2 , A1 ̸= A2 )I(Zi = Zj , Bi ̸= Bj )] : (i, j) ∈ {(1, 2), (1, 3), (3, 4)}
= 16 max E[gY ((Y1 , A1 ), (Y2 , A2 ))2 gZ ((Zi , Bi ), (Zj , Bj ))2 ] : (i, j) ∈ {(1, 2), (1, 3), (3, 4)}
= ψ2′ .
Above, equality (i) holds because, with probability 1, Ai = Aj ⇔ i = j and Bi = Bj ⇔ i = j .
Inequality (ii) holds because, after expanding the product, the only terms that remain are of
the form I(Yi = Yj )I(Zk = Zl ) with i ̸= j and k ̸= l. Equality (iii) holds because A1 ̸= A2
and Bi ̸= Bj almost surely. We conclude that
ξ˜ ≤ ψ ′ ≤ C2 b(2) .
2
Since U = 4∥Π − P × Q∥22 , Thm. 3, our mean and variance component bounds, and the
triangle inequality imply that the cheap discrete independence test has power at least 1 − β
whenever ax2 − bx − c ≥ 0 for x = ∥Π − P × Q∥2 , a = 1 − s+1 3s2 ,
r √ q p q 32 q 12
3 3 16C1 b(2) 12
2304 460800
4 b = ( β −1) n + C3 b(2) ⋆
α β −2 n + α⋆ β − 2 ns + ns2 , and
q q 12 960 24 16 q 12
3 3 144
p 4147200
4 c = C 2 b(2) ( β −1) n 2 + ⋆
α β −2 n2 + n + n + α⋆ β − 2 n2 s .
√ √
2
Since a ≤ 1, it therefore suffices to have x ≥ a1 (b + c) ≥ ab + ac ≥ b+ b2a+4ac by the
p
√ Cb
1/4
quadratic formula. The claim now follows as a1 (b + c) ≤ √βα(2)⋆ n for a universal constant C .
J.4. Proof of Prop. 6: Power of cheap testing: Hölder RL2 independence. Let P̃ be a
discrete distribution over {B1,1 , . . . , B1,K1 } with probability B1,k dP(y) of selecting bin B1,k
for each k ∈ [K1 ]. Define Q̃ analogously as a discrete distribution over {B2,1 , . . . , B2,K2 }
based on Q and Π̃ analogously as a discrete distribution over {(B1,k1 , B2,k2 ) : k1 ∈ [K1 ], k2 ∈
[K2 ]} based on Π. Let κ(2) = ⌊n2/(4τ +d1 +d2 ) ⌋. By [2, Lem. 3],
−(d1 +d2 )
∥Π̃ − P̃ × Q̃∥22 ≥ C1 κ(2) ∥Π − P × Q∥22 ,
for a universal constant C1 . Furthermore, by the argument in [25, App. M],
−(d1 +d2 )
b(2) ≜ max{∥Π∥22 , ∥P × Q∥22 } ≤ Lκ(2) ,
These bounds, together with Prop. 5, yield the sufficient power condition
−(d +d )/4
C ′ L1/4 κ(2) 1 2 √ −(d +d )/2
√ ⋆
βα n
≤ C1 κ(2) 1 2 ∥Π − P × Q∥2 ≤ ∥Π̃ − P̃ × Q̃∥2
for a universal constant C ′ . The result now follows from the estimate n2/(4τ +d1 +d2 ) ≥ κ(2) .
Standard Standard
1279 2559
0.74 639
1279 2559
0.74 639
319
159 159 319
79
0.72 0.72 79
39
Power
Power
0.70 0.70 39
0.68
0.68
19 0.66 19
0.66
F IG K.1. Power and runtime of standard MMD permutation tests of homogeneity (left) and standard HSIC wild
bootstrap tests of independence (right) as number of permutations B varies.