Cook_umn_0130E_23206

Download as pdf or txt
Download as pdf or txt
You are on page 1of 105

Two Applications of PDEs in Data Science

A Dissertation
SUBMITTED TO THE FACULTY OF THE
UNIVERSITY OF MINNESOTA
BY

Brendan Cook

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE


DEGREE OF
DOCTOR OF PHILOSOPHY

Jeff Calder

May 2022
Copyright © 2022 Brendan Cook. All rights reserved.
Table of Contents
List of Figures iii

List of Tables iv

1 Overview 1

I Convergence Rates for the Continuum Limit of Non-


dominated Sorting 1

2 Introduction 3

3 Main results 6
3.1 Definition of Viscosity Solution . . . . . . . . . . . . . . . . . . . 10
3.2 Outline of Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . 10
3.3 Outline of Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Maximum Principle and Lipschitz estimates 15

5 Rates of convergence for the longest chain problem 22

6 Lemmas For Proving Convergence Rates 28

7 Proofs of Convergence Rates 38


7.1 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.2 Proof of Theorem 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . 43

8 Proof of Theorem 3.5 46

II Poisson Learning: Graph Based Semi-Supervised


Learning
At Very Low Label Rates 56

9 Introduction 57

10 Poisson learning 58
10.1 Random walk interpretation . . . . . . . . . . . . . . . . . . . . . 60
10.1.1 Proofs for Section 10.1 . . . . . . . . . . . . . . . . . . . . 63

i
10.2 Variational interpretation . . . . . . . . . . . . . . . . . . . . . . 66
10.2.1 Proofs for Section 10.2 . . . . . . . . . . . . . . . . . . . . 67
10.3 Laplace learning at low label rates . . . . . . . . . . . . . . . . . 76
10.4 The Poisson MBO algorithm . . . . . . . . . . . . . . . . . . . . 77
10.4.1 Proofs for Section 10.4 . . . . . . . . . . . . . . . . . . . . 80

11 Poisson learning algorithms 83

12 Experimental Results 84

13 Continuum limits 88

14 Conclusion 92

References 99

ii
List of Figures
1 Illustration of the sets An , Bn and A, B used in the proof outline,
and the viscosity touching property that An ⊂ A and B ⊂ Bn . . 11
2 Demonstration of spikes in Laplace learning. The graph consists
of n = 104 independent uniform random variables on [0, 1]2 and
two points are given labels of 0 and 1. Most values of the solution
u of Laplace learning are very close to 0.5. . . . . . . . . . . . . . 61
3 Accuracy of Poisson Learning for (a) different numbers of neigh-
bors k used to construct the graph and (b) unbalanced training
data. In (a) we used 5 labels per class and in (b) we used 1 label
per class for the odd numbered classes, and m = 1, 2, 3, 4, 5 labels
per class for the even numbered classes. Both figures show the
difference in accuracy compared to k = 10 and balanced training
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

iii
List of Tables
1 MNIST: Average accuracy scores over 100 trials with standard
deviation in brackets. . . . . . . . . . . . . . . . . . . . . . . . . . 86
2 FashionMNIST: Average accuracy scores over 100 trials with
standard deviation in brackets. . . . . . . . . . . . . . . . . . . . 87
3 Cifar-10: Average accuracy scores over 100 trials with standard
deviation in brackets. . . . . . . . . . . . . . . . . . . . . . . . . . 88

iv
1 Overview
In this thesis we present two papers applying partial differential equations (PDEs)
in data science. In Part I we present the paper ”Rates of Convergence for the
Continuum Limit of Nondominated Sorting” [22] published in the SIAM Journal
of Mathematical Analysis, which builds on prior work by Calder et al. [16]
which found that the nondominated sorting process used in multi-objective
optimization has a continuum limit PDE. In this work, we prove a convergence
rate which holds with high probability, bounding the speed of convergence of
nondominated sorting to its continuum limit. This result can be applied to show
a convergence rate for the PDE-based ranking algorithm proposed by Calder et
al. in [18], which efficiently approximates nondominated sorting by solving the
continuum limit PDE numerically.
In Part II we present the paper ”Poisson Learning: Graph Based Semi-
Supervised Learning At Very Low Label Rates” [15] which introduces a new
algorithm for semi-supervised learning on graphs. Semi-supervised learning is
a subfield of machine learning that combines the paradigms of supervised and
unsupervised learning, by making use of both labeled data and unlabeled data
in a prediction task. A graph is a data structure that describes a network of
objects and the relationships between them. For example, in a social network
such as Facebook, each object represents a user, and each user is connected
to a set of other users through friendships. Other examples of graph data
include transportation networks, citation databases, websites, and many others.
When the data have a graph structure and labeled data is limited, graph-based
semi-supervised learning can leverage the geometric and topological relationships
between the unlabeled and labeled data to enhance the prediction. In graph-
based semi-supervised learning, data are given in the form of a weighted graph
with a small subset of labeled nodes, and learning algorithms make use of the
graph structure to propagate information from the labeled nodes to the unlabeled
nodes to make an accurate prediction. This paper introduces the Poisson learning
algorithm, which is a new approach to graph-based semi-supervised learning
that excels with very little labeled data.

1
Part I

Convergence Rates for the


Continuum Limit of
Nondominated Sorting

2
2 Introduction
The sorting of multivariate data is an important problem in many fields of
applied science [17]. Nondominated sorting is a discrete process that is widely
applied in multiobjective optimization and can be interpreted as arranging a
finite set of points in Euclidean space into layers according to the coordinatewise
partial order. Let ≤ denote the coordinatewise partial order on Rd given by

x ≤ y ⇐⇒ xi ≤ yi for all i = 1, . . . , d.

Given a set of distinct points X = {X1 . . . , Xn } ⊂ Rd , let F1 denote the subset


of points that are coordinatewise minimal. The set F1 is called the first Pareto
front, and the elements of F1 are called Pareto-optimal or nondominated. In
general, the k-th Pareto front is defined by
[
Fk = Minimal elements of X \ Fj ,
j<k

and nondominated sorting is the process of sorting a given set of points by


Pareto-optimality. A multiobjective optimization problem involves identifying
from a given set of feasible solutions those that minimize a collection of objective
functions. In the context of multiobjective optimization, the d coordinates of a
point to be sorted are the values of the d objective functions on a given feasible
solution, and nondominated sorting provides an effective ranking of all feasible
solutions. Nondominated sorting and multiobjective optimization are widely
used in science and engineering disciplines [24, 27], particularly to control theory
and path planning [47, 53], gene selection [29, 37], clustering [34], anomaly
detection [40, 39], and image processing [54, 21, 38].

Set Rd+ = x ∈ Rd : xi > 0 for i = 1, . . . , d and define the Pareto-depth
function Un = j=1 1Pj where Pj = x ∈ Rd+ : x ≥ y for some y ∈ Fj . It was
Pn 

shown in [17] that if the Xi are i.i.d. random variables on Rd+ with density ρ,
cd
then n−1/d Un → d u almost surely in L∞ (Rd ) as n → ∞ where u is the unique
nondecreasing viscosity solution of the problem
(
ux1 ux2 . . . uxd = ρ in Rd+
(2.1)
u=0 on ∂Rd+ .

and cd is a constant. This result shows that nondominated sorting of large

3
datasets can be approximated by solving a partial differential equation numeri-
cally. This idea was developed further by Calder et al. in [18] which proposed a
fast approximate algorithm for nondominated sorting called PDE-based ranking
based on estimating ρ from the data and solving the PDE numerically. It was
shown in [18] that PDE-based ranking is considerably faster than nondominated
sorting in low dimensions while maintaining high sorting accuracy.
In this paper, we establish rates of convergence for the continuum limit of
nondominated sorting. This is an important result in applications of PDE-based
ranking [1, 40] where it is important to consider how the error scales with the
size n of the dataset. The problem has several features that complicate the proof.
The Hamiltonian H(p) = p1 . . . pd is not coercive, which is the standard property
required to prove Lipschitz regularity of viscosity solutions [4]. If one takes a dth
root of the PDE to replace the Hamiltonian with H(p) = (p1 . . . pd )1/d , we obtain
a concave H at the cost of losing local Lipschitz regularity. In particular, solutions
of (2.1) are neither semiconcave nor semiconvex in general. Furthermore, u is
not Lipschitz due to the lack of boundary smoothness and coercivity. Our proof
approximates the solution to (2.1) by the solution to the auxiliary problem
(
ux1 ux2 . . . uxd = ρ in ΩR
(2.2)
u=0 on ∂R Ω,

where
n o
ΩR = x ∈ [0, 1]d : (x1 . . . xd )1/d > R

and
n o
∂R Ω = x ∈ [0, 1]d : (x1 . . . xd )1/d = R ,

effectively rounding off the corner singularity. We prove a one-sided conver-


gence rate for the auxiliary problem restricted to the box [0, 1]d by using an
inf-convolution to approximate u by semiconcave functions that solve (2.2) ap-
proximately. We apply the convergence rates for the longest chain problem
proved in [7] to obtain rates that hold with high probability on a collection
of simplices, which are essentially cell-problems from homogenization theory.
The remainder of the argument builds off of the proof in [11] but keeping track
quantitatively of all sources of error.
We also prove new semiconvexity results on the corner domain Rd+ , which

4
bound the blowup rate of the semiconvexity constant of u at the boundary. The
semiconvex regularity of u on the auxiliary domain enables us to avoid use of
a sup-convolution approximation for this direction, bolstering the convergence
rate. The proof uses a closed-form asymptotic expansion to obtain a smooth
approximate solution to (2.2) near the boundary, and computes semiconvexity
estimates for the approximation analytically. We believe this argument is new,
as the typical arguments found in the literature for proving semiconvexity near
the boundary proceed by means of vanishing viscosity [4]. We also extend the
semiconvexity estimates to the full domain with a doubling variables argument
which is new and simpler compared to the standard tripling variables approach
[4].
Our convergence rate proof is at a high level similar to the proofs of con-
vergence rates for stochastic homogenization of Hamilton-Jacobi equations in
[3], which uses Azuma’s inequality to control fluctuations and a doubling vari-
ables argument to prove convergence rates. Apart from the viscosity solution
theory, the main machinery we use is the convergence rate for the longest chain
problem proved by Bollobás and Brightwell in [7], whose proof is also based
on Azuma’s inequality. As our PDE is first-order, our approach uses the inf
convolution instead of a doubling variables argument which leads to an equivalent
but somewhat simplified argument.
As described in [11], this continuum limit result can be viewed in the context
of stochastic homogenization of Hamilton-Jacobi equations. One may interpret
Un as the discontinuous viscosity solution of
 n
X
in Rd+

U

n,x1 Un,x2 . . . Un,xd = δ Xj
j=1 (2.3)

∂Rd+ .

Un = 0 on

The sense in which Un solves the PDE (2.3) is not obvious. By mollifying Un ,
one obtains a sequence Unε of approximate solutions to (2.3). It can be shown
that Unε converges pointwise to CUn as ε → 0 where the constant C depends on
the choice of mollification kernel.
Our proof techniques may also be applicable to several other related problems
in the literature. The convex peeling problem studied in [20] bears many
similarities to our problem, and similar ideas may give convergence rates for
the convex peeling problem, provided the solutions of the continuum PDE are
sufficiently smooth. The papers [58, 13] introduce numerical methods for the

5
PDE (2.1) and prove convergence rates. Our semiconvex regularity results could
be used to improve the convergence rates of the above papers to O(h) in one
direction. We also suspect the methods used in our paper could be adapted to
the directed last passage percolation problem studied in [10].
We also briefly note that nondominated sorting is equivalent to the problem
of finding the length of a longest chain (i.e. a totally ordered subset) in a partially
ordered set, which is a well-studied problem in the combinatorics and probability
literature [59, 33, 8, 25]. In particular, Un (x) is equal to the length of a longest
chain in X consisting of points less than x in the partial order.

3 Main results
We begin by introducing definitions and notation that will be used throughout
the paper. In our results and proofs, we let C denote a constant that does not
depend on any other quantity, and Ck denotes a constant dependent on the
variable k. Be advised that the precise value of constants may change from line
to line. To simplify the proofs, we model the data using a Poisson point process.
Given a nonnegative function ρ ∈ L1 (Rd ), we let Xρ denote a Poisson point
process with intensity function ρ. Hence, Xρ is a random, at most countable
subset of Rd with the property that for every Borel measurable set A ⊂ Rd , the
R
cardinality N (A) of A ∩ Xρ is a Poisson random variable with mean A ρ dx.
Given two measurable disjoint sets A, B ⊂ Rd , the random variables N (A) and
N (B) are independent. Further properties of Poisson processes can be found in
[45]. In this paper we consider a Poisson point process Xnρ where n ∈ N and
ρ ∈ C(Rd ) satisfies

0 < ρmin ≤ ρ ≤ ρmax . (3.1)

We denote by Cρ a constant depending on ρmin and ρmax , and possibly also


on [ρ]C 0,1 (Rd ) and D2 ρ L∞ (Rd )
in those results that assume ρ ∈ C 0,1 (Rd ) and
ρ ∈ C 2 (Rd ) respectively. Given R ≥ 0, we define
n o
ΩR = x ∈ [0, 1]d : (x1 . . . xd )1/d > R

and n o
∂R Ω = x ∈ [0, 1]d : (x1 . . . xd )1/d = R .

6
Let u denote the viscosity solution of (2.2). Given a finite set A ⊂ Rd , let ℓ(A)
denote the length of the longest chain in the set A. Given a domain Ω ⊂ Rd ,
the Pareto-depth function Un in Ω is defined by

Un (x) = ℓ([0, x] ∩ Xnρ ∩ Ω)

where [0, x] := [0, x1 ] × . . . × [0, xd ]. The scaled Pareto-depth function is defined


by

d −1/d
un (x) = n Un (x) (3.2)
cd

where cd is the constant defined by

cd = lim n−1/d ℓ([0, 1]d ∩ Xn ) a.s. (3.3)


n→∞

d −1/d
For a subset S ⊂ Rd , we write un (S) to denote cd n ℓ(S ∩ Xnρ ). This
particular scaling is chosen to eliminate the constant on the right-hand side of
(3.6).

Remark 3.1. There are several results regarding the constant cd that have been
established in the literature. Hammersley showed that limn→∞ n−1/2 ℓ(Xn ∩
[0, 1]2 ) = c a.s. and conjectured that c = 2 in [33]. In subsequent works, Logan
and Shepp [50] and Vershik and Kerov [61] showed that c ≥ 2 and c ≤ 2. The
exact values of cd for d > 2 remain unknown, although Bollobás and Winkler
showed in [8] that
d2
1 1
 ≤ cd < e for all d ≥ 1.
d! Γ
d
d

Now we state our main convergence rate results. Let un denote the Pareto-
depth function in ΩR and let u denote the viscosity solution of (2.2).

Theorem 3.2. Given k ≥ 1 and ρ ∈ C 0,1 (Rd ) satisfying (3.1), the following
statements hold.
2
(a) Given R ∈ (0, 1], and n1/d ≥ Cd,k,ρ R−(2d −d−1)
we have

1/2 !
log2 n

−2d2 +d+1
P sup (un − u) > Cd,ρ,k R 4 n−1/4d ≤ Cd,ρ,k R−Cd n−k .
ΩR log log n

(b) Assume ρ ∈ C 2 (Rd+ ). Then there exists Cd > 0 such that for all R ∈ (0, Cd )

7
2
and n1/d ≥ Cd,k,ρ R−2d +4d−4
log(n)C we have

2/3 !
log2 n

−2d2 +d
−1/3d
P sup (u − un ) > Cd,ρ,k R 3 n ≤ Cd,ρ,k R−Cd n−k .
ΩR log log n

Theorem 3.2 depends on the parameters R and k. Although R is a constant


in this result, we have stated the explicit dependence on R as it is required to
extend the rates from ΩR to Ω0 . Observe that the convergence rates become
trivial as R → 0+ , as the proof makes use of estimates for the Lipschitz constant
and semiconvexity constant of u on ΩR that blowup as R tends to 0. Also
observe that the convergence rate in (b) is sharper than in (a), thanks to our
use of the semiconvexity estimates established in Theorem 3.5. Let v denote the
solution of (
vx1 vx2 . . . vxd = ρ in Ω0
(3.4)
v = 0 on ∂0 Ω

In the next result we state our convergence rates on Ω0 = [0, 1]d which are proved
by using u as an approximation to v and setting R equal to the optimal value
that balances the approximation error term with the convergence rate. Let

d −1/d
vn (x) = n ℓ(Xnρ ∩ [0, x]) (3.5)
cd

denote the scaled Pareto-depth function in [0, 1]d .

Theorem 3.3. Given k ≥ 1 and ρ ∈ C 0,1 (Rd ) satisfying (3.1), the following
statements hold.

(a) For all n > Cd,k,ρ we have


 
−1/(2d3 +d2 +5d+1)
P sup (vn − v) > n ≤ Cd,k,ρ n−k .
Ω0

(b) Assume ρ ∈ C 2 (Rd ). Then for all n > Cd,k,ρ we have


 
3 2
P sup (v − vn ) > n−1/(2d −d +3d+1) ≤ Cd,k,ρ n−k .
Ω0

Observe that the rate in (b) is sharper thanks to the sharper one-sided rate
in Theorem 3.3. We do not know for certain whether the rates in Theorem 3.2
and 3.3 are optimal, although it seems likely that they are not.

8
These results also extend to the situation when Xnρ = {Y1 , . . . , Yn } where
Y1 , . . . , Yn are i.i.d. random variables with continuous density ρ. The analogues
of Theorems 3.2 and 3.3 in this context follow from Lemma 7.3.

Corollary 3.4. Let Y1 , . . . , Yn be i.i.d. random variables with density ρ. Then


Theorems 3.2 and 3.3 hold when Xnρ = {Y1 , . . . , Yn }.

A key step in our proof of the sharper one-sided rate is a quantitative estimate
on the semiconvexity constant of u. As the Hamiltonian H(p) = (p1 . . . pd )1/d
is concave, the results on semiconvex viscosity solutions in [4] would lead us to
suspect that u is semiconvex. However, from an examination of the function
w(x) = d(x1 . . . xd )1/d that solves (2.1) with ρ = 1, it is evident that solutions of
(2.1) on Rd+ need not be semiconvex nor semiconcave due to the gradient singu-
larity on the coordinate axes. This motivates us to determine the rate at which
the semiconvexity constant of u on ΩR blows up as R → 0+ . For proving these
results it is convenient to raise the PDE to the 1/d power and pose the Dirichlet

problem on the more general domains ΩR,M = x ∈ [0, M ]d : (x1 . . . xd )1/d > R

with boundary conditions on ∂R,M Ω = x ∈ [0, M ]d : (x1 . . . xd )1/d = R . Let
R > 0, M ≥ 1, and let u denote the solution of
(
(ux1 ux2 . . . uxd )1/d = ρ1/d in ΩR,M
(3.6)
u=0 on ∂R,M Ω.

Our result on semiconvexity bounds the rate at which the semiconvexity constant
of u on ΩR,M blows up as R tends to 0. This result enables us to establish the
sharpened one-sided convergence rates in case (b) of Theorem 3.2 and Theorem
3.3.

Theorem 3.5. Let u denote the solution to (3.6). Then there exists a constant
Cρ > 0 such that for all R ≤ Cρ , x ∈ ΩR,M , and h ∈ Rd such that x ± h ∈ ΩR,M
we have

u(x + h) − 2u(x) + u(x − h) ≥ −Cd,M,ρ R−2d+1

where
 
−(d−1)/d
Cd,M,ρ = Cd (1 + M ρ1/d
max ) ρ min ∥Dρ∥ ∞
L (∂R,M Ω) M 2d−1
+ ρ 1/d
max M 2d−2
.

9
3.1 Definition of Viscosity Solution
Here we briefly state for reference the definition of viscosity solution for the
first-order equation
H(Du, u, x) = 0 in Γ, (3.7)

where H is continuous and Γ ⊂ Rd .

Definition 3.6 (Viscosity solution). We say that u ∈ USC(Γ) is a viscosity


subsolution of (3.7) if for every x ∈ Γ and every φ ∈ C ∞ (Rd ) such that u − φ
has a local maximum at x with respect to Γ we have

H(Dφ(x), φ(x), x) ≤ 0.

We will often say that u ∈ USC(Γ) is a viscosity solution of H ≤ 0 in Γ when


u is a viscosity subsolution of (3.7). Similarly, we say that u ∈ LSC(Γ) is a
viscosity supersolution of (3.7) if for every x ∈ Γ and every φ ∈ C ∞ (Rd ) such
that u − φ has a local minimum at x with respect to Γ we have

H(Dφ(x), φ(x), x) ≥ 0.

We also say that u ∈ LSC(Γ) is a viscosity solution of H ≥ 0 in Γ when u is a


viscosity supersolution of (3.7). Finally, we say that u is a viscosity solution of
(3.7) if u is both a viscosity subsolution and a viscosity supersolution.

3.2 Outline of Proof of Theorem 3.2


Here, we present a high-level outline of the proof of Theorem 3.2. The proof
follows a stochastic homogenization argument, similar to [3], but with different
ingredients. We first study the asymptotics of the longest chain in orthogonal
simplices of the form

Sy,p := x ∈ (−∞, y]d : 1 + (x − y) · p−1 ≥ 0



(3.8)

and
Sp := x ∈ (−∞, 0]d : 1 + x · p−1 ≥ 0

(3.9)

where p ∈ (0, ∞)d and p−1 = (p−1 −1


1 , . . . , pd ). The set Sp is an orthogonal simplex
with side length pi in the ith coordinate direction. The measure of Sp is given

10
(a) An ⊂ A (b) B ⊂ Bn

Figure 1: Illustration of the sets An , Bn and A, B used in the proof outline, and
the viscosity touching property that An ⊂ A and B ⊂ Bn .

by
p1 · · · pd
|Sp | = . (3.10)
dd
The sets A and B in Figure 1 show examples of orthogonal simplices. The longest
chain in an orthogonal simplex, un (Sp ), can be thought of as a cell problem
from homogenization, in the sense that it is a simpler local problem, whose
solution allows us to prove our main results. The value of p will turn out to be
proportional to the gradient Du of the continuum limit u, as in homogenization,
and the cell problem exactly describes the local behaviour of un for large n.
For simplicity, we will take the intensity ρ to be constant on Rd throughout
the rest of this section, and we denote the constant value by ρ > 0. The extension
to nonconstant intensities follows by approximating ρ from above and below by
constant intensities on the simplices Sp , which are vanishingly small as n → ∞.
It was shown in [11] that

lim un (Sp ) = dρ1/d |Sp |1/d , (3.11)


n→∞

with probability one. This is proved by reducing to the unit cell problem un (S1 )
using dilation invariance of un and the sets Sp . In particular, if Φ : Rn → Rn is
any dilation (i.e., Φx = (a1 x1 , . . . , ad xd ) for ai > 0), then we have ΦSp = SΦ−1 p
and so
ℓ(Xnρ ∩ Sp ) = ℓ(ΦXnρ ∩ ΦSp ) ∼ ℓ(Xn|Φ|−1 ρ ∩ ΦSp ).

11
We then choose Φ so that ΦSp = S1 , that is ai = pi , to obtain

un (Sp ) ∼ (p1 · · · pd )−1/d un|Φ|−1 (S1 )

This shows that the scaling limit (3.11) for a general simplex Sp follows directly
follow from one for the unit simplex S1 .
The first ingredient in our proof is a convergence rate, with high probability,
for the cell problem (3.11). In particular, in Theorem 5.6 we improve (3.11) by
showing that
 
un (Sp ) = dρ1/d |Sp |1/d + O n−1/2d |Sp |1/2d (3.12)

with high probability, up to logarithmic factors. The proof is based on the


concentration of measure results in [7] for the length of a longest chain in boxes,
which uses Azuma’s inequality. We adapt these results to the simplices Sp .
To illustrate how the cell problem (3.12) is used to prove our main results,
let x0 ∈ ΩR and define

An = {x ∈ [0, x0 ] ∩ ΩR : un (x0 ) − un (x) ≤ ε} .

Basically by definition we have un (An ) ≈ ε (see Lemma 6.2 for a precise statement
of this). Now, if un is well approximated by a smooth function u, then we can
Taylor expand u to show that An ≈ x0 + Sp where p−1 = ε−1 Du(x0 ). In this
case we use (3.10) to obtain

p1 · · · pd εd
|Sp | = d
= d ,
d d ux1 (x0 ) · · · uxd (x0 )

and hence

ερ1/d
ε ≈ un (An ) ≈ un (Sp ) ≈ dρ1/d |Sp |1/d = . (3.13)
(ux1 (x0 ) · · · uxd (x0 ))1/d

Rearranging we obtain the Hamilton-Jacobi equation (2.1).


The proof of our main result involves keeping track of the error estimate from
the cell problem convergence rate (3.12) in the argument above, as well as using
the viscosity solution framework to push the Taylor expansion arguments above
onto smooth test functions. For this, we use the fact that the set An satisfies a

12
viscosity property. That is, if un − φ attains its maximum at x0 , then

un (x) − φ(x) ≤ un (x0 ) − φ(x0 ),

and so
φ(x0 ) − φ(x) ≤ un (x0 ) − un (x).

It follows that An ⊂ A, where A is the corresponding set defined for the test
function φ, given by

A = {x ∈ [0, x0 ] ∩ ΩR : φ(x0 ) − φ(x) ≤ ε} .

The inclusion An ⊂ A is depicted in Figure 1 (A). Then (3.13) is modified


by inserting the inequality un (An ) ≤ un (A), and then approximating A by a
simplex, which is possible when the test function φ is sufficiently smooth. This
gives a rate in only one direction, since we get a subsolution condition, and so
we also need to consider touching from below; that is, examining the minimum
value of u − φ. In this case the inequalities are reversed and we have A ⊂ An .
This inclusion is depicted in Figure 1 (B), where we write B and Bn in place of A
and An (different names are used in the proofs of our main results for technical
reasons).
The convergence rates in our main results are then proved using a maximum
principle argument, which examines the maximum of un − u (and subsequently
u − un ) and uses the viscosity properties and cell problem convergence rates
described above. In the case where u is a non-smooth viscosity solution, one
typically replaces u by smoother approximate sub-and super-solutions obtained
by inf- and sup-convolutions, to allow for Taylor expansions (equivalently we
may use a doubling variables argument). Another main contribution of our
paper is a new semiconvexity estimate for the solution u of (2.2) on the rounded
off domain ΩR (see Theorem 3.5). We sharply characterize the blow-up of the
gradient and semiconvexity constant of u as R → 0. This allows us to avoid
the sup-convolution and use φ = u directly in the maximum principle argument
when bounding u − un . This leads to the better O(n−1/3d ) convergence rate in
Theorem 3.2 (b). In the other direction, when bounding un − u, we would need
semiconcavity of u, which is not true in general, so we use the inf-convolution to
produce a semiconcave approximation, leading to the worse O(n−1/4d ) rate in
Theorem 3.2 (a). As R → 0 and we approach the corner singularity problem (2.1),
we lose control of the semiconvexity estimates, and the solution of (2.1) is neither

13
semiconvex nor semiconcave in general. We thus obtain the rates in Theorem 3.3
by approximation to the rounded off case (2.2), leading to substantially worse
rates of convergence in the presence of the corner singularity in (2.1).
While our proof techniques are at a high level similar to [3], the details
are substantially different and cannot be compared directly. We can, however,
compare the final convergence rates we obtain. In [3] the authors consider
stochastic homogenization of Hamilton-Jacobi equations of the form
 x 
uεt + H Duε , , ω = 0 in Rd × (0, ∞),
ε

and obtain quantitative homogenization rates of


   
−O ε1/8−δ ≤ uε − u ≤ O ε1/5−δ , (3.14)

for any δ > 0, in the setting where H is level-set convex and coercive in the
gradient. Our Hamiltonian H(p) = p1 · · · pd is level-set concave (and in fact we
can write it as H(p) = (p1 · · · pd )1/d to obtain a concave Hamiltonian), but it is
not coercive. Recalling that nondominated sorting can be viewed as a stochastic
Hamilton-Jacobi equation (2.3) with rapidly oscillating terms on the order of
ε = n−1/d we see that our rates in Theorem 3.2 yield

−C1 ε1/3 ≤ un − u ≤ C2 ε1/4 ,

up to logarthmic factors, which are substantially sharper than (3.14).

3.3 Outline of Paper


Here we outline the remainder of the paper. In Section 4 we establish a maximum
principle and Lipschitz estimates for (2.2) that are used throughout the paper.
In Section 5 we extend the work of Bollobás and Winkler in [7] and establish
rates of convergence for the longest chain problem in simplices. In Section 6 we
establish our principle lemma for proving Theorem 3.2, which shows for a strict
supersolution φ of (2.2) that the maximum of un − φ occurs near the boundary
with high probability. In Section 7 we present the proofs of Theorems 3.2 and
3.3, and in Section 8 we present the proof of Theorem 3.5.

14
4 Maximum Principle and Lipschitz estimates
In this section we establish fundamental results regarding the PDE (2.1) that
are used throughout the paper. First we show that if u satisfies ux1 . . . uxd = ρ
on a domain Ω, then a closely related PDE is also automatically satisfied at
certain boundary points. Given M > 0, let Ω ⊂ [0, M ]d and define

∂ ∗ Ω = y ∈ Ω : yi = M for some i and ∃ε > 0 such that B(y, ε) ∩ [0, M )d ⊂ Ω .




(4.1)

Lemma 4.1. Given Ω ⊂ [0, M ]d , let ∂ ∗ Ω be given by (4.1) and let ρ ∈ C(Ω)
satisfy (3.1). Then the following statements hold.

(a) Suppose that u satisfies ux1 ux2 . . . uxd ≤ ρ in Ω. Then u satisfies

d
Y
(uxi )+ ≤ ρ in Ω ∪ ∂ ∗ Ω.
i=1

(b) Suppose that u satisfies ux1 ux2 . . . uxd ≥ ρ in Ω and u is nondecreasing in


each coordinate. Then u satisfies

d
Y
(uxi )+ ≥ ρ in Ω ∪ ∂ ∗ Ω.
i=1

Proof. To prove (b), let ψ ∈ C ∞ (Rd ) such that u − ψ has a local minimum at
x0 ∈ Ω, and show that ψxi (x0 ) ≥ 0. For y in a neighborhood of x0 we have

u(x0 ) − u(y) ≤ ψ(x0 ) − ψ(y).

Since u is nondecreasing in each coordinate, when h > 0 is sufficiently small we


have

u(x0 ) − u(x0 − hei ) ψ(x0 ) − ψ(x0 − hei )


0≤ ≤ .
h h

Hence, ψxi (x0 ) ≥ 0. Now let x0 ∈ Ω ∪ ∂ ∗ Ω and let φ ∈ C ∞ (Rd ) such that u − φ
has a local minimum at x0 . Without loss of generality, we may assume that
u − φ attains a strict global minimum at x0 . If x0 ∈ Ω, then φxi (x0 ) ≥ 0, and

15
we have
d
Y d
Y
(φxi (x0 ))+ = φxi (x0 ) ≥ ρ(x0 ).
i=1 i=1

Pd
If x0 ∈ ∂ ∗ Ω, let φε (x) = φ(x)−ε 1
i=1 M −xi , and we claim that u−φε attains its
minimum over Ω in Ω ∩ [0, M ) . To prove this, let yk ∈ [0, M )d be a minimizing
d

sequence. Replacing yk with a convergent subsequence, we may assume that


yk → y ∈ [0, M ]d . It is clear from the definition of φε that we must have
y ∈ [0, M )d . There exist sequences εk → 0 and xk → x0 such that εk > 0 and
u − φεk has a local minimum at xk ∈ [0, M )d . Since x0 ∈ ∂ ∗ Ω, there exists
N > 0 such that xk ∈ Ω for k > N . Hence, for all k > N we have

d  
Y εk
φxj (xk ) − ≥ ρ(xk ).
j=1
(M − xk,j )2

Since u−φεk has a local minimum at x0 and u is nondecreasing in each coordinate,


εk
we have (φεk )xj = φxj (xk ) − (M −xk,j )2 ≥ 0 for j = 1, . . . , d. Hence for k > N
we have
d
Y d
Y
 
φxj (xk ) +
= φxj (xk ) ≥ ρ(xk ).
j=1 j=1

Qd 
Letting k → ∞, we have j=1 φxj (x0 ) + ≥ ρ(x0 ). To prove (a), let x0 ∈
Ω ∪ ∂ ∗ Ω, and let φ ∈ C ∞ (Rd ) such that u − φ has a local maximum at x0 . If
φxi (x0 ) ≤ 0 for some 1 ≤ i ≤ d, then we have

d
Y 
0= φxj (x0 ) +
≤ ρ(x0 ).
j=1

Assume that φxi (x0 ) > 0 for each i. If x0 ∈ Ω, then we have

d
Y d
Y 
φxj (x0 ) = φxj (x0 ) + ≤ ρ(x0 ).
j=1 j=1

If x0 ∈ ∂ ∗ Ω. Without loss of generality, we may assume that u − φ attains


Pd 1
a strict global maximum at x0 . Let φε (x) = φ(x) + ε i=1 M −x i
. As in (a),
u − φε attains its maximum over Ω in Ω ∩ [0, M )d . Hence, there exist sequences
εk → 0 and xk → x0 , xk ∈ [0, M ]d such that u − φεk has a local maximum at

16
xk ∈ [0, M )d . Then when k is large we have xk ∈ Ω, hence

d d  
Y Y εk
φxj (xk )+ ≤ φxj (xk ) + ≤ ρ(xk ).
j=1 j=1
(M − xk,j )2

Since φ is smooth, we have φxj (xk ) > 0 for k sufficiently large. Letting k → ∞,
we have
d
Y 
φxj (x0 )+ ≤ ρ(x0 ).
j=1

Next we establish that subsolutions and supersolutions of (2.1) may be


perturbed to strict subsolutions and supersolutions. Let L and H be given by
L(p) = (p1 . . . pd )1/d , H(p) = p1 . . . pd .

Proposition 4.2. Given V ⊆ Rd , let ρ ∈ C(V ) satisfy (3.1). Then the following
statements hold.

(a1) Let u satisfy L(Du) ≥ ρ on V . Then for all λ > 0 we have

L(D((1 + λ)u)) ≥ ρ + ρmin λ on V.

(b1) Let u satisfy L(Du) ≤ ρ on V . Then for all λ ∈ (0, 1] we have

L(D((1 − λ)u)) ≤ ρ − ρmin λ on V.

(a2) Let u satisfy H(Du) ≥ ρ on V . Then for all λ > 0 we have

H(D((1 + λ)u)) ≥ ρ + dρmin λ on V.

(b2) Let u satisfy H(Du) ≤ ρ on V . Then for all λ ∈ (0, 1] we have

H(D((1 − λ)u)) ≤ ρ − ρmin λ on V.

Proof. To prove (a1), let x ∈ V . Then there exists φ ∈ C ∞ (Rd ) such that u − φ
has a local minimum at x. Consequently, (1+λ)u−(1+λ)φ has a local minimum
at x, so
L((1 + λ)Dφ(x)) = (1 + λ)f (x) ≥ f (x) + λ(inf f )
V

17
and the statement follows. The proofs of the other statements are very similar
and omitted here, making use of the inequalities (1 + λ)d ≥ (1 + dλ) in (a2) and
(1 − λ)d ≤ (1 − λ) in (b2).

Now we establish a comparison principle for the PDE (2.1).

Theorem 4.3. Given Ω ⊂ [0, M ]d , let Γ ⊂ Ω be a closed set such that Ω ⊆


Γ ∪ Ω ∪ ∂ ∗ Ω, where ∂ ∗ Ω is given by (4.1). Suppose that ρ ∈ C(Ω) satisfies (3.1),
and u ∈ C(Ω) and v ∈ C(Ω) satisfy
(
ux1 ux2 . . . uxd ≤ ρ in Ω
(4.2)
u = g1 on Γ,

and (
vx1 vx2 . . . vxd ≥ ρ in Ω
(4.3)
v = g2 on Γ,
respectively. Assume that v is nondecreasing in each coordinate and g1 ≤ g2 on
Γ. Then u ≤ v on Ω.

Proof. Given λ ∈ (0, 1), set vλ = (1 + λ)v, and suppose for contradiction that
supΩ (u − vλ ) > 0. Let

α 2
Φ(x, y) = u(x) − vλ (y) − |x − y| .
2

Then Φ ∈ C(Ω × Ω) and Ω is bounded. Hence Φ attains its maximum at some


(xα , yα ) ∈ Ω × Ω. Then we have

Φ(xα , yα ) ≥ sup(u − vλ ) > 0.


As u and −vλ are bounded above on Ω we have

2 C
|xα − yα | ≤ . (4.4)
α

As (xα , yα ) ∈ Ω × Ω, there exists a sequence αn → ∞ such that {xαn } and


{yαn } are convergent sequences. Letting xn = xαn and yn = yαn , we have
(xn , yn ) → (x0 , y0 ). By (4.4) we have x0 = y0 . By continuity of Φ we have

lim Φ(xn , yn ) = u(x0 ) − vλ (x0 ).


n→∞

18
We cannot have x0 ∈ Γ, since u(x0 ) − vλ (x0 ) > 0 and u ≤ vλ on Γ. Hence,
x0 ∈ Ω ∪ ∂ ∗ Ω. As u − vλ ≤ 0 on Γ and (u − vλ )(x0 ) > 0, by continuity of
u − vλ there exists N > 0 such that (xn , yn ) ∈ (Ω ∪ ∂ ∗ Ω) × (Ω ∪ ∂ ∗ Ω) for n > N .
2 2
Let φ(x) = α2n |x − yn | and ψ(x) = − α2n |xn − y| . Then u − φ has a local
maximum at xn and vλ − ψ has a local minimum at yn . Setting H(p) = p1 . . . pd ,
Proposition 4.2 gives that H(Dvλ ) ≥ ρ + δ on Ω, where δ = λρmin > 0. By

Lemma 4.1 we have H(Dv λ ) ≥ ρ + δ and H(Du) ≤ ρ on Ω ∪ ∂ Ω. Thus, we have
e e

H(Dφ(x n )) = H(αn (xn − yn )) ≤ ρ(xn )


e e

and

H(Dψ(y n )) = H(αn (xn − yn )) ≥ ρ(yn ) + δ.


e e

Hence, ρ(xn ) − ρ(yn ) ≥ δ > 0, and this gives a contradiction as n → ∞. We


conclude that u ≤ vλ = (1 + λ)v on Ω. Letting λ → 0+ completes the proof.

Now we establish estimates on [u]C 0,1 (ΩR,M ) with respect to R and M . To


this end, we state the following theorem, proven in [17, Theorem 2]. Let
g : Rd → [0, ∞) be bounded and Borel measurable, and let

U (x) = sup J(γ), (4.5)


γ∈A
γ(1)≤x

where Z 1
1/d
J(γ) = g(γ(t))1/d [γ1′ (t) . . . γd′ (t)] dt,
0

and

A = γ ∈ C 1 ([0, 1]; Rd ) : γj′ (t) ≥ 0 for j = 1, . . . d .




Then the value function U satisfies

Ux Ux . . . Ux = 1 g

1 2 d
in Rd+
dd (4.6)
U =0 on ∂Rd+ .

When g = ρ1ΩR,M and u is given by (3.6), Theorem 4.3 implies that d1 U = u in


ΩR,M .

19
Theorem 4.4. Given ρ ∈ C 0,1 (Rd ) satisfying (3.1), let u denote the solution to
(3.6). Then we have

[u]C 0,1 (ΩR,M ) ≤ Cd M d−1 R−(d−1) ρ1/d .


C 0,1 (ΩR,M )

Proof. Let U be given by (4.5) where g = ρ1ΩR,M . In light of the preceding


discussion, it is enough to show that

[U ]C 0,1 (ΩR,M ) ≤ Cd M d−1 R−(d−1) ρ1/d .


C 0,1 (ΩR,M )

Let x ∈ ΩR,M and set f = ρ1/d . It suffices to show that

U (x + hei ) − U (x) ≤ Cd hM d−1 R−(d−1) ∥f ∥C 0,1 (ΩR,M ) (4.7)

when 1 ≤ i ≤ d and h > 0 is sufficiently small. Given ε > 0, let γ ∈ A such


that γ(1) ≤ x + hei and U (x + hei ) ≤ J(γ) + ε. Without loss of generality, we

may assume that γ(0) ∈ ∂R,M Ω,
 γ(1) = x + hei , and  γi (t) > 0 for t ∈ [0, 1]
and 1 ≤ i ≤ d. Let Φ(z) = z1 , . . . xix+h
i
zi , . . . zd and set γ = Φ(γ). By
xi
construction, γ satisfies γ(1) = x, γ i (t) = xi +h γi (t), and γ j (t) = γj (t) for j ̸= i.
As J(γ) ≤ u(x), we have

U (x + hei ) − U (x) ≤ J(γ) − J(γ) + ε.

hzi
A simple calculation shows that |z − Φ(z)| ≤ h+xi ≤ Ch for z ∈ [0, 2x]d . Hence,
we have
hγi (t) hxi
|γ(t) − γ(t)| ≤ ≤ . (4.8)
h + xi h + xi
The above gives us

hγi (t) 1
|f (γ(t)) − f (γ(t))| ≤ [f ]C 0,1 (ΩR,M ) ≤ hxi [f ]C 0,1 (ΩR,M ) ≤ h[f ]C 0,1 (ΩR,M ) .
h + xi xi
(4.9)

20
We have
Z 1  
1/d 1/d
J(γ) − J(γ) = f (γ(t)) [γ1′ (t) · · · γd′ (t)]
− f (γ(t)) [γ ′1 (t) · · · γ ′d (t)] dt
0
1  ′ 1/d !
γ 1 (t) . . . γ ′d (t)
Z
1/d
= f (γ(t)) − f (γ(t)) ′ [γ1′ (t) . . . γd′ (t)] dt
0 γ1 (t) · · · γd′ (t)
1/d !
Z 1 
xi 1/d
= f (γ(t)) − f (γ(t)) [γ1′ (t) · · · γd′ (t)] dt.
0 xi + h

Furthermore,
 1/d ! 1/d !

xi xi
f (γ(t)) − f (γ(t)) = (f (γ(t)) − f (γ(t))) + 1 − f (γ(t))
xi + h xi + h
 
xi
≤ (f (γ(t)) − f (γ(t))) + 1 − f (γ(t))
xi + h
h
≤ (f (γ(t)) − f (γ(t))) + f (γ(t))
xi
h
≤ h ∥f ∥C 0,1 (ΩR,M ) + ∥f ∥C 0,1 (ΩR,M )
xi
= h(1 + x−1
i ) ∥f ∥C 0,1 (ΩR,M )

We conclude that
Z 1
1/d
J(γ) − J(γ) ≤ h(1 + x−1
i ) ∥f ∥C 0,1 (ΩR,M ) [γ1′ (t) . . . γd′ (t)]
0
d
Y
≤ h(1 + x−1
i ) ∥f ∥C 0,1 (ΩR,M ) (γj (1) − γj (0))1/d
j=1

(x1 . . . xd )1/d
≤ 2h ∥f ∥C 0,1 (ΩR,M ) max .
x∈ΩR,M xi

(x1 ...xd )1/d


Observe that the maximum value of xi over ΩR,M is attained when
d −(d−1)
xj = M for j ̸= i and xi = R M , we have

(x1 . . . xd )1/d
max = R−(d−1) M d−1 .
x∈ΩR,M xi

We conclude that for every ε > 0 and 1 ≤ i ≤ d we have

U (x + hei ) − U (x) ≤ ε + ChM d−1 R−(d−1) ∥f ∥C 0,1 (ΩR,M ) .

21
and consequently that

[U ]C 0,1 (ΩR,M ) ≤ CM d−1 R−(d−1) ∥f ∥C 0,1 (ΩR,M ) .

5 Rates of convergence for the longest chain


problem
As discussed in Section 2, nondominated sorting is equivalent to the problem
of finding the length of a longest chain in a Poisson point process with respect
to the coordinatewise partial order. Given n ∈ N and ρ ∈ C(R) satisfying (3.1),
let Xnρ denote a Poisson point process on Rd with intensity nρ. Given a finite
set A ⊂ Rd , let ℓ(A) denote the length of the longest chain in the set A. Then
the Pareto-depth function Un in Rd is given by Un (x) = ℓ([0, x] ∩ Xnρ ) where
[0, x] = [0, x1 ] × . . . × [0, xd ]. The scaled Pareto-depth function is defined by
d −1/d
un (x) = cd n Un (x) where cd is given by (3.3). When S ⊂ Rd is bounded
d −1/d
and Borel measurable, we write un (S) to denote cd n ℓ(S ∩ Xnρ ) and |S| to
denote its Lebesgue measure. When ρ is constant and S is a simplex of the form

x ∈ (−∞, 0]d : 1 + x · q ≥ 0 with q ∈ Rd+ , one can show that

1/d
lim un (S) = dρ1/d |S| a.s.
n→∞

In this section, we establish explicit rates of convergence for the length of the
longest chain in rectangles and simplices. We begin by stating a simple property
of Poisson processes, whose proof is found in [45].

Lemma 5.1. Let Xρ be a Poisson process on Rd with intensity function ρ, where


ρ ∈ L1loc (Rd ) is nonnegative. Then given g1 , g2 ∈ L1loc (Rd ) with 0 ≤ g1 ≤ ρ ≤ g2 ,
there exist Poisson point processes Xg1 and Xg2 such that Xg1 ⊆ Xρ ⊆ Xg2 .

The following result is proved in [7] by Bollobás and Brightwell.

Theorem 5.2. Let Xn be a Poisson point process on [0, 1]d with intensity n.
Then there exists a constant Cd such that for all n > Cd we have
 
d d 1/2d log n
P Un ([0, 1] ) − EUn ([0, 1] ) > Cd tn ≤ 4t2 exp(−t2 )
log log n

22
n1/2d
for all t satisfying 2 < t < log log n . Furthermore,

log3/2 n
cd n1/d ≥ EUn ([0, 1]d ) ≥ cd n1/d − Cd n1/2d
log log n

where cd is given by

cd = lim n−1/d ℓ(Xn ∩ [0, 1]d ) a.s.


n→∞

Next, we extend Theorem 5.2 to a Poisson process with intensity nρ where


ρ > 0 is a constant.

Theorem 5.3. Let Xnρ be a Poisson point process on [0, 1]d where ρ > 0 is a
(ρn)1/2d
constant. Then for all n > Cd ρ−1 and all t satisfying 2 < t < log log ρn we have
 
log ρn
P un ([0, 1]d ) − Eun ([0, 1]d ) > Cd n−1/2d ρ1/2d t ≤ 4t2 exp(−t2 ).
log log ρn

Furthermore,

log3/2 ρn
dρ1/d ≥ Eun ([0, 1]d ) ≥ dρ1/d − Cd ρ1/2d n−1/2d .
log log ρn

Proof. Replace n by ρn in Theorem 5.2. Also note that

lim n−1/d ℓ(Xρn ∩ [0, 1]d ) = cd ρ1/d a.s.


n→∞

Next we establish rates of convergence for the longest chain problem in a


rectangular box.

Theorem 5.4. Let Xnρ denote a Poisson point process on Rd with intensity nρ
where ρ ∈ C(Rd ) satisfies (3.1). Given x, y ∈ Rd with xi < yi for i = 1, . . . , d,

let R = [x, y] := w ∈ Rd : xi ≤ wi ≤ yi for i = 1, . . . , d .
−1
(a) For all n > Cd (supR ρ)−1 |R| and t satisfying

1/2d 1
Cd < t < Cd n1/2d (sup ρ)1/2d |R|
R log log n(supR ρ) |R|

23
we have
!
1/d 1/d −1/2d 1/2d 1/2d log3/2 (n |R| (supR ρ))
P un (R) − d(sup ρ) |R| > Cd tn (sup ρ) |R|
R R log log(n |R| (supR ρ))

≤ 4t2 exp(−t2 ).

−1
(b) For all n > Cd (inf R ρ)−1 |R| and t satisfying

1/2d 1
log1/2 (n(inf ρ) |R|) < t < (inf ρ)1/2d n1/2d |R|
R R log log n(inf R ρ) |R|

we have
 
1/d 1/2d log(n |R| (inf R ρ))
P un (R) − d(inf ρ)1/d |R| < −Cd tn−1/2d (inf ρ)1/2d |R|
R R log log(n |R| (inf R ρ))
≤ 4t2 exp(−t2 ).

Proof. We shall prove only (a), as the proof of (b) is similar. Without loss of
generality we may take R to be the rectangle [0, y] with y ∈ Rd+ . By Lemma 5.1
there exists a Poisson process X n ⊃ Xnρ on Rd with intensity function nρ where
ρ = (supR ρ) 1R + ρ1Rd \R . Given A ⊂ Rd , let un (A) = n−1/d ℓ(A ∩ X n ) and
 
set Φ(x) = xy11 , . . . xydd . Then Yn := Φ(X n ) is a Poisson process with intensity
n |R| ρ and un (R) = n−1/d ℓ([0, 1]d ∩ Yn ). Let E be the event that

1/2d log n(supR ρ) |R|


|un (R) − Eun (R)| ≤ Cd (sup ρ)1/2d t |R| n−1/2d
R log log n(supR ρ) |R|

and

1/d 1/d log3/2 (n(supR ρ) |R|)


0 ≥ Eun (R) − d(sup ρ)1/d |R| ≥ −Cd (sup ρ)1/2d t |R| n−1/2d .
R R log log(n(supR ρ) |R|)

|R|1/2d (supR ρ)1/2d n1/2d −1


where 2 < t < log log(supR ρ)|R|n , n > (supR ρ)−1 |R| , and the constant
Cd is as in Theorem 5.3. By Theorem 5.3 we have P (E) ≥ 1 − 4t2 exp(−t2 ).

24
Assume that E holds for fixed choices of t and n. As un (R) ≤ un (R), we have

1/d 1/d
un (R) − d(sup ρ)1/d |R| ≤ un (R) − Cd (sup ρ)1/d |R|
R R
1/d
≤ |un (R) − Eun (R)| + (Eun (R) − Cd (sup ρ)1/d |R| )
R

1/2d log3/2 n(supR ρ) |R|


≤ Cd t(sup ρ)1/2d n−1/2d |R| .
R log log n(supR ρ) |R|

Now we extend the preceding result to establish rates of convergence for the
longest chain in an orthogonal simplex of the form Sy,q as in (3.8). The lower
one-sided rate is easily attained taking the rectangle R ⊂ S with largest volume
and applying Theorem 5.4. To prove the upper one-sided rate, we embed S into
a finite union of rectangles and apply the union bound. The following result
verifies the existence of a suitable collection of rectangles.

Lemma 5.5. Given y ∈ Rd and q ∈ Rd+ , let Sy,q be as in (3.8). Given ε > 0,
there exists a finite collection R of rectangles covering Sy,q satisfying

1/d 1/d
Cd ε |Sy,q | ≤ |R| ≤ |Sy,q | + Cd ε |Sy,q | for all R ∈ R, (5.1)

and

dist(z, Sy,q ) ≤ Cd ε |Sy,q | for all R ∈ R and z ∈ R, (5.2)

and

|R| ≤ Cd ε−(d−1) . (5.3)

Proof. Without loss of generality, we may take y = 0 and prove the statements
for the simplex Sq := S0,q . Letting 1 denote the ones vector, we first prove

the statements for the simplex S := x ∈ [0, ∞)d : x · 1 ≤ 1 , and then obtain

the general result via reflection and scaling. Let P = x ∈ [0, ∞)d : x · 1 = 1 .

Fix x0 ∈ P , and let R = [0, x + ε1] : x ∈ P ∩ (x0 + εZd ) . It is clear from the
definition of R that (5.2) and (5.3) hold. By the Arithmetic-Geometric Mean
Inequality, for all x ∈ P we have

d
Y 1
ε≤ (xi + ε)1/d ≤ +ε
i=1
d

25
S
and it follows that (5.1) holds. To show that S ⊂ R∈R R, let y ∈ P . Then
∗ d
there exists y ∈ (x0 + εZ ) such that |yi − ≤ ε for 1 ≤ i ≤ d. Hence, yi∗ |

S
y ∈ [0, y ], and it follows that S ⊂ R∈R R. This concludes the proof for
S, and we now leverage this result to prove the statement for the simplex
−xd
Sq = x ∈ (−∞, 0]d : 1 + x · q ≥ 0 . Let Φ(x) = ( −x

q1 , . . . qd ), so Φ(S) = Sq .
1

Applying the proven result for S, there exists a collection of rectangles R1


1/d
covering S and satisfying |R1 | ≤ Cd ε−(d−1) , |R| ≤ 1
d + ε for all R ∈ R1 , and
dist(z, S1 ) ≤ Cd ε for z ∈ R and R ∈ R. Let R = {Φ(R) : R ∈ R1 }, and we
verify that R satisfies the required properties. We have |R| = |R1 |, so (5.3)
holds. To see that (5.1) holds, observe that

1/d 1/d 1/d


|Φ(R)| = d |R| |Sq |

and

1/d 1/d 1/d 1/d 1/d


dε |Sq | ≤ d |R| |Sq | ≤ |Sq | + dε |Sq |

To show (5.2), let z ∈ R ∈ R1 . Then there exists y ∈ S such that |z − y| ≤ Cd ε.


1/d
Hence, we have |Φ(z) − Φ(y)| ≤ Cd |Sq | ε, and (5.2) follows.

Now we prove our main result of this section.

Theorem 5.6. Let ρ ∈ C 0,1 (Rd ) satisfy (3.1), and given y ∈ Rd and q ∈ Rd+
1/d
let Sy,q be given by (3.8). Assume that d |Sy,q | ≤ 1. Then for all k ≥ 1 and
−1 2d
n > Cd,ρ,k |Sy,q | log(n) we have
!
1/d −1/2d 1/2d log2 n
P un (Sy,q ) − d(sup ρ) 1/d
|Sy,q | > Cd,k,ρ n |Sy,q | ≤ Cd,k n−k
Sy,q log log n

and

log2 n
 
1/d −1/2d 1/2d
P un (Sy,q ) − d( inf ρ) 1/d
|Sy,q | < −Cd,k,ρ n |Sy,q | ≤ Cd,k n−k .
Sy,q log log n

Proof. We present the proof of the first statement only, as the proof of the
second is similar and simpler. Without loss of generality, we may take y = 0
and prove the result for the simplex Sq = S0,q . We first prove the result for the

simplex S := x ∈ (−∞, 0]d : 1 + x · 1 ≥ 0 , and then obtain the general result
via a scaling argument. Given ε > 0, we may apply Lemma 5.5 to conclude there
exists a collection R of rectangular boxes covering S such that |R| ≤ Cd ε−(d−1) ,

26
1/d
Cd ε ≤ |R| ≤ d1 + Cd ε for each R ∈ R, and dist(y, S) ≤ ε for y ∈ R ∈ R.
Set R = R∈R R. As ρ ∈ C 0,1 (Rd ), we have supR ρ ≤ supS ρ + [ρ]C 0,1 (Rd ) ε for
S

y ∈ R. Let K = supS ρ + [ρ]C 0,1 (Rd ) ε . By Lemma 5.1 there exists a Poisson
process X n ⊃ Xnρ on Rd with intensity function nρ where ρ = K 1R + ρ1Rd \R .
d −1/d
Given A ⊂ Rd , let un (A) = cd n ℓ(A ∩ X n ). As X n ⊃ Xnρ and S ⊂ R, we
have un (S) ≤ un (S) ≤ maxR∈R un (R). Let ER be the event that

1/d
un (R) − dK 1/d |R| ≤ Cd νK 1/2d (5.4)

3/2
log (nK) −1
where ν = tn−1/2d log log(nK) . For any ε < Cd , n > Cd |R| K −1 , and 2 < t <
(nK)1/2d
log log nK , we have P(ER ) ≥ 1 − 4t2 exp(−t2 ) by Theorem 5.4. Letting E be the
event that (5.4) holds for all R ∈ R, we have P(E) ≥ 1 − Cd ε−d+1 t2 exp(−t2 )
p
by the union bound. Given k ≥ 1, set t = Ck log(n) for a constant C
chosen large enough so n1/2 n−Ck ≤ n−k , and let ε = tn−1/2d . Observe that
the hypotheses of Theorem 5.4 are satisfied when nK > Cd,k log(n)2d , so we
have P(E) ≥ 1 − Cd,k n1/2 n−Ck ≥ 1 − Cd,k n−k . If E holds, then using S ⊂ R,
1/d 1
|R| ≤ d + Cd ε, and (5.4), we have

un (S) ≤ un (S) ≤ max un (R) (5.5)


R∈R

≤ K 1/d + Cd ν(K 1/2d + K 1/d ). (5.6)



Now we obtain the stated result for the simplex Sq = x ∈ (−∞, 0]d : 1 + q · x ≥ 0 .
Let Φ(x) = (q1 x1 , . . . qd xd ), and observe that Φ(Sq ) = S. Then Yn := Φ(Xnρ )
−1
ρ where ρe = |det Φ|
is a Poisson point process of intensity ne nρ = dd nρ |Sq |.
As Φ preserves the length of chains, we have ℓ(Xnρ ∩ S q ) = ℓ(Yn ∩ S). Let ε =
√ 
n−1/2d Ck log n, K

e = supS ρe + [e ρ]C 0,1 (Rd ) ε , K = supSq ρ + [ρ]C 0,1 (Rd ) ε ,
ε log3/2 nK
and ν = e . Let E be the event that
e
log log nK

e 1/d + Cd ν(K
un (Sq ) ≤ K e 1/2d + K
e 1/d ). (5.7)

27
e = dd |Sq | K and d |Sq |1/d ≤ 1, we have
If E holds, then using K

e 1/d + Cd ν(K
un (Sq ) ≤ K e 1/2d + K
e 1/d )
1/d 1/2d
= d |Sq | K 1/d + Cd ν |Sq | (1 + K 1/2d )
1/d 1/2d
≤ d |Sq | (sup ρ)1/d + Cd,ρ ν |Sq |
S

1/d 1/2d log2 n


≤ d |Sq | (sup ρ)1/d + Cd,k,ρ n−1/2d |Sq | .
S log log n

Using our result for S and ℓ(Xnρ ∩ Sq ) = ℓ(Yn ∩ S), we have P(E) ≥ 1 − Cd,k n−k
e −1 log(n)2d .
for n > Cd,k K

6 Lemmas For Proving Convergence Rates


In this section we establish our primary lemma for proving Theorems 3.2 and 3.3.
In particular, we prove a maximum principle type result that shows that if φ is a
semiconcave (see Definition 8.1), strictly increasing supersolution of (2.2), then
the maximum of un − φ occurs in a neighborhood of ∂R Ω with high probability.
An analogous result holds for φ − un when φ is a semiconvex, strictly increasing
subsolution of (2.2). In proving Lemma 6.1 we shall employ Lemmas 6.2 and
6.3, whose proofs are presented after Lemma 6.1.

Lemma 6.1. Let ρ ∈ C 0,1 (Rd ) satisfy (3.1). Given R ∈ (0, 1), let φ ∈ C(ΩR )
be a non-negative function such that 0 < γ ≤ φxi ≤ γ holds in the viscosity sense
on ΩR for some constants γ ≥ 1 and γ ≤ 1. Given α ≥ 1, k ≥ 1, δ ∈ (0, 1),
2
log (n)
and ε ∈ (0, 1), let Rn = γ 1/2 n−1/2d ε−1/2 log log(n) . Then there exist constants
Cd ≤ 1 and Cd,k,ρ ≥ 1 such that when ε ≤ Cd min α−1 γ 2 γ −1 log(n)−2 , Rd−1 δ
and 1 ≥ λ ≥ Cd,k,ρ (Rn + γ −2 αε) the following statements hold.

(a) Suppose φ is semiconcave on ΩR with semiconcavity constant α ≥ 1 and


satisfies H(Dφ) ≥ ρ on ΩR+δ . Assume further that

sup (un − (1 + λ)φ) ≥ 2ε.


ΩR

Then there exists a constant Cd,ρ,k ≥ 1 such that for all n > 1 with

28
n1/d > Cd,ρ,k ε−1 γ log(n)2 we have
!
P sup (un − (1 + λ)φ) = sup (un − (1 + λ)φ) ≤ Cd,k ε−6d n−k .
ΩR ΩR+δ

(b) Suppose φ is semiconvex on ΩR with semiconvexity constant α ≥ 1 and


satisfies H(Dφ) ≤ ρ on ΩR+δ . Then there exists a constant Cd,ρ,k ≥ 1
such that for all n > 1 with n1/d > Cd,ρ,k α−1 γ 2 ε−2 we have
!
P sup ((1 − λ)φ − un ) = sup ((1 − λ)φ − un ) ≤ Cd,k ε−6d n−k .
ΩR ΩR+δ

Proof. First we introduce the notation used throughout the proof. Let ΩεR = ΩR ∩
d−1/2 ε3 Zd and we define a collection of simplices S = Sx,s : x ∈ ΩεR+δ , s ∈ Γε


where Sx,s is given by (3.8) and

Γε = s ∈ ε3 Zd : (4γ)−1 ε ≤ si ≤ 4γ −1 ε .


Lemma 6.3 implies that S ⊆ ΩR for all S ∈ S when ε ≤ Cd δRd−1 . Let ES be


the event that
1/2d
1/d Cd,k,ρ |S| log2 n
un (S) − d(sup ρ1/d ) |S| ≤ (6.1)
S n1/2d log log n

−1
and let E be the event that ES holds for all S ∈ S. For any n > Cd,k,ρ |S| log(n)2d ,
we have P(ES ) ≥ 1−Cd,k n−k by Theorem 5.6. By choice of Γε , we have (4γ)−1 ε ≤
1/d
d |S| ≤ 4γ −1 ε for S ∈ S. As we assume that n1/d > Cd,ρ,k ε−1 γ log(n)2 ,
−1
we have n > Cd,k,ρ |S| log(n)2d for all S ∈ S. As |Γε | ≤ Cd ε−3d and
|ΩεR | ≤ Cd ε−3d , we have |S| ≤ Cd ε−6d . By the union bound, we have P (E) ≥
1 − Cd,k ε−6d n−k . For the remainder of the proof we assume that E holds. Then
for each S ∈ S we have
1/2d
1/d Cd,k,ρ |S| log2 (n)
un (S) ≤ d |S| (sup ρ1/d ) +
S n1/2d log log(n)
1/d
≤ d |S| (sup ρ1/d )(1 + Cd,k,ρ Rn )
S

29
2
log (n)
where Rn = γ 1/2 ε−1/2 n−1/2d log log(n) . Let w = (1 + λ)φ, and we show that

sup (un − w) ̸= sup (un − w) .


ΩR ΩR+δ

Assume for contradiction that

sup (un − w) = sup (un − w) .


ΩR ΩR+δ

Since φ is semiconcave with semiconcavity constant α, at almost every x ∈ ΩR


we have that φ is twice differentiable at x with D2 φ(x) ≤ αI. Hence, there is
xn ∈ ΩR+δ such that D2 φ(xn ) ≤ αI and un (xn ) − w(xn ) + ε3 ≥ supΩR (un − w).
Therefore, we have

un (y) − un (xn ) ≤ w(y) − w(xn ) + ε3 for all y ∈ ΩR . (6.2)

As we assume φ ≥ 0 and supΩR (un − w) ≥ 2ε, we have un (xn ) ≥ ε. We define


the sets

An = {x ∈ [0, xn ] ∩ ΩR : un (xn ) − un (x) ≤ ε}


A = x ∈ [0, xn ] ∩ ΩR : w(xn ) − w(x) ≤ ε + ε3 .


By Lemma 6.2 we have un (An ) ≥ ε. By (6.2) we have An ⊂ A. As wxi =


(1 + λ)φxi ≥ φxi , we have A ⊂ B(xn , 2γ −1 ε). By Taylor expansion, we have

2
w(x) ≤ w(xn ) + Dw(xn ) · (x − xn ) + (1 + λ)α |x − xn |
2
≤ w(xn ) + Dw(xn ) · (x − xn ) + 2α |x − xn | .

Hence, when x ∈ A we have

2
−ε − ε3 ≤ w(x) − w(xn ) ≤ Dw(xn ) · (x − xn ) + 2α |x − xn |
≤ Dw(xn ) · (x − xn ) + Cαγ −2 ε2 .

Cαε2 γ −2 +ε
Letting pi = vxi (xn ) , we have A ⊆ Sxn ,p . We show there exist y ∈ ΩεR and
q ∈ Γε such that Sxn ,p ⊆ Sy,q . Let y ∈ ΩεR such that xn ≤ y and |y − xn | ≤ ε3 .
Letting 1 denote the all ones vector, we have Sxn ,p ⊆ Sy,p+ε3 1 . We may choose
q so p + 2ε3 1 ≥ q ≥ p + ε3 1 provided that γ −1 ε ≤ pi ≤ 4γ −1 ε − 2ε3 for each i,
which holds when ε ≤ Cα−1 γ 2 and α ≥ 1. Then Sy,p+ε3 1 ⊂ Sy,q . Using that

30
1/d
(q1 . . . qd )1/d = d |Sy,q | , we have

ε ≤ un (An )
≤ un (Sy,q )
≤ (1 + Cd,k,ρ Rn )(sup ρ)1/d (q1 . . . qd )1/d
Sy,q

d
( )!1/d
1/d
Y Cαε2 γ −2 + ε
≤ (1 + Cd,k,ρ Rn )(sup ρ) + 2ε3
Sy,q i=1
wxi (xn )
d
( )!1/d
Y Cαε2 γ −2 + ε + 2γε3
≤ (1 + Cd,k,ρ Rn )(sup ρ)1/d .
Sy,q i=1
wxi (xn )

As our assumptions on ε imply ε ≤ Cγ −1 , we have γε3 ≤ Cαε2 γ −2 . Hence we


have

1/d
≤ (1 + Cd,k,ρ Rn )(sup ρ)1/d 1 + Cγ −2 αε .

(wx1 (xn ) · . . . wxd (xn ))
Sy,q

As we assume ρ ∈ C 0,1 (Rd ) and ρ ≥ ρmin > 0, we have ρ1/d ∈ C 0,1 (ΩR ), hence
supSy,q ρ1/d ≤ ρ(xn )1/d +Cd,ρ εγ −1 . By our assumption n1/d > Cd,k,ρ ε−1 γ log(n)2 ,
we have Rn ≤ Cd,k,ρ . As ε ≤ Cα−1 γ 2 γ −1 , we have γ −2 αε ≤ C. Hence we have

1/d
(wx1 (xn ) · . . . wxd (xn )) ≤ ρ(xn )1/d + Cd,k,ρ (Rn + γ −2 αε).

Applying Proposition 4.2 with λ ≥ Cd,k,ρ (Rn +γ −2 αε), we obtain a contradiction.


We conclude that we must have

sup (un − w) ̸= sup (un − w) .


ΩR ΩR+δ

We now prove (b). Let ΩεR , Γε , and S be as in (a), and let ES be the event that

1/2d
1/d Cd,k,ρ |S| log2 n
un (S) − d(inf ρ1/d ) |S| ≥− 1/2d
(6.3)
S n log log n
−1
and let E be the event that ES holds for all S ∈ S. For any n > Cd,k,ρ |S| log(n)2d ,
we have P(ES ) ≥ 1 − Cd,k n−k by Theorem 5.6. By choice of Γε , we have
1/d
γ −1 ε ≤ d |S| ≤ 4γ −1 ε for S ∈ S. As we assume that n1/d > Cd,ρ,k ε−2 α−1 γ 2

31
and ε ≤ γ 2 γ −1 log(n)−2 , for all S ∈ S we have

−1/d
n1/d > Cd,ρ,k ε−2 α−1 γ 2 ≥ ε−1 γ log(n)2 ≥ Cd,k,ρ |S| log(n)2 .

By the union bound, we have P (E) ≥ 1 − Cd,k ε−6d n−k . For the remainder of
the proof we assume that E holds. Then for each S ∈ S we have

1/2d
1/d Cd,k,ρ |S| log2 (n) 1/d
un (S) ≥ d |S| (inf ρ1/d ) − ≥ d |S| (inf ρ1/d )(1 − Cd,k,ρ Rn )
S n1/2d log log(n) S

2
log (n)
where Rn = γ 1/2 ε−1/2 n−1/2d log log(n) . Let w = (1 − λ)φ, and we show that

sup (w − un ) ̸= sup (w − un ) .
ΩR ΩR+δ

Assume for contradiction that

sup (w − un ) = sup (w − un ) .
ΩR ΩR+δ

Since φ is semiconvex with semiconvexity constant α, at almost every x ∈ ΩR


we have that φ is twice differentiable at x with D2 φ(x) ≥ −αI. Hence, there
d −1/d
is xn ∈ ΩR+δ such that D2 φ(xn ) ≥ −αI and w(xn ) − un (xn ) + cd n ≥
supΩR (w − un ). Therefore, we have

d −1/d
un (xn ) − un (x) ≤ w(xn ) − w(x) + n for all y ∈ ΩR . (6.4)
cd

We define the sets


 
d
Bn = x ∈ [0, xn ] ∩ ΩR : un (xn ) − un (x) ≤ ε − n−1/d
cd
 
2d −1/d
B = x ∈ [0, xn ] ∩ ΩR : w(xn ) − w(x) ≤ ε − n .
cd

By (6.4) we have B ⊂ Bn . By Lemma 6.2 we have un (Bn ) ≤ ε. Letting


ε−Kαε2 γ −2
pi = vxi (xn ) , we have Sxn ,p ⊂ B(xn , γ −1 ε). For x ∈ Sxn ,p we have 1 + (x −
−1
xn ) · p ≥ 0, hence

Dw(xn ) · (xn − x) ≤ ε − Kαε2 γ −2 . (6.5)

32
Using (6.5), |x − xn | ≤ γ −1 ε, and n−1/d ≤ αγ −2 ε2 , we have

2
w(xn ) − w(x) ≤ Dw(xn ) · (xn − x) + α |x − xn |
≤ ε − Kαγ −2 ε2 + αγ −2 ε2
≤ ε − (K − 1)n−1/d .

2d
Choosing K so (K − 1) = cd , we have x ∈ B. Hence, Sxn ,p ⊂ B. We now
show there exist y ∈ ΩεR and q ∈ Γε such that Sy,q ⊆ Sxn ,p . Let y ∈ ΩεR such
that xn ≥ y and |y − xn | ≤ ε3 . Letting 1 denote the ones vector, we have
Sy,p−ε3 1 ⊆ Sxn ,p . We may choose q so p − ε3 1 ≥ q ≥ p − 2ε3 1 provided that
γ −1 ε + 2ε3 ≤ pi ≤ 4γ −1 ε for each i, which holds when ε ≤ 81 α−1 γ 2 γ −1 . Then
1/d
Sy,q ⊂ Sxn ,p . Using that (q1 . . . qd )1/d = d |Sy,q | , we have

ε ≥ un (Bn )
≥ un (Sy,q )
≥ (1 − Cd,k,ρ Rn )( inf ρ)1/d (q1 . . . qd )1/d
Sy,q

d
( )!1/d
ε − Cd αε2 γ −2
1/d
Y
3
≥ (1 − Cd,k,ρ Rn )( inf ρ) − 2ε
Sy,q
i=1
wxi (xn )
d
( )!1/d
1/d
Y ε − Cd αε2 γ −2 − 2γε3
≥ (1 − Cd Rn )( inf ρ) .
Sy,q
i=1
wxi (xn )

As our assumptions on ε imply ε ≤ γ −1 , we have γε3 ≤ αε2 γ −2 . Hence we have

1/d
≥ (1 − Cd,k,ρ Rn )( inf ρ)1/d 1 − Cd γ −2 αε .

(wx1 (xn ) · . . . wxd (xn ))
Sy,q

As we assume ρ ∈ C 0,1 (Rd ) and ρ ≥ ρmin > 0, we have (inf Sy,q ρ1/d ) ≥ ρ(xn )1/d −
Cd,ρ εγ −1 . Hence we have

1/d
≥ ρ(xn )1/d (1 − Cd,k,ρ Rn )(1 − Cd,ρ εγ −1 ) 1 − Cd γ −2 αε

(wx1 (xn ) · . . . wxd (xn ))
≥ ρ(xn )1/d − Cd,k,ρ (Rn + γ −2 αε).

Applying Proposition 4.2 with λ = Cd,k,ρ (Rn + γ −2 αε) we obtain a contradiction.


We conclude that we must have

sup (w − un ) ̸= sup (w − un ) .
ΩR ΩR+δ

33
Lemma 6.2. Given x0 ∈ ΩR and ε > 0, let

An = {x ∈ [0, x0 ] ∩ ΩR : un (x0 ) − un (x) ≤ ε} .

Then un (An ) ≤ ε + cdd n−1/d . If un (xn ) ≥ ε, then additionally we have un (An ) ≥


ε.
d −1/d
Proof. First we show that un (An ) ≤ ε + cd n . Let C1 be a longest chain in
An ∩ Xnρ , and let x be the coordinatewise minimal element of C1 . Let C2 be a
longest chain in [0, x] ∩ ΩR ∩ Xnρ . Concatenating these chains, we have

un (x0 ) ≥ un (C1 ) + un (C2 ) − un (C1 ∩ C2 )


d −1/d
= un (An ) + un (x) − n .
cd

Hence,

d −1/d d
un (An ) ≤ un (x0 ) − un (x) + n ≤ ε + n−1/d .
cd cd

To prove that un (An ) ≥ ε, we must first establish a useful property of longest


k
chains. Given S ⊂ Rd and a longest chain {yi }i=1 in S, we claim that

ℓ([0, yj ] ∩ S ∩ Xnρ ) = j. (6.6)

j
It is clear that ℓ([0, yj ] ∩ S ∩ Xnρ ) ≥ j, as {yi }i=1 is a chain of length j in
[0, yj ] ∩ S ∩ Xnρ . If ℓ([0, yj ] ∩ S ∩ Xnρ ) ≥ j + 1, then there exists a chain
j+1 k
{zi }i=1 in [0, yj ] ∩ S ∩ Xnρ . Concatenating this chain with {yi }i=j+1 yields
k
a chain of length k + 1, contradicting maximality of {yi }i=1 . Now we prove
k
the main result. Let {xi }i=1 be a longest chain in [0, x0 ] ∩ ΩR ∩ Xnρ , and let
j k
j = min {i ∈ {1, . . . , k} : xi ∈ An }. Letting C1 = {xi }i=1 and C2 = {xi }i=j , we
have

un (x0 ) = un (C1 ) + un (C2 ) − un (C1 ∩ C2 ) (6.7)


d
≤ un (xj ) + un (An ) − n−1/d . (6.8)
cd
d −1/d
By (6.6) we have un (xi ) = cd n i for 1 ≤ i ≤ k. If j > 1, then using

34
un (x0 ) − un (xj−1 ) > ε and un (x0 ) ≥ ε we have

d −1/d
un (x0 ) − un (xj ) ≥ ε − n . (6.9)
cd

If j = 1, then (6.9) is an immediate consequence of un (x0 ) ≥ ε. Combining (6.9)


with (6.7), we see that

d −1/d
un (An ) ≥ un (x0 ) − un (x) + n ≥ ε.
cd

Lemma 6.3. Let 0 < δ ≤ R. Then given y ∈ ΩR+δ , there is a constant Cd > 0
such that

dist(y, ∂R Ω) ≥ Cd δRd−1 .

Proof. Let x ∈ ∂R Ω such that |x − y| = dist(y, ∂R Ω). Then

{(1 − t)x + ty : t ∈ (0, 1)} ⊂ ΩR \ ΩR+δ .

Letting f (x) = (x1 . . . xd )1/d , we have

(x1 . . . xd )1/d
fxi (x) = .
dxi

Using that xi ≥ Rd for x ∈ ΩR and that f (x) ≤ R + δ for x ∈ ΩR \ ΩR+δ , we


have

1 −d 2
∥Df ∥L∞ (ΩR \ΩR+δ ) ≤ R (R + δ) ≤ R−d+1 .
d d

Hence, we have

2 −d+1
δ ≤ f (y) − f (x) ≤ R |y − x| .
d

Next, we establish estimates on un , vn , u, and v that hold with high proba-


bility in a thin tube around ∂R Ω and ∂0 Ω. To do so, we cover the neighborhood
with rectangular boxes and apply Theorem 5.4. In the following Lemma we
establish the existence of a suitable collection of rectangles.

35
Lemma 6.4. The following statements hold.

(a) Given ε > 0, there exists a collection R of rectangles covering Ω0 \ Ωε such


1/d 2
that |E| = Cε for each E ∈ R and |R| ≤ Cd ε−d .

(b) Given R ∈ (0, 12 ] and 0 < ε ≤ 12 R, there exists a collection R of rectangles


such that for each x, y ∈ ΩR \ ΩR+ε with x < y and [x, y] ⊂ ΩR \ ΩR+ε ,
there exists E ∈ R such that [x, y] ⊆ E. Furthermore, we have Cd Rd−1 ε ≤
1/d
|E| ≤ Cd R−1 ε for each E ∈ R and |R| ≤ Cd R−2d(d−1) ε−2d .

Proof. We give the proof of (b) only, as the proof of (a) is similar but simpler.
Given h ∈ (0, 1), let
n o
B = x ∈ [0, 2]d : R − ε ≤ (x1 . . . xd )1/d ≤ R + 2ε

and set Bh = B ∩ hZd . We define

R = {[x1 , x2 ] : (x1 , x2 ) ∈ Bh × Bh and x1 < x2 and [x1 , x2 ] ⊆ B}

and we show that R has the desired properties when h is appropriately chosen.
Let z, y ∈ ΩR \ΩR+ε with z < y and [z, y] ⊆ ΩR \ΩR+ε . Let w(x) = (x1 . . . xd )1/d .
(x1 ...xd )1/d
Then wxi = dxi and if x ∈ B and ε ≤ 12 R we have

(x1 . . . xd )1/d
Cd R ≤ ≤ Cd R−(d−1) . (6.10)
xi

Hence we have |z − y| ≤ Cd R−1 ε. Letting 1 denote the all ones vector, by (6.10)
we have w(y + h1) ≤ (R + ε) + Cd R−(d−1) h and w(z − h1) ≥ R − ε − Cd R−(d−1) h.
Then there exists a constant Cd such that [z − 2h1, y + 2h1] ⊂ B when h =
Cd Rd−1 ε. Letting h = Cd Rd−1 ε, there exist y + ∈ Bh and y − ∈ Bh such that
1/d
y + 2h1 ≥ yi+ ≥ y + h1 and z − 2h1 ≤ yi− ≤ z − h1. Letting A = |[y − , y + ]| ,
we have A ≥ Cd Rd−1 ε and

d
1X +
A≤ (y − yi− ) ≤ Cd (|z − y| + h) ≤ Cd R−1 ε.
d i=1 i

Furthermore, we have

|R| ≤ Cd h−2d ≤ Cd R−2d(d−1) ε−2d .

36
Lemma 6.5. Let ρ ∈ C(Rd ) satisfy (3.1), and let un and vn be given by (3.2).
2
log (n)
and (3.5). Given k ≥ 1, ε ∈ (0, 1), and n > 0, let Rn = n−1/2d ε−1/2 log log(n) .
Then the following statements hold.
2
(a) For all n > Cd,k ρ−1
max R
−d −d
ε log(n)4d we have
!
2
P sup vn > Cd,k,ρ ε ≤ Cd,k ε−d n−k .
Ω0 \Ωε

(b) Given R ∈ (0, 1], ε ∈ (0, R], and n > Cd,k ρ−1
max R
−2d −d
ε log(n)4d we have
!
2
P sup un > Cd,k,ρ ε ≤ Cd,k ε−3d n−k .
ΩR \ΩR+ε

Proof. We will prove (b) only, as the proof of (a) is similar. Applying Lemma
6.4 (b) with α = εR, there exists a collection R of rectangles such that
1/d
supΩR \ΩR+ε un ≤ maxR∈R un (R), Cd Rd ε ≤ |E| ≤ Cd ε for E ∈ R, and
2
|R| ≤ Cd R−2d(d−1) α−2d ≤ Cd ε−3d . Given R ∈ R, let ER be the event that

log2 n
un (R) ≤ Cd (sup ρ)1/d ε + Cd k 1/2 n−1/2d ε1/2 .
R log log n
2
If ER holds, then our assumption n > Cd,k ρ−1
max R
−d −d
ε log(n)4d implies that
−1
un (R) ≤ Cd,k,ρ ε. Furthermore, we have n > Cd (supE ρ)−1 |E| for each E ∈ R
(n|E|)1/2d
p p
and k log(n) ≤ log log n|E| . Applying Theorem 5.4 with t = 2k log(n), we
have P(ER ) ≥ 1 − Cd kn−k for n > Cd . Letting E be the event that ER holds
2
for all R ∈ R, we have P(E) ≥ 1 − Cd,k ε−3d n−k by the union bound. As
supΩR \ΩR+ε un ≤ maxR∈R un (R), the result follows.

Lemma 6.6. Given R > 0 and ρ ∈ C(Rd+ ) satisfying (3.1), let u and v denote
the solutions of (2.2) and (3.4) respectively. Then the following statements hold.

(a) For all α > 0 we have

sup u ≤ dρ1/d
max α.
ΩR \ΩR+α

(b) We have

sup(v − 1ΩR u) ≤ dρ1/d


max R.
Ω0

37
1/d 1/d
Proof. (a) Let w(x) = dρmax (x1 . . . xd )1/d − dρmax R. Then w = 0 on ∂R Ω and

(wx1 . . . wxd )1/d = ρ1/d


max ≥ (ux1 . . . uxd )
1/d
on ΩR .

Observe that ΩR = ∂R Ω ∪ ΩR ∪ ∂ ∗ ΩR , so Ω = ΩR and Γ = ∂R Ω satisfy the


hypotheses of Theorem 4.3. As u satisfies (2.2), u = 0 on ∂R Ω. By Theorem
4.3, we have u ≤ w on ΩR . Furthermore, when x ∈ ΩR \ ΩR+α , we have
1/d
u(x) ≤ w(x) ≤ dρmax α.
1/d
(b) Let w(x) = dρmax (x1 . . . xd )1/d . Then w = 0 on ∂0 Ω and (wx1 . . . wxd )1/d =
1/d 1/d
ρmax . By Theorem 4.3 we have v ≤ w in Ω0 , hence v ≤ dρmax R in Ω0 \ ΩR .
1/d
Since u = 0 on ∂R Ω, we have v ≤ u + dρmax R on ∂R Ω. Furthermore, we have
(ux1 . . . uxd )1/d = (vx1 . . . vxd )1/d in ΩR . Hence, we may apply Theorem 4.3 to
1/d
conclude that v ≤ u + dρmax R within ΩR , and the result follows.

7 Proofs of Convergence Rates


7.1 Proof of Theorem 3.2
In this section we supply the proof of Theorem 3.2. Roughly speaking, the proof
approximates u with a semiconcave function φ, uses Theorem 6.1 to show that
the maximum of un − φ and φ − un is likely to be attained in a neighborhood of
the boundary, and then applies the boundary estimates established in Lemma
6.5. To produce a suitable approximation φ, we use inf and sup convolutions,
whose properties are summarized in the following lemma. While the proofs of
similar results can be found in standard references on viscosity solutions such as
[4, 43, 23], the estimates are not stated in the sharp form required here. The
proofs of the following statements can instead be found in [9].

Lemma 7.1. Given an open and bound set Ω ⊂ Rd , u ∈ C 0,1 (Ω), and α > 0,
consider the inf-convolution defined by
 
1 2
uα (x) = inf u(y) + |x − y| .
y∈Ω 2α

Then the following properties hold:

(a) uα is semiconcave with semiconcavity constant α−1 .

38
(b) There exists a constant C > 0 such that

∥u − uα ∥L∞ (Ω) ≤ Cα[u]2C 0,1 (Ω) .

(c) There exists a constant C > 0 such that

[uα ]C 0,1 (Ω) ≤ C[u]C 0,1 (Ω) .

0,1
(d) Assume f ∈ C 0,1 (Ω) and H ∈ Cloc (Rd ). If

H(Du) ≥ f in Ω

then

H(Duα ) ≥ f − Cα[u]C 0,1 (Ω) [f ]C 0,1 (Ω) in Mα (u)

where
(   )
1 2
Mα (u) = x ∈ Ω : arg min u(y) + |x − y| ∩ Ω ̸= ∅ .
y∈Ω 2α

n o
1 2
(e) Let yα ∈ arg miny∈Ω u(y) + 2α |x − y| . Then there exists a constant
C > 0 such that

|x − yα | ≤ Cα[u]C 0,1 (Ω) .

We are now in a position to tackle the proof of Theorem 3.2. We prove


the sharper rate in (b) by leveraging the semiconvexity estimates established in
Section 8.
1 2
Proof of Theorem 3.2. (a) Given α > 0, let uα (x) = inf y∈ΩR {u(y)+ 2α |x − y| }.
By Theorem 4.4 we have

[u]C 0,1 (ΩR ) ≤ Cd,ρ R−d+1 . (7.1)

By Lemma 7.1, uα is semiconcave on ΩR with semiconcavity constant α−1 and

39
we have

∥u − uα ∥L∞ (ΩR ) ≤ α[u]2C 0,1 (ΩR ) ≤ CαR−2(d−1) (7.2)

and
d
Y
(uα )xi ≥ ρ − Cd,ρ α[u]C 0,1 (ΩR ) ≥ ρ − Cd,ρ R−d+1 α in Mα (u). (7.3)
i=1

Let x ∈ ΩR such that dist(x, ∂R Ω) > Cα[u]C 0,1 (ΩR ) where C is the con-
n Lemma 6.1 (e), oand we show that x ∈ Mα (u). Given yα ∈
stant given in
1 2
arg miny∈ΩR u(y) + 2α |x − y| , we have

1 2
u(yα ) + |x − yα | ≤ u(x),

hence yα ≤ x. By Lemma 7.1 (e) we have |x − yα | ≤ Cα[u]C 0,1 (ΩR ) < dist(x, ∂R Ω),
hence x ∈ Mα (u). By Lemma 6.3, there exist constants Cd > 0 and Cd,ρ > 0
such that ΩR+δ ⊂ Mα (u) when δ = Cd,ρ αR−2d+2 . By Lemma 7.1 and Theorem
4.4 we have

[uα ]C 0,1 (ΩR ) ≤ [u]C 0,1 (ΩR ) ≤ Cd,ρ R−d+1 .

By (7.3), we have

2
Cd,ρ R(d−1) ≤ (uα )xi ≤ Cd,ρ R−(d−1)

in the viscosity sense. Given λ > 0, let A1 denote the event that

sup (un − (1 + λ)uα ) ̸= sup (un − (1 + λ)uα ) . (7.4)


ΩR ΩR+δ

2
Let γ = Cd,ρ R(d−1) , γ = Cd,ρ R−(d−1) , δ = Cd,ρ αR−2d+2 , and ε ∈ (0, 1). Set
log2 n
ν= log log n , Rn = n−1/2d γ 1/2 ε−1/2 ν, and λ = Cd,k,ρ (Rn + γ −2 α−1 ε + R−d+1 α).
Assume for now that supΩR (un − (1 + λ)uα ) ≥ 2ε. Then for all n > 1 with
n1/d > Cd,ρ,k R−d+1 ε−1 log(n)2 and ε satisfying

ε ≤ Cd min(αγ 2 γ −1 log(n)−2 , Rd−1 δ) (7.5)

we have P(A1 ) ≥ 1 − Cd,k ε−6d n−k by Lemma 6.1. Let A2 be the event that

40
2
supΩR \ΩR+δ un ≤ Cd,k,ρ δ. By Lemma 6.5 we have P(A2 ) ≥ 1 − Cd,k δ −3d n−k
for some constant Cd . As ε ≤ δ, we have P(A1 ∩ A2 ) ≥ 1 − Cd,k ε−Cd n−k . If
A1 ∩ A2 holds, then using (7.4) and (7.2) we have

sup (un − u) ≤ sup (un − (1 + λ)uα ) + ∥u∥L∞ (ΩR ) λ + ∥u − uα ∥L∞ (ΩR )


ΩR ΩR \ΩR+δ

≤ sup un + Cd,ρ λ + Cd αR−2d+2


ΩR \ΩR+δ

≤ Cd,k,ρ δ + Cd,ρ λ + Cd,ρ R−2d+2 α

2
Using γ −2 = Cd,ρ R−2d +4d−2
, we have
 2

sup (un − u) ≤ Cd,k,ρ Rn + R−2d+2 α + R−2d +4d−2 α−1 ε . (7.6)
ΩR

In the case where supΩR (un − (1 + λ)uα ) < 2ε, we cannot apply Lemma 6.1, but
obtain (7.6) by observing that

sup (un − u) ≤ sup (un − (1 + λ)uα ) + ∥u∥L∞ (ΩR ) λ + ∥u − uα ∥L∞ (ΩR )


ΩR ΩR

≤ 2ε + Cd,ρ λ + Cd,ρ (R−2(d−1) α)


 2

≤ Cd,k,ρ Rn + R−2(d−1) α + R−2d +4d−2 α−1 ε .

Let E denote the right-hand side of (7.6), and we select the parameters α and
ε to minimize E subject to the constraints from Lemmas 6.5 and 6.1. Let
2d2 −3d+1 (−2d2 +6d−4) √ −2d2 +9d−7
ε = νn−1/2d R 2 and α = R 2 ε=R 4 n−1/4d ν 1/2 . Then
we have
(−2d2 +d+1)
E ≤ Cd,k,ρ R 4 ν 1/2 n−1/4d .

This is subject to the constraints

α ≤ R (from Lemma 6.5) (7.7)


Cd,ρ,k log(n)2 R−2 δ −1 ≤ n1/d (from Lemma 6.1) (7.8)
2 −d+1 −1 1/d
Cd,ρ,k log(n) R ε ≤n (from Lemma 6.1) (7.9)
2
−3d+1
ε ≤ Cd R2d α log(n)−2 (from Lemma 6.1) (7.10)
d−1
ε ≤ Cd R δ (from Lemma 6.1) . (7.11)

2d2 −3d+1 2
As εn1/d = n1/2d R 2 ν, (7.9) is satisfied when n1/d ≥ Cd,ρ,k R−(2d −d−1)
log(n)C .
2 6d2 −3d−3
−3d+1
We also have (7.8), since ε < δ. As R2d α = R 4 n−1/4d ν 1/2 and

41
2d2 −3d+1 2
ε=R 2 n−1/2d ν, (7.10) is satisfied when n1/d ≥ Cd,k,ρ R−2d +9d+1 log(n)C .
2
It also follows that (7.11) holds, as Rd−1 δ = R−d+1 α ≥ R2d −3d+1 α. Since α =
−2d2 +9d−7 2
R 4 ν 1/2 n−1/4d , (7.7) is satisfied when n1/d ≥ Cd,k,ρ R−(2d −9d+11)
log(n)C .
It is straightforward to check that the most restrictive of these conditions is
2
n1/d ≥ Cd,k,ρ R−(2d −d−1)
log(n)C , hence all constraints are satisfied when this
holds. We conclude that
1/2 !
log2 n

(2d2 +d+1)
− −1/4d
P sup (un − u) > Cd,ρ,k R 4 n ≤ Cd,ρ,k RCd n−k .
ΩR log log n

(b) By Theorem 3.5, there exist constants Cρ and Cd,ρ such that u is semiconvex
on ΩR with semiconvexity constant Cd,ρ R−2d+1 when R < Cρ . Given ε > 0 and
λ > 0, let E be the event that

sup ((1 − λ)u − un ) ̸= sup ((1 − λ)u − un ) .


ΩR ΩR+δ

2
Let γ = Cd,ρ R(d−1) , γ = Cd,ρ R−(d−1) , α = Cd,ρ R−2d+1 and ε ∈ (0, 1). Set
log2 n
ν= log log n , Rn = n−1/2d γ 1/2 ε−1/2 ν, and λ = Cd,k,ρ (Rn + γ −2 R−2d+1 ε). Then
2
1/d −2d+2 −2
for all n > Cd,ρ,k R2d ε and ε satisfying (7.5), we have P(E) ≥
−6d −k
1 − Cd,k ε n by Lemma 6.1. We assume for the remainder of the proof that
E holds. By Lemma 6.6 we have

sup ((1 − λ)u − un ) = sup ((1 − λ)u − un ) ≤ Cd,ρ,k δ.


ΩR ΩR \ΩR+δ

2
Letting δ = λ and using γ −2 R−2d+1 = Cd,ρ R−2d +2d−1
, we have

sup (u − un ) = sup ((1 − λ)u − un + λu)


ΩR ΩR

≤ sup ((1 − λ)u − un ) + Cd,ρ λ


ΩR \ΩR+δ
2
≤ Cd,ρ,k (Rn + R−2d +2d−1
ε).

4d2 −5d+3
2
ν 2/3
Letting E = (Rn +R−2d +2d−1
ε), we select ε to minimize E. Let ε = R 3
n1/3d
,
2
so Rn = R−2d +2d−1
ε. Then

2 −2d2 +d
E = R−2d +2d−1
ε=R 3 n−1/3d ν 2/3 .

42
From Lemma 6.1 we have the following constraints

2 8d2 −10d+6
−2d+1
Cd,ρ,k R2d ≤ ε2 n1/d = n1/3d R 3 ν 4/3 (7.12)
2d2 −3d+1 −2 2d2 −5d+1 −2
ε ≤ Cd αR log(n) =R log(n) (7.13)
d−1
ε ≤ Cd R δ. (7.14)

2
As δ = λ ≥ R−2d +2d−1
ε, it is clear that (7.14) is satisfied. Observe that (7.12) is
2
satisfied when n 1/d
≥ R−2d +4d−4
log(n)C . Furthermore, (7.13) is satisfied when
−2d2 +10d+2
1/d C
n ≥ Cd,k,ρ R 3 log(n) for some constant C. As the most restrictive
1/d −2d2 +4d−4
of these conditions is n ≥R log(n)C , when this is satisfied we can
conclude that
2/3 !
log2 n

−2d2 +d
P sup (u − un ) > Cd,ρ,k R 3 n 1/3d
≤ Cd,ρ,k RCd n−k .
ΩR log log n

7.2 Proof of Theorem 3.3


We now establish our convergence rate result on Ω0 = (0, 1]d by using the
auxiliary problem (2.2) as an approximation to (3.4). The proof is conceptually
straightforward, using un as an approximation to vn and u as an approximation
of v. As with Theorem 3.2 we obtain a sharper result in (b) thanks to Theorem
3.5.

Proof of Theorem 3.3. (a) Given R > 0, let E1 be the event that

1/2
log2 n

(2d2 +d+1)
sup (un − u) ≤ Cd,ρ,k R− 4 n−1/4d . (7.15)
ΩR log log n

2
By Theorem 3.2 (a) we have P(E1 ) ≥ 1−Cd,k,ρ RCd n−k for all n1/d ≥ Cd,k,ρ R−(2d −d−1)
log(n)C .
Given any x ∈ Ω0 and longest chain C in [0, x] ∩ Xnρ , let C1 = {y ∈ C : y ∈ ΩR }
and C2 = {y ∈ C : y ∈
/ ΩR }. Then

vn (x) − un (x) ≤ vn (C) − un (C1 ) = vn (C2 ) ≤ sup vn . (7.16)


Ω0 \ΩR

Let E2 denote the event that supΩ0 \ΩR vn ≤ Cd,ρ R holds. By Lemma 6.5 we
have P(E2 ) ≥ 1 − Cd,k,ρ R−Cd n−k , for all n with n1/d ≥ Cd,k,ρ R−d−1 log(n)4 .
Then P(E1 ∩ E2 ) ≥ 1 − Cd,k,ρ RCd n−k , and for the remainder of the proof, we
 1/2
log2 n
assume that E1 ∩ E2 holds. Let ν = log log n . Using (7.16) ,(7.15), and

43
u ≤ v on ΩR , we have for all x ∈ ΩR that

vn (x) − v(x) ≤ (vn (x) − un (x)) + (un (x) − u(x)) + (u(x) − v(x))
(2d2 +d+1)
≤ Cd,k,ρ (R + R− 4 n−1/4d ν).

Hence we have
(2d2 +d+1)
sup(vn − v) ≤ Cd,k,ρ (R + R− 4 n−1/4d ν).
Ω0

(2d2 +d+1)
Letting R = Kn−β , we select the maximum value of β such that R ≥ R− 4 n−1/4d ν
2
and n1/d ≥ Cd,k,ρ R−(2d −d−1)
log(n)C hold when n > Cd,k,ρ . These are satisfied
when dβ(2d2 + d + 5) < 1 and dβ(2d2 − d − 1) < 1, respectively. Letting
1
β= 2d3 +d2 +5d+1 , we have

sup(vn − v) ≤ Cd,k,ρ KR.


Ω0

Choosing K so Cd,k,ρ K = 1, we conclude that for all n > Cd,ρ,k we have


 
−1/(2d3 +d2 +5d+1)
P sup (vn − v) > n ≤ Cd,ρ,k n−k+Cd .
Ω0

As this holds for all k ≥ 1, the result follows.


(b) Given R > 0, let E be the event that

2/3
log2 n

−2d2 +d
−1/3d
sup (u − un ) ≤ Cd,ρ,k R 3 n . (7.17)
ΩR log log n

By Theorem 3.2 (a) we have P(E) ≥ 1 − Cd,k,ρ RCd n−k for all n with

2
n1/d ≥ Cd,k,ρ R−2d +4d−4
log(n)C . (7.18)

For the remainder of the proof, we assume that E holds. By Lemma 6.6, we
 2/3
log2 n
have supΩ0 \ΩR (v − 1ΩR u) ≤ Cd,ρ R. Let ν = log log n . Using (7.17) and
un ≤ vn , we have for x ∈ ΩR that

v(x) − vn (x) ≤ (v(x) − u(x)) + (u(x) − un (x)) + (un (x) − vn (x))


−2d2 +d
≤ Cd,k,ρ (R + R 3 n−1/3d ν).

44
If x ∈ Ω0 \ ΩR , then we have v(x) − vn (x) ≤ supΩ0 \ΩR v ≤ Cd,ρ R. Hence, we
have

−2d2 +d
sup(v − vn ) ≤ Cd,k,ρ (R + R 3 n−1/3d ν).
Ω0

−2d2 +d−1
Letting R = Kn−β , we select β to satisfy R ≥ R 3 n−1/3d ν and (7.18) when
n ≥ Cd,ρ,k . These are satisfied when dβ(2d2 −d+3) < 1 and dβ(2d2 −4d+4) < 1,
respectively. Letting β = 2d3 −d21+3d+1 , we have

sup(vn − v) ≤ Cd,k,ρ Kn−β .


ΩR

Choosing K so Cd,k,ρ K = 1, we conclude that

3
−d2 +3d+1)
sup (vn − v) ≤ n−1/(2d
Ω0

Remark 7.2. Observe that 2d3 + d2 + 5d + 1 ≥ 2d3 − d2 + 3d + 1 for d ≥ 2.


This shows that making use of the semiconvexity estimates in Theorem 3.5 has
genuinely improved the convergence rates.

The following lemma allows us to extend our results to data modeled by a


sequence of i.i.d. random variables instead of a Poisson point process.

Lemma 7.3. Let {Yk }k=1 be i.i.d. random variables on Rd with continuous
density ρ and set Y n = {Y1 , . . . , Yn }. Let Xnρ be a Poisson point process with

intensity nρ. Let F : F → R where F = S ⊂ Rd : S is finite and suppose
that P (F (Xnρ ) > c) ≤ K. Then
 √
P F (Y n ) > c ≤ Ke n.

Proof. Let N ∼ Poisson(n). Then Y N is a Poisson process with intensity nρ, as


proven in [45]. Hence we have


 X  k k e−k
P F (Y N ) > c = P F (Y k ) > c ≤ K.
k!
k=0

n! √ 
By Stirling’s Formula, nn e−n ≤ e n. We conclude that P F (Y n ) > c ≤

Ke n.

45
8 Proof of Theorem 3.5
First we define the notion of semiconvexity.

Definition 8.1. A function u is said to be semiconvex with semiconvexity


constant C on a domain Ω if for all x ∈ Ω and h ∈ Rd such that x ± h ∈ Ω we
have

2
u(x + h) − 2u(x) + u(x − h) ≥ −C |h| .

A function u is said to be semiconcave with semiconcavity constant C if −u is


semiconvex with semiconvexity constant C.

We begin by showing that estimates on the semiconvexity constant of u


near the boundary automatically extend to the whole domain provided ρ1/d is
semiconvex. The key ingredient in the proof is the concavity of the Hamiltonian
L(p) = (p1 . . . pd )1/d .

Theorem 8.2. Let Ω ⊂ [0, M ]d be an open set, and assume that ∂Ω = Γ ∪ ∂ ∗ Ω


where Γ ⊂ ∂Ω is closed and ∂ ∗ Ω is as in (4.1). Given ε > 0, let Ωε =

x ∈ Rd : dist(x, Ω) < ε and suppose that u ∈ C(Ωε ) satisfies (ux1 . . . uxd )1/d =
ρ1/d in the viscosity sense on Ωε , where ρ ∈ C(Ωε ) satisfies (3.1) and ρ1/d is
semiconvex on Ω with semiconvexity constant Kρ . Suppose there exists h ∈ Rd
such that |h| < ε and

2
u(x + h) − 2u(x) + u(x − h) ≥ −Ku |h| for x ∈ Γ. (8.1)

Then we have

−1/d 2
u(x+h)−2u(x)+u(x−h) ≥ −(1+∥u∥L∞ (Ωε ) ) max(Ku , ρmin Kρ ) |h| for x ∈ Ω.

u(x+h)+u(x−h)
Proof. Set L(p) = (p1 . . . pd )1/d and w(x) = 2 , and we show that w
2
satisfies L(Dw) ≥ ρ 1/d
− Kρ |h| on Ω. Given x0 ∈ Ω, let φ ∈ C ∞ (Rd ) such that
w − φ has a local minimum at x0 . Without loss of generality we may assume
that w(x0 ) = φ(x0 ) and w − φ has a strict global minimum at x0 . Let
 
1 1 x+y α 2
Φ(x, y) = u(x) + u(y) − φ + |x − y − 2h| .
2 2 2 2

Since Φ is continuous, it attains its minimum at some (xα , yα ) ∈ Ω × Ω. Fur-

46
thermore, we have

Φ(xα , yα ) ≤ Φ(x0 + h, x0 − h) = w(x0 ) − φ(x0 ) = 0. (8.2)

As u and φ are bounded on Ω, we have

α 2
|xα − yα − 2h| ≤ Cd . (8.3)
2

By compactness of Ω × Ω, there exists a sequence αn → ∞ such that {xαn }


and {yαn } are convergent sequences. Set xn = xαn , yn = yαn , and let (b
x, yb) =
b − yb = 2h, and now we verify that (b
limn→∞ (xn , yn ). By (8.3) we have x x, yb) =
(x0 + h, x0 − h). By lower semicontinuity of Φ, we have

lim inf Φ(xn , yn ) ≥ Φ(b y + h) − φ(b


x, yb) = w(b y + h) ≥ 0.
n→∞

By (8.2) we have

y + h) − φ(b
lim Φ(xn , yn ) = 0 = w(b y + h).
n→∞

Since w − φ has a strict global minimum at x0 , we conclude that x0 = yb + h.


Since x x, yb) = (x0 + h, x0 − h). As x0 ± h ∈ Ωε , there exists
b = yb + 2h, we have (b
N > 0 such that(xn , yn ) ∈ Ωε × Ωε when n > N . Let ψ1 and ψ2 be given by
 
x + yn 2
ψ1 (x) = −u(yn ) + 2φ − αn |x − yn − 2h|
2
 
xn + y 2
ψ2 (y) = −u(xn ) + 2φ − αn |xn − y − 2h| .
2

By construction, u−ψ1 has a local minimum at xn and u−ψ2 has a local minimum
at yn , Dψ1 (xn ) = pn − 2qn , and Dψ2 (yn ) = pn + 2qn where pn = Dφ xn +yn

2
and qn = αn (xn − yn − 2h). Since u satisfies L(Du) = ρ1/d on Ωε , we have
L(pn + 2qn ) ≥ ρ(yn )1/d and L(pn − 2qn ) ≥ ρ(xn )1/d for n > N . Using concavity
of L, we have
 
pn + 2qn pn − 2qn
L(pn ) = L +
2 2
1
≥ (L (pn + 2qn ) + L (pn − 2qn ))
2
1
≥ (ρ(xn )1/d + ρ(yn )1/d ).
2

47
x, yb) = (x0 + h, x0 − h), lower semicontinuity of L, and semiconvexity of
Using (b
ρ1/d , we have
  
xn + yn 1
L(Dφ(x0 )) ≥ lim sup L Dφ ≥ (ρ(x0 + h)1/d + ρ(x0 − h)1/d )
n→∞ 2 2
Kρ 2
≥ ρ(x0 )1/d − |h| .
2
Kρ 2 1 2 −1/d
It follows that L(Dw) ≥ ρ1/d − 2 |h| on Ω. Given θ > 0, set θ = 2 |h| max(Ku , ρmin Kρ )
and let wθ = (1 + θ)w + θ. By Proposition 4.2 we have

Kρ 2 1/d
L(Dwθ ) ≥ ρ1/d − |h| + dρmin θ ≥ ρ1/d on Ω.
2
Ku 2
Furthermore, by (8.1) we have w ≥ u − 2 |h| on Γ. By choice of θ, we have
wθ ≥ u on Γ. As ∂Ω = Γ ∪ ∂ ∗ Ω by assumption, we may apply Theorem 4.3 to
conclude wθ ≥ u on Ω. Hence for all x ∈ Ω we have

u(x + h) + u(x − h)
w(x) = ≥ u(x) − θ(1 + ∥w∥L∞ (Ω) )
2
1 2 −1/d
≥ u(x) − ∥u∥L∞ (Ωε ) |h| max(Ku , ρmin Kρ ).
2

and the result follows.

Next we establish the existence of an approximate solution to (3.6) for R = 1


in a neighborhood of the boundary when ρ(x)1/d = a + p · (x − x0 ). The
approximate solution is constructed as w + v where w is the solution to (3.6)
when ρ = 1 and v solves a related PDE. Given x0 ∈ ∂1,M , p ∈ Rd , and a > 0, let

1 − d−1
v(x) = a d ((p · x)((x1 . . . xd )1/d − (x1 . . . xd )−1/d ) − 2(p · x0 )((x1 . . . xd )1/d − 1))
2
(8.4)

and

w(x) = a1/d d(x1 . . . xd )1/d − a1/d d (8.5)

and

u = w + v. (8.6)

48
Theorem 8.3. Given x0 ∈ ∂1,M Ω, p ∈ Rd , and a > 0, let v and w be as in
(8.4) and (8.5).

(a) We have
  
Xd Y


  wxj  vxi = p · (x − x0 ) in Ω1,M
i=1 j̸=i (8.7)



 v=0 on ∂1,M Ω.

(b) Let u = w + v. Then for all ε ≤ Cd (aM d (1 + |p|))−1 we have


(
ux1 . . . uxd = a + p · (x − x0 ) + E(x) in B(x0 , ε)
(8.8)
u=0 on B(x0 , ε) ∩ ∂1,M Ω

2
where |E| ≤ Cd a−1 |p| M 2 ε2 . Furthermore, u is nondecreasing in each
coordinate within B(x0 , ε) ∩ Ω1,M .

Proof. We first prove (a). Using (8.4), it is clear that v = 0 on ∂1,M Ω. Further-
more, we have
 
d−1 p · (x − 2x0 )  p · x
2a d x i vx i = xi pi + (x1 . . . xd )1/d + −xi pi + (x1 . . . xd )−1/d
d d

and
d
d−1 X
2a d xi vxi = 2p · (x − x0 )(x1 . . . xd )1/d .
i=1

and it follows that (8.7) is satisfied. To see that (8.8) holds, observe that
 
d
Y d
X Y
(wxi + vxi ) = wx1 . . . wxd + vx i  wxj  + E = a + p · (x − x0 ) + E
i=1 i=1 j̸=i

where
d
X X Y Y
E= vx i wxj .
ℓ=2 K⊂{1,...,d} i∈K j ∈K
/
|K|=ℓ

To bound |E| within B(x0 , ε), we establish some estimates on vxi . Using that

49
1
maxx∈Ω1,M xi = M d−1 , it is straightforward to verify that Dv(x0 ) = 0 and

d−1
Cd a− d M d |p|
vxi xj (x0 ) ≤ (8.9)
x0,i

for 1 ≤ j ≤ d. Hence, we have


d−1
Cd a− d M d |p| ε
|vxi | ≤ in B(x0 , ε).
x0,i

Since wxi = a1/d (x1 . . . xd )1/d x−1


i , for x ∈ B(x0 , ε) and ε ≤ M
−(d−1)
we have

a1/d Ca1/d ε Ca1/d


|wxi | ≤ + 2 ≤ .
x0,i x0,i x0,i

Then for any K ⊂ {1, . . . , d} with |K| = ℓ we have

Y Y d−1 Y Y Ca1/d
|vxi | wxj ≤ Cd (a− d M d |p| ε)ℓ x−1
0,i
x0,j
i∈K j ∈K
/ i∈K j ∈K
/
ℓ
≤ Cd a1−ℓ M d |p| ε .

2
It follows that |E| ≤ Cd a−1 |p| M 2d ε2 for ε ≤ 1
M d (1+|p|)
. To verify that u is
nondecreasing in each coordinate within B(x0 , ε) ∩ Ω1,M , observe that in B(x0 , ε)
we have

Cd a1/d M d
wxi xj (x0 ) ≤ .
x0,i

Using (8.9), we have


d−1
Cd a− d M d |p|
uxi xj (x0 ) ≤ .
x0,i

Using uxi (x0 ) = a1/d x−1


0,i , when ε ≤ Cd aM
−d
(1 + |p|) and x ∈ B(x0 , ε) we have

d−1
uxi (x) ≥ a1/d x−1
0,i − Cd εa
− d |p| x−1
0,i ≥ 0.

We can now apply the comparison principle to show that u approximates u


near the boundary.

50
Proposition 8.4. Let u denote the solution to (3.6), where ρ ∈ C 2 (Ω1,M )
satisfies (3.1). Given x0 ∈ ∂1,M Ω, let u = v + w where v and w are as in
(8.4) and (8.5) with a = ρ(x0 ) and p = Dρ(x0 ). Then there exists a constant
Cd,ρ,M > 0 such that when ε ≤ Cd,ρ,M we have

sup |u − u| ≤ Cd ε3 M 3d−1 Aρ
B(x0 ,ε)∩Ω1,M

where
 
−2 2
Aρ = [ρ1/d ]C 0,1 (Rd ) ρmin ∥Dρ∥L∞ (∂1,M Ω) + ρ−1 2
min D ρ L∞ (∂1,M Ω)
.

Proof. Letting H(p) = p1 . . . pd , by Theorem 8.3 we have

H(Du) = ρ(x0 ) + Dρ(x0 ) · (x − x0 ) + E(x) in Ω1,M

2
where |E| ≤ Cd ρ−1 2d 2 2
min ∥Dρ(x0 )∥ M ε in B(x0 , ε). As ρ ∈ C (Ω1,M ), for x ∈
B(x0 , ε) ∩ Ω1,M we have

ρ(x) ≥ ρ(x0 ) + Dρ(x0 ) · (x − x0 ) − D2 ρ(x0 ) L∞


ε2 . (8.10)

Given λ > 0, by Proposition 4.2 and (8.10) we have in B(x0 , ε) ∩ Ω1,M that

H((1 + λ)Du) ≥ (1 + λ)ρ


≥ ρ + dρmin λ
≥ ρ(x0 ) + Dρ(x0 ) · (x − x0 ) − D2 ρ(x0 ) L∞
ε2 + dρmin λ.

2
Letting λ = Cd ρ−2 2d 2 −1 2
min ∥Dρ∥L∞ (∂1,M Ω) M ε + ρmin D ρ L∞ (∂1,M Ω)
ε2 , we have

H((1 + λ)Du) ≥ H(Du) in B(x0 , ε) ∩ Ω1,M .

Observe that (1+λ)u = u = 0 on ∂1,M Ω and u is nondecreasing in each coordinate.


We may apply Theorem 4.3 with Ω = B(x0 , ε) ∩ Ω1,M and Γ = B(x0 , ε) ∩ ∂1,M Ω
to conclude that (1 + λ)u ≥ u on B(x0 , ε) ∩ Ω1,M . By Lemma 4.4 we have
∥u∥L∞ (B(x0 ,ε)∩Ω1,M ) ≤ Cd M d−1 ρ1/d C 0,1 (Rd )
ε. Hence, we have

u ≥ u − ∥u∥L∞ (B(x0 ,ε)∩Ω1,M ) λ ≥ u − Cd M d−1 ρ1/d λ.


C 0,1 (Rd )

From Theorem 8.3, u is nondecreasing in each coordinate within B(x0 , ε). An

51
analogous application of Theorem 4.3 shows that (1−λ)u ≤ u on B(x0 , ε)∩Ω1,M ,
and we conclude that supB(x0 ,ε)∩Ω1,M |u − u| ≤ Cd M d−1 ρ1/d C 0,1 (Rd )
λ.

Now we establish semiconvexity estimates on u in a neighborhood of ∂1,M .

Lemma 8.5. Given x0 ∈ ∂1,M Ω, a > 0, and p ∈ Rd , let u be as in (8.6). Then


the following statements hold.

(a) For all η ∈ Rd we have

2 d−1
η ⊤ (D2 u(x0 ))η ≥ −Cd |η| (a− d M 2d−1 |p| + a1/d M 2d−2 ).

(b) There exist values of a > 0, p ∈ Rd , x0 ∈ ∂1,M Ω and η ∈ Rd such that

2 d−1
η ⊤ (D2 u(x0 ))η ≤ −Cd |η| (a− d M 2d−1 |p| + a1/d M 2d−2 ).

Proof. (a) We have


 
1 − d−1 dxj pj + dxi pi + p · (x − 2x0 ) + da da + p · (x − 2x0 )
uxi xj = a d − δij (x1 . . . xd )1/d
2 d2 xi xj dx2i
 
1 − d−1 dxj pj + dxi pi − p · x p·x
+ a d − δij 2 (x1 . . . xd )−1/d .
2 d2 xi xj dxi

In the following calculation we shall employ the shorthand notation η · x−1 :=


Pd −1 2 Pd
i=1 ηi xi and x−1 := i=1 x−2 d
i . Let η ∈ R with |η| = 1. Then we have

d
!
d−1
⊤ 2
X 2dx0,i pi + 2dx0,j pj − 2p · x0 + da a
2a d η (D u(x0 ))η = ηi ηj − δij 2
i,j=1
d2 x0,i x0,j x0,i
d d d
ηi2
 
2 X pi pj p · x0 a X ηi ηj X
= ηi ηj + − + −a
d i,j=1 x0,j x0,i dx0,i x0,j d i,j=1 xi xj x2
i=1 0,i
 
2 1 a 2
= (η · x−1 0 ) 2(η · p) − (p · x0 )(η · x −1
0 ) + (η · x−1 2 −1
0 ) − a η · x0
d d d
 2 2

≥ −Cd |p| x−1 0 + |p| |x0 | x−10 + a x−1 0

Using that minx∈Ω1,M xi = M −d+1 , we have

d−1
η ⊤ (D2 u(x0 ))η ≥ −Cd M 2d−1 |p| + aM 2d−2 .

2a d

1
(b) Let a = 1, η = e1 , and define x0 ∈ ∂1,M Ω by x0,1 = M d−1
, x0,j = M for

52
v
where v = 2e1 − dxx0,1

j = 2, . . . , d and p = − ∥v∥ 0
. Then M 2d−1 |p| + aM 2d−2 ≤
2M 2d−1 . We have
 
⊤ 2 2 x0 1 − (1/d)
η (D u(x0 ))η = p · 2ei − −
dx0,1 dx0,1 x20,1
2
≤ − M d−1 ∥v∥
d
2
q
= − M d−1 (2 − d−1 )2 + (d − 1)d−1 M 2d
d
≤ −Cd M 2d−1 .

Remark 8.6. Result (a) establishes an upper bound on the semiconvexity


constant of u, while result (b) shows that this is the best bound (up to a constant
Cd ) that can hold without additional restrictions on ρ. In (a) if we assume also
that p · x0 ≤ 0 the result can be improved to

d−1
η ⊤ (D2 u(x0 ))η ≥ −Cd (a− d |p| M d−1 + a1/d M 2d−2 ).

Proof of Theorem 3.5. We will prove the result in two steps, first considering
the R = 1 case, and then proving the general case using a scaling argument.
Given M ≥ 1, let u denote the solution of
(
(ux1 ux2 . . . uxd )1/d = ρ1/d in Ω1,2M
(8.11)
u=0 on ∂1,2M Ω.

1/d 1/d
Letting w(x) = dρmax (x1 . . . xd )1/d − dρmax , we have w = 0 on ∂1,2M and
1/d
(wx1 . . . wxd )1/d = ρmax ≥ u on Ω1,2M . By Theorem 4.3 we have u ≤ w on
1/d
Ω1,2M , hence ∥u∥Ω1,2M ≤ Cd M ρmax . Given ε > 0, let h ∈ Rd with |h| = ε and
set Γε = {x ∈ Ω1,M : dist(x, ∂1,M Ω) < ε} and Uε = Γ3ε \ Γε . Given x ∈ Uε ,
there exists x0 ∈ ∂1,M Ω such that x ∈ B(x0 , 3ε). Let u be as in Proposition 8.4,
with a = ρ(x0 ) and p = Dρ(x0 ). By Proposition 8.4, there exists a constant
Cd,ρ,M > 0 such that when ε ≤ Cd,ρ,M we have

sup |u − u| ≤ Cd,ρ,M ε3 .
B(x0 ,3ε)∩Ω1,M

By Lemma 8.5 (a) we have

η ⊤ D2 u(x0 )η ≥ −Cd Ku

53
where

−(d−1)/d
Ku = ρmin ∥Dρ∥L∞ (∂1,M Ω) M 2d−1 + ρ1/d
max M
2d−2
.

As u is smooth, there exists a constant Cd,ρ,M > 0 such that

inf η ⊤ D2 u(y)η ≥ −Cd Ku − Cd,ρ,M ε.


y∈B(x0 ,3ε)

and it follows that

2
u(x + h) − 2u(x) + u(x − h) ≥ − |h| (Cd Ku + Cd,ρ,M ε).

Hence, we have

u(x + h) − 2u(x) + u(x − h) u(x + h) − 2u(x) + u(x − h)


2 ≥ 2 − 4ε−2 sup |u − u|
|h| |h| B(x0 ,3ε)∩Ω1,M

≥ −(Cd Ku + Cd,ρ,M ε)

This holds for all x ∈ Uε , hence also for all x ∈ Uε . Letting Ω = Ω1,M \
Γε and Γ = {x ∈ Ω1,M : dist(x, ∂1,M Ω) = ε}, we have ∂Ω = Γ ∪ ∂ ∗ Ω and

y ∈ Rd : dist(y, Ω) < ε ⊂ Ω1,2M for ε ≤ M . As ρ1/d is semiconvex on Ω1,M
with semiconvexity constant D2 (ρ1/d ) L∞ (Ω1,M )
, we may apply Theorem 8.2
to conclude that for all x ∈ Ω1,M \ Γε we have

u(x + h) − 2u(x) + u(x − h)


2 ≥ −Cd (1 + ∥u∥L∞ (Ω1,2M ) )(Ku + Cd,ρ,M ε)
|h|
≥ −Cd (1 + M ρ1/d
max )(Ku + Cd,ρ,M ε)

where
 
−1/d
Ku = max Ku , ρmin D2 (ρ1/d ) .
L∞ (Ω1,M )

As this holds for all ε < Cd,ρ,M and h ∈ Rd with |h| = ε, we conclude that for
all x ∈ Ω1,M and h ∈ Rd with x ± h ∈ Ω1,M we have

u(x + h) − 2u(x) + u(x − h)


2 ≥ −Cd (1 + M ρ1/d
max )Ku .
|h|

Now we prove Theorem 3.5 in full generality. Let u denote the solution of (3.6)

54
and let q(x) = u(Rx). Then q satisfies
(
qx1 qx2 . . . qxd = g in Ω1,R−1 M
q=0 on ∂1,R−1 M Ω

where g = Rd ρ(Rx). By our R = 1 result, q satisfies


 
q(x + h) − 2q(x) + q(x − h) −1/d
2 ≥ −Cd (1 + M ρ1/d
max ) max K, R2 ρmin D2 (ρ1/d )
|h| L∞ (ΩR,M )

−(d−1)/d
where K = R−2d+3 (E1 + E2 ) with E1 = ρmin ∥Dρ∥L∞ (∂R,M Ω) M 2d−1 and
1/d
E2 = ρmax M 2d−2 . Then there exists a constant Cρ > 0 such that when R ≤ Cρ
we have

q(x + h) − 2q(x) + q(x − h) −2d+3


2 ≥ −Cd (1 + M ρ1/d
max )R (E1 + E2 ) .
|h|

Hence for all y ∈ Ω1,R−1 M and h′ ∈ Rd with h′ ̸= 0, we have

u(Ry + Rh′ ) − 2u(Ry + Rh′ ) + u(Ry − Rh′ ) −2d+1


2 ≥ −Cd (1 + M ρ1/d
max )R (E1 + E2 )
|Rh′ |

Replacing Rh′ with h and Ry with x, we conclude that for all x ∈ ΩR,M and
h ̸= 0 we have

u(x + h) − 2u(x) + u(x − h) −2d+1


2 ≥ −Cd (1 + M ρ1/d
max )R (E1 + E2 ) .
|h|

55
Part II

Poisson Learning: Graph Based


Semi-Supervised Learning
At Very Low Label Rates

56
9 Introduction
Semi-supervised learning uses both labeled and unlabeled data in learning tasks.
For problems where very few labels are available, geometric or topological
structure in unlabeled data can be used to greatly improve the performance of
classification, regression, or clustering algorithms. One of the most widely used
methods in graph-based semi-supervised learning is Laplace learning, originally
proposed in [71], which seeks a graph harmonic function that extends the
labels. Laplace learning, and variants thereof, have been widely applied in
semi-supervised learning [67, 66, 69, 2] and manifold ranking [35, 64, 63], among
many other problems.
This paper is concerned with graph-based semi-supervised learning at very
low label rates. In this setting, it has been observed that Laplace learning can
give very poor classification results [55, 28]. The poor results are often attributed
to the fact that the solutions develop localized spikes near the labeled points and
are almost constant far from the labeled points. In particular, label values are
not propagated well by the Laplacian learning approach. To address this issue,
recent work has suggested to consider p-Laplace learning [28]. Several works
have rigorously studied p-Laplace regularization with few labels [57, 14, 12], and
recent numerical results show that p > 2 is superior to Laplace learning at low
label rates [30]. The case of p = ∞ is called Lipschitz learning [48], which seeks
the absolutely minimal Lipschitz extension of the training data. Other methods
to address low label rate problems include higher order Laplacian regularization
[70] and spectral cutoffs [5].
While p-Laplace learning improves upon Laplace learning, the method is
more computationally burdensome than Laplace learning, since the optimality
conditions are nonlinear. Other recent approaches have aimed to re-weight the
graph more heavily near labels, in order to give them wider influence when
the labeling rate is very low. One way to re-weight the graph is the Weighted
Nonlocal Laplacian (WNLL) [56], which amplifies the weights of edges directly
connected to labeled nodes. The WNLL achieves better results at moderately
low label rates, but still performs poorly at very low label rates [30]. To address
this, [19] proposed the Properly Weighted Laplacian, which re-weights the graph
in a way that is well-posed at arbitrarily low label rates.
Much of the recent work on low label rate problems has focused on formulating
and implementing new learning approaches that are well-posed with few labels.
The exact nature of the degeneracy in Laplace learning, and the question of how

57
the tails of the spikes propagate label information, has not been studied and
is still poorly understood. For some problems, the performance of Laplacian
learning is good [71], while for other problems it is catastrophic, yielding very
poor classification results similar to random guessing [56, 30].
In this paper, we carefully analyze Laplace learning at very low label rates,
and we discover that nearly all of the degeneracy of Laplace learning is due to a
large constant bias in the solution of the Laplace equation that is present only at
low label rates. In order to overcome this problem we introduce a new algorithm,
we call Poisson learning, that gives very good classification performance down
to extremely low label rates. We give a random walk interpretation of Poisson
learning that shows how the method uses information from the random walkers
only before they reach the mixing time of the random walk and forget their initial
condition. We also propose a graph-cut enhancement of Poisson learning, called
Poisson MBO, that can incorporate knowledge about class sizes, and further
improves the class-label accuracy.
The rest of the paper is organized as follows. In Section 10 we first briefly
introduce Poisson learning, and then provide a detailed motivation for the
algorithm and a theoretical analysis from both the variational and random walk
perspectives. The Poisson MBO approach is presented in Section 10.4. In Section
11 we present the step-by-step algorithms and discuss implementation details for
the Poisson and Poisson MBO algorithms. In Section 12 we present numerical
experiments with semi-supervised classification on the MNIST, FashionMNIST,
and Cifar-10 datasets. The proofs of the results are available in the supplementary
materials.

10 Poisson learning
Let X = {x1 , x2 , . . . , xn } denote the vertices of a graph with edge weights wij ≥ 0
between xi and xj . We assume the graph is symmetric, so wij = wji . We define
Pn
the degree di = j=1 wij . For a multi-class classification problem with k classes,
we let the standard basis vector ei ∈ Rk represent the ith class. We assume the
first m vertices x1 , x2 , . . . , xm are given labels y1 , y2 , . . . , ym ∈ {e1 , e2 , . . . , ek },
where m < n. The task of graph-based semi-supervised learning is to extend the
labels to the rest of the vertices xm+1 , xm+2 , . . . , xn .
The well-known Laplace learning algorithm [71] extends the labels by solving

58
the problem )
Lu(xi ) = 0, if m + 1 ≤ i ≤ n,
(10.1)
u(xi ) = yi , if 1 ≤ i ≤ m,
where L is the unnormalized graph Laplacian given by
n
X
Lu(xi ) = wij (u(xi ) − u(xj )).
j=1

Here, u : X → Rk and we write the components of u as

u(xi ) = (u1 (xi ), u2 (xi ), . . . , uk (xi )).

The label decision for vertex xi is determined by the largest component of u(xi )

ℓ(xi ) = arg max {uj (x)}. (10.2)


j∈{1,...,k}

We note that Laplace learning is also called label propagation (LP) [72], since
the Laplace equation (10.1) can be solved by repeatedly replacing u(xi ) with
the weighted average of its neighbors, which can be viewed as dynamically
propagating labels.
At very low label rates, we propose to replace the problem (10.1) by Poisson
1
Pm
learning: Let y = m j=1 yj be the average label vector and let δij = 1 if i = j
and δij = 0 if i ̸= j. One computes the solution of the Poisson equation

m
X
Lu(xi ) = (yj − y)δij for i = 1, . . . , n (10.3)
j=1

Pn
satisfying i=1 di u(xi ) = 0. While the label decision can be taken to be the
same as (10.2), it is also simple to account for unbalanced classes or training
data with the modified label decision

ℓ(xi ) = arg max {sj uj (x)}, (10.4)


j∈{1,...,k}

where sj = bj /(y · ej ) and bj is the fraction of data belonging to class j. We


explain this label decision in Remark 10.2. The Poisson equation (10.3) can be
solved efficiently with a simple iteration given in Algorithm 1.
Technically speaking, in Laplace learning, the labels are imposed as boundary

59
conditions in a Laplace equation, while in Poisson learning, the labels appears as
a source term in a graph Poisson equation. In the sections below, we explain why
Poisson learning is a good idea for problems with very few labels. In particular,
we give random walk and variational interpretations of Poisson learning, and
we illustrate how Poisson learning arises as the low label rate limit of Laplace
learning.

10.1 Random walk interpretation


We present a random walk interpretation of Poisson learning and compare to the
random walk interpretation of Laplace learning to explain its poor performance
at low label rates. We note that Laplace learning works very well in practice for
semi-supervised learning problems with a moderate amount of labeled data. For
example, on the MNIST dataset we obtained around 95% accuracy at 16 labels
per class (0.23% label rate). However, at very low label rates the performance
is poor. At 1 label per class, we find the average performance is around 16%
accuracy. This phenomenon has been observed in other works recently [55, 28].
However, a clear understanding of the issues with Laplace learning at low label
rates was lacking. The clearest understanding of this phenomenon comes from
the random walk interpretation, and this leads directly to the foundations for
Poisson learning.
Let x ∈ X and let X0x , X1x , X2x , . . . be a random walk on X starting at
X0x = x with transition probabilities

P(Xkx = xj | Xk−1
x
= xi ) = d−1
i wij .

Let u be the solution of the Laplace learning problem (10.1). Define the stopping
time to be the first time the walk hits a label, that is

τ = inf{k ≥ 0 : Xkx ∈ {x1 , x2 , . . . , xm }}.

Let iτ ≤ m denote the index of the point Xτx , so Xτx = xiτ . Then, by Doob’s
optimal stopping theorem, we have

u(x) = E[yiτ ]. (10.5)

This gives the standard representation formula for the solution of Laplace learning
(10.1). The interpretation is that we release a random walker from x and let it

60
Figure 2: Demonstration of spikes in Laplace learning. The graph consists of
n = 104 independent uniform random variables on [0, 1]2 and two points are
given labels of 0 and 1. Most values of the solution u of Laplace learning are
very close to 0.5.

walk until it hits a labeled vertex and then record that label. We average over
many random walkers to get the value of u(x).
When there are insufficiently many labels, the stopping time τ is so large that
we have passed the mixing time of the random walk, and the distribution of Xτx
P
is very close to the invariant distribution of the random walk π(xi ) = di / i di .
In this case, (10.5) gives that
Pm
di yi
u(x) ≈ Pi=1
m =: y w . (10.6)
j=1 dj

Of course, the function u is not exactly constant, and instead it is approximately


constant with sharp spikes at the labeled vertices; see Figure 2. Previous work
has proved rigorously that Laplace learning degenerates in this way at low label
rates [57, 14].
It is important to note that the constant vector y w depends only on the
degrees of the labeled nodes in the graph, which is very sensitive to local graph
structure. When Laplace learning returns a nearly constant label function, it
can be catastrophic for classification, since most datapoints are assigned the
same label. This explains the 16% accuracy in the MNIST experiment described

61
above.
In contrast, the Poisson equation (10.3) has a random walk interpretation
that involves the Green’s function for a random walk, and is in some sense dual
to Laplace learning. The source term on the right hand side of (10.3) represents
random walkers being released from labeled points and exploring the graph,
while carrying their label information with them. For T > 0 let us define
 
T X
X m
wT (xi ) = E  yj 1{X xj =xi }  .
k
k=0 j=1

x
Essentially, each time the random walk starting from xj , denoted Xk j , visits xi
we record the label yj and sum this over all visits from 0 ≤ k ≤ T . For short
time T , this quantity is meaningful, but since the random walk is recurrent (it
is a finite graph), wT (xi ) → ∞ as T → ∞. If we normalize by 1/T , we still
have the same issue as with Laplace learning, where we are only measuring the
invariant distribution. Indeed, we note that

T X
m
x
X
wT (xi ) = yj P(Xk j = xi )
k=0 j=1

x di
and lim P(Xk j = xi ) = Pn .
k→∞ i=1 di
Therefore, when k is large we have
m m
X x di X
yj P(Xk j = xi ) ≈ Pn yj ,
j=1 i=1 di j=1

and so the tail of the sum defining wT (xi ) is recording a blind average of labels.
The discussion above suggests that we should subtract off this average tail
behavior from wT , so that we only record the short-time behavior of the random
walk, before the mixing time is reached. We also normalize by di , which leads
us to define  
T m
X 1 X
uT (xi ) = E  (yj − y)1{X xj =xi }  , (10.7)
di j=1 k
k=0

1
Pm
where y = m j=1 yj . It turns out that as T → ∞, the function uT (xi ) converges
to the solution of (10.3).

62
Theorem 10.1. For every T ≥ 0 we have
 
m
X
uT +1 (xi ) = uT (xi ) + d−1
i
 (yj − y)δij − LuT (xi ) .
j=1

If the graph G is connected and the Markov chain induced by the random walk is
aperiodic, then uT → u as T → ∞, where u is the unique solution of the Poisson
Pn
equation (10.3) satisfying i=1 di u(xi ) = 0.

Theorem 10.1 gives the foundation for Poisson learning through the random
walk perspective, and in fact, it also gives a numerical method for computing
the solution (see Algorithm 1).

Remark 10.2. The representation formula (10.7) for the solution of Poisson
learning (10.3) shows that the solution u is a linear function of the label vectors
y1 , . . . , ym . That is, for any A ∈ Rk×k , the solution uA : X → Rk of
m
X
LuA (xi ) = (Ayj − Ay)δij for i = 1, . . . , n
j=1

Pn
satisfying i=1 di uA (xi ) = 0 is exactly uA = Au, where u is the solution of
(10.3). This shows that any reweighting of the point sources, by which we mean
yj 7→ Ayj , is equivalent to reweighting the solution by u 7→ Au.
If we set A = diag(s1 , . . . , sk ), then Au corresponds to multiplying the point
sources for class i by the weight si . We can use this reweighting to account
for unbalanced classes, or a discrepancy between the balancing of training and
testing data, in the following way. Let nj be the number of training examples
from class j, and let bj denote the true fraction of data in class j. We can choose
sj so that nj sj = bj to ensure that the mass of the point sources for each class,
weighted by sj , is proportional to the true fraction of data in that class. Since
nj is proportional to y · ej , this explains our modified label decision (10.4).

10.1.1 Proofs for Section 10.1

We recall X0x , X1x , X2x , . . . is a random walk on X starting at X0x = x with


transition probabilities

wij
P(Xkx = xj | Xk−1
x
= xi ) = .
di

63
Before giving the proof of Theorem 10.1, we recall some properties of random
walks and Markov chains. The random walk described above induces a Markov
chain with state space X. Since the graph is connected and X is finite, the
Markov chain is positive recurrent. We also assume the Markov chain is aperiodic.
This implies the distribution of the random walker converges to the invariant
distribution of the Markov chain as k → ∞. In particular, choose any initial
Pn
distribution p0 ∈ ℓ2 (X) such that i=1 p0 (xi ) = 1 and p0 ≥ 0, and define

n
X wij
pk+1 (xi ) = pk (xj ). (10.8)
j=1
dj

Then pk is the distribution of the random walker after k steps. Since the Markov
chain is positive recurrent and aperiodic we have that

lim pk (xi ) = π(xi )


k→∞

for all i, where


di
π(xi ) = Pn
i=1 di
is the invariant distribution of the Markov chain. It is simple to check that if
p0 ∈ ℓ2 (X) is any function (i.e., not necesarily a probability distribution), and
we define pk by the iteration (10.8), then

n
X
lim pk (xi ) = π(xi ) p0 (xj ). (10.9)
k→∞
j=1

We now give the proof of Theorem 10.1.

Proof of Theorem 10.1. Define the normalized Green’s function


" T # T
1 X 1 X x
GT (xi , xj ) = E 1{X xj =xi } = P(Xk j = xi ).
di k di
k=0 k=0

64
Then we have
T X
n
X wℓi x
j
di GT (xi , xj ) = δij + P(Xk−1 = xℓ )
dℓ
k=1 ℓ=1
n T
X wℓi X x
j
= δij + P(Xk−1 = xℓ )
dℓ
ℓ=1 k=1
n T −1
X wℓi X x
= δij + P(Xk j = xℓ )
dℓ
ℓ=1 k=0
n
X
= δij + wℓi GT −1 (xℓ , xj ).
ℓ=1

Therefore we have

di (GT (xi , xj ) − GT −1 (xi , xj )) + LGT −1 (xi , xj ) = δij ,

where the Laplacian L is applied to the first variable of GT −1 while the second
variable is fixed (i.e. LGT −1 (xi , xj ) = [LGT −1 (·, xj )]xi ). Since

m
X
uT (xi ) = (yj − y)GT (xi , xj )
j=1

we have
m
X
di (uT (xi ) − uT −1 (xi )) + LuT −1 (xi ) = (yj − y)δij .
j=1

Summing both sides over i = 1, . . . , n we find that


n
X n
X
(uT )d,X = di uT (xi ) = di uT −1 (xi ) = (uT −1 )d,X ,
i=1 i=1

where d = (d1 , d2 , . . . , dn ) is the vector of degrees. Therefore (uT )d,X =


(uT −1 )d,X = · · · = (u0 )d,X . Noting that

m
X
di u0 (xi ) = (yj − y)δij ,
j=1

we have (u0 )d,X = 0, and so (uT )d,X = 0 for all T ≥ 0. Let u ∈ ℓ2 (X) be the

65
solution of
m
X
Lu(xi ) = (yj − y)δij
j=1

satisfying (u)d,X = 0. Define vT (xi ) = di (uT (xi ) − u(xi )). We then check that
vT satisfies
n
X wij
vT (xi ) = vT −1 (xj ),
j=1
dj

and (vT )X = 0. Since the random walk is aperiodic and the graph is connected,
we have by (10.9) that limT →∞ vT (xi ) = π(xi )(v0 )X = 0, which completes the
proof.

10.2 Variational interpretation


We can also interpret Poisson learning (10.3) as a gradient regularized variational
problem. Before proceeding, we briefly review some facts about calculus on
graphs. Let ℓ2 (X) denote the space of functions u : X → Rk equipped with the
inner product
n
X
(u, v)ℓ2 (X) = u(xi ) · v(xi ).
i=1

This induces a norm ∥u∥2ℓ2 (X) = (u, u)ℓ2 (X) . We also define the space of mean-
zero functions
n n
X o
ℓ20 (X) = u ∈ ℓ2 (X) : di u(xi ) = 0
i=1

We define a vector field on the graph to be an antisymmetric function V : X 2 →


Rk (i.e., V (xi , xj ) = −V (xj , xi )). The gradient of a function u ∈ ℓ2 (X), denoted
for simplicity as ∇u, is defined to be the vector field

∇u(xi , xj ) = u(xj ) − u(xi ).

The inner product between vector fields V and W is


n
1 X
(V, W )ℓ2 (X 2 ) = wij V (xi , xj ) · W (xi , xj ).
2 i,j=1

and the norm of V is ∥V ∥2ℓ2 (X 2 ) = (V, V )ℓ2 (X 2 ) .

66
We now consider the variational problem
 m 
1 X
min ∥∇u∥2ℓ2 (X 2 ) − (yj − y)·u(xj ) . (10.10)
2
u∈ℓ0 (X) 2 j=1

The following theorem makes the connection between Poisson learning and the
variational problem (10.10).

Theorem 10.3. Assume G is connected. Then there exists a unique minimizer


u ∈ ℓ20 (X) of (10.10), and u satisfies the Poisson learning equation (10.3).

Theorem 10.3 shows that the Poisson learning equation (10.3) arises as the
necessary conditions for a gradient regularized variational problem with an affine
loss function. We contrast this with the solution of Laplace learning (10.1),
which is the minimizer of the variational problem
n o
min
2
∥∇u∥2ℓ2 (X 2 ) : u(xi ) = yi , 1 ≤ i ≤ m . (10.11)
u∈ℓ (X)

Thus, while both Laplace and Poisson learning are gradient regularized variational
problems, the key difference is how each algorithm handles the labeled data;
Laplace learning enforces hard label constraints while Poisson learning adds an
affine loss function to the energy. Of course, many variants of Laplace learning
have been proposed with various types of soft label constraints in place of hard
constraints. These variations perform similarly poorly to Laplace learning at low
label rates, and the key feature of Poisson learning is the affine loss function
that can be easily centered.
We note that it would be natural to add a weight µ > 0 to one of the terms in
(10.10) to trade-off the importance of the two terms in the variational problem.
However, as we show in the supplementary material (see Lemma A.3), this
weight would have no effect on the final label decision. We also mention that the
variational problem (10.10) has a natural ℓp generalization that we also explore
in the supplementary material (Section A.2).

10.2.1 Proofs for Section 10.2

We first review some additional calculus on graphs. The graph divergence of a


vector field V is defined as
n
X
div V (xi ) = wij V (xi , xj ).
j=1

67
The divergence is the negative adjoint of the gradient; that is, for every vector
field V ∈ ℓ2 (X 2 ) and function u ∈ ℓ2 (X)

(∇u, V )ℓ2 (X 2 ) = −(u, div V )ℓ2 (X) . (10.12)


Pn
We also define ∥u∥pℓp (X) = i=1 |u(xi )|p and

n
1 X
∥V ∥pℓp (X 2 ) = wij |V (xi , xj )|p ,
2 i,j=1

where | · | is the Euclidean norm on Rk .


The graph Laplacian Lu of a function u ∈ ℓ2 (X) is defined as negative of the
composition of gradient and divergence
n
X
Lu(xi ) = −div (∇u)(xi ) = wij (u(xi ) − u(xj )).
j=1

The operator L is the unnormalized graph Laplacian. Using (10.12) we have

(Lu, v)ℓ2 (X) = (−div ∇u, v)ℓ2 (X) = (∇u, ∇v)ℓ2 (X 2 ) .

In particular (Lu, v)ℓ2 (X) = (u, Lv)ℓ2 (X) , and so the graph Laplacian L is self-
adjoint as an operator L : ℓ2 (X) → ℓ2 (X). We also note that

(Lu, u)ℓ2 (X) = (∇u, ∇u)ℓ2 (X 2 ) = ∥∇u∥2ℓ2 (X 2 ) ,

that is, L is positive semi-definite.


The variational interpretation of Poisson learning can be directly extended
to ℓp versions, so we proceed in generality here. For a function u : X → Rk
and a positive vector a ∈ Rn (meaning ai > 0 for all i = 1, . . . , n) we define the
weighted mean value
n
1 X
(u)a,X := Pn ai u(xi ).
i=1 ai i=1

We define the space of weighted mean-zero functions

ℓpa,0 (X) = {u ∈ ℓp (X) : (u)a,X = 0}.

68
For p ≥ 1 and µ > 0 we consider the variational problem
 m 
1 X
min ∥∇u∥pℓp (X 2 ) − µ (yj − y)·u(xj ) (10.13)
p
u∈ℓa,0 (X) p j=1

1
Pm
where y = m j=1 yj . This generalizes the variational problem (10.10) for
Poisson learning, and the theorem below generalizes Theorem 10.3.

Theorem 10.4. Assume G is connected. For any p > 1, positive a ∈ Rn , and


µ ≥ 0, there exists a unique solution u ∈ ℓpa,0 (X) of (10.13). Furthermore, the
minimizer u satisfies the graph p-Laplace equation
m
X
−div (|∇u|p−2 ∇u)(xi ) = µ (yj − y)δij . (10.14)
j=1

We give the proof of Theorem 10.4 below, after some remarks and other
results.

Remark 10.5. When p = 1, solutions of (10.13) may not exist for all µ ≥ 0,
since the variational problem (10.13) may not be bounded from below. We
can show that there exists C > 0 such that if µ < C, the variational problem
is bounded from below and our argument for existence in Theorem 10.4 goes
through.

It turns out that µ > 0 is a redundant parameter when p > 1.

Lemma 10.6. Let p > 1 and for µ > 0 let uµ be the solution of (10.13). Then,
uµ = µ1/(p−1) u1 .

It follows from Lemma 10.6 that when p > 1, the fidelity parameter µ > 0
is completely irrelevant for classification problems, since the identity uµ =
µ1/(p−1) u1 implies that the label decision (10.2) gives the same labeling for any
value of µ > 0. Hence, in Poisson learning with p > 1 we always take µ = 1.
This remark is false for p = 1.
Before proving Theorem 10.4 we first record a Poincaré inequality. The proof
is standard but we include it for completeness. We can prove the Poincaré
inequality for non-negative vectors a ∈ Rn , meaning that ai ≥ 0 for every
Pn
i = 1, . . . , n as long as i=1 ai > 0.

69
Proposition 10.7. Assume G is connected, a ∈ Rd is non-negative with
Pn
i=1 ai > 0, and p ≥ 1. There exists λp > 0 such that

λp ∥u − (u)a,X ∥ℓp (X) ≤ ∥∇u∥ℓp (X 2 ) , (10.15)

for all u ∈ ℓp (X).

Proof. Define
∥∇u∥ℓp (X 2 )
λp = min .
u∈ℓp (X) ∥u − (u)a,X ∥ℓp (X)
u̸≡(u)a,X

Then clearly (10.15) holds for this choice of λp , and λp ≥ 0. If λp = 0, then


there exists a sequence uk ∈ ℓp (X) with uk ̸≡ (uk )a,X such that

∥∇uk ∥ℓp (X 2 ) 1
≤ .
∥uk − (u)a,X ∥ℓp (X) k

We may assume that (uk )a,X = 0 and ∥uk ∥ℓp (X) = 1, and so

1
∥∇uk ∥ℓp (X 2 ) ≤ . (10.16)
k

Since |uk (x)| ≤ ∥uk ∥ℓp (X) = 1, the sequence uk is uniformly bounded and by the
Bolzano-Weierstrauss Theorem there exists a subsequence ukj such that ukj (xi )
is a convergent sequence in Rk for every i. We denote u(xi ) = limj→∞ ukj (xi ).
Since ∥ukj ∥ℓp (X) = 1 we have ∥u∥ℓp (X) = 1, and thus u ̸≡ 0. Similarly, since
(uk )a,X = 0 we have (u)a,X = 0 as well. On the other hand it follows from
(10.16) that ∥∇u∥ℓp (X 2 ) = 0, and so

wij (u(xi ) − u(xj )) = 0 for all i, j.

It follows that u(xi ) = u(xj ) whenever wij > 0. Since the graph is connected,
it follows that u is constant. Since (u)a,X = 0 we must have u ≡ 0, which is
a contradiction, since ∥u∥ℓp (X) = 1. Therefore λp > 0, which completes the
proof.

We can now prove Theorem 10.4.

Proof of Theorem 10.4. Let us write


m
1 X
Ip (u) = ∥∇u∥pℓp (X 2 ) − µ (yj − y) · u(xj ). (10.17)
p j=1

70
By Proposition 10.7 we have
m
1 p X
Ip (u) ≥ λp ∥u∥pℓp (X) − µ (yj − y) · u(xj )
p j=1

for u ∈ ℓpa,0 (X). By Hölder’s inequality we have

m
X m
X
(yj − y) · u(xj ) ≤ |yj − y||u(xj )|
j=1 j=1
 1/q  1/p
Xm Xm
≤ |yj − y|q   |u(xj )|p 
j=1 j=1
 1/q
Xm
≤ |yj − y|q  ∥u∥ℓp (X) ,
j=1

P 1/q
m
where q = p/(p − 1). Letting C = j=1 |yj − y|q we have

1 p
Ip (u) ≥ λ ∥u∥pℓp (X) − Cµ∥u∥ℓp (X) . (10.18)
p p

Since p > 1, we see that Ip is bounded below.


Let uk ∈ ℓpa,0 (X) be a minimizing sequence, that is, we take uk so that

−∞ < inf Ip (u) = lim Ip (uk ).


u∈ℓp
a,0 (X)
k→∞

By (10.18) we have that

1 p
λ ∥uk ∥pℓp (X) − Cµ∥uk ∥ℓp (X) ≤ inf Ip (u) + 1,
p p u∈ℓp
a,0 (X)

for k sufficiently large. Since p > 1, it follows that there exists M > 0 such that
∥uk ∥ℓp (X) ≤ M for all k ≥ 1. Since |uk (xi )| ≤ ∥uk ∥ℓp (X) ≤ M for all i = 1, . . . , n,
we can apply the Bolzano-Weierstrauss Theorem to extract a subsequence ukj
such that ukj (xi ) is a convergent sequence in Rk for all i = 1, . . . , n. We denote
by u∗ (xi ) the limit of ukj (xi ) for all i. By continuity of Ip we have

inf Ip (u) = lim Ip (ukj ) = Ip (u∗ ),


u∈ℓp
a,0 (X)
j→∞

71
and (u∗ )a,X = 0. This shows that there exists a solution of (10.13).

We now show that any solution of (10.13) satisfies −div |∇u|p−2 ∇u = µf .
The proof follows from taking a variation. Let v ∈ ℓpa,0 (X) and consider g(t) :=
Ip (u + tv), where Ip is defined in (10.17). Then g has a minimum at t = 0 and
hence g ′ (0) = 0. We now compute
 
m
d 1 X 
g ′ (0) = ∥∇u + t∇v∥pℓp (X 2 ) − µ (yj − y) · (u(xj ) + tv(xj ))
dt t=0  p j=1

n m
1 X d X
= wij |∇u(xi , xj ) + t∇v(xi , xj )|p − µ (yj − y) · v(xj )
2p i,j=1 dt t=0
j=1
n m
1 X X
= wij |∇u(xi , xj )|p−2 ∇u(xi , xj ) · ∇v(xi , xj ) − µ (yj − y) · v(xj )
2 i,j=1 j=1
m
X
= (|∇u|p−2 ∇u, ∇v)ℓ2 (X 2 ) − µ (yj − y) · v(xj )
j=1
m
X
= (−div (|∇u|p−2 ∇u), v)ℓ2 (X) − µ (yj − y) · v(xj )
j=1

= (−div (|∇u|p−2 ∇u) − µf, v)ℓ2 (X) ,

where
m
X
f (xi ) = (yj − y)δij .
j=1

We choose
1
−div |∇u|p−2 ∇u (xi ) − µf (xi )
 
v(xi ) =
ai
then
n
X
−div |∇u|p−2 ∇u (xi ) − µf (xi ) = 0
 
(v)a,X =
i=1

so v ∈ ℓpa,0 (X).

n

X 1 2
div |∇u|p−2 ∇u (xi ) + µf (xi )

0 = g (0) =
a
i=1 i
1 2
div |∇u|p−2 ∇u (xi ) + µf (xi )

≥ ℓ2 (X)
.
max ai

So, −div |∇u|p−2 ∇u = µf as required.
To prove uniqueness, let u, v ∈ ℓpa,0 (X) be minimizers of (10.13). Then u and

72
v satisfy (10.14) which we write as

−div (|∇u|p−2 ∇u) = µf.

Applying Lemma 10.8 (below) we find that ∥u − v∥ℓp (X) = 0 and so u = v.

In the above proof we used a quantitive error estimate which is of interest in


its own right. The estimate was on equations of the form

−div (|∇u|p−2 ∇u) = f

when f ∈ ℓp0 (X), where we use the notation: if a ∈ Rn is a constant vector


(without loss of generality the vector of ones) then we write (u)X = (u)a,X =
1
Pn p p
n i=1 u(xi ) and ℓ0 (X) = {u ∈ ℓ (X) : (u)X = 0}.

Lemma 10.8. Let p > 1, a ∈ Rn be non-negative, and assume u, v ∈ ℓpa,0 (X)


satisfy
−div (|∇u|p−2 ∇u)(xi ) = f (xi )

and
−div (|∇v|p−2 ∇v)(xi ) = g(xi )

for all i = 1, . . . , n, where f, g ∈ ℓp0 (X). Then,


( 1/(p−1)
Cλ−q
p ∥f − g∥ℓq (X) if p ≥ 2
∥u−v∥ℓp (X) ≤ 2−p
Cλ−2
p ∥∇u∥ℓp (X) + ∥∇v∥ℓp (X) ∥f − g∥ℓ2 (X) if 1 < p < 2

p
where C is a constant depending only on p and q = p−1 .

Remark 10.9. If −div (|∇u|p−2 ∇u) = f then we can write |∇u|p−2 ∇u, ∇φ ℓ2 (X 2 ) =
(f, φ)ℓ2 (X) for any φ ∈ ℓ2 (X). Choosing φ = u implies ∥∇u∥pℓp (X 2 ) = (f, u)ℓ2 (X) ≤
∥f ∥ℓq (X) ∥u∥ℓp (X) so we could write the bound for p ∈ (1, 2) in the above lemma
without ∥∇u∥ℓp (X) and ∥∇v∥ℓp (X) on the right hand side.

Proof. For p ≥ 2 we use the identity

|a − b|p ≤ C(|a|p−2 a − |b|p−2 b) · (a − b)

for vectors a, b ∈ Rk for some constant C depending only on p (which can be

73
found in Lemma 4.4 Chapter I [26]) to obtain

n
1 X
∥∇u − ∇v∥pℓp (X 2 ) = wij |∇u(xi , xj ) − ∇v(xi , xj )|p
2 i,j=1
n
X
wij |∇u(xi , xj )|p−2 ∇u(xi , xj ) − |∇v(xi , xj )|p−2 ∇v(xi , xj )

≤C
i,j=1

· (∇u(xi , xj ) − ∇v(xi , xj ))
= C(|∇u|p−2 ∇u − |∇v|p−2 ∇v, ∇(u − v))ℓ2 (X 2 )
= C(−div (|∇u|p−2 ∇u) + div (|∇v|p−2 ∇v), u − v)ℓ2 (X)
= C(f − g, u − v)ℓ2 (X)
≤ C∥f − g∥ℓq (X) ∥u − v∥ℓp (X) ,

1 1
where in the last line we used Hölder’s inequality, p + q = 1, and the value of C
may change from line-to-line. By Proposition 10.7 we have

λpp ∥u − v∥pℓp (X) ≤ ∥∇u − ∇v∥pℓp (X 2 ) ≤ C∥f − g∥ℓq (X) ∥u − v∥ℓp (X) .

Therefore we deduce

1/(p−1)
∥u − v∥ℓp (X) ≤ Cλ−q
p ∥f − g∥ℓq (X) .

Now for 1 < p < 2 we follow the proof of Lemma 4.4 in Chapter I [26] to
infer
Z 1
p−2
|a|p−2 a − |b|p−2 b · (a − b) = |a − b|2 ds

|sa + (1 − s)b|
0
Z 1
p−4 2
+ (p − 2) |sa + (1 − s)b| |(sa + (1 − s)b) · (a − b)| ds
0

for any a, b ∈ Rk . Hence, by the Cauchy Schwarz inequality,


Z 1
p−2 p−2 p−2
|a − b|2 ds

|a| a − |b| b · (a − b) ≥ (p − 1) |sa + (1 − s)b|
0
Z 1
2 1
≥ (p − 1)|a − b| ds
0 (s|a| + (1 − s)|b|)2−p
(p − 1)|a − b|2
≥ .
(|a| + |b|)2−p

74
In the sequel we make use of the inequality

C|a − b|2
|a|p−2 a − |b|p−2 b · (a − b) ≥

.
(|a| + |b|)2−p

By Hölder’s inequality and the above inequality we have (where again the
constant C may change from line-to-line)

n
p 1 X
∥∇u − ∇v∥ℓp (X 2 ) = wij |∇u(xi , xj ) − ∇v(xi , xj )|p
2 i,j=1
  p2   2−p
2
n 2 n
1 X wij |∇u(xi , xj ) − ∇v(xi , xj )|   1 X p
≤  wij (|∇u(xi , xj )| + |∇v(xi , xj )|)
2 i,j=1 (|∇u(xi , xj )| + |∇v(xi , xj )|)2−p 2 i,j=1
 p2
n
X
wij |∇u(xi , xj )|p−2 ∇u(xi , xj )−|∇v(xi , xj )|p−2 ∇v(xi , xj ) ·(∇u(xi , xj )−∇v(xi , xj ))

≤ C
i,j=1
 (2−p)p
× ∥∇u∥ℓp (X 2 ) + ∥∇v∥ℓp (X 2 ) 2

p  (2−p)p
= C |∇u|p−2 ∇u − |∇v|p−2 ∇v, ∇(u − v) ℓ22 (X 2 ) ∥∇u∥ℓp (X 2 ) + ∥∇v∥ℓp (X 2 ) 2

p  (2−p)p
= C −div (|∇u|p−2 ∇u) + div (|∇v|p−2 ∇v), u − v ℓ22 (X) ∥∇u∥ℓp (X 2 ) + ∥∇v∥ℓp (X 2 ) 2

p  (2−p)p
= C (f − g, u − v)ℓ22 (X) ∥∇u∥ℓp (X 2 ) + ∥∇v∥ℓp (X 2 ) 2

p p  (2−p)p
≤ C ∥f − g∥ℓ22 (X) ∥u − v∥ℓ22 (X) ∥∇u∥ℓp (X 2 ) + ∥∇v∥ℓp (X 2 ) 2
.

Combining the above with Proposition 10.7 we have


p p  (2−p)p
λpp ∥u − v∥ℓ2p (X) ≤ C ∥f − g∥ℓ22 (X) ∥∇u∥ℓp (X 2 ) + ∥∇v∥ℓp (X 2 ) 2

which implies the result.

The final proof from Section 10.2 is Lemma 10.6.

Proof of Lemma 10.6. Let us write


m
1 X
Ip,µ (u) = ∥∇u∥pℓp (X 2 ) − µ (yj − y) · u(xj ).
p j=1

We note that
Ip,µ (µ1/(p−1) u) = µp/(p−1) Ip,1 (u).

75
Therefore

Ip,µ (uµ ) = µp/(p−1) Ip,1 (uµ µ−1/(p−1) ) ≥ µp/(p−1) Ip,1 (u1 ).

On the other hand

µp/(p−1) Ip,1 (u1 ) = Ip,µ (µ1/(p−1) u1 ) ≥ Ip,µ (uµ )

Therefore
Ip,µ (µ1/(p−1) u1 ) = Ip,µ (uµ ).

By uniqueness in Theorem 10.4 we have uµ = µ1/(p−1) u1 , which completes the


proof.

10.3 Laplace learning at low label rates


Finally, to further motivate Poisson learning, we connect Poisson learning with
the limit of Laplace learning at very low label rates. At low label rates, the
solution of Laplace learning (10.1) concentrates around a constant y w , for which
we gave an interpretation of via random walks in Section 10.1. Near labeled
nodes, u has sharp spikes (recall Figure 2) in order to attain the labels. From
the variational perspective, the constant function has zero cost, and the cost
of spikes is very small, so this configuration is less expensive than continuously
attaining the labels u(xi ) = yi .
A natural question concerns whether the spikes in Laplace learning contain
useful information, or whether they are too localized and do not propagate
information well. To test this, we changed the label decision (10.2) in the
MNIST experiment described in Section 10.1 to subtract off the tail constant y w
identified in (10.6)

ℓ(xi ) = arg max {uj (x) − y w · ej }.


j∈{1,...,k}

where y w is defined in (10.6). Thus, we are centering the function u at zero.


At 1 label per class (10 labeled images total), the accuracy improved from
16% to 85.9%! Hence, the difference u − y w contains enough information to
make informed classification decisions, and therefore the spikes contain useful
information.
This indicates that much of the poor performance of Laplace learning can be

76
explained by a large shift bias that occurs only at very low label rates. Fixing
this seems as simple as applying the appropriate shift before making a decision
on a label, but this does not lead to a well-grounded method, since the shifted
function u − y w is exactly the graph-harmonic extension of the shifted labels
yj − y w . Why should we have to use a harmonic extension of the wrong labels in
order to achieve a better result? On the other hand, Poisson learning, which we
introduced above, provides an intuitive and well-grounded way of fixing the shift
bias in Laplace learning.
To see the connection to Poisson learning, let us assume the solution u of the
Laplace learning equation (10.1) is nearly equal to the constant vector y w ∈ Rk
from (10.6) at all unlabeled points xm+1 , . . . , xn . For any labeled node xi with
i = 1, . . . , m we can compute (assuming wij = 0 for all j ∈ {1, . . . , m}) that

n
X
Lu(xi ) = wij (u(xi ) − u(xj ))
j=1
Xn
≈ wij (yi − y w ) = di (yi − y w ).
j=m+1

Since Lu(xi ) = 0 for i = m + 1, . . . , n, we find that u approximately satisfies the


Poisson equation
m
X
Lu(xi ) = dj (yj − y w )δij for i = 1, . . . , n. (10.19)
j=1

This gives a connection, at a heuristic level, been Laplace equations with hard
constraints, and Poisson equations with point sources, for problems with very
low label rates. We note that since constant functions are in the kernel of L,
u − y w also satisfies (10.19). We also note that the labels, and the constant y w ,
are weighted by the degree dj , which does not appear in our Poisson learning
equation (10.3). We have found that both models give good results, but that
(10.3) works slightly better, which is likely due to the rigorous foundation of
(10.3) via random walks.

10.4 The Poisson MBO algorithm


Poisson learning provides a robust method for propagating label information that
is stable at very low label rates. After applying Poisson learning to propagate
labels, we propose a graph-cut method to incrementally adjust the decision

77
boundary so as to improve the label accuracy and account for prior knowledge
of class sizes. The graph-cut method we propose is to apply several steps of
gradient descent on the graph-cut problem
 m 
1 2
X
min ∥∇u∥ℓ2 (X 2 ) − µ (yj − y)·u(xj ) , (10.20)
u:X→Sk 2 j=1
(u)X =b

1
Pn
where Sk = {e1 , e2 , . . . , ek }, b ∈ Rk is given, and (u)X := n i=1 u(xi ). Since
1 2
we are restricting u(xi ) ∈ Sk , the term 2 ∥∇u∥ℓ2 (X 2 ) is exactly the graph-cut
energy of the classification given by u. Likewise, the components of the average
(u)X represent the fraction of points assigned to each class. The constraint
(u)X = b therefore allows us to incorporate prior knowledge about relative class
sizes through the vector b ∈ Rk , which should have positive entries and sum to
one. If there exists u : X → Sk with (u)X = b, then (10.20) admits a solution,
which in general may not be unique.
On its own, the graph-cut problem (10.20) can admit many local minimizers
that would yield poor classification results. The phenomenon is similar to the
degeneracy in Laplace learning at low label rates, since it is very inexpensive
to violate any of the label constraints. Our overall plan is to first use Poisson
learning to robustly propagate the labels, and then project onto the constraint set
for (10.20) and perform several steps of gradient-descent on (10.20) to improve
the classification accuracy. While Poisson learning propagates the labels in
a robust way, the cut energy is more suitable for locating the exact decision
boundary.
To relax the discrete problem (10.20), we approximate the graph-cut energy
with the Ginzburg-Landau approximation
n m
X o
min
2
GLτ (u) − µ (yj − y) · u(xj ) , (10.21)
u∈ℓ (X)
j=1
(u)X =b

where n k
1 1 XY
GLτ (u) = ∥∇u∥2ℓ2 (X 2 ) + |u(xi ) − ej |2 .
2 τ i=1 j=1

The Ginzburg-Landau approximation allows u ∈ ℓ2 (X) to take on any real values,


instead of discrete values u ∈ Sk , making the approximation (10.21) easier to
solve computationally. The graph Ginzburg-Landau approximation GLτ has
been used previously for graph-based semi-supervised learning in [31], and other

78
works have rigorously studied how GLτ approximates graph-cut energies in the
scalar setting [60]. Here, we extend the results to the vector multi-class setting.

Theorem 10.10. Assume G is connected. Let b ∈ Rk and assume there


exists u : X → Sk with (u)X = b. For each τ > 0 let uτ be any solution of
(10.21). Then, the sequence (uτ )τ is precompact in ℓ2 (X) and any convergent
subsequence uτm converges to a solution of the graph-cut problem (10.20) as
τm → 0. Furthermore, if the solution u0 : X → Sk of (10.20) is unique, then
uτ → u0 as τ → 0.

Theorem 10.10 indicates that we can replace the graph-cut energy (10.20)
with the simpler Ginzburg-Landau approximation (10.21). To descend on the
energy (10.21), we use a time-spitting scheme that alternates gradient descent
on
m
1 X
E1 (u) := ∥∇u∥2ℓ2 (X 2 ) − µ (yj − y) · u(xj ),
2 j=1

n k
1 XY
and E2 (u) := |u(xi ) − ej |2 .
τ i=1 j=1

The first term E1 is exactly the energy for Poisson learning (10.10), and gradient
descent amounts to the iteration
 m
X 
ut+1 (xi ) = ut (xi ) − dt Lut (xi ) − µ (yj − y)δij .
j=1

We note that Lu and the source term above both have zero mean value. Hence,
the gradient descent equation for E1 is volume preserving, i.e., (ut+1 )X = (ut )X .
This would not be true for other fidelity terms, such as an ℓ2 fidelity, and this
volume conservation property plays an important role in ensuring the class size
constraint (u)X = b in (10.21).
Gradient descent on the second term E2 , when τ > 0 is small, amounts to
projecting each u(xi ) ∈ Rk to the closest label vector ej ∈ Sk , while preserving
the volume constraint (u)X = b. We approximate this by the following procedure:
Let ProjSk : Rk → Sk be the closest point projection, let s1 , . . . , sk > 0 be
positive weights, and set

ut+1 (xi ) = ProjSk (diag(s1 , . . . , sk )ut (xi )), (10.22)

where diag(s1 , . . . , sk ) is the diagonal matrix with diagonal entries s1 , . . . , sk .

79
We use a simple gradient descent scheme to choose the weights s1 , . . . , sk > 0 so
that the volume constraint (ut+1 )X = b holds (see Steps 9-14 in Algorithm 2).
By Remark 10.2, this procedure can be viewed as reweighting the point sources
in the Poisson equation (10.3) so that the volume constraint holds. In particular,
increasing or decreasing si grows or shrinks the size of class i.
We note that the work of [41] provides an alternative way to enforce explicit
class balance constraints with a volume constrained MBO method based on
auction dynamics. Their method uses a graph-cut based approach with a
Voronoi-cell based initialization.

10.4.1 Proofs for Section 10.4

We now turn our attention to the Ginzburg–Landau approximation of the graph


cut problem (10.20).

Proof of Theorem 10.10. Let us redefine GLτ in a more general form,


n
1 1X
GLτ (u) = ∥∇u∥2ℓ2 (X 2 ) + V (u(xi ))
2 τ i=1

where V : Rk → [0, +∞) is continuous and V (t) = 0 if and only if t ∈ Sk . Of


Qk
course, the choice of V (t) = j=1 |t − ej |2 satisfies these assumptions. We let
( Pm
GLτ (u) − µ j=1 (yj − y) · u(xj ) if (u)X = b
Eτ (u) =
+∞ else,
( Pm
1 2
2 ∥∇u∥ℓ2 (X 2 ) −µ j=1 (yj − y) · u(xj ) if (u)X = b and u : X → Sk
E0 (u) =
+∞ else.

The theorem can be restated as showing that minimisers uτ of Eτ contain con-


vergent subsequences, and any convergent subsequence converges to a minimiser
of E0 . We divide the proof into two steps, in the first step we show that the
sequence of minimisers {uτ }τ >0 is precompact, in the second step we show that
any convergent subsequence is converging to a minimiser of E0 .

1. Compactness. We first show that any sequence {u′τ }τ >0 and M ∈ R


satisfying supτ >0 Eτ (u′τ ) ≤ M is precompact. By Proposition 10.7 and the

80
Cauchy–Schwarz inequality
v
n
λ22 ′ um
uX
1 X
M≥ ∥uτ − b∥2ℓ2 (X) + V (u′τ (xi )) −µ t (yj − y)2 ∥u′τ ∥ℓ2 (X)
2 τ i=1 j=1
| {z } | {z }
≥0 =:C
λ22
≥ ∥u′τ − b∥2ℓ2 (X) − Cµ∥u′τ − b∥ℓ2 (X) − Cµ∥b∥ℓ2 (X) .
2

Hence,
 s 
Cµ  2λ22 (M + Cµ∥b∥ℓ2 (X) )
∥u′τ − b∥ℓ2 (X) ≤ 1+ 1+  =: C
e
λ22 C 2 µ2

so {µ′τ }τ >0 is bounded in ℓ2 (X) and therefore, by the Bolzano–Weierstrass


Theorem, precompact.
To show that minimisers {uτ }τ >0 are precompact it is enough to show that
there exists M ∈ R such that supτ >0 Eτ (uτ ) ≤ M . This follows easily as we take
Pn
u ∈ ℓ2 (X) satisfying i=1 u(xi ) = b and u(xi ) ∈ Sk for all i = 1, 2, . . . , n as a
candidate. We have
m
1 X
Eτ (uτ ) ≤ Eτ (u) = ∥∇u∥2ℓ2 (X 2 ) − µ (yj − y) · u(xj ) =: M.
2 j=1

Now we have shown that there exists convergent subsequences we show that any
limit must be a minimiser of E0 .

2. Converging Subsequences. Let u0 be a cluster point of {uτ }τ >0 , i.e.


there exists a subsequence such that uτm → u0 as m → ∞. Pick any v ∈ ℓ2 (X)
with E0 (v) < +∞. We will show

(a) Eτ (v) = E0 (v),

(b) lim inf τ →0 Eτ (uτ ) ≥ E0 (u0 ).

Assuming (a) and (b) hold then, by (a),

E0 (v) = Eτm (v) ≥ Eτm (uτm ).

81
Taking the limit as m → ∞, and applying (b) we have

E0 (v) ≥ lim inf Eτm (uτm ) ≥ E0 (u0 ).


m→∞

It follows that for all v ∈ ℓ2 (X) we have E0 (u0 ) ≤ E0 (v), hence u0 is a minimiser
of E0 .
To show (a), we easily notice that

n m
1 1X X
Eτ (v) = ∥∇v∥2ℓ2 (X 2 ) + V (v(xi )) −µ (yj − y) · v(xj ) = E0 (v).
2 τ i=1 | {z } j=1
=0

For (b) we without loss of generality assume that uτ → u0 and

lim inf Eτ (uτ ) = lim Eτ (uτ ) < +∞.


τ →0 τ →0

Pn
As uτ (xi ) = b for all τ > 0 and uτ (xi ) → u0 (xi ) for every i ∈ {1, . . . , n}
i=1
Pn
then (u0 )X = i=1 u0 (xi ) = b. And since V (uτ (xi )) ≤ τ Eτ (uτ ) → 0 then we
have V (u0 (xi )) = 0, hence u0 (xi ) ∈ Sk . Now,

n m
1 1X X
Eτ (uτ ) = ∥∇uτ ∥2ℓ2 (X 2 ) + V (uτ (xi )) −µ (yj − y) · uτ (xj ) .
|2 {z } τ i=1 | {z } j=1
≥0
→ 1 ∥∇u0 ∥2 | P {z }
2 ℓ2 (X 2 ) m
→ j=1 (yj −y)·u0 (xj )

So lim inf τ →0 Eτ (uτ ) ≥ E0 (u0 ) as required.

Remark 10.11. If (a) and (b) in the proof of Theorem 10.10 are strengthened
to

(a′ ) for all v ∈ ℓ2 (X) there exists vτ → v such that limτ →0 Eτ (vτ ) = E0 (v),

(b′ ) for all v ∈ ℓ2 (X) and for all vτ → v then lim inf τ →0 Eτ (vτ ) ≥ E0 (v)

then one says that Eτ Γ-converges to E0 (and one can show that (a′ ) and (b′ )
hold in our case with a small modification of the above proof). The notion of
Γ-convergence is fundamental in the calculus of variations and is considered the
variational form of convergence as it implies (when combined with a compactness
result) the convergence of minimisers.

82
11 Poisson learning algorithms
We now present our proposed Poisson learning algorithms. The Python source
code and simulation environment for reproducing our results is available online.1
We let W = (wij )ni,j=1 denote our symmetric weight matrix. We treat all
vectors as column vectors, and we let 1 and 0 denote the all-ones and all-zeros
column vectors, respectively, of the appropriate size based on context. We assume
that the first m data points x1 , x2 , . . . , xm are given labels y1 , y2 , . . . , ym ∈
{e1 , e2 , . . . , ek }, where the standard basis vector ei ∈ Rk represents the ith class.
We encode the class labels in a k × m matrix F, whose j th column is exactly
yj . Let b ∈ Rk be the vector whose ith entry bi is the fraction of data points
belonging to class i. If this information is not available, we set b = k1 1.
Poisson learning is summarized in Algorithm 1. The label decision for node i
is ℓi = arg max1≤j≤k Uij , and the reweighting in Step 11 implements the label
decision (10.4). In all our results, we always set b = k1 1, so Poisson learning does
not use prior knowledge of class sizes (the true value for b is used in PoissonMBO
below). The complexity is O(T E), where E is the number of edges in the graph.
We note that before the reweighting in Step 11, the Poisson learning algorithm
computes exactly the function uT defined in (10.7). In view of this, there is
little to be gained from running the iterations beyond the mixing time of the
random walk. This can be recorded within the loop in Steps 8-10 by adding the
iteration pt+1 = WD−1 pt , where the initial value p0 is the vector with ones
in the positions of all labeled vertices, and zeros elsewhere. Up to a constant,
pt is the probability distribution of a random walker starting from a random
labeled node after t steps. Then the mixing time stopping condition is to run
the iterations until
∥pt − p∞ ∥∞ ≤ ε,

where p∞ = W1/(1T W1) is the invariant distribution. We use this stopping


condition with ε = 1/n in all experiments, which usually takes between 100 and
500 iterations.
The Poisson MBO algorithm is summarized in Algorithm 2. The matrices
D, L and B are the same as in Poisson learning, and Poisson MBO requires
an additional fidelity parameter µ and two parameters Ninner and Nouter . In all
experiments in this paper, we set µ = 1, Ninner = 40 and Nouter = 20. Steps
9-14 implement the volume constrained projection described in Section 10.4. We
1 Source Code: https://github.com/jwcalder/GraphLearning

83
Algorithm 1 PoissonLearning
1: Input: W, F, b, T
2: Output: U ∈ Rn×k
3: D ← diag(W1)
4: L←D−W
1
5: y← m F1
6: B ← [F − y, zeros(k, n − m)]
7: U ← zeros(n, k)
8: for i = 1 to T do
9: U ← U + D−1 (BT − LU)
10: end for
11: U ← U · diag(b/y)

set the time step as dτ = 10 and set the clipping values in Step 12 to smin = 0.5
and smax = 2. We tested on datasets with balanced classes, and on datasets
with very unbalanced classes, one may wish to enlarge the interval [smin , smax ].
The additional complexity of PoissonMBO on top of Poisson learning is
O(Ninner Nouter E). On large datasets like MNIST, FashionMNIST and Cifar-10,
our Poisson learning implementation in Python takes about 8 seconds to run on
a standard laptop computer, and about 1 second with GPU acceleration.2 The
additional 20 iterations of PoissonMBO takes about 2 minutes on a laptop and
30 seconds on a GPU. These computational times do not include the time taken
to construct the weight matrix.

12 Experimental Results
We tested Poisson learning on three datasets: MNIST [49], FashionMNIST [62]
and Cifar-10 [46]. FashionMNIST is a drop-in replacement for MNIST consisting
of 10 classes of clothing items. To build good quality graphs, we trained
autoencoders to extract important features from the data. For MNIST and
FashionMNIST, we used variational autoencoders with 3 fully connected layers
of sizes (784,400,20) and (784,400,30), respectively, followed by a symmetrically
defined decoder. The autoencoder was trained for 100 epochs on each dataset.
The autoencoder architecture, loss, and training, are similar to [44]. For Cifar-
10, we used the AutoEncodingTransformations architecture from [65], with all
the default parameters from their paper, and we normalized the features to
2 We used an NVIDIA RTX-2070 GPU, and it took 3 seconds to load data to/from the

GPU and 1 second to solve Poisson learning.

84
Algorithm 2 PoissonMBO
1: Input: W, F, b, T, Ninner , Nouter , µ > 0
2: Output: U ∈ Rn×k
3: U ← µ · PoissonLearning(W, F, b, T )
4: dt ← 1/ max1≤i≤n Dii
5: for i = 1 to Nouter do
6: for j = 1 to Ninner do
7: U ← U − dt (LU − µBT )
8: end for
9: s ← ones(1, k)
10: for j = 1 to 100 do
11: b ← 1 1T Proj (U · diag(s))
b n Sk
12: s ← max(min(s + dτ (b − b), b smax ), smin )
13: end for
14: U ← ProjSk (U · diag(s))
15: end for

6
Difference in Accuracy (%)

Difference in Accuracy (%)

0.2
5

0.0 4

3
−0.2
2
−0.4 MNIST MNIST
FashionMNIST 1 FashionMNIST
Cifar-10 Cifar-10
−0.6 0
5 10 15 20 1 2 3 4 5
Number of neighbors (k) Number of labels per even class

(a) (b)
Figure 3: Accuracy of Poisson Learning for (a) different numbers of neighbors k
used to construct the graph and (b) unbalanced training data. In (a) we used 5
labels per class and in (b) we used 1 label per class for the odd numbered classes,
and m = 1, 2, 3, 4, 5 labels per class for the even numbered classes. Both figures
show the difference in accuracy compared to k = 10 and balanced training data.

unit-vectors.
For each dataset, we constructed a graph over the latent feature space. We
used all available data to construct the graph, giving n = 70, 000 nodes for
MNIST and FashionMNIST, and n = 60, 000 nodes for Cifar-10. The graph was
constructed as a K-nearest neighbor graph with Gaussian weights given by

wij = exp −4|xi − xj |2 /dK (xi )2 ,




where xi represents the latent variables for image i, and dK (xi ) is the distance

85
Table 1: MNIST: Average accuracy scores over 100 trials with standard deviation
in brackets.

# Labels per class 1 2 3 4 5


Laplace/LP [71] 16.1 (6.2) 28.2 (10.3) 42.0 (12.4) 57.8 (12.3) 69.5 (12.2)
Nearest Neighbor 55.8 (5.1) 65.0 (3.2) 68.9 (3.2) 72.1 (2.8) 74.1 (2.4)
Random Walk [68] 66.4 (5.3) 76.2 (3.3) 80.0 (2.7) 82.8 (2.3) 84.5 (2.0)
MBO [31] 19.4 (6.2) 29.3 (6.9) 40.2 (7.4) 50.7 (6.0) 59.2 (6.0)
VolumeMBO [41] 89.9 (7.3) 95.6 (1.9) 96.2 (1.2) 96.6 (0.6) 96.7 (0.6)
WNLL [56] 55.8 (15.2) 82.8 (7.6) 90.5 (3.3) 93.6 (1.5) 94.6 (1.1)
Centered Kernel [51] 19.1 (1.9) 24.2 (2.3) 28.8 (3.4) 32.6 (4.1) 35.6 (4.6)
Sparse LP [42] 14.0 (5.5) 14.0 (4.0) 14.5 (4.0) 18.0 (5.9) 16.2 (4.2)
p-Laplace [30] 72.3 (9.1) 86.5 (3.9) 89.7 (1.6) 90.3 (1.6) 91.9 (1.0)
Poisson 90.2 (4.0) 93.6 (1.6) 94.5 (1.1) 94.9 (0.8) 95.3 (0.7)
PoissonMBO 96.5 (2.6) 97.2 (0.1) 97.2 (0.1) 97.2 (0.1) 97.2 (0.1)

in the latent space between xi and its K th nearest neighbor. We used K = 10 in


all experiments. The weight matrix was then symmetrized by replacing W with
W + W T . For Poisson learning, we additionally set wii = 0 for all i. Placing
zeros on the diagonal does not change the solution the Poisson learning equation
(10.3), but it does accelerate convergence of the iteration in Algorithm 1 by
allowing the random walk to propagate faster.
We compare against Laplace learning (10.11) [71], lazy random walks [68, 66],
multiclass MBO [31, 6], weighted nonlocal Laplacian (WNLL) [56], volume
constrained MBO [41], Centered Kernel Method [51], sparse label propagation
[42], and p-Laplace learning [30]. In the volume constrained MBO method we
used exact volume constraints and temperature of T = 0.1. In the Centered
Kernel Method, we chose α to be 5% larger than the spectral norm of the
centered weight matrix. For a baseline reference, we also compared against a
nearest neighbor classifier that chooses the label of the closest labeled vertex
with respect to the graph geodesic distance. In all experiments, we ran 100 trials
randomly choosing which data points are labeled, with the exception of the
p-Laplace and sparse label propagation methods, which are slower and were run
for 10 trials. The same random label permutations were used for all methods.
Tables 1, 2 and 3 show the average accuracy and standard deviation over all
100 trials for various low label rates. We also ran experiments at higher label
rates on FashionMNIST and Cifar-10, which are reported in the lower half of
their respective tables. We mention that in Tables 1, 2 and 3 the training data is
1
balanced, so y = 10 1. Thus, the label decisions (10.4) and (10.2) are equivalent.

86
Table 2: FashionMNIST: Average accuracy scores over 100 trials with standard
deviation in brackets.

# Labels per class 1 2 3 4 5


Laplace/LP [71] 18.4 (7.3) 32.5 (8.2) 44.0 (8.6) 52.2 (6.2) 57.9 (6.7)
Nearest Neighbor 44.5 (4.2) 50.8 (3.5) 54.6 (3.0) 56.6 (2.5) 58.3 (2.4)
Random Walk [68] 49.0 (4.4) 55.6 (3.8) 59.4 (3.0) 61.6 (2.5) 63.4 (2.5)
MBO [31] 15.7 (4.1) 20.1 (4.6) 25.7 (4.9) 30.7 (4.9) 34.8 (4.3)
VolumeMBO [41] 54.7 (5.2) 61.7 (4.4) 66.1 (3.3) 68.5 (2.8) 70.1 (2.8)
WNLL [56] 44.6 (7.1) 59.1 (4.7) 64.7 (3.5) 67.4 (3.3) 70.0 (2.8)
Centered Kernel [51] 11.8 (0.4) 13.1 (0.7) 14.3 (0.8) 15.2 (0.9) 16.3 (1.1)
Sparse LP [42] 14.1 (3.8) 16.5 (2.0) 13.7 (3.3) 13.8 (3.3) 16.1 (2.5)
p-Laplace [30] 54.6 (4.0) 57.4 (3.8) 65.4 (2.8) 68.0 (2.9) 68.4 (0.5)
Poisson 60.8 (4.6) 66.1 (3.9) 69.6 (2.6) 71.2 (2.2) 72.4 (2.3)
PoissonMBO 62.0 (5.7) 67.2 (4.8) 70.4 (2.9) 72.1 (2.5) 73.1 (2.7)

# Labels per class 10 20 40 80 160


Laplace/LP [71] 70.6 (3.1) 76.5 (1.4) 79.2 (0.7) 80.9 (0.5) 82.3 (0.3)
Nearest Neighbor 62.9 (1.7) 66.9 (1.1) 70.0 (0.8) 72.5 (0.6) 74.7 (0.4)
Random Walk [68] 68.2 (1.6) 72.0 (1.0) 75.0 (0.7) 77.4 (0.5) 79.5 (0.3)
MBO [31] 52.7 (4.1) 67.3 (2.0) 75.7 (1.1) 79.6 (0.7) 81.6 (0.4)
VolumeMBO [41] 74.4 (1.5) 77.4 (1.0) 79.5 (0.7) 81.0 (0.5) 82.1 (0.3)
WNLL [56] 74.4 (1.6) 77.6 (1.1) 79.4 (0.6) 80.6 (0.4) 81.5 (0.3)
Centered Kernel [51] 20.6 (1.5) 27.8 (2.3) 37.9 (2.6) 51.3 (3.3) 64.3 (2.6)
Sparse LP [42] 15.2 (2.5) 15.9 (2.0) 14.5 (1.5) 13.8 (1.4) 51.9 (2.1)
p-Laplace [30] 73.0 (0.9) 76.2 (0.8) 78.0 (0.3) 79.7 (0.5) 80.9 (0.3)
Poisson 75.2 (1.5) 77.3 (1.1) 78.8 (0.7) 79.9 (0.6) 80.7 (0.5)
PoissonMBO 76.1 (1.4) 78.2 (1.1) 79.5 (0.7) 80.7 (0.6) 81.6 (0.5)

We see that in nearly all cases, PoissonMBO outperforms all other methods,
with PoissonMBO typically outperforming Poisson learning by a few percentage
points. The most drastic improvements are seen at the ultra low label rates,
and at the moderate label rates shown in Tables 2 and 3, several other methods
perform well. We note that VolumeMBO and PoissonMBO are the only methods
that incorporate prior knowledge of class sizes, and are most suitable for direct
comparison. Our results can be compared to the clustering results of 67.2% on
FashionMNIST [52] and 41.2% on Cifar-10 [32].
Figure 3(a) shows the accuracy of Poisson learning at 5 labels per class as a
function of the number of neighbors K used in constructing the graph, showing
that the algorithm is not particularly sensitive to this. Figure 3(b) shows the
accuracy of Poisson learning for unbalanced training data. We take 1 label
per class for half the classes and m = 1, 2, 3, 4, 5 labels per class for the other
half. Since the training data is unbalanced, y is not a constant vector and the

87
Table 3: Cifar-10: Average accuracy scores over 100 trials with standard deviation
in brackets.

# Labels per class 1 2 3 4 5


Laplace/LP [71] 10.4 (1.3) 11.0 (2.1) 11.6 (2.7) 12.9 (3.9) 14.1 (5.0)
Nearest Neighbor 31.4 (4.2) 35.3 (3.9) 37.3 (2.8) 39.0 (2.6) 40.3 (2.3)
Random Walk [68] 36.4 (4.9) 42.0 (4.4) 45.1 (3.3) 47.5 (2.9) 49.0 (2.6)
MBO [31] 14.2 (4.1) 19.3 (5.2) 24.3 (5.6) 28.5 (5.6) 33.5 (5.7)
VolumeMBO [41] 38.0 (7.2) 46.4 (7.2) 50.1 (5.7) 53.3 (4.4) 55.3 (3.8)
WNLL [56] 16.6 (5.2) 26.2 (6.8) 33.2 (7.0) 39.0 (6.2) 44.0 (5.5)
Centered Kernel [51] 15.4 (1.6) 16.9 (2.0) 18.8 (2.1) 19.9 (2.0) 21.7 (2.2)
Sparse LP [42] 11.8 (2.4) 12.3 (2.4) 11.1 (3.3) 14.4 (3.5) 11.0 (2.9)
p-Laplace [30] 26.0 (6.7) 35.0 (5.4) 42.1 (3.1) 48.1 (2.6) 49.7 (3.8)
Poisson 40.7 (5.5) 46.5 (5.1) 49.9 (3.4) 52.3 (3.1) 53.8 (2.6)
PoissonMBO 41.8 (6.5) 50.2 (6.0) 53.5 (4.4) 56.5 (3.5) 57.9 (3.2)

# Labels per class 10 20 40 80 160


Laplace/LP [71] 21.8 (7.4) 38.6 (8.2) 54.8 (4.4) 62.7 (1.4) 66.6 (0.7)
Nearest Neighbor 43.3 (1.7) 46.7 (1.2) 49.9 (0.8) 52.9 (0.6) 55.5 (0.5)
Random Walk [68] 53.9 (1.6) 57.9 (1.1) 61.7 (0.6) 65.4 (0.5) 68.0 (0.4)
MBO [31] 46.0 (4.0) 56.7 (1.9) 62.4 (1.0) 65.5 (0.8) 68.2 (0.5)
VolumeMBO [41] 59.2 (3.2) 61.8 (2.0) 63.6 (1.4) 64.5 (1.3) 65.8 (0.9)
WNLL [56] 54.0 (2.8) 60.3 (1.6) 64.2 (0.7) 66.6 (0.6) 68.2 (0.4)
Centered Kernel [51] 27.3 (2.1) 35.4 (1.8) 44.9 (1.8) 53.7 (1.9) 60.1 (1.5)
Sparse LP [42] 15.6 (3.1) 17.4 (3.9) 20.0 (1.9) 21.7 (1.3) 15.0 (1.1)
p-Laplace [30] 56.4 (1.8) 60.4 (1.2) 63.8 (0.6) 66.3 (0.6) 68.7 (0.3)
Poisson 58.3 (1.7) 61.5 (1.3) 63.8 (0.8) 65.6 (0.6) 67.3 (0.4)
PoissonMBO 61.8 (2.2) 64.5 (1.6) 66.9 (0.8) 68.7 (0.6) 70.3 (0.4)

label decision in Step 11 of Algorithm 1 (Poisson Learning) compensates for


unbalanced training data. Note that in Figure 3 we plot the difference in accuracy
compared to (a) the baseline of k = 10 and (b) 1 label per class. In Figure 3 (b),
we see an increase in accuracy when only half the classes get additional labels,
though the increase is not as large as in Tables 1, 2 and 3 where all classes get
additional labels.

13 Continuum limits
We briefly discuss continuum limits for the Poisson learing problem (10.3).
We take p = 2 for simplicity, but the arguments extend similarly to other
values of p ≥ 1. In order to analyze continuum limits of graph-based learning
algorithms, we make the manifold assumption, and assume G is a random
geometric graph sampled from an underlying manifold. To be precise, we assume

88
the vertices of the graph corresponding to unlabeled points x1 , . . . , xn are a
sequence of i.i.d. random variables drawn from a d-dimensional compact, closed,
and connected manifold M embedded in RD , where d ≪ D. We assume the
probability distribution of the random variables has the form dµ = ρdVolM ,
where VolM is the volume form on the manifold, and ρ is a smooth density. For
the labeled vertices in the graph, we take a fixed finite set of points Γ ⊂ M. The
vertices of the random geometric graph are

Xn := {x1 , . . . , xn } ∪ Γ.

We define the edge weights in the graph by

wxy = ηε (|x − y|) ,

where ε > 0 is the length scale on which we connect neighbors, |x−y| is Euclidean
distance in RD , and η : [0, ∞) → [0, ∞) is smooth with compact support, and
ηε (t) = ε1d η εt . We denote the solution of the Poisson learning problem (10.3)


for this random geometric graph by un,ε (x).


The normalized graph Laplacian is given by

2 X
Ln,ε u(x) = 2
ηε (|x − y|)(u(x) − u(y)),
ση nε
y∈Xn

|z1 |2 η(|z|) dz. It is well-known (see, e.g., [36]), that Ln,ε is


R
where ση = Rd
consistent with the (negative of) the weighted Laplace-Beltrami operator

∆ρ := −ρ−1 div M (ρ2 ∇M u),

where div M is the manifold divergence and ∇M is the manifold gradient. We


write div = div M and ∇ = ∇M now for convenience. In particular, for any
u ∈ C 3 (M) we have

|Ln,ε u(x) − ∆ρ u(x)| ≤ C(∥u∥C 3 (M) + 1)(λ + ε)



holds for all x ∈ Xn with probability at least 1 − Cn exp −cnεd+2 λ2 for any
0 < λ ≤ 1, where C, c > 0 are constants.
Using the normalised graph Laplacian in the Poisson learning problem (10.3)

89
we write
m
X
Ln,ε un,ε (x) = n (g(y) − y)δx=y for x ∈ Xn , (13.1)
y∈Γ

1
P
where g(y) ∈ R denotes the label associated to y ∈ Γ and y = |Γ| x∈Γ g(x).
We restrict to the scalar case (binary classification) for now. Note that the
normalisation plays no role in the classification problem (10.2). To see what
should happen in the continuum, as n → ∞ and ε → 0, we multiply both sides
of (13.1) by a smooth test function φ ∈ C ∞ (M), sum over x ∈ X, and divide
by n to obtain

1 X
(Ln,ε un,ε , φ)ℓ2 (X) = (g(y) − y)φ(y). (13.2)
n
y∈Γ

Since Ln,ε is self-adjoint (symmetric), we have



(Ln,ε un,ε , φ)ℓ2 (X) = (un,ε , Ln,ε φ)ℓ2 (X) = (un,ε , ∆ρ φ)ℓ2 (X) +O (λ + ε)∥un,ε ∥ℓ1 (X) .

We also note that


X Z X
(g(y) − y)φ(y) = (g(y) − y)δy (x)φ(x) dVolM (x),
y∈Γ M y∈Γ

where δy is Dirac-Delta distribution centered at y ∈ M, which has the property


that Z
δy (x)φ(x) dVolM (x) = φ(y)
M

for every smooth φ ∈ C ∞ (M). Combining these observations with (13.2) we see
that
  Z
1 (λ + ε) X
(un,ε , ∆ρ φ)ℓ2 (X) +O ∥un,ε ∥ℓ1 (X) = (g(y)−y)δy (x)φ(x) dVolM (x).
n n M y∈Γ

If we can extend un,ε to a function on M in a suitable way, then the law of large
numbers would yield
Z
1
(un,ε , ∆ρ φ)ℓ2 (X) ≈ un,ε (x)ρ(x)∆ρ φ(x) dVolM (x).
n M

Hence, if un,ε → u as n → ∞ and ε → 0 in a sufficiently strong sense, then the

90
function u : M → R would satisfy
Z Z X
− u div (ρ2 ∇φ) dVolM = (g(y) − y)δy (x)φ(x) dVolM (x)
M M y∈Γ

for every smooth φ ∈ C ∞ (M). If u ∈ C 2 (M), then we can integrate by parts


on the left hand side to find that
Z Z X
− φ div (ρ2 ∇u) dVolM = (g(y) − y)δy (x)φ(x) dVolM (x)
M M y∈Γ

Since φ is arbitrary, this would show that u is the solution of the Poisson problem

 X
−div ρ2 ∇u = (g(y) − y)δy on M. (13.3)
y∈Γ

We conjecture that the solutions un,ε converge to the solution of (13.3) as n → ∞


and ε → 0 with probability one.

Conjecture 13.1. Assume ρ is smooth. Assume that n → ∞ and ε = εn → 0


so that
nεd+2
lim = ∞.
n→∞ log n

Let u ∈ C ∞ (M \ Γ) be the solution of the Poisson equation (13.3) and un,ε solve
the graph Poisson problem (13.1). Then with probability one

lim max |un,ε (x) − u(x)| = 0


n→∞ x∈Xn
dist(x,Γ)>δ

for all δ > 0.

The conjecture states that un,ε converges to u uniformly as long as one


stays a positive distance away from the source points Γ, where the solution u is
singular. We expect the conjecture to be true, since similar results are known to
hold when the source term on the right hand side is a smooth function f . The
fact that the right hand side in (13.3) is highly singular, involving delta-mass
concentration, raises difficult technical problems that will require new insights
that are far beyond the scope of this paper.

Remark 13.2. If Conjecture 13.1 is true, it shows that Poisson learning is


consistent with a well-posed continuum PDE for arbitrarily low label rates.
This is in stark contrast to Laplace learning, which does not have a well-posed

91
continuum limit unless the number of labels grows to ∞ as n → ∞ sufficiently
fast. This partially explains the superior performance of Poisson learning for
low label rate problems.

14 Conclusion
We proposed a new framework for graph-based semi-supervised learning at very
low label rates called Poisson learning. The method is efficient and simple to
implement. We performed a detailed analysis of Poisson learning, giving random
walk and variational interpretations. We also proposed a graph-cut enhancement
of Poisson learning, called Poisson MBO, that can give further improvements.
We presented numerical results showing that Poisson Learning outperforms all
other methods for semi-supervised learning at low label rates on several common
datasets.

92
References
[1] B. Abbasi, J. Calder, and A. M. Oberman. Anomaly detection and classifi-
cation for streaming data using partial differential equations. SIAM Journal
on Applied Mathematics, In press, 2017.

[2] R. K. Ando and T. Zhang. Learning on graph with Laplacian regularization.


In Advances in Neural Information Processing Systems, pages 25–32, 2007.

[3] S. Armstrong, P. Cardaliaguet, and P. Souganidis. Error estimates and


convergence rates for the stochastic homogenization of hamilton-jacobi
equations. Journal of the American Mathematical Society, 27(2):479–540,
2014.

[4] M. Bardi and I. Capuzzo-Dolcetta. Optimal control and viscosity solutions


of Hamilton-Jacobi-Bellman equations. Springer, 1997.

[5] M. Belkin and P. Niyogi. Using manifold structure for partially labelled
classification. In Advances in Neural Information Processing Systems, 2002.

[6] A. L. Bertozzi and A. Flenner. Diffuse interface models on graphs for


classification of high dimensional data. Multiscale Modeling & Simulation,
10(3):1090–1118, 2012.

[7] B. Bollobás and G. Brightwell. The height of a random partial order:


concentration of measure. The Annals of Applied Probability, pages 1009–
1018, 1992.

[8] B. Bollobás and P. Winkler. The longest chain among random points
in Euclidean space. Proceedings of the American Mathematical Society,
103(2):347–353, 1988.

[9] J. Calder. Lecture notes on viscosity solutions. Available at http://


www-users.math.umn.edu/~jwcalder/viscosity_solutions.pdf.

[10] J. Calder. Directed last passage percolation with discontinuous weights.


Journal of Statistical Physics, 158(45):903–949, 2015.

[11] J. Calder. A direct verification argument for the Hamilton–Jacobi equation


continuum limit of nondominated sorting. Nonlinear Analysis: Theory,
Methods & Applications, 141:88–108, 2016.

93
[12] J. Calder. Consistency of Lipschitz learning with infinite unlabeled data
and finite labeled data. arXiv:1710.10364, 2017.

[13] J. Calder. Numerical schemes and rates of convergence for the Hamilton–
Jacobi equation continuum limit of nondominated sorting. Numerische
Mathematik, 137(4):819–856, 2017.

[14] J. Calder. The game theoretic p-Laplacian and semi-supervised learning


with few labels. Nonlinearity, 32(1), 2018.

[15] J. Calder, B. Cook, M. Thorpe, and D. Slepcev. Poisson learning: Graph


based semi-supervised learning at very low label rates. In International
Conference on Machine Learning, pages 1306–1316. PMLR, 2020.

[16] J. Calder, S. Esedoglu, and A. O. Hero. A Hamilton–Jacobi equation for the


continuum limit of nondominated sorting. SIAM Journal on Mathematical
Analysis, 46(1):603–638, 2014.

[17] J. Calder, S. Esedoḡlu, and A. O. Hero III. A Hamilton-Jacobi equa-


tion for the continuum limit of non-dominated sorting. SIAM Journal on
Mathematical Analysis, 46(1):603–638, 2014.

[18] J. Calder, S. Esedoḡlu, and A. O. Hero III. A PDE-based approach to


non-dominated sorting. SIAM Journal on Numerical Analysis, 53(1):82–104,
2015.

[19] J. Calder and D. Slepčev. Properly-Weighted graph Laplacian for semi-


supervised learning. Applied Mathematics & Optimization, Dec 2019.

[20] J. Calder and C. K. Smart. The limit shape of convex hull peeling. To
appear in Duke Mathematical Journal, 2020.

[21] T. Chan and L. Vese. Active contours without edges. IEEE Transactions
on Image Processing, 10(2):266–277, Feb. 2001.

[22] B. Cook and J. Calder. Rates of convergence for the continuum limit of
nondominated sorting. SIAM Journal on Mathematical Analysis, 54(1):872–
911, 2022.

[23] M. Crandall, H. Ishii, and P. Lions. User’s guide to viscosity solutions


of second order partial differential equations. Bulletin of the American
Mathematical Society, 27(1):1–67, July 1992.

94
[24] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and elitist multi-
objective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary
Computation, 6(2):182–197, 2002.

[25] J.-D. Deuschel and O. Zeitouni. Limiting curves for i.i.d. records. The
Annals of Probability, 23(2):852–878, 1995.

[26] E. DiBenedetto. Degenerate parabolic equations. Springer Science & Business


Media, 1993.

[27] M. Ehrgott. Multicriteria Optimization. Springer, Berlin, Heidelberg, New


York, 2005.

[28] A. El Alaoui, X. Cheng, A. Ramdas, M. J. Wainwright, and M. I. Jordan.


Asymptotic behavior of ℓp -based Laplacian regularization in semi-supervised
learning. In Conference on Learning Theory, pages 879–906, 2016.

[29] G. Fleury, A. O. Hero III, S. Yoshida, T. Carter, C. Barlow, and A. Swaroop.


Pareto analysis for gene filtering in microarray experiments. In European
Signal Processing Conference (EUSIPCO), 2002.

[30] M. Flores, J. Calder, and G. Lerman. Algorithms for Lp-based semi-


supervised learning on graphs. arXiv:1901.05031, 2019.

[31] C. Garcia-Cardona, E. Merkurjev, A. L. Bertozzi, A. Flenner, and A. G.


Percus. Multiclass data segmentation using diffuse interface methods on
graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence,
36(8):1600–1613, 2014.

[32] K. Ghasedi, X. Wang, C. Deng, and H. Huang. Balanced self-paced learning


for generative adversarial clustering network. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 4391–4400,
2019.

[33] J. Hammersley. A few seedlings of research. In Proceedings of the Sixth


Berkeley Symposium on Mathematical Statistics and Probability, volume 1,
pages 345–394, 1972.

[34] J. Handl and J. Knowles. Exploiting the trade-off – the benefits of multiple
objectives in data clustering. In Evolutionary Multi-Criterion Optimization,
pages 547–560. Springer, 2005.

95
[35] J. He, M. Li, H.-J. Zhang, H. Tong, and C. Zhang. Manifold-ranking
based image retrieval. In Proceedings of the 12th annual ACM International
Conference on Multimedia, pages 9–16. ACM, 2004.

[36] M. Hein, J.-Y. Audibert, and U. v. Luxburg. Graph Laplacians and their
convergence on random neighborhood graphs. Journal of Machine Learning
Research, 8(Jun):1325–1368, 2007.

[37] A. O. Hero III. Gene selection and ranking with microarray data. In
IEEE International Symposium on Signal Processing and its Applications,
volume 1, pages 457–464, 2003.

[38] K.-J. Hsiao, J. Calder, and A. O. Hero III. Pareto-depth for multiple-query
image retrieval. IEEE Transactions on Image Processing, 24(2):583–594,
2015.

[39] K.-J. Hsiao, K. Xu, J. Calder, and A. O. Hero III. Multi-criteria anomaly
detection using Pareto Depth Analysis. In Advances in Neural Information
Processing Systems 25, pages 854–862. 2012.

[40] K.-J. Hsiao, K. Xu, J. Calder, and A. O. Hero III. Multi-criteria similarity-
based anomaly detection using Pareto Depth Analysis. IEEE Transactions
on Neural Networks and Learning Systems, 2015. To appear.

[41] M. Jacobs, E. Merkurjev, and S. Esedoḡlu. Auction dynamics: A volume


constrained MBO scheme. Journal of Computational Physics, 354:288–310,
2018.

[42] A. Jung, A. O. Hero III, A. Mara, and S. Jahromi. Semi-supervised learning


via sparse label propagation. arXiv preprint arXiv:1612.01414, 2016.

[43] N. Katzourakis. An introduction to viscosity solutions for fully nonlinear


PDE with applications to calculus of variations in L infinity. Springer, 2014.

[44] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In Pro-


ceedings of the 2nd International Conference on Learning Representations
(ICLR), 2014.

[45] J. F. C. Kingman. Poisson processes, volume 3. Oxford university press,


1992.

[46] A. Krizhevsky and G. Hinton. Learning multiple layers of features from


tiny images. Technical report, Citeseer, 2009.

96
[47] A. Kumar and A. Vladimirsky. An efficient method for multiobjective
optimal control and optimal control subject to integral constraints. Journal
of Computational Mathematics, 28(4):517–551, 2010.

[48] R. Kyng, A. Rao, S. Sachdeva, and D. A. Spielman. Algorithms for Lipschitz


learning on graphs. In Conference on Learning Theory, pages 1190–1223,
2015.

[49] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning


applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324,
1998.

[50] B. F. Logan and L. A. Shepp. A variational problem for random Young


tableaux. Advances in Mathematics, 26(2):206–222, 1977.

[51] X. Mai and R. Couillet. Random matrix-inspired improved semi-supervised


learning on graphs. In International Conference on Machine Learning, 2018.

[52] R. McConville, R. Santos-Rodriguez, R. J. Piechocki, and I. Craddock.


N2d:(not too) deep clustering via clustering the local manifold of an au-
toencoded embedding. In 2020 25th International Conference on Pattern
Recognition (ICPR), pages 5145–5152. IEEE, 2021.

[53] I. Mitchell and S. Sastry. Continuous path planning with multiple constraints.
In IEEE Conference on Decision and Control, volume 5, pages 5502–5507,
Dec. 2003.

[54] D. Mumford and J. Shah. Optimal approximations by piecewise smooth


functions and associated variational problems. Communications on Pure
and Applied Mathematics, 42(5):577–685, 1989.

[55] B. Nadler, N. Srebro, and X. Zhou. Semi-supervised learning with the


graph Laplacian: The limit of infinite unlabelled data. Advances in Neural
Information Processing Systems, 22:1330–1338, 2009.

[56] Z. Shi, S. Osher, and W. Zhu. Weighted nonlocal Laplacian on interpolation


from sparse data. Journal of Scientific Computing, 73(2-3):1164–1177, 2017.

[57] D. Slepčev and M. Thorpe. Analysis of p-Laplacian regularization in semisu-


pervised learning. SIAM Journal on Mathematical Analysis, 51(3):2085–
2120, 2019.

97
[58] W. Thawinrak and J. Calder. High-order filtered schemes for the Hamilton-
Jacobi continuum limit of nondominated sorting. Journal of Mathematics
Research, 10(1), Nov 2017.

[59] S. Ulam. Monte Carlo calculations in problems of mathematical physics.


Modern Mathematics for the Engineers, pages 261–281, 1961.

[60] Y. Van Gennip and A. L. Bertozzi. Gamma-convergence of graph Ginzburg-


Landau functionals. Advances in Differential Equations, 17(11/12):1115–
1180, 2012.

[61] A. Vershik and S. Kerov. Asymptotics of the Plancherel measure of the


symmetric group and the limiting form of Young tables. Soviet Doklady
Mathematics, 18(527-531):38, 1977.

[62] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image


dataset for benchmarking machine learning algorithms. arXiv preprint
arXiv:1708.07747, 2017.

[63] B. Xu, J. Bu, C. Chen, D. Cai, X. He, W. Liu, and J. Luo. Efficient manifold
ranking for image retrieval. In Proceedings of the 34th International ACM
SIGIR Conference on Research and Development in Information Retrieval,
pages 525–534. ACM, 2011.

[64] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang. Saliency detection
via graph-based manifold ranking. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 3166–3173, 2013.

[65] L. Zhang, G.-J. Qi, L. Wang, and J. Luo. Aet vs. aed: Unsupervised
representation learning by auto-encoding transformations rather than data.
In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 2547–2555, 2019.

[66] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf. Learning with


local and global consistency. In Advances in Neural Information Processing
Systems, pages 321–328, 2004.

[67] D. Zhou, J. Huang, and B. Schölkopf. Learning from labeled and unla-
beled data on a directed graph. In Proceedings of the 22nd International
Conference on Machine Learning, pages 1036–1043. ACM, 2005.

98
[68] D. Zhou and B. Schölkopf. Learning from labeled and unlabeled data using
random walks. In Joint Pattern Recognition Symposium, pages 237–244.
Springer, 2004.

[69] D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Schölkopf. Ranking


on data manifolds. In Advances in Neural Information Processing Systems,
pages 169–176, 2004.

[70] X. Zhou and M. Belkin. Semi-supervised learning by higher order regu-


larization. In Proceedings of the Fourteenth International Conference on
Artificial Intelligence and Statistics, pages 892–900, 2011.

[71] X. Zhu, Z. Ghahramani, and J. D. Lafferty. Semi-supervised learning


using Gaussian fields and harmonic functions. In Proceedings of the 20th
International Conference on Machine learning (ICML-03), pages 912–919,
2003.

[72] X. Zhu, J. Lafferty, and R. Rosenfeld. Semi-supervised learning with graphs.


PhD thesis, Carnegie Mellon University, 2005.

99

You might also like