Robust Risk Aggregation With Neural Networks: Stephan Eckstein Michael Kupper Mathias Pohl May 27, 2020

Robust risk aggregation with neural
networks
Stephan Eckstein∗ Michael Kupper† Mathias Pohl‡
May 27, 2020

arXiv:1811.00304v3 [q-fin.MF] 26 May 2020
Abstract. We consider settings in which the distribution of a multivariate

random variable is partly ambiguous. We assume the ambiguity lies on the level
of the dependence structure, and that the marginal distributions are known.
Furthermore, a current best guess for the distribution, called reference measure,
is available. We work with the set of distributions that are both close to the given
reference measure in a transportation distance (e.g. the Wasserstein distance),
and additionally have the correct marginal structure. The goal is to find upper
and lower bounds for integrals of interest with respect to distributions in this
set.
The described problem appears naturally in the context of risk aggregation.
When aggregating different risks, the marginal distributions of these risks are
known and the task is to quantify their joint effect on a given system. This is
typically done by applying a meaningful risk measure to the sum of the individual
risks. For this purpose, the stochastic interdependencies between the risks need
to be specified. In practice the models of this dependence structure are however
subject to relatively high model ambiguity.
The contribution of this paper is twofold: Firstly, we derive a dual representa-
tion of the considered problem and prove that strong duality holds. Secondly, we
propose a generally applicable and computationally feasible method, which relies
on neural networks, in order to numerically solve the derived dual problem. The
latter method is tested on a number of toy examples, before it is finally applied
to perform robust risk aggregation in a real world instance.
∗ Department of Mathematics, University of Konstanz, Universitätsstrae 10, 78464 Konstanz, Germany,

stephan.eckstein@uni-konstanz.de
† Department of Mathematics, University of Konstanz, Universitätsstrae 10, 78464 Konstanz, Germany,
kupper@uni-konstanz.de
‡ Faculty of Business, Economics & Statistics, University of Vienna, Oskar-Morgenstern-Platz 1, 1090 Vi-
enna, Austria, mathias.pohl@univie.ac.at
1
1. Introduction
1.1. Motivation
Risk aggregation is the process of combining multiple types of risk within a firm. The aim
is to obtain meaningful measures for the overall risk the firm is exposed to. The stochastic
interdependencies between the different risk types are crucial in this respect. There is a
variety of different approaches to model these interdependencies. One generally observes
that these models for the dependence structure between the risk types are significantly less
accurate than the models for the individual types of risk.
We take the following approach to address this issue: We assume that the distributions
of the marginal risks are known and fixed. This assumption is justified in many cases
of practical interest. Moreover, risk aggregation is per definition not concerned with the
computation of the marginal risks’ distributions. Additionally, we take a probabilistic model
for the dependence structure linking the marginals risks as given. Note that there are at least
two different approaches in the literature to specify this reference dependence structure: The
construction of copulas and factor models. The particular form of this reference model is not
relevant for our approach as long as it allows us to generate random samples. Independently
of the employed method, the choice of a reference dependence structure is typically subject
to high uncertainty. Our contribution is to model the ambiguity with respect to the specified
reference model, while fixing the marginal distributions. We address the following question
in this paper:
How can we account for model ambiguity with respect to a specific dependence structure
when aggregating different risks?
We propose an intuitive approach to this problem: We compute the aggregated risk with
respect to the worst case dependence structure in a neighborhood around the specified refer-
ence dependence structure. For the construction of this neighborhood we use transportation
distances. These distance measures between probability distributions are flexible enough to
capture different kinds of model ambiguity. At the same time, they allow us to generally
derive numerical methods, which solve the corresponding problem of robust risk aggrega-
tion in reasonable time. To highlight some of the further merits of our approach, we are
able to determine the worst case dependence structure for a problem at hand. Hence, our
method for robust risk measurement is arguably a useful tool also for risk management as it
provides insights about which scenarios stress a given system the most. Moreover, it should
be emphasized that our approach is restricted neither to a particular risk measure nor a
particular aggregation function.1
In summary, the approach presented provides a flexible way to include model ambiguity in
situations where a reference dependence structure is given and the marginals are fixed. It is
generally applicable and computationally feasible. In the subsequent subsection we outline
our approach in some more details before discussing the related literature.
1.2. Overview
We aim to evaluate Z
f dµ̄,
Rd
1 Note also that our methods can be applied to solve completely unrelated problems, such as the portfolio
selection problem under dependence uncertainty introduced in G. C. Pflug and Pohl (2017).
2
for some f : Rd → R in the presence of ambiguity with respect to the probability measure
µ̄ ∈ P(Rd ), where P(Rd ) denotes the set of all Borel probability measures on Rd . In
particular, we assume that the marginals µ̄1 , . . . , µ̄d of µ̄ are known and the ambiguity lies
solely on the level of the dependence structure. Moreover, we assume a reference dependence
structure, namely the one implied by the reference measure µ̄, is given and that the degree
of ambiguity with respect to the reference measure µ̄ can be modeled by the transportation
distance dc , which is defined in (2) below. Hence, we consider the following problem
Z
φ(f ) := max f dµ, (1)
µ∈Π(µ̄1 ,...,µ̄d ) Rd
dc (µ̄,µ)≤ρ
where the set Π(µ̄1 , . . . , µ̄d ) consists of all µ ∈ P(Rd ) satisfying µi = µ̄i for all i = 1, . . . , d,
where µi ∈ P(R) and µ̄i ∈ P(R) denote the i-th marginal distributions of µ and µ̄. We fix
a continuous function c : Rd × Rd → [0, ∞) such that c(x, x) = 0 for all x ∈ R. The cost of
transportation between µ̄ and µ in P(Rd ) with respect to the cost function c is defined as
Z
dc (µ̄, µ) := inf c (x, y) π(dx, dy), (2)
π∈Π(µ̄,µ) Rd ×Rd
where Π(µ̄, µ) denotes the set of all couplings of the marginals µ̄ and µ. For the cost function
1/p
c(x, y) = ||x − y||p with p ≥ 1, the mapping dc corresponds to the Wasserstein distance of
order p.
The numerical methods to solve problem (1), which are developed in this paper, build on
the following dual formulation of problem (1):
n d Z
X Z h d
X i o
inf ρλ + hi dµ̄i + sup f (y) − hi (yi ) − λc(x, y) µ̄(dx) , (3)
λ≥0, hi ∈Cb (R) R Rd y∈Rd
i=1 i=1
where Cb (R) the set of all continuous and bounded functions h : R → R. This dual for-
mulation was initially derived by Gao and Kleywegt (2017a). These authors show that
strong duality holds, i.e. problem (1) and (3) coincide, for upper semicontinuous functions
f (x)
f : X → R satisfying the growth condition supx∈X c(x,y 0)
< ∞ for some y0 ∈ X, where
X = X1 × ... × Xd for possibly non-compact subsets X1 , ..., Xd of R.
Theorem 1 in Section 2 extends the duality in the following aspects: Firstly, the functions
f : X → R need not satisfy a growth condition that depends on the cost c. Our results
allow for upper semicontinuous functions of bounded growth. Secondly, we can consider
a space X = X1 × · · · × Xd , where Xi can be arbitrary polish spaces. We emphasize
that the problem setting can therefore include an information structure where multivariate
marginals are known and fixed. Lastly, Theorem 1 extends the constraint dc (µ̄, µ) ≤ ρ to a
more general way of penalizing with respect to dc (µ̄, µ).
We now turn to the question how the dual formulation (3) can be utilized to solve problem
(1). One approach is to assume that the reference distribution µ̄ is a discrete distribution. In
this context, Gao and Kleywegt (2017a) show that the dual problem (3) can be reformulated
as a linear program under the following assumptions: First, the function f can be written
as the maximum of affine functions. Second, the reference distribution µ̄ is given by an
empirical distribution on n points x1 , . . . , xn in Rd . Third, the cost function c has to be
Pd
additively separable, i.e. c(x, y) = i=1 ci (xi , yi ). For further details we refer to Corollary 3
in Section 2.1.
3
This linear programming approach is especially useful when only few observations are
available to construct the reference distribution µ̄ - a case where accounting for ambiguity
with respect to the dependence structure is often required. Nevertheless, the assumptions
under which problem (3) can be solved by means of linear programming exclude many cases
of practical interest. Even in cases that linear programming is applicable, the resulting size
of the linear program quickly becomes intractable in higher dimensions. Hence, this paper
presents a generally applicable and computationally feasible method to numerically solve
problem (3) which uses neural networks.
The basic idea is to use neural networks to parametrize the functions hi ∈ Cb (R) and then
solve the resulting finite dimensional problem. Theoretically, such an approach is justified by
the universal approximation properties of neural networks, see for example Hornik (1991).
To utilize neural networks, we first dualize the point-wise supremum inside the integral
of (3). Under mild assumptions, this leads to
n d Z
X Z o
inf λρ + hi dµ̄i + g dµ̄ .
λ≥0, R Rd
i=1
hi ∈Cb (R), g∈Cb (Rd ):
Pd
g(x)≥f (y)− i=1 hi (yi )−λc(x,y)
As the pointwise inequality constraint prevents a direct implementation with neural net-
works, the constraint is penalized. This is done by introducing a measure θ ∈ P(R2d ),
which we refer to as the sampling measure. Further, we are given a family of penalty func-
tions (βγ )γ>0 which increases the accuracy of the penalization for increasing γ, e.g. βγ (x) =
γ max{0, x}2 . The resulting optimization problem reads
n d Z
X Z
φθ,γ (f ) := inf λρ + hi dµ̄i + g dµ̄ (4)
λ≥0, R Rd
i=1
hi ∈Cb (R), g∈Cb (Rd ) Z d
X o
+ βγ f (y) − hi (yi ) − λc(x, y) − g(x) θ(dx, dy) .
R2d i=1
Before we develop numerical methods to evaluate φθ,γ (f ) and thereby approximate φ(f ),
we need to study the convergence
φθ,γ (f ) → φ(f ) for γ → ∞. (5)
A sufficient condition for this convergences is derived in Proposition 1. We additionally

give a general instance where this derived condition is satisfied. It states that (5) holds
whenever the cost function c satisfies a mild growth condition and the sampling measure θ
is the product measure between the reference measure and the respective marginals, i.e. θ =
µ̄ ⊗ (µ̄1 ⊗ ... ⊗ µ̄d ).
Besides the optimal value of problem (1) also the corresponding optimizer is of interest.
To this end, we develop duality for problem (4). This duality leads to a simple formula
to obtain approximate optimizers of the initial problem (1) once the dual formulation (4)
is solved. It shows that any optimizer (λ? , (h?i )i=1,...,d , g ? ) of (4) gives an approximate
optimizer µ? of (1) by setting µ? equal to the second marginal of π ? , where π ? is defined by
the Radon-Nikodym derivative
d
dπ ? X
(x, y) := βγ0 f (y) − g ? (x) − h?i (yi ) − λ? c(x, y) . (6)
dθ i=1
4
Problem φθ,γ (f ) fits into the standard framework in which neural networks can be applied
to parametrize the functions hi ∈ Cb (R) and g ∈ Cb (Rd ). We justify this parametrization
theoretically by giving conditions under which the approximation error vanishes for a infinite-
size neural network. In Section 3, we give details concerning the numerical solution of
φθ,γ (f ) using neural networks, which encompasses the choice of the neural network structure,
hyperparameters and optimization method.
This approach based on neural networks is the main reason to derive and study the
penalized problem (4). Nonetheless, problem (4) is interesting in its own right and by
no means limited to the application of neural networks: it may be efficiently solved using
advanced first-order methods, see e.g. Nesterov (2012). We thank an anonymous referee for
pointing this out to us.
The remainder of the paper is structured as follows. In the subsequent Section 1.3, we
provide a overview of the relevant literature. Our main results can be found in Section
2, which consists of three parts: First, we state and prove the general form of the duality
between (1) and (3) and derive some implications thereof. In the second part of Section 2,
we study the penalization introduced in equation (4) above. Third, we give conditions under
which φθ,γ (f ) can be approximated with neural networks. Section 3 gives implementation
details. Section 4 is devoted to four toy examples, which aim to shed some light on the
developed concepts. In the final Section 5, the acquired techniques are applied to a real
world example. We thereby demonstrate how to implement robust risk aggregation with
neural networks in practice.
1.3. Related literature

There are three different strings of literature, which are relevant in the present context:
Firstly, literature on risk aggregation; secondly, literature on model ambiguity and particu-
larly on ambiguity sets constructed using the Wasserstein distance; thirdly, recent applica-
tion of neural networks in finance and related optimization problems.
Risk aggregation
In Section 5, we motivate from an applied point of view why there is interest in risk bounds
for the sum of losses of which the marginal distributions are known. The theoretical in-
terest in this topic started with the following questions: How can one compute bounds for
the distribution function of a sum of two random variables when the marginal distributions
are fixed? This problem was solved in 1982 by Makarov (1982) and Rüschendorf (1982).
Starting with the work of Embrechts and Puccetti (2006) more than 20 years later, the
higher dimensional version of this problem was studied extensively due to its relevance for
risk management. We refer to Embrechts, Wang, and Wang (2015) and Puccetti and Wang
(2015) for an overview of the developments concerning risk aggregation under dependence
uncertainty, as this problem was coined. Let us mention that Puccetti and Rüschendorf
(2012) introduced the so-called rearrangement algorithm, which is a fast procedure to nu-
merically compute the bounds of interest. Applying this algorithm to real-world examples
demonstrates a conceptual drawback of the assumption that no information concerning the
dependence of the marginal risk is available: The implied lower and upper bound for the
aggregated risk are impractically far apart.
5
Hence, some authors recently tried to overcome this drawback and to come up with
more realistic bounds by including partial information about the dependence structure.
For instance, Puccetti and Rüschendorf (2012) discuss how positive, negative or indepen-
dence information influence the above risk bounds; Bernard, Rüschendorf, and Vanduffel
(2017) derive risk bounds with constraints on the variance of the aggregated risk; Bernard,
Rüschendorf, Vanduffel, and Wang (2017) consider partially specified factor models for the
dependence structure. The interested reader is referred to Rüschendorf (2017) for a recent
review of these and related approaches. Finally, we want to point out the intriguing contri-
bution by Lux and Papapantoleon (2016). These authors provide a framework which allows
them to derive VaR-bounds if (a) extreme value information is available, (b) the copula
linking the marginals is known on a subset of its domain and (c) the latter copula lies in the
neighborhood of a reference copula as measured by a statistical distance.
Since our paper aims to contribute to this string of literature, let us point out that the
latter mentioned type of partial information about the dependence structure used in Lux
and Papapantoleon (2016) is similar in spirit to our approach. We emphasize that Lux and
Papapantoleon use statistical distances which are different to the transportation distance dc
defined in the previous subsection.
Model Ambiguity
There is an obvious connection of problem (1), which is studied in this paper, with the
following minimax stochastic optimization problem
min max EQ [f (x, ξ)] , (7)

x∈X Q∈Q
where X ⊂ Rm , f : Rm × Ξ → R, ξ is a random vector whose distribution Q is supported

on Ξ ⊂ Rd and Q is a nonempty set of probability distributions, referred to as ambiguity
set. Problems of this form recently became known as distributionally robust stochastic opti-
mization problems. As pointed out by Shapiro (2017), there are two natural and somewhat
different approaches to constructing the ambiguity set Q. On the one hand, ambiguity sets
have been defined by moment constraints, see Delage and Ye (2010) and references therein.
An alternative approach is to assume a reference probability distribution Q̄ is given and de-
fine the ambiguity set by all distributions which are in the neighborhood of Q̄ as measured
by a statistical distance. To the best of our knowledge two distinct choices of this statistical
distance have been established in the literature: The φ-divergence and the Wasserstein dis-
tance. Concerning ambiguity sets constructed using the φ-divergence we refer toBayraksan
and Love (n.d.) and references therein. In the following, we focus on approaches which rely
on the Wasserstein distance to account for model ambiguity. G. Pflug and Wozabal (2007)
were the first to study these particular ambiguity sets. Mohajerin Esfahani and Kuhn (2018)
showed that distributionally robust stochastic optimization problems over Wasserstein balls
centered around a discrete reference distribution possess a tractable reformulation: under
mild assumptions these problems belong to the same complexity class as their non-robust
counterparts. The duality result driving this insight was also proven by Blanchet and Murthy
(2019),Gao and Kleywegt (2016) and Bartl, Drapeau, and Tangpi (2019) based on differ-
ent techniques and assumptions. These contribution indicate that distributionally robust
stochastic optimization using the Wasserstein distance developed into an active field of re-
search in recent years. For instance, Zhao and Guan (2018) and Hanasusanto and Kuhn
(2018) adapted similar ideas in the context of two-stage stochastic programming and Chen,
6
Yu, and Haskell (2019) and Yang (2017) study distributionally robust Markov decision pro-
cesses using the Wasserstein distance. Obloj and Wiesel (2018) analyze a robust estimation
method for superhedging prices relying on a Wasserstein ball around the empirical measure.
Most relevant in the context of our paper are the following two references: Gao and Kley-
wegt (2017b) put two Wasserstein-type-constraints on the probability distribution Q in (7):
Q has to be close in Wasserstein distance to a reference distribution Q̄, while the dependence
structure implied by Q has to be close, again in Wasserstein distance, to a specified reference
dependence structure. In their follow-up paper, Gao and Kleywegt (2017a) consider problem
(1) in the context of stochastic optimization, i.e. in the framework (7). The contribution of
this paper, i.e. their duality result and the LP-formulation, is already reviewed in the above
overview. In addition, the authors provide numerical experiments in portfolio selection and
nonparametric density estimation.
Neural networks in finance and optimization

Applications of neural networks have vastly increased in recent years. Most of the popularity
arose from successes of neural networks related to data representation tasks, e.g. related to
pattern recognition, image classification, or task-specific artificial intelligence. In contrast
to such a utilization, neural networks have also been applied strictly as a tool to solve
certain optimization problems. This is the way we use neural networks in this paper, and
they have found similar uses in various areas related to finance. Among others, they were
applied to solving high dimensional partial differential equations and stochastic differential
equations (see e.g. Beck, Becker, Grohs, Jaafari, & Jentzen, 2018; Berner, Grohs, & Jentzen,
2018; Weinan, Han, & Jentzen, 2017) as well as backward stochastic differential equation
(Henry-Labordere, 2017), in optimal stopping (Becker, Cheridito, & Jentzen, 2019), optimal
hedging with respect to a risk measure (Buehler, Gonon, Teichmann, & Wood, 2019), and
superhedging (Eckstein & Kupper, 2019).
For more classical learning tasks where neural networks are applied, ideas from optimal
transport and distributional robustness are also used. While the settings are often quite
different in nature to the one in this paper, the optimization problems which are eventually
implemented are nevertheless similar. Most related to the current paper are settings in
which optimal transport type of constraints are solved via a penalization or regularization
method. Examples include generative models for images (see e.g. Gulrajani, Ahmed, Ar-
jovsky, Dumoulin, & Courville, 2017; Roth, Lucchi, Nowozin, & Hofmann, 2017), optimal
transport and calculation of barycenters for images (see e.g. Seguy et al., 2017), martingale
optimal transport (see e.g. Henry-Labordere, 2019), or distributional robustness methods
applied to learning tasks (see e.g. Blanchet, Kang, & Murthy, 2019; Gao, Chen, & Kleywegt,
2017).
2. Results
2.1. Duality
Let X = X1 × X2 × · · · × Xd be a Polish space, and denote by P(X) the set of all Borel
probability measures on X. Throughout, we fix a reference probability measure µ̄ ∈ P(X).
For i = 1, ..., d, we denote by µi := µ ◦ pr−1
i the i-th marginal of µ ∈ P(X), where pri : X →
Xi is the projection pri (x) := xi . Further, let κ : X → [1, ∞) be a growth function of the
7
Pd
form κ(x1 , ..., xd ) = i=1 κi (xi ), where each κi : Xi → [1, ∞) is continuous and satisfies
κ dµ̄i < ∞. We further assume one of the following: Either κ has compact sublevel sets,2
R
Xi i
or Xi = Rdi for all i = 1, ..., d. Denote by Cκ (X) and Uκ (X) the spaces of all continuous,
respectively upper semicontinuous functions f : X → R such that f /κ is bounded. Recall
that Cb (X) denotes the set of all continuous and bounded functions on X.
In the following we fix a continuous function c : X × X → [0, ∞) such that c(x, x) = 0 for
all x ∈ X. The cost of transportation between µ̄ and µ in P(X) with respect to the cost
function c is defined as
Z
dc (µ̄, µ) := inf c (x, y) π(dx, dy), (8)
π∈Π(µ̄,µ) X×X
where Π(µ̄1 , . . . , µ̄d ) denotes the set of all µ ∈ P(X) such that µi = µ̄i for all i = 1, . . . , d.
The elements in Π(µ̄1 , . . . , µ̄d ) are referred to as couplings of the marginals µ̄1 , . . . , µ̄d . Al-
though the computation of the convex conjugate in the following result relies on Bartl,
Drapeau, and Tangpi (2019), we do not need their growth condition on the cost function
c. The main reason we do not require this condition is that continuity from above of the
functional (9) - which corresponds to tightness of the considered set of measures - is already
obtained by the imposed marginal constraints.
Theorem 1. For every convex and lower semicontinuous function ϕ : [0, ∞] → [0, ∞] such
that ϕ(0) = 0 and ϕ(∞) = ∞, and all f ∈ Uκ (X), it holds that
nZ o
max f dµ − ϕ(dc (µ̄, µ)) (9)
µ∈Π(µ̄1 ,...,µ̄d ) X
n d Z
X Z h d
X i o
∗
= inf ϕ (λ) + hi dµ̄i + sup f (y) − hi (yi ) − λc(x, y) µ̄(dx) ,
λ≥0, hi ∈Cκi (Xi ) Xi X y∈X
i=1 i=1
where ϕ∗ denotes the convex conjugate of ϕ, i.e. ϕ∗ (λ) = supx≥0 {λx − ϕ(x)}.
Proof. 1) Define the optimal transport functional ψ1 : Cκ (X) → R by

d Z
nX o
ψ1 (f ) := inf hi dµ̄i : hi ∈ Cκi (Xi ) such that ⊕di=1 hi ≥ f ,
i=1 Xi
Pd
where ⊕di=1 hi : X → R is defined as ⊕di=1 hi (x) := i=1 hi (xi ). We show that ψ1 is
continuous from above on Cκ (X), i.e. for every sequence (f n ) in Cκ (X) such that f n ↓ f ∈
Cκ (X) one has ψ1 (f n ) ↓ ψ1 (f ). Fix ε > 0 and hi ∈ Cκi (Xi ) such that ⊕di=1 hi ≥ f and
Pd R
ψ1 (f ) + 3ε ≥ i=1 Xi hi dµ̄i . Since f 1 ∈ Cκ (X) and hi ∈ Cκi (Xi ), there exists a constant
M > 0 such that |f 1 | ≤ M κ and |hi | ≤ M κi . By assumption, Xi κi dµ̄i < +∞ for all
R
i = 1, . . . , d. We now show that ψ1 (f n0 ) ≤ ψ1 (f ) + ε, which we do separately depending on

whether we assume κ has compact sublevel sets, or Xi = Rdi .
Pd R
• Let κ have compact sublevel sets. Choose z > 0 such that i=1 Xi 4M (κi − dz )+ dµ̄i ≤
ε
3 . By Dini’s lemma there exists n0 ∈ N such that f
n0
≤ ⊕di=1 hi + 3ε on the compact
2 Thisis for example satisfied if all κi have compact sublevel sets, since the sublevel sets are by continuity
closed and further by positivity of κi it holds {κ ≤ c} ⊆ {κ1 ≤ c} × ... × {κd ≤ c}.
8
{κ ≤ 2z}. Since it further holds κ11{κ>2z} ≤ 2(κ − z)+ ≤ 2 ⊕di=1 (κi − dz )+ , one obtains
f n0 = 11{κ≤2z} f n0 + 11{κ>2z} f n0
ε
≤ 11{κ≤2z} ⊕di=1 hi + 11{κ>2z} f n0 +
3
ε
= ⊕di=1 hi + 11{κ>2z} (f n0 − ⊕di=1 hi ) +
3
ε
≤ ⊕di=1 hi + 11{κ>2z} 2M κ +
3
z + ε
≤ ⊕di=1 hi + 4M κi − +
d 3
d
and hence ψ1 (f n0 ) ≤ i=1 Xi hi + 4M (κi − dz )+ dµ̄i + 3ε ≤ ψ1 (f ) + ε.
P R
Pd R
• Let Xi = Rdi . Choose z > 0 such that i=1 Xi 4M (κi − dz )+ dµ̄i ≤ 6ε . Choose Ri > 0
Pd c ε
such that i=1 µ̄i (B(0, Ri ) ) · 4M z < 6 , where B(0, r) is the open euclidean ball
around 0 of radius r. By Dini’s lemma there exists n0 ∈ N such that f n0 ≤ ⊕di=1 hi + 3ε
on the compact K := K1 × ... × Kd := B(0, R1 + 2) × ... × B(0, Rd + 2). Since
11B(0,Ri +1)c is upper semicontinuous, we can find continuous and bounded functions
Pd R
gi such that 11B(0,Ri +1)c ≤ gi and i=1 Xi gi dµ̄i · 4M z < 6ε (since gi approximates
11B(0,Ri +1)c and 11B(0,Ri +1)c ≤ 11B(0,Ri )c ). With some of the same steps as in the case
where κ has compact sublevel sets, one obtains
f n0 = 11K f n0 + 11K c f n0
ε
≤ ⊕di=1 hi + 11K c (f n0 − ⊕di=1 hi ) +
3
ε
≤ ⊕di=1 hi + 11K c 11{κ>2z} 2M κ + 11K c 11{κ≤2z} 2M κ +
3
z + ε
≤ ⊕di=1 hi + 4M κi − + 11K c 4M z +
d 3
d
z + d ε
≤ ⊕i=1 hi + 4M κi − + (⊕i=1 11Kic )4M z +
d 3
d
z + ε
≤ ⊕i=1 hi + 4M κi − + 4M z · gi +
d 3
Pd R
and hence ψ1 (f n0 ) ≤ i=1 Xi hi + 4M (κi − dz )+ + 4M z · gi dµ̄i + 3ε ≤ ψ1 (f ) + ε.
This shows that ψ1 is continuous from above on Cκ (X). Moreover, its convex conjugate is
given by
Z d Z
X
∗
ψ1,Cκ
(µ) = sup f dµ − inf hi dµ̄i
f ∈Cκ (X) X hi ∈Cκi (Xi ) Xi
i=1
⊕di=1 hi ≥f
Z d Z
X
= sup sup f dµ − hi dµ̄i
hi ∈Cκi (Xi ) f ∈Cκ (X) X i=1 Xi
⊕di=1 hi ≥f
d Z
(
0 if µ ∈ Π(µ̄1 , . . . , µ̄d )
X Z
= sup hi dµ − hi dµ̄i = (10)
hi ∈Cκi (Xi ) i=1 X Xi +∞ else
for all µ ∈ Pκ (X), where Pκ (X) denotes the set of all µ ∈ P(X) such that κ ∈ L1 (µ). Notice
that Π(µ̄1 , . . . , µ̄d ) ⊂ Pκ (X).
9
2) Define ψ2 : Cκ (X) → R ∪ {+∞} by
Z
ψ2 (f ) := inf ϕ∗ (λ) +

sup f (y) − λc(x, y) µ̄(dx) .
λ≥0 X y∈X
By definition ψ2 is convex and increasing. Further, since inf λ≥0 ϕ∗ (λ) = ϕ∗ (0) = 0 and
f λc (x) := supy∈X {f (y) − λc(x, y)} ≥ f (x) for all λ ≥ 0, it follows that
Z
∗
ψ2 (f ) ≥ inf ϕ (λ) + f dµ̄ > −∞
λ≥0 X
1
for all f ∈ Cκ (X), where we use that f ∈ L (µ̄). For the convex conjugates one has
Z
∗
ψ2,C κ
(µ) := sup f dµ − ψ2 (f )
f ∈Cκ (X) X
Z
∗
= sup f dµ − ψ2 (f ) =: ψ2,U κ
(µ) = ϕ(dc (µ̄, µ)) (11)
f ∈Uκ (X) X
for all µ ∈ Pκ (X). Indeed, for every µ ∈ Pκ (X) one has

∗ ∗ ∗
ψ2,Uκ
(µ) ≥ ψ2,C κ
(µ) ≥ ψ2,C b
(µ) = ϕ(dc (µ̄, µ)),
where the last equality is shown in Bartl, Drapeau, and Tangpi (2019, Proof of Thm. 2.4,
Step 4), notably without using the growth condition for c imposed in Bartl, Drapeau, and
∗
Tangpi (2019). It remains to show that ψ2,U κ
(µ) ≤ ϕ(dc (µ̄, µ)). Since ϕ(∞) = ∞, the case
dc (µ̄, µ) = ∞ is obvious. Suppose dc (µ̄, µ) < +∞. Note that X f λc dµ̄ is well-defined since
R
f λc ≥ f ∈ L1 (µ̄), so that the negative part of the integral is finite. Further, by eliminating
redundant choices in supremum and infimum of the convex conjugate, one obtains
nZ Z o
∗ ∗
ψ2,U κ
(µ) = sup f dµ − inf
∗
ϕ (λ) + f λc dµ̄ .
f ∈Uκ (X) X λ≥0, ϕ (λ)<∞, X
f λc dµ̄<∞
R
ψ2 (f )<∞ X
For every ε > 0, f ∈ Uκ (X) and λ ≥ 0 such that ψ2 (f ) < +∞, ϕ∗ (λ) < +∞, X f λc dµ̄ <
R
+∞, it follows that X f dµ, ϕ∗ (λ) and X f λc dµ̄ are real numbers, so that
R R
Z Z
∗
f dµ − ϕ (λ) − f λc dµ̄ − ε
X X
Z Z
≤ f dµ − λdc (µ̄, µ) + ϕ(dc (µ̄, µ)) − f λc dµ̄ − ε
X X
Z Z Z
≤ f (y) π(dx, dy) − λc(x, y) π(dx, dy) − f λc (x) π(dx, dy) + ϕ(dc (µ̄, µ))
X×X X×X X×X
Z
λc(x, y) + f λc (x) − λc(x, y) − f λc (x) π(dx, dy) + ϕ(dc (µ̄, µ))

≤
X×X
= ϕ(dc (µ̄, µ)),

R
where π ∈ Π(µ̄, µ) is such that λdc (µ̄, µ) + ε ≥ X×X λc d π, and where we used that
ϕ∗ (λ) ≥ λdc (µ̄, µ) − ϕ(dc (µ̄, µ)) and f (y) ≤ λc(x, y) + f λc (x). Taking the supremum over
∗
all such f and λ implies ψ2,U κ
(µ) ≤ ϕ(dc (µ̄, µ)).
3) For f ∈ Uκ (X) define the convolution
ψ(f ) := inf {ψ1 (g) + ψ2 (f − g)}
g∈Cκ (X)
( d Z Z d
)
X h X i
∗
= inf ϕ (λ) + hi dµ̄i + sup f (y) − hi (yi ) − λc(x, y) µ̄(dx) .
λ≥0, hi ∈Cb (Xi ) Xi X y∈X
i=1 i=1
10
For the associated convex conjugates it follows from (10) and (11) that
Z
∗
ψCκ (µ) = sup sup f dµ − ψ1 (g) − ψ2 (f − g)
f ∈Cκ (X) g∈Cκ (X) X
Z Z
∗ ∗
= sup g dµ − ψ1 (g) + sup f dµ − ψ2 (f ) = ψ1,Cκ
(µ) + ψ2,Cκ
(µ)
g∈Cκ (X) X f ∈Cκ (X) X
Z Z
∗ ∗
= ψ1,C κ
(µ) + ψ2,Uκ
(µ) = sup g dµ − ψ1 (g) + sup f dµ − ψ2 (f )
g∈Cκ (X) X f ∈Uκ (X) X
Z
= sup sup f dµ − ψ1 (g) − ψ2 (f − g)
f ∈Uκ (X) g∈Cκ (X) X
(
∗ ϕ dc (µ̄, µ) if µ ∈ Π(µ̄1 , . . . , µ̄d )
= ψU κ
(µ) =
+∞ else
for all µ ∈ Pκ (X).

4) For every f ∈ Uκ (X) one has
Z Z
∗
ψ(f ) ≥ f dµ̄ − ψUκ (µ̄) = f dµ̄ > −∞
X X
∗ 1
since ψU κ
= ϕ(dc (µ̄, µ̄)) = ϕ(0) = 0 and f ∈ L (µ). This shows that ψ : Uκ (X) → R.
(µ̄)
By definition, ψ is convex and increasing. Moreover, ψ is continuous from above on Cκ (X),
since for every sequence (f n ) in Cκ (X) such that f n ↓ 0 one has
inf ψ(f n ) = inf inf ψ1 (g) + ψ2 (f n − g)

n∈N n∈N g∈Cκ (X)
= inf inf ψ1 (f n − g) + ψ2 (g)

g∈Cκ (X) n∈N

= inf ψ1 (−g) + ψ2 (g) = ψ(0),
g∈Cκ (X)
where we use that ψ1 is continuous from above on Cκ (X) by the first step. Since also
∗ ∗
ψC κ
= ψU κ
on Pκ (X) by the third step, it follows from (Bartl, Cheridito, & Kupper, 2019,
Theorem 2.2.) and (Bartl, Cheridito, & Kupper, 2019, Proposition 2.3.) that ψ has the
dual representation
nZ o nZ o
∗
ψ(f ) = max f dµ − ψC κ
(µ) = max f dµ − ϕ(d c (µ̄, µ))
µ∈Pκ (X) X µ∈Π(µ̄1 ,...,µ̄d ) X
for all f ∈ Uκ (X).
Corollary 1. For every f ∈ Uκ (X) one has

Z
max f dµ (12)
µ∈Π(µ̄1 ,...,µ̄d ) X
dc (µ̄,µ)≤ρ
n d Z
X Z h d
X i o
= inf ρλ + hi dµ̄i + sup f (y) − hi (yi ) − λc(x, y) µ̄(dx)
i=1 i=1
(13)
for each radius ρ ≥ 0.
Proof. This follows directly from Theorem 1 for ϕ given by ϕ(x) = 0 if x ≤ ρ and ϕ(x) = +∞
if x > ρ. In that case the conjugate is given by ϕ∗ (λ) = ρλ.
11
Remark 1. Let us comment on the interpretation of the dual problem (13): Roughly speak-
ing, in case ρ = ∞, the above result collapses to the duality of multi-marginal optimal
transport. On the other hand, if ρ = 0, both the primal problem (12) and the dual problem
R
(13) reduce to f dµ̄. Finally, if one drops the constraint µ ∈ Π(µ̄1 , . . . , µ̄d ) in the primal
formulation (12), the functions h1 = h2 = · · · = 0.
From a computational point of view the penalty function ϕ(x) = x is of particular interest
since the optimization in Theorem 1 over the Lagrange multiplier λ disappears.
Corollary 2. For every f ∈ Uκ (X) one has

nZ o
max f dµ − dc (µ̄, µ)
µ∈Π(µ̄1 ,...,µ̄d ) X
d Z
nX Z h d
X i o
= inf hi dµ̄i + sup f (y) − hi (yi ) − c(x, y) µ̄(dx) .
hi ∈Cκi (Xi ) Xi X y∈X
i=1 i=1
Proof. This follows from Theorem 1 for ϕ(y) = y. Indeed, as the convex conjugate is given
by ϕ∗ (λ) = 0 for 0 ≤ λ ≤ 1 and ϕ∗ (λ) = +∞ for λ > 1, the infimum in Theorem 1 is
attained at λ = 1.
Corollary 3 (Gao and Kleywegt (2017a)). Let f (x) = max1≤m≤M (am )> x + bm for x ∈ Rd ,
Pn
am ∈ Rd , and bm ∈ R. Let µ̄ = n1 j=1 δxj for given points x1 , . . . , xn in Rd .3 Let the same
points x1 , . . . , xn define the sets Xi , i.e. Xi = {x1i , . . . , xni } and X = X1 × · · · × Xd . Let the
Pd
cost function c be additively separable, i.e. c(x, y) = i=1 ci (xi , yi ). Then, the dual problem
(13) is equivalent to the linear program
d n n
n 1 XX 1X o
min λρ + hi (j) + g(j) (14)
λ, hi (j), g(j), ui (j,m) n i=1 j=1 n j=1
d
X
s.t.: g(j) ≥ bm + ui (j, m) j = 1, . . . , n; m = 1, . . . , M (15)
i=1
j
ui (j, m) ≥ am k k
i xi − hi (k) − λci (xi , xi ) i = 1, . . . , d; m = 1, . . . , M ; j, k = 1, . . . , n (16)
λ ≥ 0. (17)
The proof can be found in Gao and Kleywegt (2017a). For the convenience of the reader,
we also present a direct proof of Corollary 3.
Pn
Proof. Due to the assumptions that Xi = {x1i , . . . , xni } and µ̄ = n1 j=1 δxj , the term
Pn
h dµ̄i in (13) can be written as n1 j=1 hi (xj ) and we shall use that hi (xj ) = hi (xji ).
R
Xi i
Pd
Combing these facts with the assumption c(x, y) = i=1 ci (xi , yi ), the dual problem (13)
3 Note that δx (A) = 1 if x ∈ A, and δx (A) = 0 otherwise.
12
can be reformulated as
( d n
1 XX
min λρ + hi (xj )
λ≥0, hi n i=1 j=1
n d d
)
1X n X
m m
X
j
o
+ max max a yi + b − hi (y) − λc(x , y)
n j=1 y∈X 1≤m≤M i=1 i i=1
( d n
1 XX
= min λρ + hi (xj )
λ≥0, hi n i=1 j=1
n d
)
1X n X j
o
+ max max bm + am
i y i − hi (y i ) − λc i (x i , yi )
n j=1 1≤m≤M y∈X i=1
The assumption X = X1 × · · · × Xd implies that for any y ∈ X we can find indices k1 , . . . , kd

with 1 ≤ ki ≤ n for i = 1, . . . , d such that y = (xk11 , . . . , xkdd ). We introduce the auxiliary
variables g(j) ∈ R for j = 1, . . . , n and write the above problem as
d n n
n 1 XX 1X
min λρ + hi (xj ) + g(j) :
λ≥0, hi , g(j) n i=1 j=1 n j=1
d
X o
ki ki j ki
g(j) ≥ max bm + am x
i i − hi (x i ) − λc i (x i , xi ) , 1 ≤ j ≤ n, 1 ≤ m ≤ M
k1 ,...,kd
i=1
d n n
n 1 XX 1X
= min λρ + hi (xj ) + g(j) :
λ≥0, hi , g(j) n i=1 j=1 n j=1
d
X o
ki ki j ki
g(j) ≥ bm + max am
i xi − hi (xi ) − λci (xi , xi ) , 1 ≤ j ≤ n, 1 ≤ m ≤ M ,
1≤k≤n
i=1
where we use that

d
X d
X
ki j ki j
max am ki
i xi − hi (x ) − λci (xi , xi ) = max am x
i i
k
− h (x
i i
k
) − λc (x
i i , xk
i ) .
k1 ,...,kd 1≤k≤n
i=1 i=1
Introducing the auxiliary variables ui (j, m) ∈ R, where i = 1, . . . , d, j = 1, . . . , n and m =

1, . . . , M , in order to remove the remaining max function, together with the notation hi (j) :=
hi (xj ) ∈ R yields the assertion.
2.2. Penalization
The aim of this section is to modify the functional (9), so that it allows for a numerical
solution by neural networks.
To focus on the main ideas, we assume that κ is bounded, i.e. we restrict to continuous
bounded functions, as well as ϕ = ∞11(ρ,∞) as in the overview in Section 1.2. Hence, in line
with Corollary 1 we consider the functional
Z
φ(f ) := max f dµ (18)
µ∈Π(µ̄1 ,...,µ̄d ) X
dc (µ̄,µ)≤ρ
n d Z
X Z h d
X i o
= inf ρλ + hi dµ̄i + sup f (y) − hi (yi ) − λc(x, y) µ̄(dx)
i=1 i=1
13
for all f ∈ Cb (X) and a fixed radius ρ > 0. For simplicity, we assume that the function
f λc (x) = supy∈X {f (y) − λc(x, y)} is continuous for all λ ≥ 0 and f ∈ Cb (X).4 In that case,
the functional φ1 : Cb (X 2 ) → R defined as
n d Z
X Z o
φ1 (f ) := inf λρ + hi dµ̄i + g dµ̄ (19)
λ≥0, hi ∈Cb (Xi ), g∈Cb (X): Xi X
i=1
g(x)≥f (x,y)− d
P
i=1 hi (yi )−λc(x,y)
satisfies φ(f˜) = φ1 (f˜ ◦ pr2 ) for all f˜ ∈ Cb (X), i.e. φ1 is an extension of φ from Cb (X) to
Cb (X 2 ). The functional φ1 can be regularized by penalizing the inequality constraint. To
do so, we consider the functional
n d Z
X Z
φθ,γ (f ) := inf λρ + hi dµ̄i + g dµ̄
λ≥0, hi ∈Cb (Xi ), Xi X
g∈Cb (X) i=1
Z d
X o

+ βγ f (x, y) − g(x) − hi (yi ) − λc(x, y) θ(dx, dy) (20)
X2 i=1
1
for a sampling measure θ ∈ P(X 2 ), and a penalty function βγ (x) := γ β(γx), γ > 0, where
β : R → [0, ∞) is convex, nondecreasing, differentiable, and satisfies β(x)x → ∞ for x → ∞.
Let βγ∗ (y) := supx∈R {xy − βγ (x)} for y ∈ R+ , and notice that βγ∗ (y) = γ1 β ∗ (y).
Notice that the introduced penalization method is in no way specific to the penalized
constraint and hence rather general. It includes as a special case the well-studied entropic
penalization related to the Sinkhorn algorithm, which is often applied to optimal transport
problems. The penalization can also be seen as a regularization since it introduces a slight
smoothness bias for the probability measures in the optimization problem. On the one hand,
this leads to an approximation error, which can be made arbitrarily small theoretically, see
Propositions 1 and 2 below. On the other hand, the resulting smoothness is also seen as
a feature which produces good empirical results (see e.g. Cuturi, 2013; Genevay, Peyré, &
Cuturi, 2017).
The following lemma sets the stage for Proposition 1, in which we provide a duality
result for φθ,γ (f ), study the respective relation of primal and dual optimizers, and outline
convergence φθ,γ (f ) → φ(f ) for γ → ∞.
Lemma 1. For every f ∈ Cb (X 2 ) one has
φ1 (f˜) + φ2 (f − f˜) ,

φθ,γ (f ) = inf (21)
f˜∈Cb (X 2 )
R
where φ2 (f ) := X2
βγ (f ) dθ. Moreover, the convex conjugate of φθ,γ is given by
βγ∗ dπ
R R
dθ if π1 = µ̄, π2 ∈ Π(µ̄1 , ..., µ̄d ) and c dπ ≤ ρ
φ∗θ,γ (π) = X2 dθ X2
∞ else
dπ
for all π ∈ P(X 2 ) with the convention dθ = +∞ if π is not absolutely continuous with
respect to θ.
4 By definition, f λc is lower semicontinuous. Moreover, if c(x, y) = c̄(x − y) for a continuous function
c̄ : X → [0, ∞) with compact sublevel sets, then f λc is upper semicontinuous and therefore continuous.
This for instance holds for c̄(x) = di=1 |xi | or c̄(x) = di=1 |xi |2 corresponding to the first and second
P P
order Wasserstein distance on R . d
14
Proof. Observe that for every f ∈ Cb (X 2 ) one has
φ1 (f˜) + φ2 (f − f˜)

inf
f˜∈Cb (X 2 )
n d Z
X Z Z o
= inf λρ + hi dµ̄i + g dµ̄ + βγ (f − f˜) dθ
λ≥0, hi ∈Cb (Xi ), g∈Cb (X), f˜∈Cb (X 2 ): Xi X X2
i=1
f˜(x,y)≤g(x)+ d
P
i=1 hi (yi )+λc(x,y)
where the right hand side is equal to φθ,γ (f ). This follows from the dominated convergence
Pd
theorem applied on the sequence f˜n (x, y) = min{n, g(x) + i=1 hi (yi ) + λc(x, y)}.
As for the calculation of the convex conjugate, we first show that φ∗θ,γ (π) = ∞ whenever
π1 6= µ̄ or π2 6∈ Π(µ̄1 , ..., µ̄d ). Indeed, since
d Z
nX Z
φθ,γ (f ) ≤ inf hi dµ̄i + g dµ̄
hi ∈Cb (Xi ), g∈Cb (X) Xi X
i=1
Z d
X o

+ βγ f (x, y) − g(x) − hi (yi ) θ(dx, dy)
X2 i=1
d Z
nX Z o
≤ inf hi dµ̄i + g dµ̄ + βγ (0),
hi ∈CP
b (Xi ), g∈Cb (X): Xi X
g(x)+ i hi (yi )≥f (x,y) i=1
it follows that φθ,γ is bounded above by a multi-marginal transport problem. As the re-
spective convex conjugate is +∞, it follows that φ∗θ,γ (π) = ∞ for all π ∈ P(X 2 ) such that
π1 6= µ̄ or π2 6∈ Π(µ̄1 , ..., µ̄d ). Conversely, if π1 = µ̄ and π2 ∈ Π(µ̄1 , ..., µ̄d ) one has
nZ o
∗
φθ,γ (π) = sup f dπ − φθ,γ (f )
f ∈Cb (X 2 ) X2
n Z Z o
= sup sup − λρ + f˜ dπ − βγ (f˜ − λc) dθ
λ≥0 f˜∈Cb (X 2 ) X2 X2
n Z Z Z o
= sup sup − λρ + λ c dπ + f¯dπ − βγ (f¯) dθ
λ≥0 f¯∈Cb (X 2 ) X2 X2 X2
Z Z
βγ∗ dπ

= sup λ c dπ − ρ + dθ dθ.
λ≥0 X2 X2
(R
βγ∗ dπ
R
X2 dθ dθ if X2
c dπ ≤ ρ
= .
+∞ else
Pd
Here, the second equality follows by substituting f˜(x, y) = f (x, y) − i=1 hi (yi ) − g(x)
and using the structure of the marginals of π. The third equality follows by setting f¯n =
f˜ + min{n, λc} and using the dominated convergence theorem. Finally, the fourth equality
follows by a standard selection argument, see e.g. the proof of (Bartl, Cheridito, & Kupper,
2019, Lemma 3.5).
Proposition 1. Suppose there exists π ∈ P(X 2 ) such that φ∗θ,γ (π) < ∞. Then it holds:
(a) For for every f ∈ Cb (X 2 ) one has

Z Z
βγ∗ dπ

φθ,γ (f ) = max f dπ − dθ dθ. (22)
π∈Π(µ̄,µ̄
R 1 ,...,µ̄d ): X2 X2
c dπ≤ρ
15
(b) Let f ∈ Cb (X 2 ). If g ? ∈ Cb (X), h?i ∈ Cb (Xi ), i = 1, ..., d, and λ? ≥ 0 are optimizers
of (20), then the probability measure π ? defined by
d
dπ ? X
(x, y) := βγ0 f (x, y) − g ? (x) − h?i (yi ) − λ? c(x, y)
dθ i=1
is a maximizer of (22). Hence µ? := π ? ◦ pr−1

2 is a feasible solution to (18).
(c) Fix f ∈ Cb (X) and ε > 0. Suppose that µε ∈ P(X) is an ε-optimizer of (18), and
πε ∈ Π(µ̄, µε ) satisfies α := X 2 β ∗ dπ
R R
dθ dθ < ∞, and X 2 c dπε ≤ ρ. Then one has
ε
β(0) α
φθ,γ (f ◦ pr2 ) − ≤ φ(f ) ≤ φθ,γ (f ◦ pr2 ) + ε + .
γ γ
Proof. (a) To show duality, we check condition (R1) from (Bartl, Cheridito, & Kupper,
2019, Theorem 2.2), i.e. we have to show that φθ,γ is real-valued and continuous from above.
That φθ,γ is real-valued follows from the assumption that there exists π ∈ P(X 2 ) such that
φ∗θ,γ (π) < ∞. Indeed, it holds ∞ > φ∗θ,γ (π) ≥ f dπ − φθ,γ (f ) and hence φθ,γ (f ) > −∞
R
(while φθ,γ (f ) < ∞ is clear) for all f ∈ Cb (X 2 ).

To show continuity from above, let (fn ) be a sequence in Cb (X 2 ) such that fn ↓ 0. In
view of (21), one has
inf φ1 (f˜) + φ2 (fn − f˜) = φ1 (f˜) + φ2 (−f˜) = φθ,γ (0),

inf φθ,γ (fn ) = inf inf
n∈N f˜∈Cb (X 2 ) n∈N f˜∈Cb (X 2 )
since inf n∈N φ2 (fn − f˜) = φ2 (−f˜) by dominated convergence. By Lemma 1, the claim follows.
(b) That π ? is a feasible solution in the sense that π1? = µ̄, π2? ∈ Π(µ̄1 , . . . , µ̄d ), and
c dπ ? = ρ whenever λ? > 0, follows from the first order conditions. For instance, since
R
X2
the derivative of (20) in direction g ? + tg vanishes at t = 0, it follows X g dµ̄ − X 2 g ◦
R R
pr1 dπ ? = 0 for all g ∈ Cb (X), which shows that π1? = µ̄. This also implies that π ? is a
probability measure. Similarly, π2? ∈ Π(µ̄1 , ..., µ̄d ) follows by considering the derivative in
direction h?i + thi , and X 2 λ? c dπ ? = λ? ρ from the first order condition for λ. Hence, as π ?
R
is feasible it follows from Lemma 1 that

Z
φθ,γ (f ) ≥ f dπ ? − φ∗θ,γ (π ? )
X2
Z
X X
f βγ0 f − g ? − h?i − λ? c − βγ∗ βγ0 f − g ? − h?i − λ? c dθ

=
X2 i i
Z X Z X
g? + h?i + λ? c dπ ? + h?i − λ? c dθ

= βγ f − ĝ −
X2 i X2 i
XZ Z Z X
?
h?i ?
h?i − λ? c dθ

=λ ρ+ dµ̄i + g dµ̄ + βγ f − ĝ −
i Xi X X2 i
= φθ,γ (f )
where we use that βγ∗ βγ0 (x) = βγ0 (x)x − βγ (x) for all x ∈ R. This shows that π ? is an

optimizer.
(c) By restricting the infimum in (20) to those λ ≥ 0, hi ∈ Cb (Xi ), g ∈ Cb (X) such that
16
P
g(x) ≥ f (y) − i hi (yi ) − λc(x, y), it follows that
n d Z
X Z o
φθ,γ (f ◦ pr2 ) ≤ inf λρ + hi dµ̄i + g dµ̄ + βγ (0)
λ≥0, hi ∈Cb (Xi ), g∈Cb (X): Xi X
i=1
g(x)≥f (y)− d
P
i=1 hi (yi )−λc(x,y)
β(0)
= φ(f ) + ,
γ
where the last equality follows from (19). As for the second inequality, since µε ∈ P(X) is
an ε-optimizer of (18), and πε ∈ Π(µ̄, µε ) one has
Z Z
α
φ(f ) ≤ f dµε + ε = f ◦ pr2 dπε − φ∗θ,γ (πε ) + φ∗θ,γ (πε ) + ε ≤ φθ,γ (f ◦ pr2 ) + + ε.
X X2 γ
The proof is complete.
The following Proposition shows that the convergence result from Proposition 1 (c) can
be applied whenever the sampling measure is chosen as θ = µ̄ ⊗ µ̄1 ⊗ ... ⊗ µ̄d , and a minimal
growth condition on the cost function c is imposed, see Proposition 2 (b)(ii) below. In this
case, existence of π ∈ P(X 2 ) such that φ∗θ,γ (π) < ∞ holds as well, so that Proposition
1 applies in full. It is worth pointing out that this result below trivially transfers to all
1 ⊗...µ̄d
reference measures θ for which the Radon-Nikodym derivative dµ̄⊗µ̄dθ is bounded. As
pointed out by a referee, it is especially desirable to have the values αε := X 2 β ∗ dπ
R
dθ dθ
ε
uniformly bounded in ε (respectively growing in a certain order depending on ε), so that a

linear convergence φθ,γ (f ) → φ(f ) for γ → ∞ (respectively a slower order of convergence)
is implied. The result below does not achieve this, and we believe this to be a non-trivial
task left open for future work.
Proposition 2. (a) Let µi ∈ P(Xi ) for i = 1, ..., d. Let ν ∈ Π(µ1 , ..., µd ) and let µ :=
w
µ1 ⊗ µ2 ⊗ ... ⊗ µd . Then there exist ν n ∈ Π(µ1 , ..., µd ) for n ∈ N such that ν n → ν for
n dν n
n → ∞, ν µ and there exist constants 0 < Cn < ∞ such that dµ ≤ Cn µ-a.s..
R Pd
(b) Let ηi : Xi → [0, ∞) be Borel measurable with Xi ηi dµ̄i < ∞. Let η(x) = i=1 ηi (xi ).
Assume there is a constant C > 0 such that for all x, y ∈ X it holds c(x, y) ≤ C(η(x)+
η(y)). Let θ = µ̄ ⊗ µ̄1 ⊗ ... ⊗ µ̄d . Then it holds:
(i) For π ∗ ∈ Π(µ̄, µ̄1 , ..., µ̄d ) with c dπ ∗ ≤ ρ, there exist πε ∈ Π(µ̄, µ̄1 , ..., µ̄d ) for
R
w
ε > 0 such that πε → π ∗ for ε → 0, πε θ, dπ
R
dθ is θ-a.s. bounded and
ε
c dπε ≤ ρ.
(ii) The condition for Proposition 1 (c) is satisfied, i.e. for every ε > 0 there exists
µ ∈ P(X) which is an ε-optimizer of (18), and πε ∈ Π(µ̄, µε ) satisfying αε :=
R ε ∗ dπε R
X2
β dθ dθ < ∞, and X 2 c dπε ≤ ρ.
Proof. Proof of (a): We endow each Xi by a compatible metric mi and without loss of
Pd
generality we specify the metric on X = X1 × X2 × ... × Xd as m(x, y) = i=1 mi (xi , yi ).
Step 1) Construction of ν n : Let Kin ⊆ Xi be compact for i = 1, ..., d such that µi (Kin ) → 1
for n → ∞, and K n = K1n × ... × Kdn . Notably ν(K n ) → 1 follows.
Further, since each Kin is compact we can choose a Borel partition An of K n , i.e.
.
∪A∈An A = K n ,
where each A ∈ An is Borel measurable and satisfies A = A1 × ... × Ad as well as

1
max sup m(x, y) ≤ .
A∈An x,y∈A n
17
A simple way of obtaining such a partition is to first cover each Kin by countably many
open balls of radius 1/(2dn), choosing a finite subcover, and building a partition of Kin out
of that subcover. For the partition of K n simply choose all product sets that can be formed
from the partitions of the Kin .
Note that (K n )c is the disjoint union of 2d − 1 many product sets, namely Kn,1
c
× Kn,2 ×
... × Kn,d , ..., Kn,1 × ... × Kn,d . We denote the union of An with the family of these 2d − 1
c c
many product sets by An , which is a partition of X. Define

X
ν n := ν(A) · (ν|A )1 ⊗ ... ⊗ (ν|A )d
A∈An
where implicitly the sum is understood to only include those terms where ν(A) > 0 and ν|A
is then defined as ν|A (B) = ν(A ∩ B)/ν(A). We won’t make this explicit, but every time
we divide by ν(A) or µi (Ai ) we will assume it is one of the relevant terms with ν(A) > 0,
where of course ν(A) > 0 implies µi (Ai ) > 0 for all i = 1, ..., d and A ∈ An .
Step 2) Verifying marginals of ν n : We only show that ν1n = µ1 , while the other marginals
follow in the same way by symmetry. Let B1 ⊆ X1 be Borel. It holds
X
ν n (B1 × X2 × ... × Xd ) = ν(A) · (ν|A )1 (B1 )
A∈An
X
= ν(A) · ν|A (B1 × X2 × ... × Xd )
A∈An
X
= ν (A ∩ (B1 × X2 × ... × Xd ))
A∈An
= ν(B1 × X2 × ... × Xd ) = ν1 (B1 ).

w
Step 3) Convergence ν n → ν for n → ∞: Let f : X → R be bounded and Lipschitz
continuous with constant L. We have to show f dν n → f dν for n → ∞. Since ν(A) =
R R
ν n (A) for all A ∈ An (and in particular ν n ((K n )c ) = ν((K n )c )), it holds

Z Z X Z Z
f dν n − f dν ≤ kf k∞ 2ν((K n )c ) + f 11A dν n − f 11A dν
A∈An
X
n c
≤ kf k∞ 2ν((K ) ) + sup |f (x) − f (y)|ν(A)
x,y∈A
A∈An
X 1
≤ kf k∞ 2ν((K n )c ) + ν(A)L
n
A∈An
n→∞
−→ 0.
dν n
Step 4) Absolute continuity and boundedness of dµ : Let
1
Cn = max .
A∈An : ν(A)d−1
ν(A)>0
Given arbitrary Borel sets Bi ⊆ Xi for i = 1, ..., d, we show that ν n (B1 × ... × Bd ) ≤
Cn µ(B1 × ... × Bd ). Once this is shown, ν n (S) ≤ Cn µ(S) follows for all Borel sets S ⊆ X by
the monotone class theorem, which will immediately yield both absolute continuity ν n µ
n
and dν
dµ ≤ Cn .
18
For A ∈ An it holds
(ν|A )i (Bi ) = ν(A)−1 ν(A1 × ... × (Ai ∩ Bi ) × ... × Ad )

≤ ν(A)−1 · νi (Ai ∩ Bi ) = ν(A)−1 · µi (Ai ∩ Bi ).
It follows
X
ν n (B1 × ... × Bd ) = ν(A) · (ν|A )1 (B1 ) · ... · (ν|A )d (Bd )
A∈An
X 1
≤ µ1 (B1 ∩ A1 ) · ... · µd (Bd ∩ Ad )
ν(A)d−1
A∈An
X
≤ Cn µ((B1 × ... × Bd ) ∩ A) = Cn µ(B1 × ... × Bd )
A∈An
where we note that for the last equality to hold, the second to last sum over A ∈ An includes
all terms, not just the ones where ν(A) > 0 (this only makes the sum larger). The proof of
part (a) is complete.
w
Proof of (b), (i): We first show the following: If Π(µ̄, µ̄1 , ..., µ̄d ) 3 πε → π ∈ Π(µ̄, µ̄1 , ..., µ̄d )
R R
for ε → 0, then c dπε → c dπ for ε → 0.
To prove it, note that the growth condition implies that for every δ > 0, we can choose
R R
a compact set K ∈ X × X such that supε>0 K c c dπε ≤ δ and K c c dπ ≤ δ. Restricted to
K, c is bounded from above, say by a constant M > 0 (note c is non-negative). Hence for
R R
all ε > 0, it holds | c dπε − min{c, M } dπε | ≤ 2δ, and the same for π instead of πε . Since
R R R
min{c, M } is continuous and bounded, we get | c dπε − c dπ| ≤ 4δ + | min{c, M } dπε −
R
min{c, M } dπ| → 4δ for ε → 0. Letting δ go to zero yields the claim.
For the statement of the part (b)(i), consider for λ ∈ (0, 1) the coupling πλ := λπ ∗ + (1 −
w
λ) µ̄ ⊗ (x 7→ δx ) . Then it holds c dπλ < ρ since c(x, x) = 0. Further πλ → π ∗ for λ → 1.
R
R
By approximating every πλ by a πλ,ε via part (a), c dπλ,ε ≤ ρ follows automatically for ε
R R
small enough, which follows by c dπλ,ε → c dπλ for ε → 0 as shown above. The claim
hence follows by a diagonal argument.
Proof of (b), (ii): Any optimizer µ? ∈ Π(µ̄1 , ..., µ̄d ) of (18) and a corresponding coupling
π ∈ Π(µ̄, µ̄1 , ..., µ̄d ) with c dπ ∗ ≤ ρ can be approximated via part (b) by (πε )ε>0 which
∗
R
satisfies all required properties. Taking µε as the projection of πε onto the second component
w
of X × X, i.e. µε = πε ◦ ((x, y) 7→ y)−1 , we get that µε → µ? and hence f dµε → f dµ?
R R
which means after a possible change of indices, µε is an ε-optimizer of (18).
2.3. Approximation with neural networks

Let us shortly recap. In Section 2.1, we show that our original problem
Z
φ(f ) := max f dµ,
µ∈Π(µ̄1 ,...,µ̄d ) Rd
dc (µ̄,µ)≤ρ
can be written as
n d Z
X Z h d
X i o
inf ρλ + hi dµ̄i + sup f (y) − hi (yi ) − λc(x, y) µ̄(dx) ,
λ≥0, hi ∈Cb (R) R Rd y∈Rd
i=1 i=1
19
for all continuous and bounded functions f ∈ Cb (X). We then proceed in Section 2.2 with
considering the penalized version of the latter problem
n d Z
X Z
φθ,γ (f ) := inf λρ + hi dµ̄i + g dµ̄
λ≥0, R Rd
i=1
hi ∈Cb (R), g∈Cb (Rd ) Z d
X o
+ βγ f (y) − hi (yi ) − λc(x, y) − g(x) θ(dx, dy) .
R2d i=1
We provide sufficient conditions for the convergence φθ,γ (f ) → φ(f ) for γ → ∞. The
subsequent and final step is to theoretically justify that neural networks can indeed be used
to approximate φθ,γ (f ) and thereby φ(f ).
To do so, let us introduce the following notation: We denote by A0 , . . . , Al affine transfor-
mations with A0 mapping form Rd0 to Rm , A1 , . . . , Al−1 mapping form Rm to Rm and Al
mapping form Rm to R. We further fix a non-constant, continuous and bounded activation
function ϕ : R → R. The evaluation of ϕ at a vector y ∈ Rm is understood pointwise,
i.e. ϕ(y) = (ϕ(y1 ), . . . , ϕ(ym )). Then,
N(m, d0 ) := g : Rd0 → R : x 7→ Al ◦ ϕ ◦ Al−1 ◦ · · · ◦ ϕ ◦ A0 (x)

defines the set of neural network functions mapping to the real numbers R with a fixed
number of layers l ≥ 2 (at least one hidden layer), input dimension d0 and hidden dimension
m. The following is a classical universal approximation theorem for neural networks.
Theorem 2 (Hornik (1991)). Let h ∈ Cb (RN ). For any finite measure ν ∈ P(RN ) and
ε > 0 there exists m ∈ N and hm ∈ Nm,N such that kh − hm kLp (ν) ≤ ε.
For Proposition 3 below, let Xi := RNi for i = 1, ..., d and thus X = RN with N =
Pd
i=1 Ni . Define the function F (λ, h1 , ..., hd , g) by
φθ,γ (f ) = inf F (λ, h1 , ..., hd , g).

λ≥0,
hi ∈Cb (R ), g∈Cb (RN )
Ni
We define the neural network approximation of φθ,γ (f ) by
φm
θ,γ (f ) = inf F (λ, hm m m
1 , ..., hd , g ).
λ≥0,
hm m
i ∈N(m,Ni ), g ∈N(m,N )
The following result showcases a simple, yet general setting in which the neural network
approximation is asymptotically precise:
Proposition 3. Fix f ∈ Cb (RN ). Let p > 1, βγ (x) := γ1 (γx)p+ and assume c ∈ Lp (θ). Then
φm
θ,γ (f ) → φθ,γ (f ) for m → ∞.
Proof. By the choice of the activation function, all network functions are continuous and
bounded, and hence φm θ,γ (f ) ≥ φθ,γ (f ).
It therefore suffices to show that for any ε > 0, there exists an m ∈ N such that
φθ,γ (f ) ≥ φm
θ,γ (f ) − ε.
20
Choose any feasible (λ, h1 , ..., hd , g) for φθ,γ (f ). By Theorem 2 we can find a sequence
(λm , hm m m m
1 , ..., hd , g ) with hi ∈ N(m, Ni ) for i = 1, ..., d and g
m
∈ N(m, N ) such that for
m → ∞ it holds
λm → λ,
hm
i → hi in Lp (µ̄i ) for i = 1, ..., d,
((x, y) 7→ hm
i (yi )) → ((x, y) 7→ hi (yi )) in Lp (θ) for i = 1, ..., d,
m
g →g in Lp (µ̄),
m
((x, y) 7→ g (x)) → ((x, y) 7→ g(x)) in Lp (θ).
As c ∈ Lp (θ), it also holds λm c → λc in Lp (θ) and hence

d
X
((x, y) 7→ f (x, y) − g m (x) − hm m
i (yi ) − λ c(x, y))
i=1
d
X
→ ((x, y) 7→ f (x, y) − g(x) − hi (yi ) − λc(x, y))
i=1
in Lp (θ) as m → ∞. As the mapping x 7→ x+ is Lipschitz-1, taking only the positive parts

lets the above convergence remain valid. As convergence in Lp (θ) implies convergence of the
p-th moment, we obtain F (λm , hm m m
1 , ..., hd , g ) → F (λ, h1 , ..., hd , g) as m → ∞.
For a given ε > 0, choose a feasible (λ, h1 , ..., hd , g) for φθ,γ (f ), such that φθ,γ (f ) ≥
F (λ, h1 , ..., hd , g)− 2ε . Due to the above proven convergence, we can find (λm , hm m m
1 , ..., hd , g )
m m
with hi ∈ N(m, Ni ) for i = 1, ..., d and g ∈ N(m, N ) such that
ε ε ε
φθ,γ (f ) ≥ F (λ, h1 , ..., hd , g) − ≥ F (λm , hm 1 , ..., h m m
d , g ) − − ≥ φθ,γ (f )m − ε.
2 2 2
Remark 2. While the previous result obtains, for a fixed f ∈ Cb (RN ), the convergence
φmθ,γ (f ) → φθ,γ (f ) for m → ∞, approximation errors for finite values of m are also of
interest. In general, there is hope to achieve this. In the setting of Proposition 3 with
p ∈ N≥2 : Assume λ, h1 , ..., hd , g are optimizers of φθ,γ (f ) which have sufficient moments
and hm m m
1 , ..., hd , g are network functions which approximate h1 , ..., hd , g up to ε accuracy
for the respective Lp -norms. Then it holds
φθ,γ (f ) − φm
θ,γ (f ) ≤ C · ε,
where C is a constant only depending on f, λ, c, h1 , ..., hd , g.

Pd
Proof. First, define T (x, y) := f (y) − i=1 hi (yi ) − λc(x, y) − g(x) and T m (x, y) := f (y) −
Pd m m
i=1 hi (yi ) − λc(x, y) − g (x). Using the function F introduced above, we have that
φm m m m
θ,γ (f ) ≤ F (λ, h1 , ..., hd , g ) and hence
φm m m m m m
θ,γ (f ) − φθ,γ (f ) ≤ |F (λ, h1 , ..., hd , g ) − F (λ, h1 , ..., hd , g)|
d
X p m p
≤ khi − hm m
i kL1 (µ̄i ) + kg − g kL1 (µ̄) + kT+ − (T )+ kL1 (θ)
i=1
and by the inequality quoted in Lemma 2 (see Appendix A.1) it holds
kT+p − (T m )p+ kL1 (θ) ≤ C̃kT+ − T+m kLp (θ) ≤ C̃kT − T m kLp (θ) ≤ C̃(d + 1)ε
21
Pp−1
with C̃ = k=0 kT+ kkLp (θ) kT+m kp−1−k m
Lp (θ) . Since φθ,γ (f ) ≤ φθ,γ (f ), we obtain
φθ,γ (f ) − φm m
θ,γ (f ) = φθ,γ (f ) − φθ,γ (f ) ≤ (C̃(d + 1) + (d + 1)) · ε.
While the constant C̃ formally depends on the p-th moments of both T and T m , to eliminate
the dependence on T m one can use kT m kLp (θ) ≤ kT kLp (θ) + kT − T m kLp (θ) ≤ kT kLp (θ) + 1
for ε small enough.
3. Implementation
This section aims to give specifics regarding the implementation of problem (4) as an ap-
proximation of problem (1). In particular, the following points are discussed:
1. The choice of θ, βγ and neural network structure.
2. The optimization method for the parameters of the neural network.
3. How to evaluate the quality of the obtained solution.
4. The typical runtime.
3.1. Choice of θ, βγ and neural network parameters

The neural network structure to approximate the space Cb (Rd ) is chosen as a feedforward
neural network with 5 layers (input, output, 3 hidden layers) with hidden dimension 64 · d.
The basic idea behind this was to increase the size of the neural networks until a further
increase no longer changes the outcome of the optimization. As an activation function we
use the ReLu function.
To be precise, the neural network functions we work with are of the form
x 7→ A4 ◦ ϕ ◦ A3 ◦ ϕ ◦ A2 ◦ ϕ ◦ A1 ◦ ϕ ◦ A0 (x),
|{z} | {z } | {z } | {z } | {z }
output 4th layer 3rd layer 2nd layer input
layer layer
where the activation function ϕ is chosen as ϕ(x) = max{0, x}. The mappings Ai are
affine transformations, i.e. Ai (x) = Mi x + bi for a matrix Mi ∈ Rdi,2 ×di,1 and a vector
bi ∈ Rdi,2 . The matrices M0 , ..., M4 and vectors b0 , ..., b4 are the parameters of the network
that one optimizes for. As described above, the dimensions of these parameters are chosen
as follows: The input dimension d0,1 = d is given by the dimension of the input vector x,
and di,2 = di+1,1 has to hold for compatibility. The hidden dimension di,1 for i = 1, 2, 3, 4
is set to 64 · d, while the output dimension d4,2 is always 1.
The penalization function βγ is set to βγ (x) = γ max{0, x}2 . On the one hand, this choice
has shown to be stable across all examples. On the other hand, the theory in Proposi-
tion 3 applies precisely to penalization functions of this kind. is Regarding the parameter
γ, we usually first solve the problem with a low choice, like γ = 50, which leads to stable
performance. Then, we gradually increase γ until a further increment no longer leads to a
significant change in the objective value of (4). Regarding instabilities when γ is set too
large, see Section 3.3.
Concerning the sampling measure θ, the basic choice is to use θprod = µ̄ ⊗ µ̄1 ⊗ ... ⊗ µ̄d .
Particularly for low values of ρ, this is suboptimal: Indeed, for ρ = 0 in problem (22), we
22
know that the optimizer is always of the form π diag = µ̄⊗K, where K is the stochastic kernel
K : Rd → P(Rd ) given by K(x) = δx . Since π diag is singular with respect to θprod , using only
θprod as sampling measure, one can expect high errors arising from penalization for small
values of ρ. It hence makes sense to use (among other possibilities) θhalf := 21 θprod + 12 π diag .
This is very specific however, and most solutions will not put mass precisely where π diag
puts mass. Hence, we add some noise to π diag , e.g. via a Gaussian measure with covariance
matrix ε2 : θthird := 12 θprod + 41 π diag + 41 (π diag ∗ N (0, ε2 )), where ∗ denotes convolution of
measures. The sampling measure θhalf is used in all four toy examples in Section 4, whereas
we rely on θthird in the final case study in Section 5.
3.2. Optimization method for the parameters of the neural network

This subsection may as well be called “Training”. However, as we do not employ neural
networks in a training-testing kind of environment, this might be misleading.
Regarding this topic, trial and error is especially useful, as the simple goal is to obtain a
stable convergence. For the parameters of the neural network, we use the Adam Optimizer
with parameters β1 = 0.99 and β2 = 0.995. For the learning rate, we start with α = 0.0001
for the first N0 iterations of training, and then decrease it by a factor of 0.98 each 50
iterations for a total of Nfine further iterations. We use a batch size (the number of samples
generated in each iteration for the measures involved) of around 27 to 216 , see Section 3.3
for more details. N0 and Nfine are chosen problem specific: for simple problems in Section
4, N0 = 15000 and Nfine = 5000, while for the DNB case study in Section 5 they are chosen
as N0 = 60000 and Nfine = 30000.
The parameter λ has to be optimized separately from the parameters of the neural net-
work, since the value of λ is clearly more important than any single parameter of the network.
To be precise after a fixed number Nλ of iterations, λ is updated by
1 X i
λ 7→ λ − αλ ∇λ ,
Nλ
i∈I
where I are the previous Nλ many iterations, αλ is the learning rate and ∇iλ is the sample
derivative of the objective function with respect to λ in iteration i. Concerning the choice of
αλ and Nλ , we usually first set αλ to around 0.1 (depending on the problem), and decrease
it in the same fashion as α, while Nλ is set to 200. Before we update λ for the first time,
we wait until the network parameters are in a sensible region, which typically takes around
1000 to 10000 iterations.
If another parameter is involved in the optimization (such as τ in the examples that
calculate the AVaR), we employ the same method as for λ, but we update this parameter
even more rarely (once every 1000-2500 iterations), and wait longer at the start to update
it the first time (between 5000 and 20000 iterations).
3.3. Evaluation of the solution quality

To evaluate the obtained solutions, we found that mostly three aspects have to be considered:
(a) Is the neural network structure rich enough?
(b) How large is the effect of penalization?
(c) Has the numerical optimization procedure converged to a (near) minimum?
23
Section A.2 shows how this is put into practice for an exemplary case.
Part (a) is the seemingly simplest, as we found the choice of network structure described
in Subsection 3.1 to be sufficient for all problems, in the sense that further increasing the
network size does not alter the obtained solution.
Regarding part (b), the most useful observation is the following: As described in Propo-
sition 1, the numerical solution via neural networks can be used to obtain an approximate
solution µ? of the primal problem. If we evaluate the integral f dµ? and compare it to
R
φθ,γ (f ), the difference is φ∗θ,γ (π ? ), which can be seen as the effect of penalization. If φ∗θ,γ (π ? )
has a small value, it indicates a small effect of penalization. The second observation is that
φθ,γ (f ) is increasing in γ, and under the conditions studied in Proposition 1 and 2 converges
to φ(f ). Hence, starting with a low value of γ and increasing it until no further change is
observed is a good strategy. When doing so, values of γ that are too large can of course
be detrimental regarding part (c), and hence when increasing γ a concurrent adaptation of
training parameters (like learning rate or batch size) is often necessary.
Regarding part (c), we found that most instabilities could be solved by increasing the
batch size. This increase naturally comes with longer run times. Especially if γ has to be
increased a lot to allow for a small effect of penalization, very large batch sizes were required
(e.g. in the DNB case study, we use a batch size of 215 ). To obtain structured criteria for
convergence (compared to just evaluating convergence visually), we can again use the dual
relation arising from Proposition 1. Indeed, we can exploit the fact the numerically obtained
µ? (as the second marginal of π ? from Proposition 1 (b)) is an approximately feasible solution
to problem (1) if the algorithm has converged. Hence, as a necessary criteria for convergence,
one can check whether µ? satisfies criteria for feasibility. To this end, one can compare the
marginals of µ? to those of µ̄ (we did this mostly by visually evaluating empirical marginals
of µ? ) as well as estimate dc (µ̄, µ? ).5
3.4. Runtime
Generally speaking, calculations using neural networks can benefit greatly from paralleliza-
tion, e.g. by employing GPUs. For most of our examples, this was not necessary however,
and the respective calculations could be performed quickly (i.e. in between one and five
minutes) even with a regular CPU (intel i5-7200U; dual core with 2.5-3.1 GHz each). For
the DNB case study however, a single run with stable learning parameters takes around 20
hours on a CPU. By utilizing a single GPU (Nvidia GeForce RTX 2080 Ti) this is reduced
to around 30 minutes. Notably, in the smaller examples there was less speed-up when using
GPU compared to CPU, the reason being that the problems were too small to fully utilize
parallel capabilities of a GPU.
4. Examples
The aim of this section is to illustrate how the above introduced concepts can be used to
numerically solve given problems. In particular we demonstrate that neural networks are
able to (1) achieve a satisfactory empirical performance for all problems considered, (2)
naturally determine the structure of the worst-case distribution via Proposition 1 (b) and
(3) deal with problems, that cannot be reformulated as LP. Concerning the latter point, we
5 Clearly,
dc (µ̄, µ? ) should be be bounded by ρ and in the optimum equal to ρ (if one is not already in the
edge case where ρ is so large that it doesn’t have an effect).
24
consider both a function f , which cannot be written as the maximum of affine functions,
as well as a cost function c, which is not additively separable. Additionally, we make a
case for the generality of our duality result: we replace the distance constraint by a distance
penalty and fix the distribution of bivariate, rather than univariate, marginals. Furthermore
by considering unbounded functions f , we shed some light on the necessity of the growth
functions κ used in Theorem 1. To achieve all of these points, we consider three examples
with increasing difficulty.
Concerning the notation in this section, c denotes the cost function
X
c(x, y) = ||x − y||1 = |xi − yi |.
i
This notation implies that

Z d
X
dc (µ̄, µ) := inf |xi − yi | π(dx, dy),
π∈Π(µ̄,µ) Rd ×Rd i=1
is the first order Wasserstein distance with respect to the L1 -metric. On the other hand, we
consider the first order Wasserstein distance with respect to the Euclidean metric
 !1/2 
 Z d
X 
dc2 (µ̄, µ) := inf (xi − yi )2 π(dx, dy) . (23)
π∈Π(µ̄,µ) Rd ×Rd 
i=1
Notice that the cost function c2 (x, y) := ||x − y||2 is not additively separable.6
4.1. Expected maximum of two comonotone standard Uniforms

We start our exemplification with a toy example which is not connected to risk measurement.
Consider the following problem
Z
φ(f1 ) := sup E [max(U, V )] = sup max(x1 , x2 ) µ(dx), (24)
(VU )∼µ∈Π(µ̄1 ,µ̄2 ), µ∈Π(µ̄1 ,µ̄2 ), [0,1]2
dc (µ̄,µ)≤ρ dc (µ̄,µ)≤ρ
where µ̄1 = µ̄2 = U([0, 1]) are (univariate) standard uniformly distributed probability
measures and µ̄ is the comonotone copula. In other words, µ̄ is a bivariate probabil-
ity measure with standard uniformly distributed marginals which are perfectly dependent.
In the notation of the Section 2, we choose the function f as f1 (x) = max(x1 , x2 ) and
X = X1 × X2 = [0, 1] × [0, 1]. Interpreting problem (24), we aim to compute the expected
value of the maximum of two standard Uniforms under ambiguity with respect to the ref-
erence dependence structure, which is given by the comonotone coupling. Problem (24)
possesses the following analytic solution
1 + min(ρ, 0.5)
φ(f1 ) = .
2
The derivation of this solution can be found in Appendix A.3 and is based on the dual-
ity result in Corollary 1. Hence, problem (24) is well suited to benchmark the solution
method based on neural networks. In comparison, we also solve the problem with linear
programming. To be precise, we consider the following two methods:
6 Inthe literature the Wasserstein distance with respect to the Euclidean metric is usually associated with
order two, in which case the underlying cost function is additively separable.
25
0.8 0.8
0.75 0.75
0.7 0.7
0.65 0.65
0.6 0.6
0.55 0.55
0.5 0.5
0.45 0.45
0.4 0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6
Figure 1: In the left panel, the analytic solution φ(f1 ) of problem (24) is plotted as a function
of ρ and compared to corresponding numerical solutions obtained by method 1.a)
and method 2.a), which are described in Section 4.1. The right panel shows the
same for the improved methods 1.b) and 2.b).
1. We discretize the reference copula µ̄ (and thereby the marginal distributions µ̄1 and µ̄2 )
and solve the resulting dual problem by means of linear programming (see Corollary
3). There are two distinct ways to discretize µ̄:
a) We use Monte Carlo sampling. In the notation of Corollary 3, this means we sample
n points x11 , . . . , xn1 in [0, 1] from the standard Uniform distribution. Then, we set
xj2 = xj1 for j = 1, . . . , n.
b) We set the points xj1 = xj2 = 2j−1
2n for j = 1, . . . , n. As the comonotonic copula lives
only on the main diagonal of the unit square, this deterministic discretization of µ̄
in some sense minimizes the discretization error. The simple geometrical argument
used to find this discretization can be applied only due to the special structure of
the reference distribution at hand.
Let us emphasize that method 1.a) can be applied to any reference distribution µ̄. On
the other hand, method 1.b) can only be used in this particular example as µ̄ is given
by the comonotonic copula.
2. We solve the problem with the neural network approach described in the above Sec-
tion 3. As discussed, some hyperparameters need to be chosen problem specific. In
particular we set: N0 = 15000, Nfine = 5000, γ = 1280, batch size = 27 and αλ = 0.1.7
Concerning the sampling measure θ, for this example we compare
a) the basic choice θ = θprod and
b) the improved choice θ = θhalf .
To better understand these parameter choices and our neural network approach in
general, we provide a detailed convergence analysis for this example in Appendix A.2.
Figure 1 compares the two above mentioned methods to solve problem (24) for different
values of ρ. In the left panel of Figure 1, we observe that method 1.a) yields an unsatisfactory
result even though n = 250 is chosen as large as possible for the resulting LP to be solvable
by a commercial computer. This issue arises due to the poor quality of the discretization
7 We wait for 2500 iterations until updating λ for the first time where λ is initialized to λ = 0.75
26
0.8
0.75
0.7
0.65
0.6
0.55
0.5
0.45
0.4
0 0.1 0.2 0.3 0.4 0.5 0.6
Figure 2: The analytic solution φ(f1 ) of problem (24), which uses the first order Wasserstein
distance with respect to the L1 -metric, is compared to the numerical solution φ̃(f1 )
of problem (25), which uses the first order Wasserstein distance with respect to
the Euclidean metric, i.e. the L2 -metric.
resulting from Monte Carlo simulation. If one chooses the discretization as done in method
1.b), we recover the analytic solution of problem (24) as can be seen in the right panel of
Figure 1. Moreover, Figure 1 indicates that method 2, i.e. the approach presented in this
paper, yields quite good and stable results. The left panel, however, shows that for small ρ
method 2.a) does not rediscover the true solution. The reason for this is that when drawing
random samples from the chosen sampling measure θprod , it is unlikely that we sample from
the relevant region, namely the main diagonal of the unit square. As discussed in Section
3.1, method 2.b) is designed to overcome precisely this weakness and the right panel of
Figure 1 illustrates that it does.
We finalize this example by considering the Wasserstein distance with respect to the
Euclidean metric dc2 , defined in equation (23), rather than the Wasserstein distance with
respect to the L1 metric dc . Thus, we compare problem (24) to
Z
φ̃(f1 ) := sup max(x1 , x2 ) µ(dx). (25)
µ∈Π(µ̄1 ,µ̄2 ), [0,1]2
dc2 (µ̄,µ)≤ρ
Since the cost function c2 is not additively separable, φ̃(f1 ) - other than φ(f1 ) - cannot
be approximated based on Corollary 3, i.e. linear programming. Nevertheless, we can ap-
proximate φ̃(f1 ) using neural networks, which demonstrates the flexibility of our approach.8
Figure 2 compares φ(f1 ) and φ̃(f1 ) for different ρ. Note that as c(x, y) ≥ c2 (x, y) for all x, y,
dc (µ̄, µ) ≥ dc2 (µ̄, µ)1/2 for all µ̄, µ ∈ P(X). Hence, φ(f1 ) ≤ φ̃(f1 ) for fixed ρ. Figure 2 is in
line with this observation.
4.2. Average Value at Risk of two independent standard Uniforms

We increase the level of complexity slightly compared to the previous example, as we now
turn to robust risk aggregation. We aim to compute AVaRα (U +V ), where U and V are inde-
pendent standard Uniforms under ambiguity with respect to the independence assumption.
8 We use the same choices for the hyperparameters as given in point 2. above, but increase N0 considerably
for small ρ to guarantee convergence of λ, and use θhalf as a sampling measure.
27
Note that the Average Value at Risk is defined by
1
AVaRα (Y ) := min τ + E [max(Y − τ, 0)] ,
τ ∈R 1−α
see Rockafellar and Uryasev (2000). Using the first order Wasserstein distance to construct
an ambiguity set around the reference dependence structure, we are led to the following
problem
Φ2 := sup AVaRα (U + V ) (26)

(VU )∼µ∈Π(µ̄1 ,µ̄2 ),
dc (µ̄,µ)≤ρ
( Z )
1
= sup inf τ+ max(x1 + x2 − τ, 0)µ(dx) (27)
µ∈Π(µ̄1 ,µ̄2 ), τ ∈R 1−α [0,1]2
dc (µ̄,µ)≤ρ
= inf φ(f2τ ), (28)

τ ∈R
where µ̄1 = µ̄2 = U([0, 1]) are (univariate) standard uniformly distributed probability mea-
sures and µ̄ is the independence copula. In other words, µ̄ = U([0, 1]2 ) is a bivariate prob-
ability measure with independent, standard uniformly distributed marginals. Moreover, we
1
have that f2τ (x) = τ + 1−α max(x1 + x2 − τ, 0) and φ(·) is defined as in equation (1).
Notice that in the above formulation of the problem we can go from (27) to (28) since the
problem is convex in τ and concave in µ and Wasserstein balls are weakly compact. Thus,
we can apply Sion’s Minimax Theorem to interchange the supremum and the infimum in
(27).
In Appendix A.4, we derive an analytical upper and lower bound for Φ2 in (26). These
bounds are tight enough for the present purpose, which is to evaluate the performance of
the two discussed numerical methods.
Figure 3 supports the latter claim: The analytic bounds for Φ2 are rather tight when
plotted as a function of ρ. The bounds are compared to the same two numerical methods
as discussed in the previous example. With respect to the solution based on Monte Carlo
simulation and linear programming, we now average over 100 simulations for each fixed ρ.
Thus, the results in Figure 3 do not fluctuate as much as those we have seen in the left panel
of Figure 1. Nevertheless, Figure 3 shows that the solution obtained via MC and LP does
not stay within the analytic bounds - other than the solution based on our neural networks
approach. Arguably this is due to the lack of symmetry when discretizing the reference
distribution µ using Monte Carlo. Regarding runtime, both numerical methods take around
the same time to calculate the values needed for Figure 3.
We now want to illustrate a further merit of the neural networks approach, namely that
we can sample from the numerical optimizer µ? of problem (26). By doing so, we obtain
information about the structure of the worst case distribution. The samples are obtained by
acceptance-rejection sampling from the density given by Proposition 1 (b), where we replace
true optimizers by numerical ones. Figure 4 plots samples of this worst case distribution
µ? for different values of ρ. To understand the intriguing nature of the results presented
in Figure 4, we have to describe problem (26) in some more detail. It should be clear that
the comonotone coupling of the Uniforms U and V is maximizing AVaRα (U + V ) among all
possible coupling of U and V . However, one can find many different maximizing couplings.
Notably, the optimizer shown for ρ = 0.2 corresponds to the one which has the lowest relative
28
1.75
1.7
1.65
Φ2
1.6
1.55
analytic bounds
1.5
LP-solution with MC
NN-solution
1.45
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
Figure 3: The analytic upper and lower bounds of problem (26) are compared to two distinct
numerical solutions. The first numerical solution is obtained by Monte Carlo
simulation with n = 100 sample points as well as linear programming and averaged
over 100 simulations for each fixed ρ. The second numerical solution is obtained
by penalization and neural networks. The confidence level of the AVaR considered
in problem (26) is set to α = 0.7.
entropy with respect to the independent coupling among the maximizers of AVaRα (U + V ).
On the other hand, the middle panel for ρ = 0.16 motivated us to derive a coupling which -
among maximizers of AVaRα (U +V ) - we conjecture to have the lowest Wasserstein distance
to the independent coupling. This coupling is used to derive the lower bound for problem
(26) in Appendix A.4. Some features of the others couplings, e.g. for ρ = 0.08 and ρ = 0.12
came as a surprise to us: For example, the curved lines as boundary for the support are
unusual in an L1 -Wasserstein problem.
4.3. Variance of three normally distributed random variables with

distance penalization
We now leave the domain of uniformly distributed, univariate marginals and replace the
distance constraint by a distance penalty. We analyze the following problem
1
χ(f4 ) := sup Var(X1 + X2 + Y ) − dc̃ (µ̄, µ)r
X
( )∼µ∈Π(µ̄12 ,µ̄3 ) r
Y
Z
1
(x1 + x2 + y)2 − m2 µ(dx1 , dx2 , dy) − dc̃ (µ̄, µ)r ,

= sup (29)
µ∈Π(µ̄12 ,µ̄3 ) R3 r
where the cost function c̃(x, y) = 2||x − y||1 .9 We specify the reference distribution function
as follows    
0 1 0.8 0
µ̄ = N  0  ,  0.8 1 0  .
0 0 0 1
In this examples, there are two novelties that are explained in the following.
9 One can think of the factor 2 occurring in the cost function similarly to a particular choice of ρ for the
radius of the Wasserstein ball.
29
ρ = 0.00 ρ = 0.04 ρ = 0.08
ρ = 0.12 ρ = 0.16 ρ = 0.20
Figure 4: Samples from the optimizer µ? of problem (26) as obtained by the neural networks
approach are shown in form of a heatplot for six different levels of ambiguity,
i.e. ρ = 0, 0.04, 0.08, 0.12, 0.16, 0.2.
Firstly, the fact that we set µ ∈ Π(µ̄12 , µ̄3 ), means we are fixing not only the univariate
marginal distributions, which are standard normal, but also the dependence structure be-
tween the first and the second margin X1 and X2 . In this case, we assume that X1 and X2
are jointly normal with correlation 0.8. We use µ̄12 to denote the fixed, bivariate margin.
As a consequence, the model ambiguity concerns solely the dependence structure between
the third margin Y and the other two margins X1 and X2 .
Secondly, rather than a distance constraint dc̃ (µ̄, µ) ≤ ρ, we now use a distance penalty to
account for the described model ambiguity: we set ϕ(x) = 1r xr in Theorem 1. The parameter
r accounts for the degree of penalization and hence is not comparable to the radius ρ of the
Wasserstein balls described above. Instead, for r → ∞ the penalization becomes closer and
closer to the case where we impose the constraint dc̃ (µ̄, µ) ≤ 1.
These two specifications aim to demonstrate the value of the generality of Theorem 1 with
respect to both the choice of polish spaces and the modelling of ambiguity.
Even though Section 2.2 focuses on the Wasserstein ball constraint, the solution method
based on penalization and neural networks is trivially adapted to problems like (29). We
state the resulting numerical solution of problem (29) for different values of r in Table 1.
In order to make these results more concrete, we sampled 20000 values from the respective
worst case distribution µ? and report the corresponding empirical covariance matrix Σ̂µ? .
Notably the covariance matrix does not completely characterize µ? , since µ? does not have
to be a joint normal distribution.
30
2
(x1 + x2 + y) dµ? dc̃ (µ̄, µ? )
R
χ(f4 ) R3
Σ̂µ?
 
1 0.8 0
No
4.6 4.6 0  0.8 1 0 
penalization
0 0 1
 
0.998 0.801 0.847
r=1 6.16 8.08 1.92  0.801 1.008 0.853 
0.847 0.853 1.011
 
0.989 0.806 0.737
r=2 6.50 7.60 1.48  0.806 0.989 0.736 
0.737 0.736 0.997
 
0.975 0.795 0.675
r=3 6.57 7.29 1.29  0.795 0.991 0.682 
0.675 0.682 0.980
 
0.976 0.792 0.652
r=4 6.65 7.19 1.21  0.792 0.970 0.654 
0.652 0.654 0.986
 
0.991 0.803 0.554
r=∞ 6.76 6.76 1.00  0.803 0.998 0.551 
0.554 0.551 0.993
Table 1: Comparison of the numerical solutions χ(f4 ) of problem (24), computed based on
penalization and neural networks, for different values of r. We define the worst case
2
distribution µ? ∈ Π(µ̄12 , µ̄3 ) such that χ(f4 ) = R3 (x1 + x2 + y) µ? (dx1 , dx2 , dy)−
R
1 ? r
r dc̃ (µ̄, µ ) and report also the empirical covariance matrix Σ̂µ computed from
?
N = 20000 samples of µ? . The case r = ∞ corresponds to the constraint dc̃ (µ̄, µ) ≤

1.
5. DNB case study: Aggregation of six given risks

Aas and Puccetti (2014) provide a very illustrative case study of the risk aggregation at
the DNB, Norway’s largest bank. We want to make use of this example to showcase the
applicability of the novel framework presented in this paper.
The DNB is exposed to six different types of risks: credit, market, asset, operational,
business and insurance risk. Let the random variables L1 , . . . , L6 represent the marginal
risk exposures for these six risks. Per definition, risk aggregation is not concerned with the
computation of the distribution of the marginal risks. Hence, we take the corresponding
marginal distribution functions F1 , . . . , F6 as given. In this particular case, F1 , F2 and F3
are empirical cdfs originating from given samples, while L4 , L5 and L6 are assumed to be
log-normally distributed with given parameters, see Table 2.
For the purpose of risk management, the DNB needs to determine the capital to be
reserved. According to the Basel Committee on Banking Supervision (2013), this capital
requirement should be computed by the Average Value at Risk (AVaR) of the sum of these
six losses.10 The AVaR of the sum of these six losses at a specific confidence level α is defined
as

+
1 +
AVaRα L6 = min τ + E [max( L6 − τ, 0 )] , (30)
τ ∈R 1−α
10 Aasand Puccetti (2014) focus on the Value at Risk (VaR) rather than the AVaR. Since the Basel Commit-
tee on Banking Supervision recently shifted the quantitative risk metrics system from VaR to Expected
Shortfall (see Chang, Jiménez-Martı́n, Maasoumi, McAleer, & Pérez-Amaral, 2019), which is equivalent
to the AVaR, we consider the AVaR in our study.
31
Description Type Parameters/Other details
given by 2.5 Million samples;
F1 cdf of credit risk L1 empirical cdf
standard deviation σ̄1 = 644.602
F2 cdf of market risk L2 empirical cdf
F3 cdf of asset risk L3 empirical cdf
mean m̄4 = 840.735;
F4 cdf of operational risk L4 lognormal cdf
mean m̄5 = 743.345;
F5 cdf of business risk L5 lognormal cdf
mean m̄6 = 438.978;
F6 cdf of insurance risk L6 lognormal cdf
reference copula with 6 degrees of freedom
C0 student-t copula
linking L1 , . . . , L6 and correlation matrix Σ0
Table 2: Overview of the information concerning the reference distribution in the DNB case
study. The correlation matrix Σ0 is given in Appendix A.5. Fi denotes the cumu-
lative distribution function of the marginal probability measure µ̄i for i = 1, . . . , 6.
P6
where L+ 6 := i=1 Li . To evaluate expression (30), the joint distribution of L1 , . . . , L6
is needed. As the marginal distributions of L1 , . . . , L6 are known, the DNB relies on the
concept of copulas to model the dependence structure between these risks. From the above
description, it is clear that joint observations of the L1 , . . . , L6 are not available. Hence,
standard techniques to determine the copula, e.g., by fitting a copula family and the cor-
responding parameters to a multivariate data set, do not apply. A panel of experts at the
DNB therefore chooses a specific reference copula C0 , in this case a student-t copula with
six degrees of freedom and a particular correlation matrix. Such an approach is common in
practice and referred to as expert opinion.
From an academic point of view, this method for risk aggregation is not very satisfying due
to the fact that the experts’ choice of a reference dependence structure between the different
risk types might be very inaccurate. Hence, we say that there is model ambiguity with
respect to the dependence structure. It should be emphasized that a misspecification of this
reference copula chosen by expert opinion can have a significant impact on the aggregated
risk and therefore on the required capital. Table 3 supports this statement by comparing
the AVaR implied by the reference copula C0 to the AVaR implied by other dependence
structures: Without any information regarding the dependence structure between the six
risk, the lower (resp. upper) bound for the AVaR with confidence level α = 0.95 is 24165.52
(resp. 36410.12) million Norwegian kroner. Similar bounds are studied in Aas and Puccetti
(2014). As we pointed out in the literature review in Section 1.3, these bounds have been
criticized in the literature since they are too far apart for practical purposes. We therefore
apply the results derived in this paper to compute bounds for the AVaR which depend on
the level ρ of distrust concerning the reference copula C0 . Alternatively, the parameter ρ
can be understood as the level of ambiguity with respect to the reference distribution µ̄.
32
inf C∈C AVaRC +
α (L6 ) AVaRΠ +
α (L6 ) AVaRC +
α (L6 )
0
supC∈C AVaRC +
α (L6 )
24165.52 26980.64 30498.94 36410.12
Table 3: Note that we set α = 0.95. We use the rearrangement algorithm (see Aas and
Puccetti, 2014) to approximate inf C∈C AVaRC + C +
α (L6 ), while supC∈C AVaRα (L6 ) =
P6
i=1 AVaRα (Li ). The two remaining entries are computed by averaging over 50
simulation runs where 10 millions sample points are drawn in each run. Note that
Π denotes the independence copula. Thus, AVaRΠ +
α (L6 ) corresponds to the AVaR
of the sum of the six losses given that they are independent.
We define the probability measure µ̄ of the reference distribution by the following joint
cumulative distribution function
F̄ (x) = C0 (F1 (x1 ), F2 (x2 ), . . . , F6 (x6 ),
for all x ∈ R6 . Hence, the cdfs of the marginals µ̄i are given by Fi (·) for i = 1, 2, . . . , 6. The
problem of interest can be formulated as follows:
ΦC AVaRα L+

4 (α, ρ) := + inf 6 , (31)
0
L6 ∼µ∈Π(µ̄1 ,...,µ̄6 ),
dc (µ̄,µ)≤ρ
C0
AVaRα L+

Φ4 (α, ρ) := sup 6 . (32)
L+
6 ∼µ∈Π(µ̄1 ,...,µ̄6 ),
dc (µ̄,µ)≤ρ
The cost function c defining the transportation distance dc in problem (31) and (32) is set
to
d
X |xi − yi |
c(x, y) = , (33)
i=6
σ̄i
where σ̄i denotes the standard deviation of µ̄i and is given in Table 2. The rational behind
this definition of c is that we want to model the ambiguity such that it concerns solely the
dependence structure of the reference distribution. Definition (33) is a simple way to achieve
this.11
Figure 5 shows the numerical solutions of problems (31) and (32), which are computed
relying on penalization and neural networks, as a function of ρ and for α = 0.95. As a
comparison, the same problem is also solved with respect to the independence coupling Π
rather than the reference copula C0 described in Table 2. The shaded regions outline the
possible levels of risk for a given level of ambiguity ρ and the two reference structures.
On one hand, the evolution of the risk levels in ρ, combined with the given optimizers of
problems (31) and (32) can be used as an informative tool to better understand the risk
the DNB is exposed to. On the other hand, if a certain level of ambiguity is justified in
practice, the bank can assign their capital based on the corresponding worst-case value. If
for example ρ = 0.1 is decided on, the bank would have to assign 32490 capital compared
to 30499 as dictated by the reference structure C0 .
11 It should be mentioned that Gao and Kleywegt (2017a) promote the definition c(x, y) = 6i=1 |Fi (xi ) −
P
Fi (yi )|, which implies that the transportation distance dc is defined directly on the level of copulas. Even
if this approach is arguably more intuitive, we stick to definition (33) mainly for the sake of computational
efficiency.
33
Figure 5: We consider two distinct reference dependence structures, the student-t copula
C0 defined in Table 2 and the independence copula Π. The corresponding robust
C0
solutions ΦC Π
4 (α, ρ) and Φ4 (α, ρ), defined in (31), and (32) resp. Φ4 (α, ρ) and
0
Π
Φ4 (α, ρ), defined analogously, are plotted as a function of the level of ambiguity ρ.
We compare these results, which were computed relying on the concept presented
in this paper, to the known values of AVaRα (L+ 6 ) given in Table 2. Note that we
fix α = 0.95.
Analytically, one striking feature of the numerical solution with respect to C0 is worth
pointing out: The absolute upper bound is attained already for ρ ≈ 0.8, while the distance
from the reference measure to the comonotone joint distribution can be calculated to be
around 1.7. This underlines the fact that even though the comonotone distribution is a
maximizer of the worst case AVaR, there are several more, and they may be significantly
more plausible structurally than the comonotone one.
In conclusion, this paper introduces a flexible framework to aggregate different risks while
accounting for ambiguity with respect to the chosen dependence structure between these
risks. The proposed numerical method allows us to perform this task without making
restrictive assumptions about either the particular form of the aggregation functional, or
the considered distributions, or the specific way to account for the model ambiguity.
A. Appendix
A.1. A basic inequality between difference in p-th moment and
difference in p-norm
The following result is used in Remark 2. The statement and proof are taken from Norbert
(2013).
34
Lemma 2. Let p ∈ N and X, Y ∈ Lp , then
p−1
X
p p
kX − Y k1 ≤ kX − Y kp kXkkp kY kp−1−k
p .
k=0
Proof. For p = 1 the inequality obviously holds. Let p ≥ 2, then

p−1 p−1
X X 1/q
kX p − Y p k1 = (X − Y ) X k Y p−1−k ≤ kX − Y kp |X|kq |Y |(p−1−k)q . (34)
1
k=0 1 k=0
It holds for all k ∈ {0, ..., p − 1} that
|X|kq |Y |(p−1−k)q ≤ kXkkq p−kq

p kY kp . (35)
1
For k ∈ {0, p − 1} (35) is immediate. For 0 < k < p − 1 (35) follows by Hölder’s inequality
applied with r = p/kq. Putting (34) and (35) together, we obtain
p−1 p−1
X 1/q X
kX p − Y p k1 ≤ kX − Y kp kXkkq p−kq
p kY kp ≤ kX − Y kp kXkkp kY kpp−1−k
k=0 k=0
and thus the claim.
A.2. Convergence analysis for Example 4.1

This section is meant to demonstrate how to assess the quality of the obtained numerical
solution. We consider the case ρ = 0.25 in Figure 1. We use the sampling measure θprod
and compare the following three parameter settings:
(i) γ = 100, batch size = 1024, N0 = 15000 and Nfine = 5000.
(ii) γ = 2500, batch size = 1024, N0 = 15000 and Nfine = 5000.
(iii) γ = 2500, batch size = 16, N0 = 7500 and Nfine = 2500.

Figure 6 examines the contrast between setting (i) and (ii). As can be seen, in both settings
the algorithm appears to converge in a stable way. In setting (i), there is however an apparent
difference between the dual value φθ,γ (f1 ) and the primal value f1 dµ? , which is computed
R
using the worst case distribution µ? . We can therefore conclude that the penalization in
setting (i) is insufficient, i.e. the penalization parameter γ is chosen too low. This is clearly
not the case for setting (ii). Figure 7 shows that both a small batch size and a small number
of iterations lead to bad numerical behavior.
A.3. Proof for Section 4.1

We want to derive the analytic solution of problem (24). To do so, the concept of copulas
turns out to be rather useful. We refer to Nelsen (2007) for an introduction to this topic. Let
C denote the set of all copulas and let the comonotonic copula be denoted by M (u1 , u2 ) =
min(u1 , u2 ), for all u1 , u2 ∈ [0, 1]. Using this notation, we can rewrite problem (24) and
show the following:
Z
1 + min(ρ, 0.5)
φ1 (f ) = sup max(u1 , u2 )dC(u1 , u2 ) = .
C∈C, [0,1]2 2
dc (M,C)≤ρ
35
Figure 6: Comparison of the parameter setting (i) with γ = 100 and (ii) with γ = 2500, while
batch size = 1024, N0 = 15000 and Nfine = 5000 in both settings. The upper left
panel shows the dual value φθ,γ (f1 ) as well as the primal value f1 dµ? . The upper
R
right resp. lower left panel illustrates the convergence of λ resp. dc (µ̄, µ? ). The
lower right panel plots 5000 samples from the first marginal µ?1 of the worst case
distribution µ? . Note that this histogram is also representative for the second
marginal µ?2 . The computation time is 205 seconds in both cases.
Proof. First, we derive an upper bound for dc (M, C), where C ∈ C. Since M lives on the
main diagonal of the unit square, the vertical (or horizontal) projection of the mass of an ar-
bitrary copula C ∈ C onto M is feasible transportation plan with costs [0,1]2 |u1 −u2 |dC(u).12
R
The latter expression appears in the definition of a concordance measure called Spearman’s
footrule and it known to be maximized by the countermonotonic copula W (u1 , u2 ) :=
max(u1 + u2 − 1, 0) for all u1 , u2 ∈ [0, 1] cite¡see¿liebscher2014copula. Hence, we obtain
that
Z Z Z 1
dc (M, C) ≤ |u1 − u2 |dC(u) ≤ |u1 − u2 |dW (u) = |2u1 − 1|du1 = 0.5.
[0,1]2 [0,1]2 0
Second, we show that this upper bound is attend for dc (M, W ). The Kantorovich Rubinstein
12 Note that |u1 − u2 | is the distance (and thereby the cost of transportation) of any point-mass C(u1 , u2 )
to the main diagonal M (u1 , u2 ).
36
Figure 7: Convergence analysis of the parameter setting (iii) where γ = 2500, batch size
= 16, N0 = 7500 and Nfine = 2500. The upper left panel shows the dual value
φθ,γ (f1 ) as well as the primal value f1 dµ? . The upper right resp. lower left panel
R
illustrates the convergence of λ resp. dc (µ̄, µ? ). The lower right panel plots 5000
samples from the first marginal µ?1 of the worst case distribution µ? . Note that
this histogram is also representative for the second marginal µ?2 . The computation
time is 45 seconds.
duality yields that

Z Z
dc (M, W ) = sup h(u)dM (u) − h(v)dW (v)
|h(u)−h(v)|≤c(u,v) [0,1]2 [0,1]2
Z Z
≥ u1 + u2 dM (u) − v1 + v2 dW (v) = 1 − 0.5 = 0.5,
[0,1]2 [0,1]2
where we simply set h(u) = u1 + u2 to obtain the inequality. Since dc (M, C) ≤ 0.5 for all
C ∈ C, we have that dc (M, W ) = 0.5.
Combining these two observations yields for ρ > 0.5 that
Z Z
3
φ(f1 ) = sup max(u1 , u2 )dC(u1 , u2 ) = max(u1 , u2 )dW (u1 , u2 ) = .
C∈C [0,1]2 [0,1]2 4
It follows that we can assume ρ ≤ 0.5 for the remainder of the proof.
Let us define the copula Rα as follows:
(
W (u1 , u2 ) if 1−α 1+α
2 ≤ u1 , u2 ≤ 2
Rα (u1 , u2 ) = ,
M (u1 , u2 ) else
37
for α ∈ [0, 1]. Using the same projection-argument as in the beginning of the proof, it follows
that
Z Z (1+α)/2
dc (M, Rα ) ≤ |u1 − u2 |dRα (u) = |2u1 − 1|du1 = α2 /2.
[0,1]2 (1−α)/2
Thus, dc (M, R√2ρ ) ≤ ρ, which implies

Z
1+ρ
φ(f1 ) ≥ max(u1 , u2 )dR√2ρ (u1 , u2 ) = .
[0,1] 2 2
By Corollary 1, we have that

( 2 Z
X 1
φ(f1 ) = inf λρ + hi (ui )dui
λ≥0,hi ∈C([0,1]) 0
i=1
Z " 2 2
# )
X X
+ sup max(v1 , v2 ) − hi (vi ) − λ |ui − vi | dM (u) .
[0,1]2 v∈[0,1]2 i=1 i=1
ρ
Plugging in the value λ = 0.5 and setting h1 (u) = h2 (u) = u/2, yields φ1 (f ) ≤ 2 + 12 +0.
A.4. Proof for Section 4.2

We now derive the analytic bounds for Φ2 , i.e. the solution of problem (26), which are
plotted in Figure 3.
Let us start by proving the following upper bound
2√

ρ
Φ2 ≤ min 1 + α, 2 − 2 − 2α + , (36)
3 2(1 − α)
where Φ2 is defined in (26).
Proof. Due to Corollary 1,

( 2 Z
X 1
Φ2 = inf λρ + hi (ui )dui (37)
τ,λ≥0,hi ∈C([0,1]) 0
i=1
Z " 2 2
# )
1 X X
+ sup τ+ max(v1 + v2 − τ, 0) − hi (vi ) − λ |ui − vi | d(u1 , u2 ) .
[0,1]2 v∈[0,1]2 1−α i=1 i=1
The following choice of optimizers in equation (37) yields the upper bound for Φ2 given in
(36):
√ ατ ?

1 1
λ= , τ = τ ? := 2 − 2 − 2α and hi (v) = v− for i = 1, 2.
2(1 − α) 1−α 2
We now derive the following lower bound

√
2√

2(−3 + 2 2 − 2α + 3α)ρ
Φ2 ≥ min 1 + α, 2 − 2 − 2α + , (38)
3 3(2 − α)(1 − α)α
where Φ2 is defined in (26).
38
Proof. It is straight forward to see that Φ2 is concave in the radius ρ of the considered
Wasserstein ball around µ̄. This is due to the fact that we defined the ground metric c(·, ·)
of the transportation distance dc by the L1 -metric, i.e. c(x, y) = ||x−y||1 . Hence, to establish
the lower bound (38), we only need to show that for ρ? = α(1 − α)(1 − α/2) it holds that
Φ2 ≥ 1 + α.
Therefore, we define the probability measure µα by the following bivariate copula


 u1 u2 if u ∈ [0, α/2]2 ∪ [α/2, α]2

 2−α u u

if u ∈ ([0, α/2] × [α/2, α]) ∪ ([α/2, α] × [0, α/2])
α 1 2
Cα (u1 , u2 ) = 1 2
.


 1−α u 1 u2 if u ∈ [α, 1]

min(u1 , u2 ) else

Tedious calculations show that dc (µ̄, µα ) ≤ α(1 − α)(1 − α/2) = ρ? , where µ̄ is the bivariate
probability measure with independent, standard uniformly distributed marginals defined in
problem (26). Moreover, for VU ∼ µα it holds that AVaRα (U + V ) = 1 + α.

A.5. Correlation Matrix

The purpose of this subsection is to give the correlation matrix Σ0 . Recall that Σ0 defines the
student-t copula C0 with six degrees of freedom used as a reference dependence structure in
the case study by Aas and Puccetti (2014), which we consider in Section 5. As this matrix is
not given in the paper by Aas and Puccetti (2014), we simply choose the following arbitrary
correlation matrix
 
1 0.36 0.35 0.44 0.45 0.30
 0.36 1 0.37 0.36 0.41 0.43 
 
 0.35 0.37 1 0.44 0.32 0.42 
 
Σ0 =  .
 0.44 0.36 0.44 1 0.41 0.29 
 
 0.45 0.41 0.32 0.41 1 0.28 
0.30 0.43 0.42 0.29 0.28 1
Acknowledgments
The authors confirm that the data supporting the findings of this study are available within
the article by Aas and Puccetti (2014).
The authors thank Daniel Bartl, Ludovic Tangpi, Ruodu Wang and the participants of
the numerous conference and seminars, where the authors presented this paper, for help-
ful comments as well as interesting discussions. Moreover, Mathias Pohl thanks Philipp
Schmocker for his help and acknowledges support by the Austrian Science Fund (FWF)
under the grant P28661 and Stephan Eckstein sincerely thanks Jan Oblój for his hospitality.
Finally, we thank the two referees for their helpful comments.
References
Aas, K., & Puccetti, G. (2014). Bounds for total economic capital: the DNB case study.
Extremes, 17 (4), 693–715. doi: 10.1007/s10687-014-0202-0
39
Bartl, D., Cheridito, P., & Kupper, M. (2019). Robust expected utility maximization with
medial limits. Journal of Mathematical Analysis and Applications, 471 (1), 752 - 775.
doi: 0.1016/j.jmaa.2018.11.012
Bartl, D., Drapeau, S., & Tangpi, L. (2019). Computational aspects of robust optimized
certainty equivalents and option pricing. Mathematical Finance, 0 (0), 1 23. doi:
10.1111/mafi.12203
Basel Committee on Banking Supervision. (2013). Consultative document, fun-
damental review of the trading book: A revised market risk framework.
http://www.bis.org/publ/bcbs265.pdf .
Bayraksan, G., & Love, D. K. (n.d.). Data-driven stochastic programming using phi-
divergences. Tutorials in operations research, 1-19. doi: 10.1287/educ.2015.0134
Beck, C., Becker, S., Grohs, P., Jaafari, N., & Jentzen, A. (2018). Solving stochastic
differential equations and Kolmogorov equations by means of deep learning. arXiv
preprint arXiv:1806.00421 .
Becker, S., Cheridito, P., & Jentzen, A. (2019). Deep optimal stopping. Journal of Machine
Learning Research, 20 (74), 1–25.
Bernard, C., Rüschendorf, L., & Vanduffel, S. (2017). Value-at-risk bounds with variance
constraints. Journal of Risk and Insurance, 84 (3), 923–959. doi: 10.1111/jori.12108
Bernard, C., Rüschendorf, L., Vanduffel, S., & Wang, R. (2017). Risk bounds for factor
models. Finance and Stochastics, 21 (3), 631–659. doi: 10.1007/s00780-017-0328-4
Berner, J., Grohs, P., & Jentzen, A. (2018). Analysis of the generalization error: Empirical
risk minimization over deep artificial neural networks overcomes the curse of dimen-
sionality in the numerical approximation of black-scholes partial differential equations.
arXiv preprint arXiv:1809.03062 .
Blanchet, J., Kang, Y., & Murthy, K. (2019). Robust Wasserstein profile inference and
applications to machine learning. Journal of Applied Probability, 56 (3), 830857. doi:
10.1017/jpr.2019.49
Blanchet, J., & Murthy, K. (2019). Quantifying distributional model risk via optimal
transport. Mathematics of Operations Research, 44 (2), 565–600. doi: 10.1287/moor
.2018.0936
Buehler, H., Gonon, L., Teichmann, J., & Wood, B. (2019). Deep hedging. Quantitative
Finance, 1–21. doi: 10.1080/14697688.2019.1571683
Chang, C.-L., Jiménez-Martı́n, J.-Á., Maasoumi, E., McAleer, M., & Pérez-Amaral, T.
(2019). Choosing expected shortfall over var in basel iii using stochastic dominance.
International Review of Economics & Finance, 60 , 95–113. doi: 10.1016/j.iref.2018
.12.016
Chen, Z., Yu, P., & Haskell, W. B. (2019). Distributionally robust optimization for sequen-
tial decision-making. Optimization, 68 (12), 2397-2426. doi: 10.1080/02331934.2019
.1655738
Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. In
Advances in neural information processing systems (pp. 2292–2300).
Delage, E., & Ye, Y. (2010). Distributionally robust optimization under moment uncertainty
with application to data-driven problems. Operations research, 58 (3), 595–612. doi:
10.1287/opre.1090.0741
Eckstein, S., & Kupper, M. (2019, Feb 20). Computation of optimal transport and re-
lated hedging problems via penalization and neural networks. Applied Mathematics &
Optimization. doi: 10.1007/s00245-019-09558-1
40
Embrechts, P., & Puccetti, G. (2006). Bounds for functions of dependent risks. Finance
and Stochastics, 10 (3), 341–352. doi: 10.1007/s00780-006-0005-5
Embrechts, P., Wang, B., & Wang, R. (2015). Aggregation-robustness and model uncertainty
of regulatory risk measures. Finance and Stochastics, 19 (4), 763–790. doi: 10.1007/
s00780-015-0273-z
Gao, R., Chen, X., & Kleywegt, A. J. (2017). Wasserstein distributional robustness and
regularization in statistical learning. arXiv preprint arXiv:1712.06050 .
Gao, R., & Kleywegt, A. J. (2016). Distributionally robust stochastic optimization with
Wasserstein distance. arXiv preprint arXiv:1604.02199 .
Gao, R., & Kleywegt, A. J. (2017a). Data-driven robust optimization with known marginal
distributions. Working paper .
Gao, R., & Kleywegt, A. J. (2017b). Distributionally robust stochastic optimization with
dependence structure. arXiv preprint arXiv:1701.04200 .
Genevay, A., Peyré, G., & Cuturi, M. (2017). Learning generative models with sinkhorn
divergences. arXiv preprint arXiv:1706.00292 .
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. C. (2017). Improved
training of Wasserstein gans. In Advances in neural information processing systems
(pp. 5767–5777).
Hanasusanto, G. A., & Kuhn, D. (2018). Conic programming reformulations of two-stage
distributionally robust linear programs over Wasserstein balls. Operations Research,
66 (3), 849-869. doi: 10.1287/opre.2017.1698
Henry-Labordere, P. (2017). Deep primal-dual algorithm for BSDEs: Applications of ma-
chine learning to CVA and IM. Available at SSRN: 3071506 .
Henry-Labordere, P. (2019). (Martingale) optimal transport and anomaly detection with
neural networks: A primal-dual algorithm. Available at SSRN 3370910 .
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural
Networks, 4 (2), 251 - 257. doi: 10.1016/0893-6080(91)90009-T
Lux, T., & Papapantoleon, A. (2016). Model-free bounds on value-at-risk using partial
dependence information. arXiv preprint arXiv:1610.09734 .
Makarov, G. (1982). Estimates for the distribution function of a sum of two random variables
when the marginal distributions are fixed. Theory of Probability & its Applications,
26 (4), 803–806. doi: 10.1137/1126086
Mohajerin Esfahani, P., & Kuhn, D. (2018). Data-driven distributionally robust optimiza-
tion using the Wasserstein metric: performance guarantees and tractable reformula-
tions. Mathematical Programming, 171 (1), 115–166. doi: 10.1007/s10107-017-1172-1
Nelsen, R. B. (2007). An introduction to copulas. Springer Science & Business Media.
Nesterov, Y. (2012). Efficiency of coordinate descent methods on huge-scale optimization
problems. SIAM Journal on Optimization, 22 (2), 341–362. doi: 10.1137/100802001
Norbert. (2013). When does convergence in Lp imply convergence of the p-th moment?
Mathematics Stack Exchange. Retrieved from https://math.stackexchange.com/
q/475146 (version: 2013-08-25)
Obloj, J., & Wiesel, J. (2018). Statistical estimation of superhedging prices. arXiv preprint
arXiv:1807.04211 .
Pflug, G., & Wozabal, D. (2007). Ambiguity in portfolio selection. Quantitative Finance,
7 (4), 435–442. doi: 10.1080/14697680701455410
Pflug, G. C., & Pohl, M. (2017). A review on ambiguity in stochastic portfolio optimization.
Set-Valued and Variational Analysis, 1–25. doi: 10.1007/s11228-017-0458-z
41
Puccetti, G., & Rüschendorf, L. (2012). Bounds for joint portfolios of dependent risks.
Statistics & Risk Modeling with Applications in Finance and Insurance, 29 (2), 107–
132. doi: 10.1524/strm.2012.1117
Puccetti, G., & Rüschendorf, L. (2012). Computation of sharp bounds on the distribution
of a function of dependent risks. Journal of Computational and Applied Mathematics,
236 (7), 1833 - 1840. doi: 10.1016/j.cam.2011.10.015
Puccetti, G., & Wang, R. (2015). Extremal dependence concepts. Statistical Science, 30 (4),
485–517. doi: 10.1214/15-STS525
Rockafellar, R. T., & Uryasev, S. (2000). Optimization of conditional value-at-risk. Journal
of risk , 2 , 21–42. doi: 10.21314/JOR.2000.038
Roth, K., Lucchi, A., Nowozin, S., & Hofmann, T. (2017). Stabilizing training of genera-
tive adversarial networks through regularization. In Advances in neural information
processing systems (pp. 2018–2028).
Rüschendorf, L. (1982). Random variables with maximum sums. Advances in Applied
Probability, 14 (3), 623-632. doi: 10.2307/1426677
Rüschendorf, L. (2017). Risk bounds and partial dependence information. In From statistics
to mathematical finance (pp. 345–366). Springer. doi: 10.1007/978-3-319-50986-0-17
Seguy, V., Damodaran, B. B., Flamary, R., Courty, N., Rolet, A., & Blondel, M.
(2017). Large-scale optimal transport and mapping estimation. arXiv preprint
arXiv:1711.02283 .
Shapiro, A. (2017). Distributionally robust stochastic programming. SIAM Journal on
Optimization, 27 (4), 2258–2275. doi: 10.1137/16M1058297
Weinan, E., Han, J., & Jentzen, A. (2017). Deep learning-based numerical methods for
high-dimensional parabolic partial differential equations and backward stochastic dif-
ferential equations. Communications in Mathematics and Statistics, 5 (4), 349–380.
doi: 10.1007/s40304-017-0117-6
Yang, I. (2017). A convex optimization approach to distributionally robust Markov decision
processes with Wasserstein distance. IEEE control systems letters, 1 (1), 164–169. doi:
10.1109/LCSYS.2017.2711553
Zhao, C., & Guan, Y. (2018). Data-driven risk-averse stochastic optimization with
Wasserstein metric. Operations Research Letters, 46 (2), 262 - 267. doi: 10.1016/
j.orl.2018.01.011
42

Robust Risk Aggregation With Neural Networks: Stephan Eckstein Michael Kupper Mathias Pohl May 27, 2020

Uploaded by

Copyright:

Available Formats

Robust Risk Aggregation With Neural Networks: Stephan Eckstein Michael Kupper Mathias Pohl May 27, 2020

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Robust Risk Aggregation With Neural Networks: Stephan Eckstein Michael Kupper Mathias Pohl May 27, 2020

Uploaded by

Copyright:

Available Formats

Robust risk aggregation with neural

May 27, 2020

Abstract. We consider settings in which the distribution of a multivariate

∗ Department of Mathematics, University of Konstanz, Universitätsstrae 10, 78464 Konstanz, Germany,

enna, Austria, mathias.pohl@univie.ac.at

φθ,γ (f ) → φ(f ) for γ → ∞. (5)

A sufficient condition for this convergences is derived in Proposition 1. We additionally

1.3. Related literature

min max EQ [f (x, ξ)] , (7)

where X ⊂ Rm , f : Rm × Ξ → R, ξ is a random vector whose distribution Q is supported

Neural networks in finance and optimization

Proof. 1) Define the optimal transport functional ψ1 : Cκ (X) → R by

i = 1, . . . , d. We now show that ψ1 (f n0 ) ≤ ψ1 (f ) + ε, which we do separately depending on

for all µ ∈ Pκ (X). Indeed, for every µ ∈ Pκ (X) one has

= ϕ(dc (µ̄, µ)),

for all µ ∈ Pκ (X).

inf ψ(f n ) = inf inf ψ1 (g) + ψ2 (f n − g)

= inf inf ψ1 (f n − g) + ψ2 (g)

for all f ∈ Uκ (X).

Corollary 1. For every f ∈ Uκ (X) one has

Corollary 2. For every f ∈ Uκ (X) one has

3 Note that δx (A) = 1 if x ∈ A, and δx (A) = 0 otherwise.

The assumption X = X1 × · · · × Xd implies that for any y ∈ X we can find indices k1 , . . . , kd

where we use that

Introducing the auxiliary variables ui (j, m) ∈ R, where i = 1, . . . , d, j = 1, . . . , n and m =

Lemma 1. For every f ∈ Cb (X 2 ) one has

order Wasserstein distance on R . d

(a) For for every f ∈ Cb (X 2 ) one has

is a maximizer of (22). Hence µ? := π ? ◦ pr−1

(while φθ,γ (f ) < ∞ is clear) for all f ∈ Cb (X 2 ).

inf φ1 (f˜) + φ2 (fn − f˜) = φ1 (f˜) + φ2 (−f˜) = φθ,γ (0),

is feasible it follows from Lemma 1 that

uniformly bounded in ε (respectively growing in a certain order depending on ε), so that a

where each A ∈ An is Borel measurable and satisfies A = A1 × ... × Ad as well as

many product sets by An , which is a partition of X. Define

= ν(B1 × X2 × ... × Xd ) = ν1 (B1 ).

ν n (A) for all A ∈ An (and in particular ν n ((K n )c ) = ν((K n )c )), it holds

(ν|A )i (Bi ) = ν(A)−1 ν(A1 × ... × (Ai ∩ Bi ) × ... × Ad )

which means after a possible change of indices, µε is an ε-optimizer of (18).

2.3. Approximation with neural networks

N(m, d0 ) := g : Rd0 → R : x 7→ Al ◦ ϕ ◦ Al−1 ◦ · · · ◦ ϕ ◦ A0 (x)

φθ,γ (f ) = inf F (λ, h1 , ..., hd , g).

We define the neural network approximation of φθ,γ (f ) by

As c ∈ Lp (θ), it also holds λm c → λc in Lp (θ) and hence

in Lp (θ) as m → ∞. As the mapping x 7→ x+ is Lipschitz-1, taking only the positive parts

where C is a constant only depending on f, λ, c, h1 , ..., hd , g.

and by the inequality quoted in Lemma 2 (see Appendix A.1) it holds

1. The choice of θ, βγ and neural network structure.

2. The optimization method for the parameters of the neural network.

3. How to evaluate the quality of the obtained solution.

4. The typical runtime.

3.1. Choice of θ, βγ and neural network parameters

3.2. Optimization method for the parameters of the neural network

3.3. Evaluation of the solution quality

(a) Is the neural network structure rich enough?

(b) How large is the effect of penalization?

(c) Has the numerical optimization procedure converged to a (near) minimum?

This notation implies that

4.1. Expected maximum of two comonotone standard Uniforms