Deconstructing Generative Adversarial Networks

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO.
11, NOVEMBER 2020 7155
Deconstructing Generative Adversarial Networks

Banghua Zhu , Student Member, IEEE, Jiantao Jiao , Member, IEEE, and David Tse, Fellow, IEEE
Abstract— Generative Adversarial Networks (GANs) are a I. I NTRODUCTION

thriving unsupervised machine learning technique that has led
to significant advances in various fields such as computer vision,
natural language processing, among others. However, GANs are
known to be difficult to train and usually suffer from mode
U NSUPERVISED learning has been studied extensively
in the literature. It is intimately related to the prob-
lem of distribution (density) estimation in statistics [1], and
collapse and the discriminator winning problem. To interpret
the empirical observations of GANs and design better ones, dimensionality reduction [2]. Recently, Generative Adversarial
we deconstruct the study of GANs into three components and Networks (GANs) become a thriving unsupervised machine
make the following contributions. learning technique that has led to significant advances in
• Formulation: we propose a perturbation view of the pop- various fields such as computer vision, natural language
ulation target of GANs. Building on this interpretation, processing, among others [3], [4]. In computer vision and
we show that GANs can be connected to the robust sta-
tistics framework, and propose a novel GAN architecture, natural language processing, GANs have resulted in superior
termed as Cascade GANs, to provably recover meaningful empirical performance over standard generative models for
low-dimensional generator approximations when the real images and texts, such as variational autoencoder and deep
distribution is high-dimensional and corrupted by outliers. Boltzmann machine [5]. Various ideas have been proposed
• Generalization: given a population target of GANs,
to further improve the quality of the learned distribution and
we design a systematic principle, projection under admis-
sible distance, to design GANs to meet the population the stability of the training [6]–[19]. Given the success of
requirement using only finite samples. We implement our GANs, there also exist challenges that need timely solutions.
principle in three cases to achieve polynomial and sometimes In particular, GAN training has been reportedly observed to
near-optimal sample complexities: (1) learning an arbitrary be challenging, unstable, and the problems of mode collapse
generator under an arbitrary pseudonorm; (2) learning a and discriminator winning frequently appear [12], [20]–[24].
Gaussian location family under total variation distance,
where we utilize our principle to provide a new proof for To understand the empirical observations in GAN training
the near-optimality of the Tukey median viewed as GANs; and propose theoretically near-optimal GANs algorithms,
(3) learning a low-dimensional Gaussian approximation of a we deconstruct the design of GANs into three fundamental
high-dimensional arbitrary distribution under Wasserstein aspects: formulation, generalization and optimization, and
distance. We demonstrate a fundamental trade-off in the develop systematic approaches to analyze them.
approximation error and statistical error in GANs, and
demonstrate how to apply our principle in practice with
only empirical samples to predict how many samples would
be sufficient for GANs in order not to suffer from the A. Main Contributions
discriminator winning problem.
• Optimization: we demonstrate alternating gradient descent Our main contributions in this paper can be summarized as
is provably not locally asymptotically stable in optimizing follows.
the GAN formulation of PCA. We found that the minimax
duality gap being non-zero might be one of the causes, and • Formulation: we propose a perturbation view in design-
propose a new GAN architecture whose duality gap is zero, ing the population target of GANs. Building on this
where the value of the game is equal to the previous minimax interpretation, we show that GANs can be connected
value (not the maximin value). We prove the new GAN to the robust statistics framework in terms of perturba-
architecture is globally asymptotically stable in solving PCA
under alternating gradient descent. tion beyond the total variation distance. We also pro-
pose a novel GAN architecture, termed as Cascade
Index Terms— Generative Adversarial Networks (GANs), GANs, to provably recover meaningful low-dimensional
wasserstein distance, optimal transport, generalization error,
generator approximations when the real distribution is
information-theoretic limit, robust statistics.
high-dimensional and corrupted by outliers. We also
Manuscript received November 4, 2019; accepted March 20, 2020. Date elaborate on the implicit assumptions used in [6] that
of publication March 27, 2020; date of current version October 21, 2020. motivated the Wasserstein GAN.
This work was supported by NSF under Grant CIF-1909499. (Corresponding
author: Banghua Zhu.)
• Generalization: given a population target of GANs,
Banghua Zhu and Jiantao Jiao are with the Department of Electrical Engi- we design a systematic principle, projection under admis-
neering and Computer Sciences, University of California at Berkeley, Berke- sible distance, to construct GAN algorithms that achieve
ley, CA 94720 USA (e-mail: banghua@berkeley.edu; jiantao@berkeley.edu). polynomial and sometimes near-optimal sample complex-
David Tse is with the Department of Electrical Engineering, Stanford
University, Stanford, CA 94305 USA (e-mail: dntse@stanford.edu). ities given only finite samples. We study three cases in
Communicated by K. Chaudhuri, Associate Editor for Statistical Learning. detail:
Color versions of one or more of the figures in this article are available
online at http://ieeexplore.ieee.org. 1) learning an arbitrary generator under an arbitrary
Digital Object Identifier 10.1109/TIT.2020.2983698 pseudonorm;
0018-9448 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
7156 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 11, NOVEMBER 2020
2) learning Gaussian location family under total varia- may lie in the space of images. Given infinite samples from
tion distance, where we provide a new proof for the the real distribution PX , the goal of GANs can be understood
near-optimality of Tukey median viewed as GANs; intuitively as finding a generator g whose distribution Pg(Z)
3) learning low-dimensional Gaussian approximations is the closest to the real distribution PX . At the minimum
of a high-dimensional arbitrary distribution under we would need a distance1 L(Pg(Z) , PX ) for every g ∈ G.
Wasserstein-2 distance. Here L(·, ·) : M(Rd ) × M(Rd ) → R≥0 is a function
We demonstrate a fundamental trade-off in the approxi- that characterizes the distance between distributions. Naturally
mation error and statistical error in GANs, and demon- with infinite samples and unbounded computation we would
strate how to apply our principle in practice with only ideally like to solve for
empirical samples to predict how many samples would ĝ = arg min L(Pg(Z) , PX ). (1)
be sufficient for GANs in order not to suffer from the g∈G
discriminator winning problem. Note that the minimum in (1) may not be achievable in
• Optimization: we demonstrate alternating gradient general. However, we assume throughout this paper that it is
descent is provably not locally asymptotically stable in attained, since the other case can be dealt with by a standard
optimizing the GAN formulation of PCA. We found limiting argument. The formulation problem looks at the
that the minimax duality gap being non-zero might be design of L and G. As a concrete example, the vanilla GAN [5]
one of the causes, and propose a new GAN architecture defines the distance function L as the Jensen-Shannon diver-
whose duality gap is zero, and the value of the game is gence, and uses neural network as generator family.
equal to the previous minimax value (not the maximin
value). We prove the new GAN architecture is globally
asymptotically stable in solving PCA under alternating A. The Population Target of GANs: a Perturbation View
gradient descent. Intuitively, at minimum the design of L and G needs to
satisfy that
Organization of paper. The rest of the paper is organized as
follows. In Section II, we discuss the formulation of GANs inf L(Pg(Z) , PX ) is “small”. (2)
g∈G
from a perturbation view, and propose Cascade GANs for
Indeed, were it not true, then one should certainly design
capturing the sequential perturbation model. In Section III,
another generator family G that can better approximate the
we study the generalization property of GANs. In Section IV,
real distribution PX . In other words, we implicitly assume
we study the optimization property of GANs. All the proofs
are deferred to the appendices. The real distribution PX is a “slightly” perturbed
Notations. Throughout this paper, we use capital letter X for generator g ∈ G under L.
random variable, blackbold letter P for probability distribution. The perturbation view allows us to unify the field of robust
Generally, the distribution for random variable X is denoted statistics and GANs. In robust statistics, it is usually assumed
as PX , and the corresponding empirical distribution from PX that the observed distribution (in GANs, the true distribu-
with n samples is P̂nX . In all sections except for Section IV, tion PX ) is a slightly perturbed version of a generator member
we use bold lower case letters for vectors, bold upper case under total variation distance. Hence, we can connect GANs
letters for matrices. In Section IV, we use lower case letters to the robust statistics framework2. This viewpoint proves to
for vectors, upper case letters for matrices. We use A† to be beneficial: indeed, we would later utilize a general analysis
denote the pseudo inverse of matrix A. Random vector Z technique we developed for GANs to provide a new proof
follows zero mean normal distribution with identity covari- for the optimal statistical properties of the Tukey median in
ance. We use r(A) to denote the rank of matrix A, R(A) robust estimation. Our perturbation view of GANs is inspired
to denote the range of matrix A. We say A ≥ 0 if A is by [25] which constructed GAN architectures that are provably
positive semidefinite, A > 0 if A is positive definite. We use achieving the information-theoretic limits in robust estimation
TV(P, Q) = supA P(A) − Q(A) to denote the total variation of mean and covariances. However, it also immediately raises
distance between P and Q. For non-negative sequences aγ , bγ , a conflict: it was argued in [6] that the Jensen–Shannon diver-
we use the notation aγ α bγ to denote that there exists a gence (similarly, the total variation) distance is not supposed
a
constant C that only depends on α such that supγ bγγ ≤ C, to be used as the distance function for GANs. How can we
and aγ α bγ is equivalent to bγ α aγ . When the constant unify these two viewpoints?
C is universal we do not write subscripts for and .
Notation aγ bγ is equivalent to aγ bγ and aγ bγ .
a B. Cascade GANs
Notation aγ bγ means that lim inf γ bγγ = ∞, and aγ bγ
is equivalent to bγ aγ . A careful inspection of the arguments in [6] shows that
the total variation is inappropriate when the generator family
II. F ORMULATION OF GAN S 1 This distance needs to be broadly understood as a function that may not
GANs aim at selecting a generator g(·) from some generator satisfy the axioms of a distance function.
2 For GAN, the perturbation metric and the metric to recover original
family G. The generator function takes Gaussian random noise distribution are the same, but for general robust statistics problem they might
Z ∼ N (0, Ir ) as input and outputs a random variable which be different. The latter problem is beyond the scope of the paper.
ZHU et al.: DECONSTRUCTING GANs 7157
In the context of Figure 1, the final generator family G is

one-dimensional distributions, and the intermediate generator
family G is full-dimensional distributions. By solving equa-
tion (3), we are assuming that a low dimensional Gaussian
distribution Pg(Z) is perturbed under W2 distance to form a
high dimensional Gaussian distribution Pg (Z) , and then per-
turbed under TV distance to get the observed distribution PX .
Generalizing the Cascade GAN architecture, we have mul-
tiple rounds of perturbation, and TV and W2 can be replaced
by any distance functions. A similar formulation was raised
in [26], where it was shown that sometimes Cascade GANs
can be viewed as constraining the discriminator for vanilla
GANs. The chained TV and W2 perturbation problem may
also be cast as a single perturbation under a new distance
function L(PX , PY ) = inf π E[min( X − Y 2 , 1X=Y )]. How-
ever, this would require a new design of GAN, while the
Cascade GAN architecture is readily implementable given two
Fig. 1. Illustration of the perturbation in the robust statistics setting, and the GANs designed specifically for TV and W2 . We mention that
setting of applications of GANs. Top: the generator family is one-dimensional
distributions. The real distribution is a mixture of 99% of generator one AmbientGAN [27] is also proposed to learn under the chained
and 1% of the outliers marked as “TV perturbation” in the figure. Bottom: perturbation model, but it requires that the second perturbation
the generator family is one-dimensional distributions. The real distribution is simulatable.
is a mixture of 99% of the two-dimensional Gaussian whose support covers
generator one, and 1% of the outliers marked as “TV perturbation”. Both the
total variation distance and the Wasserstein distance between generator one
(which intuitively is the correct generator to select in infinite samples) and C. Continuity With Non-Zero Gradient
the real distribution are gigantic. This property requires that the L we construct is neither too
strong nor too weak for the family G. Similar observations
consists of distributions with intrinsically low dimensions. were made in [6] that motivated the W1 distance. For the
Indeed, for any two low-dimensional distributions whose sake of completeness, we illustrate this point via a different
supports do not overlap, the total variation distance between example here. Consider the top figure in Fig. 2. Here the
them is the maximum value one. The Wasserstein distance generators are one-dimensional Gaussian distributions in high-
is proposed in [6] to be a distance that is continuous with dimensions, and we assume that generator one is the real
respect to small local variations and is not sensitive to support distribution. To ensure that our GAN is able to find the correct
mismatch. generator one, it is natural to assume that the distance function
However, the usual robust statistics setting is closer to the L(Pg(Z) , PX ) is providing gradient for us to locate the optimal
top of Figure 1, where the observed distribution shares 99% generator. The middle figure of Figure 2 shows the distance as
of the support of generator one, with 1% outliers far away. a function of θ (the angle between these two one-dimensional
Then, our goal is to successfully recover generator one given Gaussians) under the Jensen–Shannon divergence, while the
the corruption. bottom figure in Figure 2 shows that under the Wasserstein-2
A natural combination of the robust statistics formulation distance. It is clear that the Jensen–Shannon divergence
and the motivation of the Wasserstein GANs would give reaches zero if and only if θ = 0, and then saturates for other
rise to the bottom of Figure 1, where neither total varia- values of θ, while under W2 it is a continuous function of θ
tion nor Wasserstein distance provides a meaningful descrip- with non-vanishing gradient almost everywhere.
tion of the perturbation. Indeed, the generator family is the Intuitively speaking, the Jensen–Shannon divergence is a too
one-dimensional distributions, and the observed distribution strong distance such that it always tells that the distributions
PX has 99% of the points in the two-dimensional distribution are maximally different even if they are slightly apart. It is
covering generator one, plus 1% outliers. Ideally, the best the saturation that causes the discontinuity and the gradient
generator that fits this distribution should be generator one. to vanish. We remark that constructing a too weak distance
However, both Wasserstein distance and total variation dis- could also cause the gradient to vanish while being a contin-
tance between generator one and the observed distribution are uous distance. For example, if we construct a distance like
gigantic. This is precisely due to the fact that this perturbation L(P, Q) = EP [X] − EQ [X] , then L(Pg1 (Z) , Pg2 (Z) ) ≡ 0
is in fact a sequential perturbation: we first perturb generator for the generator family of centered one-dimensional Gaussian
under Wasserstein distance to obtain the two-dimensional distributions.
distribution around generator one, and then we perturb it under
total variation distance to add the 1% outlier.
III. G ENERALIZATION OF GAN S
We naturally have the following Cascade GAN architecture
to capture this perturbation model: Given the population target of the minimization problem

min min

TV(PX , Pg (Z) ) + λW2 (Pg(Z) , Pg (Z) ) . (3) min L(Pg(Z) , PX ), (4)
g∈G g ∈G g∈G
Fig. 3. Illustration of poor performance of projecting the empirical distrib-

ution to the general family under distance W2 . The true distribution lies on
the large circle, while one of the generator distributions could be much closer
to empirical distribution. This leads to the large discrepancy between true
distribution and the generator output distribution found by the naive algorithm.
study g̃ that is produced by a GAN architecture, which

proves to achieve near-optimal sample complexity in specific
problems.
Fig. 2. Plot of different discrepancy measures L as a function of θ. Top:
the generator family is one-dimensional Gaussians, and we denote the angle
between any generator with generator one as θ. Middle: the plot of the
Jensen–Shannon divergence as a function of θ; Bottom: the plot of the W2
A. The Naive Algorithm
distance as a function of θ. It is clear that the Jensen–Shannon divergence Intuitively, an approximate method to solve (4) is to replace
reaches zero if and only if θ = 0, and then saturates for other values of θ,
while under W2 it is a continuous function of θ with non-vanishing gradient the real distribution PX with the empirical distribution P̂nX
almost everywhere. formed with n i.i.d. samples from PX and perform
min L(Pg(Z) , P̂nX ). (7)

g∈G
in this section we study the problem of approximately solving
this minimization problem given only finitely many samples It was observed in [22] if L is the W1 distance, then it
from PX . Concretely, given L and G, we observe n i.i.d. requires number of samples n exponential in the dimension d
i.i.d.
samples X1 , X2 , . . . , Xn ∼ PX , and produce a generator of X to ensure that W1 (P̂nX , PX ) vanishes. Hence, it intuitively
g̃(·) ∈ G such that with probability at least 1−δ, the following suggests that P̂nX is a bad proxy for PX for n not big
equation enough. Fig. 3 provides an intuitive explanation for the poor
performance of the naive projection algorithm. The poor per-
L(Pg̃(Z) , PX ) ≤ C · inf L(Pg(Z) , PX ) + n (G, L, PX , δ),
g(·)∈G formance occurs when the true distribution (on the boundary of
(5) the larger circle) is far from the empirical distribution (at the
center), while one is able to find another generator output
holds. From now on, we define distribution (on the boundary of the inner smaller circle) much
OP T = inf L(Pg(Z) , PX ) (6) closer to the center. In this case, the distance between the
g(·)∈G output generator distribution and the real distribution would
as the oracle error. be at least the difference between the radius of two circles.
For fixed L and G, the oracle error is fixed. We focus our Completing the arguments, we show in Theorem 5 that if G
study on finding the best g̃ that minimizes both the factor C on is the Gaussian family and L is W2 , indeed one can find
the oracle error, and also the stochastic error n which should some distribution inside generator family that is closer to
vanish as n → ∞. Intuitively, the constant C should be 1, since the empirical distribution but very far away from the real
when n = ∞ we can perform the minimization in (4) exactly distribution.
to achieve it. Interestingly, we show that forcing C = 1, which
is called sharp oracle inequality in the literature [28], may
B. Projection Under Admissible Distance
be fundamentally in trade-off with the stochastic error. This
phenomenon is made precise in Theorem 7. Similar oracle We propose a general method, termed as projection under
inequality and density estimation performance for Generative admissible distance to reduce the sample complexity, and for
Adversarial Networks are also studied in [19]. Here we focus a wide range of problems to achieve polynomial, even near
on improving the statistical rate via designing a general optimal sample complexities. Our key observation is that the
framework of projection under admissible distances. population algorithm in (4) is not the unique approach to
We emphasize that in formulation (5), we have not specified achieve approximate optimality in searching the best generator
the algorithm family that generates g̃. In this paper we mainly g, even in population. We have the following definition.
Definition 1 (Admissible Distance): Suppose L is a pseudo- in practice. It is claimed in [29] that training under a tighter
metric3 . We say that the function L is admissible with evidence lower bounds can be detrimental to the process of
parameter (c1 , c2 , L ) with respect to (L, G) if it satisfies the learning an inference network by reducing the signal-to-noise
following two properties: ratio of the gradient estimator for variational autoencoder.
1) Approximate resolution within generators: for any In [26] and [30], using a restricted discriminator family
g1 , g2 ∈ G, is shown to be related to a moment-matching effect. The
recent work of [15] is most related to us, where a notion of
c1 (L (Pg1 (Z) , Pg2 (Z) ) − L (Pg2 (Z) , Pg2 (Z) )) “conjoining discriminator” is developed, while we emphasize
≥L(Pg1 (Z) , Pg2 (Z) ) − c2 , (8) that the results in [15] only applies to cases where OP T = 0,
and the conjoining discriminator family is derived for a few
2) Robust to perturbation: for any g ∈ G, P1 , P2 , there specific generators with L being the W1 distance.
exists a function L such that The proof is deferred to Appendix (B).
|L (Pg(Z) , P1 ) − L (Pg(Z) , P2 )| ≤ L (P1 , P2 )
≤ L(P1 , P2 ), (9) C. Arbitrary G and Pseudonorm L
Suppose L(P1 , P2 ) can be written as P1 − P2 L , where ·
In other words, the approximate resolution requirement L is a pseudonorm4 on a vector space M which is a subset of
ensures that the new function L can distinguish between signed measures on Rd . We have the following representation
generators at least as well as the desired distance L, and the of · L .
robustness property shows that observing a slightly different Lemma 1: Assume appropriate topology on M such that
real distribution may not change the value of L by too anycontinuous linear functional of μ ∈ M can be written
much. Here L may not be a pseudometric. Thus L (P, P) as f dμ for some measurable f , and for any μ ∈ M,
may not be 0 in general. the linear functional M : αμ → α μ L is continuous. For
Of course, L = L is admissible with parameter (1, 0, L), any measurable function f on Rd , define
but the key point of Definition 1 is that there exist a large vari-
ety of functions L that are admissible. In general, we would f L∗ = sup f dμ. (14)
like to choose L such that it has a better convergence rate μL =1,μ∈M
than L, which will be illustrated in later sections. Then, we can represent · as an integral probability metric:
L
The next theorem shows that projecting under an admissible
function achieves near optimality. μ = sup f dμ. (15)
L
Theorem 1: Suppose L is a pseudometric, and L is admis- f :f L∗ =1
sible with parameter (c1 , c2 , L ) with respect to (L, G). Define
Lemma 1 is a consequence of the Hahn-Banach Theo-

g = arg min L (Pg(Z) , PX ). (10) rem [31, Theorem 4.2]. The proof is deferred to Appendix A.
g∈G Note that we already know any integral probability metric is
Then, a pseudonorm.
In fact, the pseudonorm includes a wide range of dis-
L(Pg (Z) , PX ) ≤ (1 + 2c1 )OP T + c2 . (11) crepancy measures used in various GANs. As a concrete
Furthermore, suppose we observe n i.i.d. samples from PX example, MMD-GAN [32] minimizes the kernel maximum
and denote the empirical distribution formed by the samples mean discrepancy distance, which is a pseudonorm and thus
as P̂nX . Define can be further studied under our framework. Denoting
g = arg min L (Pg(Z) , P̂nX ). (12)

g∈G F = {f : f L∗ = 1}, (16)
Then, we can write

L(Pg (Z) , PX ) ≤ (1 + 2c1 )OP T + 2c1 L (PX , P̂nX ) + c2 . L(P1 , P2 ) = sup f dP1 − f dP2 . (17)
(13) f ∈F
We now design an admissible distance for L. Construct the

Remark 1: Theorem 1 shows that projecting the empiri-
-covering of G : G = {g1 , g2 , · · · , gN () } such that ∀g ∈
cal distribution P̂nX to G under L still achieves O(OP T )
G, ∃gi ∈ G , L(Pgi (Z) , Pg(Z) ) ≤ . We define
approximation error, and the finite sample effect is reflected
by the convergence of L (PX , P̂nX ) but not L(PX , P̂nX ), which
F = {fij : fij = arg max f d(Pgi (Z) − Pgj (Z) ),
could be significantly faster. In concrete examples, we aim at f ∈F
minimizing c1 , and balancing c2 with c1 L (PX , P̂nX ). 1 ≤ i, j ≤ N ()}. (18)
Remark 2: Theorem 1 characterizes the performance of
projecting under any admissible L . Training under an alterna- In other words, interpreting F as the family of discriminators,
tive discrepancy measure is already observed to be beneficial in F we only include those discriminators that are optimally
3 The definition of pseudometric is included in Appendix (). 4 The definition of pseudonorm is included in Appendix ().
discriminating between members in the -covering of G. 3) Whenever discriminator wins phenomenon occurs, which
In some sense, F is the minimal set of useful discriminators means that in solving the projection estimator the discrim-
for the generator family G. Naturally, it leads to the definition inator can always distinguish between the real and forged
of L as data, reduce the number of parameters of the discrimina-
tor and it will not hurt the learning performance as long
L (PX , PY ) = sup f d(PX − PY ). (19) as L is admissible;
f ∈F
4) As long as L(·, ·) is strong enough such that proximity
We have the following result. under L ensures mode collapse does not happen, project-
Theorem 2: Assuming the true distribution is PX , let the ing under L prohibits mode collapse.
GAN estimator be The idea of set design mainly comes from Yatracos’
method [36] (also referred to as Schaffe’s sets by [33]).
g̃ = arg min L (Pg(Z) , P̂nX ), (20)
g(·)∈G In fact, the prefactor 3 in front of the oracle term
inf g(·)∈G L(Pg(Z) , PX ) is likely to be tight. It is shown in [37]
where L is defined in (19). Then that in Yatracos’ method the constant 3 cannot be improved if
we require proper learning.
L(Pg̃(Z) , PX ) ≤ 3 inf L(Pg(Z) , PX ) + 2L (PX , P̂nX ) + 4.
g(·)∈G
(21) D. Robust Estimation: Gaussian Location Model and Total

Variation Distance
Furthermore, suppose the -covering number N () of G under
d Consider the generator family being Gaussian location
L satisfies N () ≤ C 1 , and there exists a constant B such
model and discrepancy measure being total variation distance,
that for any f ∈ F ,
i.e., G = {N (µ, I) : µ ∈ Rd }, L = TV. We have allowed
1 ourselves to slightly abuse notation here to denote G by the
n
nt2
PX E[f (X)] − f (Xi ) > t ≤ 2 exp − 2 . family of probability measures rather than the family of func-
n i=1 2B
tions. We denote by Pg some member in the generator family
(22) G. In the robust statistics literature, it is usually assumed that
1 there exists some distribution g ∗ ∈ G, and we observe PX
2
Then, taking = d log
n
n
, with probability at least 1 − δ, such that
there exists some constant D depending on B, C such that
TV(PX , Pg∗ ) (24)
L(Pg̃(Z) , PX ) ≤3 inf L(Pg(Z) , PX ) + D(B, C) is small. Since g ∗ is unknown, we can equivalently state this
g(·)∈G
1 assumption as
1 1 2
· max d log(n), log . (23) OP T = inf TV(PX , Pg ) (25)
n δ g∈G
Remark 3: Theorem 2 relies on two assumptions: the is small. This model is understood in the robust statistics
covering number of G and the concentration properties of literature as the oblivious adversary contamination model (or
f (X). The behavior of the assumed covering number is Hampel’s contamination model) which is stronger than the
common in practice for parametric families [33]. On the Huber additive contamination model since it allows "deletion".
concentration properties, for any bounded function f , it fol- In robust statistics, the ultimate goal is that given empirical
lows from the Hoeffding inequality [34] that the sub-Gaussian observations from PX , we would like to obtain an estimate
behavior is true. In another case, if f (·) is a Lipschitz function ĝ such that TV(Pĝ , Pg∗ ) is small. Our observation is that it
and the distribution of X satisfies the logarithmic Sobolev suffices to guarantee that TV(Pĝ , PX ) is small, which is the
inequality [35], the result also holds. goal of GANs. Indeed, it follows from the triangle inequality
Here the conclusion does not require G to be parametric that
family. The result holds as long as the covering number of G
is bounded. Thus it can be further extended to non-parametric TV(Pĝ , Pg∗ ) ≤ TV(Pĝ , PX ) + TV(PX , Pg∗ ) (26)
families with the same argument. ≤ OP T + TV(Pĝ , PX ). (27)
The proof is deferred to Appendix (C). Theorem 2 provide
insights into practice. It implies Utilizing the general framework of admissible distances,
we present proofs for two GAN architectures that are both
1) The number of parameters of the discriminator family
admissible distances of (TV, G), and both achieve optimal
should be at most that of the generator family since
statistical error up to a universal constant.
it suffices to only being able to distinguish between
1) Admissible Distance With Parameter (1, 0, TV ): We first
generator family distributions;
construct an admissible distance for (TV, G) with parameter
2) As the number of samples n increases, we need to gradu-
(1, 0, TV ).
ally increase the parameters (discrimination power) of the
1/2 We construct the weakened distance by constraining the set
d log n
discriminators, as shown by the formula = n of A in the definition of total variation distance. Denote Aij as
in Theorem 2; the set {x ∈ Rd : Pgi (x) > Pgj (x), gi , gj ∈ G}, A = {Aij }.
When gi , gj takes all possible generators in G = {N (μ, I) : condition in (4) to consider arbitrary real distribution, but stick
μ ∈ Rd }, A is the set of all halfspaces {{x ∈ Rd : v, x+b > to the rank-constrained linear map generator and W2 distance.
0} : v 2 = 1, b ∈ Rd }. The weakened distance is thus This model generalizes PCA and provides an intellectually
defined as stimulating playground for studying the generalization and
optimization properties of GANs. We demonstrate a few
TV (Pg , PX ) = sup PX (v T
(X − b) ≥ 0)
v=1,b∈Rd
striking behaviors of the W2 -GAN model in this section.
1) Exponential Sample Complexity for the Naive Algorithm:
− Pg (v T (g(Z) − b) ≥ 0). (28)
The W2 distance is defined as

The optimality of projection under TV is guaranteed in the
W2 (PX , PY ) = inf E X −Y 2, (34)
following theorem:
π:πX =PX ,πY =PY
Theorem 3: Let the GAN estimator be
g̃ = arg min TV (Pg , P̂nX ), (29) which admits a dual form:
g∈G
W22 (PX , PY ) = E X 2
2 +E Y 2
2 +2 sup
where TV is defined in (28). Then, TV is an admissible ψ(·):convex
distance of (TV, G) with parameter (1, 0, TV ), and with prob- − E[ψ(X)] − E[ψ (Y )],
(35)
ability at least 1 − δ, there exists some universal constant C,
such that where ψ (x) = supv vT y − ψ(v) is the convex conjugate
d+1 log(1/δ) of ψ.
TV(Pg̃ , PX ) ≤ 3 · OP T + C · +2 . Without loss of generality, we can assume the means of
n 2n
PX , PY are zero. Indeed, for any two distributions PX , PY ,
(30)
assume the corresponding means are µX and µY , and denote
The proof is deferred to Appendix (D). X = X − µX , Y = Y − µY . Then, we have
2) Tukey Median: The Tukey median approach can be
equivalently formulated as projecting under another admissible W22 (PX , PY ) = inf E X 2
2 +E Y 2
2 − 2E[X T Y ]
π
distance. It is defined as the following: for two distribu- = inf E[Tr(XX T )]+E[Tr(Y Y T )]−2E[X T Y ]
tions P, Q, we define a different admissible distance as π
= inf E[Tr(X X T )]+µXT µX +E[Tr(Y Y T )]
LTukey (P, Q) = sup Q (X − EP [X])T v > 0 − 1/2. (31) π
v∈Rd + µY T µY − 2E[X T Y ] − 2µX T µY
2
It has been shown in previous literature that Tukey median = µX − µY 2 + W22 (PX , PY ). (36)
is able to achieve optimal convergence rate under Huber’s
additive contamination model [38]. Based on the theory of Thus, the W22 between distributions with non-zero means
admissible distance, We provide a new simple proof for the can be decomposed to the difference between means and
optimality of Tukey median. W22 between the centered distributions. In the rest of the
Theorem 4: Let the GAN estimator (Tukey median) be paper, we mainly consider W2 distance between zero-mean
defined as distributions except for Theorem 6 and 7.
To demonstrate the intricacies of GAN design, we start
g̃ = arg min LTukey (Pg , P̂nX ), (32) with the vanilla GAN architecture. We assume the generator
g∈G
is linear, i.e. G = {g(Z) = AZ : A ∈ Rd×r , Z ∈ Rr },
where LTukey is defined in (31). Then, LTukey is an admis- which means that g(Z) ∼ N (0, K) for some K positive
sible distance of (TV, G) with parameter (2, 0, TV ), where semidefinite with rank at most r. The optimal generator in
TV (P1 , P2 ) is defined in equation (28). With probability at this setting is trivially g(Z) = K1/2 Z if PX = N (0, K)
least 1 − δ, there exists some universal constant C such that and r = d. However, as the next theorem shows, it takes the
d+1 log(1/δ) GAN architecture more than exponential number of samples
TV(Pg̃ , PX ) ≤ 5 · OP T + C · +4 . to approach the optimal generator even when K is identity.
n 2n
(33) It is a refinement of [16].
Theorem 5: Let the GAN estimator be
The proof is deferred to Appendix (E).
g̃ = arg min W2 (Pg(Z) , P̂nX ). (37)
g(·)∈G
E. W2 -GAN Model
It was first shown in [16] that when the generator family is Then we have
rank-constrained linear maps, L is the Wasserstein-2 distance,
W2 (Pg̃(Z) , PX ) ≤ inf W2 (Pg(Z) , PX ) + 2W2 (PX , P̂nX ).
and the real distribution is Gaussian, then the population GAN g∈G
solution (4) produces the PCA of the covariance matrix of (38)
the true distribution. Inspired by this phenomenon, Quadratic
GAN is proposed and both theoretical and empirical results Furthermore, assume EPX [X] = 0, and PX satisfies the
demonstrate its fast generalization. In this section, we relax the following assumption (which N (0, Id ) satisfies when d ≥ 3):
• X ∼ μ satisfies logarithmic Sobolev inequality, i.e. for as the Fréchet Inception Distance [40]. The following result
some constant B > 0 and all smooth f on Rd , analyzes the performance of projecting under W̃2 .
Theorem 6 (Guarantee for quadratic GAN): Assume the
Ent(f 2 ) ≤ 2BE(|∇f |2 ), (39) generator family to search is rank-r affine generators:
where ∇f is the usual gradient of f , Ent(f ) = Gr = {g(Z) = AZ + b : A ∈ Rd×r , Z ∈ Rr , b ∈ Rd }.
Eμ (f log f ) − Eμ (f ) log Eμ (f ).
(44)
• Upper bound on expected W2 convergence rate5 for some
constant C1 depending on PX : Denote the family of full rank Gaussian distribution generator
as G = {g(Z) = AZ + b : A ∈ Rd×d , Z ∈ Rd , b ∈ Rd }. Let
E[W22 (PX , P̂nX )] ≤ C1 (PX ) · n−2/d . (40) the GAN estimator be
Then, there exists some constant C depending on PX , such g̃ = arg min W̃2 (Pg(Z) , P̂nX ). (45)
that with probability at least 1 − δ, g(·)∈Gr
W2 (Pg̃(Z) , PX ) ≤ inf W2 (Pg(Z) , PX ) + C(PX ) · n−1/d Then, W̃2 is an admissible distance for (W2 , G) with parameter
g∈G (1, 0, W̃2 ), and
2B log(1/δ) 1/2
+2 . (41) W2 (Pg̃(Z) , PX ) ≤ ΣX − Σ1/2
r F+ 2 inf W2 (Pg(Z) , PX )
n g(·)∈G
For lower bound, when PX = N (0, Id ), we have with + 2W̃2 (P̂nX , PX ), (46)
probability at least 0.99, there exist some constants C2 , C3 (d),
such that when n > C3 (d) and d ≥ 5, here μX , ΣX are the mean and covariance matrix of X,
respectively, Σr is the best rank-r approximation of ΣX under
W2 (Pg̃(Z) , N (0, Id )) ≥ C2 d1/2 n−3/d . (42) Frobenius norm (the r-PCA solution).
Furthermore, if the minimum eigenvalue of the covariance
Remark 4: Equation (41) depends on two assumptions: matrix of X is lower bounded by a constant B 6 , then
the logarithmic Sobolev inequality (39) and the expected
1/2
convergence rate (40). The logarithmic Sobolev inequality W2 (Pg̃(Z) , PX ) ≤ ΣX −Σ1/2
r F +2 inf W2 (Pg(Z) , PX )
g(·)∈G
holds for a wide range of distributions including Gaussian dis-
tribution [35]. The upper bound on the expected convergence d
+2 ΣX − Σ̂ + 2 µX − µ̂ , (47)
rate holds for any distribution PX with finite 2d/(d − 2) B
moment and d ≥ 5 [39]. It also holds for N (0, Id ) when
where µ̂, Σ̂ are the empirical mean and covariance matrix
d ≥ 3.
of P̂nX , respectively.
The proof is deferred to Appendix F. The proof of the
Remark 5: It follows from [41] that if PX is sub-Gaussian,
theorem also provides a lower bound for convergence rate
i.e., there exists some constant L such that
of Wasserstein-2 distance between Gaussian and its empirical
2
/L2
distribution, which appears to be new. PX (|X − E[X], x| > t) ≤ 2e−t (48)
Theorem 5 shows that even when there exists a generator
that can perfectly reconstruct the real distribution that produces for t > 0 and x 2 = 1, then with probability at least 1 − δ,
the samples, the sample complexity grows at least exponen- we have

tially with the dimension of the data distribution. The result 2 d
2
in Theorem 5 can also be understood as an instantiation of ΣX − Σ̂ L log (49)
δ n
the discriminator winning phenomenon. Indeed, unless we
are given samples at least exponential in the dimension d, and
the discriminator in the W2 dual representation can always
d
easily tell the difference between the empirical distribution µX − µ̂ δ,L . (50)
P̂nX and the solved optimal generator distribution g̃. Intuitively, n
to alleviate this phenomenon, one needs to design discrimina- More generally, it follows from [42, Theorem 5.6.1] that if
tors to properly weaken their discriminating power. E[X] = 0, and for some K > 0, X ≤ K almost surely,
2) Polynomial Sample Complexity GAN Design: It was then for every n ≥ 1,
proposed in [16] that when PX is Gaussian, one shall project
K 2 Σ log d (K 2 + Σ ) log d
under the Quadratic GAN distance defined as E ΣX − Σ̂ + . (51)
n n
W̃2 (P1 , P2 ) = W2 (N (µ1 , Σ1 ), N (µ2 , Σ2 )), (43) The proof of theorem is deferred to Appendix G. The results
in Theorem 6 can be interpreted in the following way.
where µ1 , µ2 are means of P1 , P2 , and Σ1 , Σ2 are covari- 1/2
ances of P1 , P2 . This distance is also used in the literature 1) ΣX − Σ1/2 r F : the PCA error introduced by rank
mismatch between generator family and true distribution;
5 When the dimension of the support of P
X is less than the ambient
6 If the minimum eigenvalue is not lower bounded by a positive constant,
dimension d, the convergence rate is depending on the intrinsic dimension
but not the ambient dimension. we can show the stochastic error d1/2 ΣX − Σ̂1/2 + +µX − µ̂.
2) 2 · inf g(·)∈G W2 (Pg(Z) , PX ): the non-Gaussianness of

distribution PX ;
3) 2· W̃2 (P̂nX , PX ): stochastic error that vanishes as n → ∞.
It is interesting to note that the when the distrib-
ution PX is indeed Gaussian, which is equivalent to
inf g(·)∈G W2 (Pg(Z) , PX ) = 0, the PCA error term cannot be
improved even by a constant when n → ∞. Indeed, it is due
to the observation [16] that
1/2
inf W2 (Pg(Z) , N (µX , ΣX )) = ΣX − Σ1/2
r F. (52)
g∈Gr
However, when r = d, it is not clear whether the factor 2

in front of the term inf g(·)∈G W2 (Pg(Z) , PX ) is necessary.
Can we reduce the constant 2 to 1 perhaps by using other Fig. 4. The trade-off between oracle error OP T and stochastic error n .
algorithms? Although Quadratic GAN suffers from a larger prefactor C = 2 for the oracle
3) The Near-Optimality of the Quadratic GAN: Motivated error term, it achieves significantly smaller overall error when the generator
family is close to the target data distribution. ’Limit for C = 1’ line represents
by Theorem 6, we first study the problem of whether it is the best possible performance we can achieve if we want the slope to be 1,
possible to reduce the prefactor 2 of the non-Gaussianness which still requires exponential sample complexity when OP T = 0. For any
term in (47). The following theorem shows that it is impossible OP T < 12 (n−1/d − n−1/2 ) (left of the vertical line), the projection under
the quadratic GAN method achieves smaller overall error. Beyond the vertical
to achieve constant 1 without sacrificing the stochastic error line, one may consider switching to a more appropriate generator family.
term.
Theorem 7: Consider full-rank affine generator family
G = {g(Z) = AZ + b : A ∈ Rd×d , b ∈ Rd , Z ∈ Rd }. and the GAN estimator g̃ is defined in (45). Then, for any
Define > 0, there exists some distribution PX , EPX [X] = 0,
PK = {N (0, I)} ∪ {PX : X − E[X] ≤ K a.s., inf g∈G W2 (Pg(Z) , PX ) > 0, such that with infinite sample
1 size, we have
λmin (E[(X − E[X])(X − E[X])T ]) ≥ }. (53) √
2
W2 (Pg̃(Z) , PX ) ≥ ( 2 − ) inf W2 (Pg(Z) , PX ). (56)
Then, there exists a constant D > 0 such that there is no g∈G
algorithm
√ with the following property: for any PX ∈ PK , K ≥

D( d + ln (n)), algorithm A makes n draws from PX
F. Admissible Distances for Cascade GANs
and with probability at least 51/100 outputs a hypothesis P̂X
satisfying In the Cascade GAN setting in (3), we have motivated the
usage of the distance
W2 (P̂X , PX ) ≤ inf W2 (Pg(Z) , PX ) + (n, d), (54)
g(·)∈G
L(P, Q) = min L1 (Pg (Z) , Q) + λL2 (P, Pg (Z) ) (57)
where (n, d) d n−6/d as n → ∞.
g ∈G
The proof is relegated to Appendix (H). As an illustrating when the overall perturbation cannot be represented as only
example in Figure 4, we specify the discrepancy measure perturbing under L1 or L2 , but a cascaded perturbation L2
L to be W2 , G = {g(Z) = AZ + b : A ∈ Rd×d , b ∈ followed by L1 . In this section, we consider the generalization
Rd , Z ∈ Rd }, and plot E[W2 (Pg̃(Z) , PX )] as a function of problem (5) for this new distance.
OP T = inf g∈G W2 (PX , Pg(Z) ), for naive projection and pro- Theorem 9: Assume we have L1 as an admissible distance
posed algorithm. Furthermore, we show the minimal stochastic for (L1 , G ) with parameter (c1 , c2 , L1 ), L2 as an admissible
error for keeping the slope 1, which means in order to make distance for (L2 , G ) with parameter (d1 , d2 , L2 ) for generator
the stochastic error n in equation (5) smaller, it is necessary family G , and G ⊂ G . We further assume that L1 (Pg , Pg ) =
to sacrifice the constant C. L2 (Pg , Pg ) = 0 for all g ∈ G . Define the new distance as
Combining Theorem 6 and 7, we know that for the
non-Gaussianness term, the prefactor 2 is achievable by the L (P, Q) = min L1 (Pg (Z) , Q) + λL2 (P, Pg (Z) ). (58)
Quadratic GAN, while the constant 1 is not achievable without g ∈G
sacrificing the stochastic error. It is also shown in [37] that the

Then, L is an admissible distance for L in (57) and G with
sacrifice of constant factor in general cannot be avoided if we
parameter
require proper learning. Can we pin down the exact constant
while still keeping the stochastic error term small? The next c2 d2
theorem shows that the Quadratic GAN method cannot achieve max(c1 , d1 ), max(c1 , d1 ) · + , L1 ,
√ c1 d1
any constant that is strictly less than 2 even with infinite
sample size. and if we define the GAN estimator as
Theorem 8: Suppose
ĝ = arg min L (Pg(Z) , P̂nX ), (59)
G = {g(Z) = AZ : A ∈ Rd×d , Z ∈ Rd }, (55) g∈G
we have
L(Pĝ(Z) , PX ) ≤ (1 + 2 max(c1 , d1 ))OP T
+ 2 max(c1 , d1 )L1 (PX , P̂nX )
c2 d2
+ max(c1 , d1 ) · + , (60)
c1 d1
where
Fig. 5. The estimated W2 and W̃2 distances between a high dimensional
OP T = inf L(Pg(Z) , PX ). (61) isotropic Gaussian distribution and its empirical distribution for original W2
g∈G GAN and Quadratic GAN. The left figure is for 10-dimensional Gaussian
and the right figure is for 20-dimensional Gaussian. The x and y axes are
all logarithmic. The fitted coefficients of linear model from two lines show
that the matching method produces results that are consistent with theoretical
G. Testing the Strength of Discriminators With predictions with reasonably small sample sizes.
Empirical Samples
In order to achieve a fast convergence rate, it is crucial that an open problem in optimal transport theory. Based on the
L in Theorem 1 is weak enough. As illustrated in Theorem 5, matching method, we can estimate the number of samples
with inappropriate choice of discriminator, discriminator win- required for L(PX , P̂nX ) to achieve certain error , which helps
ning phenomenon will lead to poor performance of GANs. predict when discriminator winning problem occurs.
In this section, we propose a principled way to predict
the number of samples needed to alleviate the discriminator IV. O PTIMIZATION OF GAN S
winning phenomenon from empirical samples by a matching GANs are usually optimized using alternating gradient
theory [43]. Concretely, we aim at estimating the distance descent (AGD), which is known to be difficult to train, with
between empirical distribution and its population distribution little known stability or convergence guarantees. An unstable
L(P̂nX , PX ) using only empirical samples. training procedure may lead to mode collapse. Given a learn-
The method we propose is as follows. Given N samples ing algorithm proposed from statistical considerations, it is
from true distribution, we select the first n < N/2 samples to crucial to ensure that it is provably computationally efficient.
form empirical distribution P̂1 , and another disjoint n samples Thus it is of paramount importance to ensure that widely used
to form empirical distribution P̂2 . Then we compute L(P̂1 , P̂2 ) algorithms such as AGD are able to provably solve the GAN
as a proxy for L(PX , P̂nX ). A special case of this method optimization problem.
is known as maximum discrepancy [44] in the literature. It was shown in prior work [16] that the optimal generated
However, as is shown below, our method is more general and distribution under (45) is a Gaussian distribution whose covari-
can be applied to any pseudometric that is convex in its first ance matrix is the r-PCA of the empirical covariance matrix Σ̂.
argument. Our non-asymptotic result below can be seen as However, in practice, PCA is solved using specialized algo-
a generalization of the asymptotic justification for matching rithms but not alternating gradient descent. Recent work in
method in [45]. Fréchet Inception distance [40] also suggests training GAN
Theorem 10: Given N samples from true distribution, under the quadratic GAN distance W̃2 between coding layers.
we select the first n < N/2 samples to form empirical distri- Hence, it is natural to answer whether particularizing the
bution P̂1 , and another disjoint n samples to form empirical alternating gradient descent algorithm to this specific setting
distribution P̂2 . Suppose L(P, Q) is a pseudometric which is produces a convergent, stable, and efficient algorithm for the
convex in its first argument. Then, r-PCA. We mention that [16] has obtained some results in this
E[L(PX , P̂nX )] ≤ E[L(P̂1 , P̂2 )] ≤ 2E[L(PX , P̂nX )]. (62) direction but here we explore the question in greater depth.
In this section, we show that for the linear generator family
The proof is deferred to Appendix K. In the experiments
below, we use the matching method to successfully recover Gr = {g(Z) = AZ : A ∈ Rd×r , Z ∈ Rr } (63)
the exponent as predicted in Theorem 5 and 6. We conduct and W2 distance, AGD is globally asymptotically stable when
the experiments for estimating W2 and W̃2 distances between r = d, but is generally not locally asymptotically stable
an isotropic Gaussian distribution and its empirical distribution when r < d. We propose another GAN architecture based
with d = 10 and d = 20, where the result is shown in Figure 5. on parameter sharing which is provably globally asymptoti-
We can see that the matching method succesfully recovers cally stable7 and successfully computes the PCA using AGD.
the exponent of sample size n. In our analysis, it is shown Intuitively, one reason why AGD is not locally asymptotically
that for d ≥ 3 and n large enough, the quadratic GAN stable in solving the W2 -GAN formulation of PCA is that the
can achieve convergence rate as dn−1/2 , and the fitted rate minimax value is not the same as the maximin value in the
from two figures are d1.00 n−0.53 . Meanwhile the convergence GAN formulation, but AGD does not favor minimax explicitly
rate for the naive W2 GAN structure is lower bounded by over maximin. However, we do emphasize that this is not the
d1/2 n−3/d and upper bounded by C(d)n−1/d . The fitted rate sole reason, since the example in Figure 7 has duality gap zero,
is d0.48 n−1.24/d . We conjecture that the convergence rate is but alternating gradient descent is not asymptotically stable in
exactly d1/2 n−1/d up to universal multiplicative constants for
the W2 distance when d ≥ 3, whose validity is currently 7 The definition of local and global asymptotic stability is in Appendix ().
TABLE I
M INIMAX AND M AXIMIN VALUES OF O RIGINAL Q UADRATIC GAN
IN (65) FOR r = 1 AND PARAMETER S HARING GAN IN (72)
solving the minimax. It is still an open problem what all the

key reasons are that lead to the instability of GAN training. Fig. 6. The illustration of the instability of alternating gradient descent in
The key insight we draw is that utilizing information on GAN optimization in Theorem 12. We add 0.1 perturbation on both a11
where the maximum is attained can help us design a new and a22 , and plot the gradient flow for v. The star in the middle of the
figure is the optimal point for v. As long as A is perturbed a little, it would
minimax formulation whose duality gap (minimax value minus be repelled from the optimal point. This intuitively explains why alternating
maximin value) is zero and solve the inner maximum problem gradient descent is not locally asymptotically stable.
easily, while at the same time the new minimax value is equal
to the old minimax value. It is in constrast with other convexi-
fication approaches such as MIX-GAN [22], MAD-GAN [46]
and Stackelberg GAN [47], where new minimax problems
are proposed with smaller duality gap, but the new minimax
problem shares neither the same minimax value nor the same
generator set as the old formulation. Indeed, applying their for-
mulations to the PCA setting mentioned above would trivialize
the problem: by using multiple generators it is equivalent to
finding the best full-dimensional Gaussian approximation for
an arbitrary distribution, and the Quadratic GAN would simply
output the Gaussian distribution whose mean and covariance Fig. 7. The simulation of the dynamics of alternating gradient descent
match the observed distribution. in GAN optimization in Theorem 12. We add 0.0001 perturbation on
We summarize the minimax and maximin values in Table I a11 , a12 , a22 , v1 and −0.0001 perturbation on v2 . We plot the change of
values with time. We can see a22 is monotone increasing and other values are
for the original Quadratic GAN in Theorem 12 and proposed fluctuating around equilibrium. This shows that alternating gradient descent
Parameter Sharing GAN in Theorem 13. One can see that for is not locally asymptotically stable.
original Quadratic GAN, the duality gap is non-zero, while
Parameter Sharing GAN circumvents this issue, resulting in
global stability for alternative gradient descent. is that usually the generator distribution is supported on a
low-dimensional manifold embedded in the space with high
ambient dimensions. What happens when r < d? It is not
A. Instability for Optimization in Degenerate Case
difficult to see that alternate gradient descent is not globally
Let Gr be defined in (63). Theorem 6 shows that the convergent in this case [16]. We show here that one does not
algorithm specified in (64) below even have local asymptotic stability.
g̃ = arg min W2 (Pg(Z) , N (μ̂, Σ̂)) (64) Theorem 12: The optimization problem (65) is generally not
g(·)∈Gr locally asymptotically stable when r(U ) = 1 using alternating
enjoys desirable statistical properties. Can this problem be gradient descent.
solved efficiently numerically? First, we show two equivalent As an illustrating example, we consider d = 2, and decom-
minimax formulation of (64) via the following theorem: pose U = vv T . We denote A as [a11 , a12 ; a21 , a22 ]. Let K be
Theorem 11: The GAN optimization problem in equa- [1, 0; 0, 0]. The corresponding optimal solution is v = (1, 0),
tions (64) with μ̂ = 0 can be cast as several equivalent forms. a11 = 1, a12 = a22 = 0. Figure 6 shows that starting from the
We can either do optimal solution pair (A∗ , U ∗ ), if we let A deviate from A∗ a
little, U would be immediately repulsed off U ∗ . Furthermore,
min sup Tr((I − A)K + (I − A−1 )U ) (65) we simulate the dynamics of both A and U after perturbation
U≥0,r(U)≤r A>0
in Figure 7 under the same setting. It can be seen that a22 is
or do monotone increasing and other values are fluctuating around
the equilibrium. Thus the system is not locally asymptotically
min sup Tr((I − A)K + (I − A† )U ).
U≥0,r(U)≤r A≥0,R(U)⊂R(A) stable.
(66) Our result provides a provable example that is of the same
spirit as the example in [49] which showed experimentally
Here K = Σ̂, A is the discriminator, and U is the covariance that standard gradient dynamics of GAN to learn a mixture
matrix for the distribution of generator. of Gaussian often fails to converge, while training generator
It follows from [48, Pg. 129] that when r = d, alter- with optimal discriminator would lead to convergence. It is
nating gradient descent converges to the optimal solution also shown in [50] that employing a gradient-based learning
U ∗ = K [16]. However, one of the key features of GANs algorithm for general-sum and potential games will avoid a
non-negligible subset of the local Nash equilibrium, and will Unlike a metric, one may have d(x, y) = 0 for distinct values
lead to convergence to non-Nash strategies. x = y.
Definition 3 (Pseudonorm): A real-valued function · :
B. Stability Results for Parameter Sharing GAN X → R is a pseudonorm if for any x, y ∈ X, the following
Ideally, in order to solve the minimax GAN optimization condition holds:
problem, we need to first solve the max and then the min, since
minimax may not equal maximin in general. The failure of x ≥ 0, x = 0 ⇒ x = 0. (76)
alternating gradient descent in Theorem 12 shows that treating cx = |c| x . (77)
the min and max optimization problems in an equal footing x+y ≤ x + y . (78)
may cause instability problems. The idea of solving more the
maximization problem and less the minimization problem is A pseudonorm differs from a norm in that the pseudonorm
implicitly suggested in [51] and the two time-scale update can be zero for nonzero vectors (functions).
rule [40]. Definition 4 (-Covering Number [33]): Let (V, · ) be
The next result shows that if we embed some knowledge a normed space, and G ∈ V . We say {V1 , ..., VN } is an
of the optimal solution of the inner maximization problem, -covering of G if ∀g ∈ G, ∃i such that g − Vi ≤ .
we can create a new minimax formulation with the following Definition 5 (Equilibrium Point [52]): Consider a nonlinear
time-invariant system dx dt = f (x), where f : R → Rd .
d
properties:
∗
• Zero duality gap: the minimax value equals the maximin A point x ∈ R is an equilibrium point of the system if
d
value; f (x∗ ) = 0.
• Relation to old minimax formulation: the minimax Definition 6 (Local Stability [52]): An equilibrium point
value of the new formulation is equal to the minimax in a system is locally stable if any point initialized near the
value of the old formulation without changing the gener- equilibrium point stays near that equilibrium point. Formally,
ator. we say that an equilibrium point x∗ is locally stable if for all
• Optimization stability: AGD is globally asymptotically > 0, there exists a δ > 0 such that
stable in solving the r-PCA for r = 1.
x(0) − x∗ < δ ⇒ x(t) − x∗ < , ∀t > 0 (79)
Naturally, one shall embed the connections between the
optimal value A∗ and U ∗ in the optimization problem. It can Definition 7 (Local Asymptotic Stability [52]): An equilib-
be shown that when r = 1, the optimal solutions in (65) can rium point x∗ in a system is locally asymptotically stable if it
be written in the following form: is locally stable, and there exists a δ > 0, such that
K = QΣQT (67)
∗
x(0) − x∗ < δ ⇒ lim x(t) − x∗ = 0 (80)
U = λ1 v1 v1T (68) t→∞
A∗ = v1 v1T , (69) Definition 8 (Global Asymptotic Stability [52]): An equilib-

rium point x∗ is globally asymptotically stable if it is locally
where (67) is the eigenvalue decomposition of K, Q =
stable, and limt→∞ x(t) − x∗ = 0 for all x(0) ∈ Rd .
[v1 , v2 , . . . , vd ], and Σ = diag(λ1 , λ2 , . . . , λd ) ranks the eigen-
values of K in a descending order. In other words, U ∗ is the
1-PCA of K, and A∗ shares the linear subspace as U ∗ with A. Proof of Lemma 1
the eigenvalue being one. It naturally leads to the following
parameter sharing GAN architecture. ( [16] considered another It follows from the definition of the pseudonorm that x L
parameter sharing architecture.) is a sublinear function on the vector space. It follows from
Theorem 13: Alternating gradient descent is globally the definition of f L∗ that for any f such that f L∗ = 1,
asymptotically stable in solving GAN in the parameter sharing we have

formulation, where
f dμ ≤ μ L . (81)
A = λvv T , (70)
U = bvv T , (71) Hence, it suffices to show that for every
μ, we can indeed find
some f such that f L∗ = 1 and f dμ = μ L .
and we use alternating gradient descent to solve
For any μ, consider a continuous linear functional M
min sup Tr((I − A)K + (I − A† )U ). (72) defined on the one-dimensional subspace {α · μ : α ∈ R},
v=1,b≥0 λ≥0
such that M (αμ) = α μ L . Clearly, we have
A PPENDIX
M (αμ) ≤ αμ L = |α| μ L. (82)
Definition 2 (Pseudometric): A non-negative real-valued
function d : X × X → R+ is a pseudometric if for It follows from the Hahn–Banach theorem [31, Theorem 4.2]
any x, y, z ∈ X, the following condition holds: that we can extend M to the whole vector space to obtain
another linear functional T such that T (μ) = μ L , and the
d(x, x) = 0. (73)
T (μ ) ≤ μ L for any μ . The proof is completed by noting
d(x, y) = d(y, x). (74) that we have
assumed any continuous linear functional can be
d(x, z) ≤ d(x, y) + d(y, z). (75) written as f dμ for some measurable f .
B. Proof of Theorem 1 C. Proof of Theorem 2

Proof: It follows from the definition of OP T in (6) that We first show that the first property is satisfied with c1 = 1,
there exists some g ∗ such that c2 = 4. Apparently, L (PX , PX ) = 0 for any PX . Indeed, for
any g̃, g ∗ ∈ G, by the definition of G , ∃g1 , g2 ∈ G , such that
OP T = L(Pg∗ (Z) , PX ). (83)
L(Pg̃(Z) , Pg1 (Z) ) ≤ , L(Pg2 (Z) , Pg∗ (Z) ) ≤ . (92)
We have
Thus by triangle inequality
L(Pg (Z) , PX ) ≤ L(Pg (Z) , Pg∗ (Z) ) + L(Pg∗ (Z) , PX ) (84)
L(Pg̃(Z) , Pg∗ (Z) ) ≤ L(Pg̃(Z) , Pg1 (Z) ) + L(Pg1 (Z) , Pg2 (Z) )
≤ c2 + c1 (L (Pg (Z) , Pg∗ (Z) )
+ L(Pg2 (Z) , Pg∗ (Z) )
−L (Pg∗ (Z) , Pg∗ (Z) ))
≤ L(Pg1 (Z) , Pg2 (Z) ) + 2
+ OP T (85)

= L (Pg1 (Z) , Pg2 (Z) ) + 2
≤ c2 +c1 L (Pg (Z) , PX )−L (Pg (Z) , Pg (Z) )
∗ ∗
≤ L (Pg1 (Z) , Pg̃(Z) ) + L (Pg̃(Z) , Pg∗ (Z) )
+ OP T + OP T (86)
+ L (Pg∗ (Z) , Pg2 (Z) ) + 2
= c1 (L (Pg (Z) , PX ) − L (Pg∗ (Z) , Pg∗ (Z) ))
≤ L (Pg̃(Z) , Pg∗ (Z) ) + 4, (93)
+ (1 + c1 )OP T + c2 (87)
≤ c1 (L (Pg∗ (Z) , PX ) − L (Pg∗ (Z) , Pg∗ (Z) )) where in the last inequality we used the fact that L ≤ L.
The second property of admissibility is satisfied since L
+ (1 + c1 )OP T + c2 (88) satisfies the triangle inequality and we can take L = L .
≤ c1 (L (Pg∗ (Z) , Pg∗ (Z) ) + OP T Applying Theorem 1, we have
− L (Pg∗ (Z) , Pg∗ (Z) )) + (1 + c1 )OP T + c2
L(Pg̃(Z) , PX ) ≤ 3 · OP T + 2L (PX , P̂nX ) + 4 (94)
(89)
= (1 + 2c1 )OP T + c2 . (90) In order to bound L (PX , P̂nX ), we write it as
1
n
Concretely, (84) follows from the triangle inequality of L, L (PX , P̂nX ) = sup E[f (X)] − f (Xi ). (95)
(85) follows from the first property of admissibility of L , f ∈F n i=1
(86) follows from the second property of admissibility and
Suppose there exists a constant B such that for any f ∈ F ,
the fact that L ≤ L, (88) follows from the definition of g ,
1
n
(89) follows from the second property of admissibility. nt2
For the estimator g , following similar lines of arguments P(E[f (X)] − f (Xi ) > t) ≤ 2 exp(− 2 ). (96)
n i=1 2B
in equation (87), we have
Using union bound, we can get
L(Pg (Z) , PX ) ≤ c1 (L (Pg (Z) , PX ) − L (Pg∗ (Z) , Pg∗ (Z) ))
1
n
+ (1 + c1 )OP T + c2 P( sup E[f (X) − f (Xi )] > t)
f ∈F n i=1
≤ c1 (L (Pg (Z) , P̂nX ) + L (PX , P̂nX )
nt2
− L (Pg∗ (Z) , Pg∗ (Z) )) + (1 + c1 )OP T + c2 ≤2|F | exp(− )
2B 2
≤ c1 (L (Pg∗ (Z) , P̂nX ) + L (PX , P̂nX ) nt2
− L (Pg∗ (Z) , Pg∗ (Z) )) + (1 + c1 )OP T + c2 ≤2N2 exp(− 2 )
2B
2d
= c1 L (Pg∗ (Z) , P̂nX ) + c1 L (PX , P̂nX ) 1 nt2
≤2C 2 exp(− 2 ), (97)
− c1 L (Pg∗ (Z) , Pg∗ (Z) ) + (1 + c1 )OP T +c2 2B
≤ c1 (L (Pg∗ (Z) , PX ) + L (PX , P̂nX )) d
since we have assumed that N ≤ C 1 . Denoting the right
+ c1 L (PX , P̂nX ) − c1 L (Pg∗ (Z) , Pg∗ (Z) ) hand side in equation (97) as δ, we have
+ (1 + c1 )OP T + c2 2B 2 2C 2
t= log( 2d ). (98)
= c1 L (Pg∗ (Z) , PX ) + 2c1 L (PX , P̂nX ) n δ
− c1 L (Pg∗ (Z) , Pg∗ (Z) ) + (1 + c1 )OP T +c2 Then with probability at least 1 − δ, we have
≤ c1 (L (Pg∗ (Z) , Pg∗ (Z) ) + OP T )
2B 2 2C 2
+ 2c1 L
(PX , P̂nX )
− c1 L (Pg∗ (Z) , Pg∗ (Z) ) L(Pg̃(Z) , PX ) ≤ 3 · OP T + 2 log( 2d ) + 4. (99)
n δ
+ (1 + c1 )OP T + c2 1
Taking = ( d log
n ) , we can derive
n 2

= (1 + 2c1 )OP T + 2c1 L (PX , P̂nX ) + c2 . 1
(91) 1 1 2
L(Pg̃(Z) , PX ) ≤ 3·OPT+D(B, C) max d log(n), log .
n δ
(100)
D. Proof of Theorem 3 As for the second property, we notice that

From the definition of TV in (28), there is |LTukey (Pg , P1 ) − LTukey (Pg , P2 )|
TV (PX , PX ) = 0 for any PX . By definition of total
=| sup P1 (X − EPg [X])T v > 0
variation distance, we have for g1 , g2 ∈ G, v∈Rd

− sup P2 (X − EPg [X])T v > 0 |
TV (Pg1 , Pg2 ) = TV(Pg1 , Pg2 ), (101) v∈Rd
≤ sup P1 (A) − P2 (A)
since the optimal set to distinguish between Pg1 and Pg2 is A∈{{x:(x−b)T v>0}:v,b∈Rd }
a halfspace. Thus the first property is satisfied with c1 = 1, =TV (P1 , P2 ), (108)
c2 = 0.
The second property of admissibility is satisfied since TV where TV (P1 , P2 ) is defined in equation (28). Thus LTukey
satisfies the triangle inequality and we can take L = TV . satisfies the second property with L = TV .
Applying Theorem 1, we have Applying Theorem 1 we can derive

TV(Pg̃ , PX ) ≤ 5 · OP T + 4TV (PX , P̂nX ). (109)
TV(Pg̃ , PX ) ≤ 3 · inf TV(Pg , PX ) + 2TV (PX , P̂nX )
g∈G It follows from similar arguments to that in the proof of
≤ 3 · OP T + 2TV (PX , P̂nX ). (102) Theorem 3 that there exists an absolute constant C > 0 such
that with probability at least 1 − δ,
Since the class A = {{x ∈ Rd : v, x + b > 0} : v 2 =
d+1 log(1/δ)
1, b ∈ Rd } is all halfspaces in Rd , it has VC dimension TV(Pg̃ , PX ) ≤ 5 · OP T + C · +4 .
d + 1 [53]. By VC inequality [33, Theorem 2.2, Chap. 4], n 2n
regardless of the distribution PX , we have with probability at (110)
least 1 − δ, there exists some absolute constant C, such that
F. Proof of Theorem 5
d+1 log(1/δ) 1) Concentration for W2 Distance for Distributions That
TV (PX , P̂nX ) ≤ C · + . (103)
n 2n Satisfy Logarithmic Sobolev Inequality: Following the similar
idea in [54, Corollary 5.5], we can prove the following lemma.
Combining equation (102) and (103), we can derive with
Lemma 2: For any measure μ on Polish space X , consider
probability at least 1 − δ, there exists some universal con-
random variable X ∼ μ and corresponding empirical samples
stant C, such that
x ∈ X n , with each xi sampled i.i.d. from distribution μ.
d+1 log(1/δ) The map x → W2 (PX , P̂nX ) is √1n -Lipschitz under Euclidean
TV(Pg̃ , PX ) ≤ 3 · OP T + C · +2 . distance, i.e., viewing W2 (PX , P̂nX ) as a function W (x), for
n 2n
(104) different sets of samples x, x
1
W (x) − W (x ) ≤ √ x − x . (111)
n
E. Proof of Theorem 4
Proof: By triangle inequality of the W2 distance
We first show the first property is satisfied. From the
W (x) − W (x ) ≤ W2 (P̂nX , P̂nX ). (112)
definition of LTukey in (31), there is LTukey (PX , PX ) = 0 for
any PX ∈ G since each distribution in G is symmetric. For We know that the Wasserstein distance is related to Talagrand
any two distribution P1 = N (μ1 , I), P2 = N (μ2 , I), it follows function [54] as follows:
from straightforward computation that 1
W2 (μ, v) = T2 (μ, v) 2

μ1 − μ2
TV(P1 , P2 ) = 2Τ
2
− 1. (105) = inf |x − y|2 dπ(x, y);
2 π X2
12
We also have π(x, y) ∈ P(X 2 ); π(x) = μ, π(y) = v . (113)
According to the convexity of T2 (·, ·) [55, Theorem 4.8], one
LTukey (P1 , P2 ) = Τ ( μ1 − μ2 2 ) − 1/2. (106) has
x 1
Here Τ(x) = −∞ φ(x)dx is the standard normal CDF. W2 (P̂nX , P̂nX ) = (T2 (P̂nX , P̂nX )) 2
We notice that for any P1 , P2 ∈ G, we have 1
n
1
≤( T2 (δ(xi ), δ(xi )) 2
1 n i=1
LTukey (P1 , P2 ) ≥ TV(P1 , P2 ), (107)
1
n
2 1
= √ ( (xi − xi )2 ) 2
which follows from the fact that Τ(x) as a function on [0, ∞) n i=1
is non-decreasing. Hence the first property is satisfied with 1
= √ x − x . (114)
c1 = 2, c2 = 0. n
Thus we have 3) Proof of Lower Bound for Gaussian Distribution: We

1 first lower bound the term W2 (N (0, I), P̂nX ). We bound W2
W (x) − W (x ) ≤ √ x − x . (115)
n distance by W1 distance by Hölder’s inequality and then use
the Kantorovich duality:
Under the same setting, we can derive the concentration
property of W2 distance between any distribution that satisfies W2 (N (0, I), P̂nX ) ≥ W1 (N (0, I), P̂nX )
logarithmic Sobolev inequality and its empirical distribution as
= sup (Ef (X) − Ef (Y )), (124)
follows. f :f L ≤1
Lemma 3: Let μ be a probability measure on Rn such that
for some constant B > 0 and all smooth f on Rn ,
where X ∼ N (0, I), Y ∼ P̂nX . We take a specific Lipschitz-1
Ent(f 2 ) ≤ 2BE(|∇f |2 ), (116) function to further lower bound it:
where ∇f is the usual gradient of f , Ent(f ) = Eμ (f log f ) −
Eμ (f ) log Eμ (f ). Assume X ∼ μ, the Wasserstein distance f (x) = min x − Xi , (125)
W2 (PX , P̂nX ) satisfies i
P(W2 (PX , P̂nX ) ≥ E[W2 (PX , P̂nX )] + t) ≤ exp(−nt2 /2B).

where i ∈ [n] and Xi is the i-th sample. We know that
(117) Ef (Y ) = 0 since Y is empirical distribution, thus
From Lemma 2, we know that W2 (PX , PnX ) is a Lipschitz
function for samples x. Thus by the Herbst argument [35, 2.3], W2 (N (0, I), P̂nX ) ≥ EX∼N (0,I) min X − Xi . (126)
we can directly derive the inequality (117). i
2) Proof for Upper Bound: Denote
g ∗ = arg min W22 (Pg(Z) , PX ). (118) This is the nearest neighbor distance for Gaussian distrib-
g∈G ution, which is extensively studied in the literature [56]–[58].
By the triangle inequality, we have for any PX , We know that there exist constants C2 (d), C3 , such that when
n > C2 (d) and d > 1, with probability at least 0.999,
W2 (PX , Pg̃(Z) ) ≤ W2 (PX , P̂nX ) + W2 (Pg̃(Z) , P̂nX )
≤ W2 (PX , P̂nX ) + W2 (Pg∗ (Z) , P̂nX )
Γ(1+ d1 )
≤ 2W2 (PX , P̂nX ) + W2 (Pg∗ (Z) , PX ) EX∼N (0,I) [min X−Xi ] ≥ C3 1/d
f 1−1/d(x)dx·n−1/d
i Vd Rn
= inf W2 (Pg(Z) , PX ) + 2W2 (PX , P̂nX ). ≥ 1/2 −1/d
C3 d n (127)
g∈G
(119)
We can upper bound the term W2 (PX , P̂nX )
given the as n → ∞.
following two assumptions: Thus, we can conclude that for n > C2 (d) and d > 1, there
exist a constant C3 , such that with probability at least 0.999,
• X ∼ μ satisfies logarithmic Sobolev inequality in (116).
• Upper bound on expected convergence rate for some
constant C1 depending on distribution PX : W2 (N (0, I), P̂nX ) ≥ C3 d1/2 n−1/d . (128)
E[W22 (PX , P̂nX )] ≤ C1 (PX ) · n−2/d . (120)
The first assumption would lead to concentration inequal- Note that when d = 1, it is known from [59, Corol-
ity (117) by Lemma (2). From (120), by Jensen’s inequality, lary 6.14] that the convergence rate of E[W2 (N (0, I), P̂nX )]
1/2
we can show that is Δ log log n
. The exact convergence rate for the case of
n
E[W2 (PX , P̂nX )] ≤ C1 (PX ) · n−1/d . (121) d = 2 remains open.
Before proving the exponential convergence rate lower
Taking right-hand side in equation (117) as δ and solving bound for the Gaussian distribution, we first consider a simpler
for t, we have with probability at least 1 − δ, there exists some case which searches the generator in a subset of the linear
constant C1 depending on PX , such that generator family, G = {N (0, cId ) : c ∈ R}, with real
2B log(1/δ) distribution N (0, Id ). Naive estimator aims at looking for c
W2 (PX , P̂nX ) ≤ C1 (PX ) · n−1/d + . (122) that minimizes the W2 distance between empirical distribution
n
Now we can derive the conclusion: for any distribution sat- and generator function:
isfying (116) and (120), there exists some constant C1 (PX ),
such that with probability at least 1 − δ, we have ĉ = arg min W22 (N (0, cId ), P̂nX ). (129)
c≥0
W2 (PX , Pg̃(Z) ) ≤ inf W2 (Pg(Z) , PX ) + C1 (PX ) · n−1/d
g∈G
2B log(1/δ) Assume X̂ ∼ P̂nX , Z ∼ N (0, Id ), denote

+2 . (123)
n ρn = supπZ =N (0,Id ),π =P̂n E[Z T X̂]. The W22 distance in
X̂ X
equation (129) is given by Lemma 4 [60]: For any symmetric positive definite matri-
ces A, B, we have
W22 (N (0, ĉI), P̂nX ) = min W22 (N (0, cI), P̂nX ) 1 1 1
c≥0
A2 − B2 A − B 2 , (136)
2 2 2
= min E Z 2 c + E X̂ 2
c≥0 where · is the operator norm.
− 2 sup E[Z T X̂]c Proof: Since equation (136) is homogeneous in scaling,
1
π it suffices to prove that for A − B ≤ 1, there is A 2 −
1
n 1
B 2 ≤ C. Now consider the function
= min dc2 + xi 2
− 2ρn c 1
c≥0 n i=1 1
f (x) = 1− t−3/2 dt. (137)
1 1 +
n tx
2 0
= xi + min dc2 − 2ρn c Let s = tx, we get
n i=1 c≥0
x

n s −3/2
1 2 ρ2n f (x) = x 1/2
s ds
= xi − . (130) 0 1+s
n d ∞ ∞
i=1
1/2 s −3/2 s −3/2
=x s ds − s ds
The optimal c that minimizes the distance between empirical 0 1 + s x 1 + s
distribution and real distribution is 1
= Kx 2 + g(x), (138)
ρn ∞ s −3/2
ĉ = . (131) where K = 0 1+s s
∞
ds, g(x) = x1/2 x 1+s
s
s−3/2 ds.
d
Furthermore, we know that It can be easily seen that K < ∞ and |g(x)| ≤ 2. Thus,
we know
W22 (N (0, I), P̂nX ) = E Z 2
2 + E X̂ 2
2 − 2 sup E[Z T X̂]
π KA1/2 − f (A) = g(A) ≤ 2. (139)
1
n
This would imply
=d+ xi 2 − 2ρn . (132)
n i=1 1 1 1 f (A) f (A) f (B)
n A2 − B2 ≤ A2 − + −
We denote A = n1 i=1 xi 2 , which satisfies E[A] = d K K K
2d −2/d f (B) 1
Var(A) = n . We note that for d ≥ 5, we have dn

and +
K
− B2
d
as n → ∞. Hence, with probability at least 0.99, there 4 1
n
≤ + f (A) − f (B) . (140)
exists some C2 (d) > 0, C3 > 0, C
4 > 0 such that for n > K K
C2 (d), d ≥ 5, we have A ≤ d + C3 nd , d − ρn ≥ C4 dn−2/d . Since we know that A − B ≤ 1, as long as f is Lipschitz in
1 1
For the lower bound of W2 (N (0, I), Pg̃(Z) ), we use operating norm, we can conclude that A 2 −B 2 is bounded,
triangle inequality and the fact that W2 (Pg̃(Z) , P̂nX ) ≤ thus leading to uniform continuity. We show f is Lipschitz in
W2 (N (0, ĉI), P̂nX ). With probability at least 0.99, for the rest of the proof. Note that integration and matrix operation
n > C2 (d) and d ≥ 5, commute. Thus,
1
W2 (N (0, I), Pg̃(Z) ) f (A) = (I − (I + tA)−1 )t−3/2 dt, (141)
0
1
≥ W2 (N (0, I), P̂nX ) − W2 (Pg̃(Z) , P̂nX )
f (A) − f (B) = ((I + tB)−1 − (I + tA)−1 )t−3/2 dt
≥ W2 (N (0, I), P̂nX ) − W2 (N (0, ĉI), P̂nX ) 0 1
ρ2 = (I + tB)−1 (A − B)(I + tA)−1 t−1/2 dt.
= d + A − 2ρn − A − n 0
d
ρ2n (142)
d + A − 2ρn − (A − d )
=√ The last equation comes from the identity X − Y = −1 −1
ρ2
d + A − 2ρn + A − dn X−1 (Y − X)Y−1 . Then we can derive
1
(d − ρn )2
= (133) f (A)−f (B) = (I+tB)−1 (A−B)(I+tA)−1 t−1/2 dt
√ ρ2
d( d + A − 2ρn + A − dn ) 10
√ ≤ (I+tB)−1 (A−B)(I+tA)−1 t−1/2 dt
(d − ρn )3/2 d − ρn
= √ (134) 0
1
d ρ2n
d + A − 2ρn + A − d ≤ (I + tB)−1 · (A − B)
≥ C5 d1/2 n−3/d . (135) 0
· (I + tA)−1 t−1/2 dt
1
G. Proof of Theorem 6 ≤ (A − B) t−1/2 dt
0
1) Uniform Continuity of Matrix Square Root Operator in
= 2 (A − B) . (143)
Operating Norm: We need the following lemma for the proof
of main theorem. This finishes the proof of the whole lemma.
2) Proof of Main Theorem: Similar to Theorem 2, Thus we have

the moment matching Wasserstein-2 distance W̃2 is an admis- 1 1 1
sible distance of the original Wasserstein-2 distance with W̃22 (PX , P̂nX ) = Tr(KX ) + Tr(K̂X ) − 2 Tr((KX
2
K̂X KX
2
)2 )
2
parameters (1, 0). Define + µX − µ̂X
1 1
≤ Tr(KX ) + Tr(K̂X ) − 2 Tr(K̂X
2
KX
2
)
gr∗ = arg min W̃2 (Pg(Z) , PX ), (144) 2
g∈Gr + µX − µ̂X
g ∗ = arg min W2 (Pg(Z) , PX ).
1 1
(145) 2 2 2
g∈G
= Tr((KX
2
− K̂X ) ) + µX − µ̂X
1 1
2 2
= KX − K̂X
2 2
F + µX − µ̂X
We have 1 1
2 2
≤ d KX
2
− K̂X
2
+ µX − µ̂X . (150)
W2 (Pg̃(Z) , PX ) ≤ W2 (Pg̃(Z) , Pg∗ (Z) ) + W2 (Pg∗ (Z) , PX ) If the minimum eigenvalue of KX is lower bounded by a
= W̃2 (Pg̃(Z) , Pg∗ (Z) ) + W2 (Pg∗ (Z) , PX ) constant B, by Ando-Hemmen inequality [62, Thoerem 6.2],
we have
≤ W̃2 (Pg̃(Z) , PX ) + W̃2 (PX , Pg∗ (Z) )
1 1
+ W2 (Pg∗ (Z) , PX ) W̃22 (PX , P̂nX ) ≤ d KX 2

− K̂X2 2
+ µX − µ̂X 2
≤ W̃2 (Pg̃(Z) , PX ) + 2W2 (Pg∗ (Z) , PX ) d
≤ 1/2 1/2
KX − K̂X 2
(λmin (KX ) + λmin (K̂X ))2
≤ W̃2 (Pg̃(Z) , P̂nX ) + W̃2 (P̂nX , PX )
+ µX − µ̂X 2 (151)
+ 2W2 (Pg∗ (Z) , PX )
d
≤ W̃2 (Pgr∗ (Z) , P̂nX ) + W̃2 (P̂nX , PX ) ≤ KX − K̂X 2 + µX − µ̂X 2 . (152)
B
+ 2W2 (Pg∗ (Z) , PX ) If the minimum eigenvalue of KX is not lower bounded,
≤ W̃2 (Pgr∗ (Z) , PX ) + 2W̃2 (P̂nX , PX ) it follows from Lemma 4 that
1 1
+ 2W2 (Pg∗ (Z) , PX ) W̃22 (PX , P̂nX ) ≤ d KX
2
− K̂X
2 2
+ µX − µ̂X 2
2
= inf W̃2 (Pg(Z) , PX )+2 inf W2 (Pg(Z) , PX) d KX − K̂X + µX − µ̂X . (153)
g∈Gr g∈G
+ 2W̃2 (P̂nX , PX )
1/2
H. Proof of Theorem 7
≤ ΣX − Σ1/2
r F + 2 inf W2 (Pg(Z) , PX )
g∈G We aim at justifying the tightness of Theorem 2. Consider
+ 2W̃2 (P̂nX , PX ). (146) the following two sampling models:
• Case 1. We draw n i.i.d. samples directly from N (0, I).
Here ΣX = E[XX T ] is the covariance matrix of X, and Denote the resulting empirical distribution as E1 .
Σr is the best rank-r approximation of ΣX under Frobenius • Case 2. We draw N elements i.i.d. from N (0, I). Denote
norm. Next, we bound the term W̃2 (P̂nX , PX ). Denote the the empirical distribution as DN . Then, we sample n
covariance matrices for PX and P̂nX as KX and K̂X . The i.i.d. points from DN . Denote the resulting empirical
distance W̃22 (PX , P̂nX ) can be written as [61] distribution as E2 .
For any distribution E we write AE to indicate that algo-
W̃22 (PX , P̂nX ) = Tr(KX ) + Tr(K̂X ) − 2 Tr((KX
2
K̂X KX
2
)2)
1 1 1 rithm A is only given access to the information contained in E.
2
The following lemma states that no algorithm
√ can successfully
+ µX − µ̂X . (147) distinguish these two cases when n N .
Lemma 5: There is an absolute
√ constant c > 0 such that the
We know following holds. If n ≤ c N , and B is any “distinguishing”
1 1 1 1 1 1
algorithm that aims outputting either “1” or “2”. Then,
1 1
Tr((KX
2
K̂X KX
2
) 2 ) = Tr(((K̂X
2
KX ) K̂X
2 T 2
KX
2
)2 )
1 1 |P(B E1 outputs“1”) − P(B E2 outputs“1”)| ≤ 0.01. (154)
≥ Tr(K̂X KX ).
2 2
(148)
Proof: We √ show that in both cases, with probability at
1 1 least√0.99, the c N draws received by B are a set consisting
Denote X = K̂X KX . The last inequality follows from the
2 2
of c N i.i.d. samples from N (0, I). This can be shown
fact that for any matrix X and its singular value decomposition
using a birthday paradox type argument below. The empirical
X = UΣV,
distribution E2 can be understood as drawing n i.i.d. samples
1 1 from DN with replacement. On the other hand, the empirical
Tr((XT X) 2 ) = Tr((VT Σ2 V) 2 ) = Tr(VT ΣV) distribution E1 can be understood as drawing n i.i.d. samples
≥ Tr(UΣV) = Tr(X). (149) from DN without replacement. It follows from the birthday
paradox that the probability of sampling all distinct points in It follows from the triangle inequality that
DN in sampling with replacement is
W2 (P̂X , N (0, I)) ≥ W2 (DN , N (0, I)) − W2 (P̂X , DN )
1 2 3 n (162)
1− 1− 1− ··· 1 −
N N N N
2 4 6 2n
≥ W2 (DN , N (0, I))
>e− N e− N e− 2N · · · e− N − inf W2 (Pg(Z) , DN ) − (n, d). (163)
n(n−1) g(·)∈G
=e− N . (155)
According to Theorem 5, following the same procedure
The inequality comes from the fact that 1 − x ≥ e−2x for as (133), we have with probability at least 0.999
n(n−1)
x ∈ [0, 12 ]. Taking c = 0.01, we have e− N > 0.99, which
gives the conclusion of the lemma. W2 (DN , N (0, I)) − inf W2 (Pg(Z) , DN ) d N −3/d
g(·)∈G
Now we use Lemma 5 to prove Theorem 7. Suppose √ that
(164)
there exists an algorithm A, such that given n = c N
−6/d
independent points draws from any PX ∈ PK , A outputs a d n ,
hypothesis P̂X with probability at least 51/100 satisfying (165)
√
W2 (P̂X , PX ) ≤ inf W2 (Pg(Z) , PX ) + (n, d), (156) where in the last step we used the assumption that n = c N .
g(·)∈G
It implies that with probability at least 0.507
where (n, d) d n−6/d . We will describe how the existence W2 (P̂X , N (0, I)) > (n, d) (166)
of A yields a distinguishing algorithm B violating Lemma 5.
The algorithm B works as follows. Given access to i.i.d. draws if (n, d) d n−6/d as n → ∞.
from distribution PX , it first runs algorithm A, obtaining with For algorithm B, it outputs “1" if W2 (P̂X , N (0, 1)) ≤
probability at least 51/100 a distribution P̂X satisfying equa- (n, d), hence we know
tion (156), and then computes the value W2 (P̂X , N (0, I)).
If W2 (P̂X , N (0, I)) ≤ (n, d) then it outputs “1", and other- P B E2 outputs “1” ≤ 0.493, (167)
wise it outputs “2". which implies that
Since G is the family of full-dimensional affine genera-
tors, inf g(·)∈G W2 (Pg(Z) , N (0, I)) = 0, which shows that in |P(B E1 outputs“1”) − P(B E2 outputs“1”)| > 0.01. (168)
Case 1, we have with probability 51/100 W2 (P̂X , N (0, I)) ≤
This contradicts with Lemma 5 and proves the theorem.
(n, d). Hence,

P B E1 outputs “1” ≥ 0.51. (157) I. Proof of Theorem 8
Now let’s consider Case 2. In this case the true underlying For infinite sample size, we have access to the real distribu-
distribution is DN , and for N d [41], we have the tion PX , and our GAN estimator would output a Gaussian
minimum eigenvalue of the covariance matrix of DN is bigger distribution N (µX , ΣX ), where µX = EPX [X], ΣX =
than 1/2 with probability at least 0.999. It follows from the EPX [XX T ] are the mean and covariance of the distri-
concentration of the norm [42, Theorem 3.1.1] there exists a bution PX . It suffices to show that for any > 0,
universal constant L > 0 such that if Y ∼ N (0, I), we can find some distribution PX with µX = 0,
inf µ,Σ W2 (N (µ, Σ), PX ) > 0, such that
√ t2 √
P(| Y − d| ≥ t) ≤ 2 exp − , (158) W2 (N (0, ΣX ), PX ) ≥ ( 2 − ) inf W2 (N (µ, Σ), PX ).
L µ,Σ
which implies that with probability at least 1 − δ, we have (169)

√ We construct distribution PX in the following way. Denote
2
Y ≤ d + L ln . (159) X = (X1 , X2 , . . . , Xd )T , we assume that the d random
δ
variables {X1 , X2 , . . . , Xd } are mutually independent, Xi ∼
Since DN consists of N vectors, if we set δ = 1 N (0, 1) for 1 ≤ i ≤ d − 1, and
1000N , then
with probability at least 0.999, we know 1 1 1
√ PXd = Qa = 1− δ0 + δ−a + 2 δa , (170)
a2 2a2 2a
X ≤ d + L ln (2000N ) (160)
where a > 1 is some parameter that will be chosen later. Here
if X ∼ DN . In other words, we have
almost√ surely δy denotes the point mass distribution that puts probability one
K = d + L ln (2000 N ). for the point y. Clearly, EPX [X] = 0, EPX [XX T ] = I.
Hence, with probability at least 0.508, by running algorithm We first show an upper bound on the right hand side
A we obtain a distribution P̂X such that of (169). Indeed, for the PX we constructed,
W2 (P̂X , DN ) ≤ inf W2 (Pg(Z) , DN ) + (n, d). (161) inf W22 (N (µ, Σ), PX ) ≤ inf W22 (N (0, c), Qa ), (171)
g(·)∈G µ,Σ c≥0
where we have reduced the set of µ, Σ we take the infimum We have

over and only considered couplings that are independent across
the indices of X. Using the tensorization property of W22 , E[f (Z)] = E[a(|Z| − τa )½(|Z| ≥ τa )] (186)
we know that = aE[|Z|½(|Z| ≥ τa )]−aτa P(|Z| ≥ τa ) (187)
τa
W2 (N (0, I), PX ) = W2 (N (0, 1), Qa ). (172) = aE[|Z|½(|Z| ≥ τa )] − (188)
a
τa
Hence, it suffices to show that for any > 0, there exists ≤ a(E[Z ]) (P(|Z| ≥ τa ))3/4 −
4 1/4
(189)
some a > 1 such that a
4 1/4 a τa
√ ≤ (E[Z ]) − (190)
W2 (N (0, 1), Qa ) ≥ ( 2 − ) inf W2 (N (0, c), Qa ). (173) a3/2 a
c≥0
as well as
It follows from the definition of W2 that
1 1
EQa [|Y |] = a= . (191)
W22 (N (0, c), Qa ) = inf E(X − Y ) 2
(174) a 2 a
π:πX =N (0,c),πY =Qa
2 2 Hence,
= E[X ] + E[Y ]
−2 sup Eπ [XY ] (175) 1 τa τa
ρa ≤ (E[Z 4 ])1/4 − + (192)
π:πX =N (0,c),πY =Qa
√ a1/2 a a
= c+1−2 sup Eπ [ cZY ]. 4 1/4 1
= (E[Z ]) , (193)
π:πZ=N (0,1),πY =Qa a1/2
(176)
which can be made arbitrarily close to zero if we take a → ∞.
Note that supπ:πZ =N (0,1),πY =Qa Eπ [ZY ] is independent of c.
Denote
J. Proof of Theorem 9
ρa = sup Eπ [ZY ].
π:πZ =N (0,1),πY =Qa
We verify the two properties of admissible distances.
First, for any g ∈ G, since G ⊂ G ,
Then,
√ L (Pg , Pg ) = min L1 (Pg , Pg ) + λL2 (Pg , Pg )
inf W22 (N (0, c), Qa ) = inf (c + 1 − 2 cρa ) (177)
g ∈G
c≥0 c≥0
= 0, (194)
= 1 − ρ2a , (178)
where the infimum achieving c = ρa . Taking c = 1, we know since we can take g = g. There is L (Pg , Pg ) = L1 (Pg , Pg ) =
that L2 (Pg , Pg ) = 0. Since G ⊂ G , for g1 , g2 ∈ G, we know that
g1 , g2 ∈ G . Denote
W22 (N (0, 1), Qa ) = 2 − 2ρa . (179)
g ∗ = arg min L1 (Pg , Pg1 ) + λL2 (Pg2 , Pg ). (195)
Hence, we have g ∈G

W2 (N (0, 1), Qa ) 2 − 2ρa From the fact that L1 and L2 are admissible distances, we have
= (180)
inf c≥0 W2 (N (0, c), Qa ) 1 − ρ2a
L (Pg1 , Pg2 ) − L (Pg2 , Pg2 ) = L1 (Pg∗ , Pg1 ) + λL2 (Pg2 , Pg∗ )
2 1 c2
= . (181) ≥ L1 (Pg∗ , Pg1 ) −
1 + ρa c1 c1
Hence, it suffices to show that one can make ρa → 0 λ d2
+ L2 (Pg2 , Pg∗ ) − .
as a → ∞. Define τa such that for Z ∼ N (0, 1), we have d1 d1
P(|Z| ≥ τa ) = a12 . It follows from the Gaussian tail abound (196)
that
√ By the non-negativity of the distance function, we can derive
τa 1 + ln a. (182)
max(c1 , d1 ) · L (Pg1 , Pg2 ) ≥ L1 (Pg∗ , Pg1 ) + λL2 (Pg2 , Pg∗ )
Consider function f : R → R defined as
c2 d2
− max(c1 , d1 ) · ( + )
0 0 ≤ |z| ≤ τa c1 d1
f (z) = (183)
a(|z| − τa ) |z| ≥ τa ≥ min

L 1 (Pg , Pg )+λL2 (Pg , Pg )
1 2
g ∈G
The key observation is the following inequality for any z, x c2 d2

− max(c1 , d1 ) · ( + )
such that |x| = a or x = 0: c1 d1
= L(Pg1 , Pg2 ) − max(c1 , d1 )
xz ≤ f (z) + τa |x|. (184)
c2 d2
· ( + ). (197)
Hence, we can upper bound c1 d1
ρa ≤ EZ∼N (0,1) [f (Z)] + τa EQa [|Y |]. (185) Hence the first property is satisfied.
Now we verify the second property. For any distribu- which by definition is an
upper bound
on
tion P1 , P2 and any g ∈ G, denote A Ψ
W22 (N (0, A), N (0, B))
for any Ψ such that ≥ 0.
ΨT B
g1∗ = arg min L1 (Pg , P1 ) + λL2 (Pg , Pg ), (198) We now show that for any S ≥ 0, R(B) ⊂ R(S),
g ∈G
g2∗ = arg min L1 (Pg , P2 ) + λL2 (Pg , Pg ). (199) Tr((I − S)A + (I − S † )B) ≤ W22 (N (0, A), N (0, B)).
g ∈G (211)
Without loss of generality, assume L (Pg , P1 ) ≥ L (Pg , P2 ), Indeed,
then there is
W22 (N (0, A), N (0, B)) = Tr(A) + Tr(B) − 2 sup Eπ [X T Y ]
|L (Pg , P1 ) − L (Pg , P2 )| = (L1 (Pg1∗ , P1 ) + λL2 (Pg , Pg1∗ )) π
− (L1 (Pg2∗ , P2 ) + λL2 (Pg , Pg2∗ )) (212)

≤ (L1 (Pg2∗ , P1 ) + λL2 (Pg , Pg2∗ )) ≥ Tr(A) + Tr(B) − 2(E[f (X)]
− (L1 (Pg2∗ , P2 ) + λL2 (Pg , Pg2∗ )) + E[f ∗ (Y )]), (213)
= |L1 (Pg2∗ , P1 ) − L1 (Pg2∗ , P2 )| where we have used Young’s inequality xT y ≤ f (x) + f ∗ (y)
(200) where f is an arbitrary convex function, and f ∗ is its convex
conjugate. Taking f (x) = 12 xT Sx, S ≥ 0, we have
= L1 (P1 , P2 ). (201)
f ∗ (y) = sup xT y − f (x) (214)
x∈Rd
K. Proof of Theorem 10
1 T †
y S y y ∈ R(S)
The right-hand side can be seen via the triangle inequality = 2 . (215)
upon noting that P̂1 and P̂2 are independent copies of the ∞ y∈/ R(S)
empirical distribution P̂nX . Evaluating the expectations leads to the desired inequality.
E[L(P̂1 , P̂2 )] ≤ E[L(P̂1 , PX )] + E[L(PX , P̂2 )]] Now, we have shown that for any Ψ such that
A Ψ
= 2E[L(PX , P̂nX )]. ≥ 0, any S ≥ 0 such that R(B) ⊂ R(S), we have
(202) ΨT B
The left hand side can be proved via Jensen’s inequality and Tr(A)+Tr(B)−2 Tr(Ψ) ≥ W22 (N (0, A), N (0, B))
convexity of L,
≥ Tr(A)+Tr(B)−Tr(SA+S † B).
E[L(P̂1 , P̂2 )] ≥ E[L(PX , P̂2 )] (216)
= E[L(PX , P̂nX )]. (203) The following lemma constructs an explicit coupling.
Lemma 6: Define
†
L. Proof of Theorem 11 S0 = B 1/2 (B 1/2 AB 1/2 )1/2 B 1/2 (217)
It suffices to show the following: for any two Gaussian
Ψ0 = AS0 . (218)
distributions X ∼ N (0, A), Y ∼ N (0, B), we have
Then,
W22 (N (0, A), N (0, B))
= inf Eπ X − Y 2 A Ψ0
(204) ≥ 0, (219)
π:πX =N (0,A),πY =N (0,B) ΨT0 B

A Ψ
= min Tr(A) + Tr(B) − 2 Tr(Ψ) : ≥0 (205) and
ΨT B
= sup Tr((I − S)A + (I − S −1 )B) (206) Tr(Ψ0 ) = Tr((A1/2 BA1/2 )1/2 ). (220)
S>0
Proof: We first remind ourselves of the Schur complement
= sup Tr((I − S)A + (I − S † )B) (207) characterization of PSD matrices.
S≥0,R(B)⊂R(S)
Lemma 7 [61]: Suppose A ≥ 0, B ≥ 0. Then,
= Tr(A) + Tr(B) − 2 Tr((A1/2 BA1/2 )1/2 ). (208)
A Ψ
Σ= ≥0 (221)
We first consider equation (205). Consider jointly Gaussian ΨT B
coupling between X and Y such that the joint covari-
ance matrix of the vector (X; Y ) is given by the positive if and only if
semi-definite matrix Ψ ∈ Ω = {Ψ : R(ΨT ) ⊂ R(B), A − ΨB † ΨT ≥ 0}. (222)

A Ψ
. (209) Now we check that Ψ0 is a legitimate covariance matrix
ΨT B
using Lemma 7. Indeed, denoting C = (B 1/2 AB 1/2 )1/2 ,
The corresponding value under this specific coupling is given we have S0 = B 1/2 C † B 1/2 , Ψ0 = AB 1/2 C † B 1/2 , and it
by is clear that R(ΨT0 ) ⊂ R(B). It suffices to verify that
Tr(A) + Tr(B) − 2 Tr(Ψ), (210) A − Ψ0 B † ΨT0 ≥ 0. (223)
It suffices to verify that Consider the special case when

I≥A 1/2
B 1/2 †
C B 1/2 †
B B 1/2 †
C B 1/2 1/2
A . (224)
1 0
K= . (239)
Denote the SVD of A1/2 B 1/2 = U ΣV T , we have 0 0
1/2 1/2
B A = V ΣU T , C = V ΣV T . Building on the obser-
We can derive the optimal solution as
vation that B 1/2 B † B 1/2 ≤ I, we have

A1/2 B 1/2 C † B 1/2 B † B 1/2 C † B 1/2 A1/2 1 0 1
A∗ = , v∗ = (240)
0 0 0
≤A1/2 B 1/2 C † C † B 1/2 A1/2
=U ΣV T V Σ† V T V Σ† V T V ΣU T (225) Note that A∗ is not inside the feasible region of positive
definite matrices, but at the boundary of that. However, one can
≤U ΣΣ† Σ† ΣU T (226)
verify that the gradient flow is 0 for the optimal solution if at
≤I, (227) the boundary we interpret the matrix inverse as matrix pseudo-
where in the last step we used the inequality ΣΣ† Σ† Σ ≤ I. inverse. We are interested in whether the system will leave the
Lemma 8: [63] Suppose A, B are PSD matrices of the optimal point after being slightly perturbed. Consider A, U is
same size. Then, searching within

inf Tr(AS + BS −1 ) = 2 Tr((A1/2 BA1/2 )1/2 ). (228) a a v
S>0 A = 11 12 , v = 1 (241)
a21 a22 v2
The infimum is achievable for some S > 0 if and only if
r(A) = r(B) = r(A1/2 BA1/2 ). Here r(·) is the rank of Note that A must be symmetric, thus we also have a12 =
matrix. a21 . Denote x = [vec(A), v] as the system state vector, and
Now we can prove the main results. We have the following x∗ = [vec(A∗ ), v ∗ ] as the equilibrium point of system defined
chain of inequalities and equalities as a consequence of (216), in equation (238). We can compute the derivative of a11 , a12 ,
Lemma 6, and Lemma 8. a22 and v as follows.
2 Tr((A1/2 BA1/2 )1/2 ) = inf Tr(AS + BS −1 ) (229) da11 a22 v1 − a12 v2
2
S>0
= −1 (242)
≥ inf Tr(AS + BS † ) dt a11 a22 − a212
R(B)⊂R(S),S≥0
da12 (a22 v1 − a12 v2 )(a11 v2 − a12 v1 )
(230) = (243)
dt (a11 a22 − a212 )2
≥ sup 2 Tr(Ψ) (231) 2
da22 a11 v2 − a12 v1
A Ψ =
Ψ: T
(244)
Ψ B
≥0 dt a11 a22 − a212
dv1 a22 v1 − a12 v2
≥ 2 Tr(Ψ0 ) (232) =2 − 2v1 (245)
dt a11 a22 − a212
= 2 Tr((A1/2 BA1/2 )1/2 ). (233) dv2 a11 v2 − a12 v1
=2 − 2v2 (246)
dt a11 a22 − a212
M. Proof of Theorem 12
The optimal solutions in (65) when r = 1 can be written in Thus for any δ > 0, we can intialize v = (1, 0), a11 = 1,
the following form: a12 = 0, a22 = δ/2. Under this case, all the derivatives
equal 0. Then x(0) − x∗ 2 = δ/2 < δ, but limt→∞ x(t) −
K = QΣQT (234) x∗ 2 = δ/2 = 0. Since δ can be arbitrarily small, it violates
∗
U = λ1 v1 v1T (235) the definition of local asymptotic stability in Appendix .
∗
A = v1 v1T , (236) We can conclude that the system is not locally asymptotically
stable.
where Q = [v1 , v2 , . . . , vd ], and Σ = diag(λ1 , λ2 , . . . , λd )
ranks the eigenvalues of K in a descending order. In other
words, U ∗ is the 1-PCA of K, and A∗ shares the linear N. Proof of Theorem 13
subspace as U ∗ with the eigenvalue being one. We have assumed that r = 1. We assume without loss
We show the optimization problem (65) is in general not of generality that K = 0. We parametrize U = bvv T , A =
locally asymptotically stable with r(U ) ≤ 1. We parametrize λvv T , v 2 = 1, b > 0, λ > 0. In this case, the objective
U = vv T , i.e. function reduces to
min sup Tr((I − A)K + (I − A−1 )vv T ). (237) b
v A>0 Tr(K) − λv T Kv + b − . (247)
λ
Taking the derivative of A and v, the gradient flow differential
equation is given by Denote

d A −K + A−1 U A−1 b
= . (238) f (v, b, λ) = Tr(K) − λv T Kv + b − , (248)
dt v −2(I − A−1 )v λ
we now analyze the alternating gradient descent flow for this The case of b = 0, λ > 0 can also be solved and it is
problem. The gradient flow differential equation is given by straightforward to show that dLdt ≤ 0 for any v such that
⎡ ⎤ v T v = 1, b = 0, λ > 0.
⎡ ⎤ − ∂f
v Taking the part corresponding to b out, we have
d ⎣ ⎦ ⎢ ⎥
∂v
b =⎢ ∂f ⎥
⎣− ∂b ⎦ . (249)
dt 1 1 1 b
λ ∂f b −1+ − 2 − 2
2v T K 2λKv − 2λ(v T Kv)v
∂λ λ λ λ 4λ
2
Note that we have constrained v 2 = 1, in other words, v bb
lies in the unit sphere. Suppose the unconstrained derivative of − − v T Kv (254)
2λ3
λ2
f at v is g, then the gradient of f constrained on the Stiefel 2
1 b T 2
manifold [64] v 2 = 1 is g−vg T v. Indeed, the direction = −b 1 − − v K v − (v T Kv)2
g−vg T v is orthogonal to v. λ λ
2
Concrete computation shows when b > 0, λ > 0, b b
− − v T Kv (255)
⎡ ⎤ ⎡ ⎤ 2λ3 λ2
v 2λKv − 2λ(v T Kv)v
d ⎣ ⎦ ⎢ 1 ⎥ ≤ 0, (256)
b =⎣ λ −1 ⎦. (250)
dt
λ −v Kv + λ2 .
T b
where equality holds if and only if λ = 1, v T K 2 v =
1
(v T Kv)2 , b = v T Kv, here we have used the inequality that
whenT b = 0,b dt = λ − 1 ∨ 0, and when λ = 0, dt =
db dλ
v T K 2 v ≥ (v T Kv)2 .
−v Kv + λ2 ∨ 0.
We decompose v as
The saddle point we aim for is
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
d
v ±v1 ±v1
⎣ b ⎦ = ⎣v1T Kv1 ⎦ = ⎣ λ1 ⎦ v= ai vi , (257)
(251)
i=1
λ 1, 1,
where {v1 , v2 , . . . , vd } are the eigenvectors of matrix K. Here

Here v1 is an eigenvector corresponding to the largest eigen- 2
value of K, and λ1 is the largest eigenvalue of K, which is i ai = 1.
The terms that remain in the derivative of L is
assumed to be positive. We also assume that at the beginning
of the optimization, we initialize v such that v1T v = 0. 1 v1T 1
− + λ21 (2λKv − 2λv T Kvv) + λ1 1 −
Hence, we construct the Lyapunov function L(v, b, λ) as 8 v T v1 λ
1 1 1 1
L(v, b, λ) = + λ21 ln + (b − λ1 )2 + (1−λ)v T Kv+ v T Kv 2v T K 2λKv−2λ(v T Kv)v ,
8 |v T v1 | 2 4
2 2 (258)
1 1 1 1 b
+ (λ−1)2 + −1 + −v T Kv .
2 8 λ 8 λ2 which is equivalent to
(252)
1 λ1
− + λ21 2λ λ1 − λi a2i + λ1 − + λi a2i
Note that if λ → 0+ , L → ∞, so if we are able to prove the 8 λ
descent property of L then λ will not hit zero in the trajectory. i
i

Solving for the gradient of L and denoting v(t), b(t), λ(t) 2 2 2 2 2
−λ λi ai + λ λi ai λi ai − ( λi ai ) .
as v, b, λ when b > 0, λ > 0, we have i i i i
d (259)
L(v(t), b(t), λ(t))
dt
1 v1T Hence, it suffices to show that
=− + λ21 (2λKv − 2λv T Kvv)
8 v T v1 λ1 + λi a2i
1
+ (b − λ1 ) −1
i

λ λ1 1
b ≤ +λ λi a2i + 2 + λ21 (λ1 − λi a2i )
+ (λ − 1) −v T Kv + 2 λ 8
λ i i

1 1 −1 b − λi a2i λ2i a2i − ( λi a2i ) . (260)
+ −1 − v T Kv
4 λ λ2 λ2 i i i
1 b 1 1 To simplify notation, we introduce
+ − v T Kv −1
4 λ2 λ2 λ
1 b x= λi a2i (261)
+ v T Kv − 2 2v T K 2λKv − 2λ(v T Kv)v i
4 λ
1 b −2b b y= λ2i a2i , (262)
+ 2
− v T Kv − v T Kv . (253)
4 λ λ3 λ2 i
clearly we have To apply the Lasalle Theorem [65], it suffices to show that
over the trajectory of (v, b, λ) is constrained in a compact
λ1 ≥ y ≥ x ≥ 0. (263)
dt ≤ 0 and L(v, b, λ) > 0 except for
set, and over that set dL
The inequality that we need to prove is λ = 1, b = v T Kv = λ1 . Indeed, the compact set can be
chosen to be
λ1 1
λ1 +x ≤ +λ x+2 +λ21 (λ1 − x) − x(y 2 − x2 ) .
λ 8 {(v, b, λ) : v T v = 1, b ∈ [0, bc ], λ ∈ [λc1 , λc2 ]}. (277)
(264)
Indeed, the constraint bc comes from the fact that L → ∞ as
We note that b → ∞, hence the trajectory of b is bounded by a constant bc
1 which depends on the initial condition. The constraint λc1 , λc2
x+2 + λ21 (λ1 − x) − x(y 2 − x2 ) comes from similar considerations since L → ∞ whenever
8
λ → 0+ or λ → ∞.
1
≥x+2 + λ21 (λ1 − x) − x(y + x)(λ1 − x) (265) Furthermore, one can check that L is radially
8
unbounded, i.e. L(v, b, λ) → ∞ as (v, b, λ) → ∞.
1
≥ x + (λ1 − x) + 2λ21 − x(y + x) (266) Invoking [66, Corollary 5.2], we know that the
4 dynamics converge to a single point in the set
1 {(v, b, λ) : v T v = 1, λ = 1, b = λ1 , v T Kv = λ1 }
≥ x + (λ1 − x) + 2λ21 − λ1 (2λ1 ) (267)
4 since every point in this set is globally asymptotically stable.
1
≥ x + (λ1 − x) (268)
4 ACKNOWLEDGMENT
> 0, (269)
The authors would like to thank Farzan Farnia and Soheil
since we have assumed that λ1 > 0. Feizi for helpful discussions at the initial stage of this project.
Minimizing the right hand side with respect to λ shows that They would like to thank Luigi Ambrosio and Yihong Wu for
it suffices to show that discussions related to optimal transport, Jacob Steinhardt and
1 Ilias Diakonikolas for discussions related to robust statistics.
(λ1 + x)2 ≤ 4λ1 x + 2 + λ21 (λ1 − x)−x(y 2 −x2 ) , They are grateful to Chao Gao for stimulating discussions
8
(270) on the relationship between GAN and robust statistics at the
workshop of robust and high-dimensional statistics held at the
which is equivalent to Simons Institute at the University of California, Berkeley in
1 fall, 2018.
(λ1 − x)2 ≤ 8λ1 + λ21 (λ1 − x) − 4λ1 x(y + x)(y − x).
8
(271) R EFERENCES
[1] B. W. Silverman, Density Estimation for Statistics and Data Analysis.
Since λ1 ≥ y, it suffices to show that Evanston, IL, USA: Routledge, 2018.
1 [2] L. J. P. van der Maaten, E. O. Postma, and H. J. van den Herik,
(λ1 − x)2 ≤ 8λ1 + λ21 (λ1 − x) − 4λ1 x(y +x)(λ1 −x). “Dimensionality reduction: A comparative review,” J. Mach. Learn. Res.,
8 vol. 10, nos. 1–41, pp. 66–71, Oct. 2009.
(272) [3] A. Brock, J. Donahue, and K. Simonyan, “Large scale GAN training for
high fidelity natural image synthesis,” 2018, arXiv:1809.11096. [Online].
Available: http://arxiv.org/abs/1809.11096
Clearly, this inequality holds when λ1 = x. If λ1 > x, [4] T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trends in
it suffices to show that deep learning based natural language processing,” IEEE Comput. Intell.
Mag., vol. 13, no. 3, pp. 55–75, Aug. 2018.
λ1 − x ≤ λ1 (1 + 8λ21 ) − 4λ1 x(y + x). (273) [5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
It is true since Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2672–2680.
[6] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” 2017,
λ1 (1 + 8λ21 ) − 4λ1 x(y + x) ≥ λ1 + 8λ31 − 4λ1 λ1 (λ1 + λ1 ) arXiv:1701.07875. [Online]. Available: http://arxiv.org/abs/1701.07875
[7] I. Durugkar, I. Gemp, and S. Mahadevan, “Generative multi-
(274) adversarial networks,” 2016, arXiv:1611.01673. [Online]. Available:
= λ1 (275) http://arxiv.org/abs/1611.01673
[8] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie, “Stacked
≥ λ1 − x. (276) generative adversarial networks,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), vol. 2, Jul. 2017, p. 3.
The above chain of inequalities holds equality if and only [9] D. Jiwoong Im, H. Ma, C. Dongjoo Kim, and G. Taylor, “Generative
if λ1 = x. adversarial parallelization,” 2016, arXiv:1612.04021. [Online]. Avail-
able: http://arxiv.org/abs/1612.04021
Hence, we have showed that the derivative of L is strictly [10] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis with
negative in the regime v T v = 1, λ > 0, b ≥ 0 except auxiliary classifier GANs,” 2016, arXiv:1610.09585. [Online]. Available:
when λ = 1, v T Kv = λ1 , b ∈ {0, λ1 }. The largest invariant http://arxiv.org/abs/1610.09585
[11] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
set of (v, b, λ) satisfying the condition above is {(v, b, λ) : learning with deep convolutional generative adversarial networks,” 2015,
v T v = 1, λ = 1, b = λ1 , v T Kv = λ1 }. arXiv:1511.06434. [Online]. Available: http://arxiv.org/abs/1511.06434
[12] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and [38] M. Chen, C. Gao, and Z. Ren, “Robust covariance and scatter matrix
X. Chen, “Improved techniques for training GANs,” in Proc. Adv. Neural estimation under Huber’s contamination model,” Ann. Statist., vol. 46,
Inf. Process. Syst., 2016, pp. 2234–2242. no. 5, pp. 1932–1960, Oct. 2018.
[13] I. O. Tolstikhin, S. Gelly, O. Bousquet, C.-J. Simon-Gabriel, and [39] N. Fournier and A. Guillin, “On the rate of convergence in Wasserstein
B. Schölkopf, “AdaGAN: Boosting generative models,” in Proc. Adv. distance of the empirical measure,” Probab. Theory Rel. Fields, vol. 162,
Neural Inf. Process. Syst., 2017, pp. 5424–5433. nos. 3–4, pp. 707–738, Aug. 2015.
[14] T. Xu et al., “AttnGAN: Fine-grained text to image generation with [40] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,
attentional generative adversarial networks,” 2017, arXiv:1711.10485. “GANs trained by a two time-scale update rule converge to a local
[Online]. Available: http://arxiv.org/abs/1711.10485 Nash equilibrium,” in Proc. Adv. Neural Inf. Process. Syst., 2017,
[15] Y. Bai, T. Ma, and A. Risteski, “Approximability of discriminators pp. 6626–6637.
implies diversity in GANs,” 2018, arXiv:1806.10586. [Online]. Avail- [41] R. Vershynin, “How close is the sample covariance matrix to the actual
able: http://arxiv.org/abs/1806.10586 covariance matrix?” J. Theor. Probab., vol. 25, no. 3, pp. 655–686,
[16] S. Feizi, F. Farnia, T. Ginart, and D. Tse, “Understanding GANs: Sep. 2012.
The LQG setting,” 2017, arXiv:1710.10793. [Online]. Available: [42] R. Vershynin, High-Dimensional Probability: An Introduction With
http://arxiv.org/abs/1710.10793 Applications in Data Science, vol. 47. Cambridge, U.K.: Cambridge
[17] K. Li and J. Malik, “On the implicit assumptions of GANs,” 2018, Univ. Press, 2018.
arXiv:1811.12402. [Online]. Available: http://arxiv.org/abs/1811.12402 [43] L. Ambrosio and F. Glaudo, “Finer estimates on the 2-dimensional
matching problem,” 2018, arXiv:1810.07002. [Online]. Available:
[18] M. Sanjabi, J. Ba, M. Razaviyayn, and J. D. Lee, “On the
http://arxiv.org/abs/1810.07002
convergence and robustness of training GANs with regularized
[44] P. L. Bartlett, S. Boucheron, and G. Lugosi, “Model selection and
optimal transport,” 2018, arXiv:1802.08249. [Online]. Available:
error estimation,” Mach. Learn., vol. 48, nos. 1–3, pp. 85–113, 2002,
http://arxiv.org/abs/1802.08249
doi: 10.1023/A:1013999503812.
[19] T. Liang, “On how well generative adversarial networks learn densi- [45] T. Rippl, A. Munk, and A. Sturm, “Limit laws of the empirical Wasser-
ties: Nonparametric and parametric results,” 2018, arXiv:1811.03179.
stein distance: Gaussian distributions,” J. Multivariate Anal., vol. 151,
[Online]. Available: http://arxiv.org/abs/1811.03179 pp. 90–109, Oct. 2016.
[20] T. Che, Y. Li, A. Paul Jacob, Y. Bengio, and W. Li, “Mode regularized [46] A. Ghosh, V. Kulharia, V. Namboodiri, P. H. S. Torr, and P. K. Dokania,
generative adversarial networks,” 2016, arXiv:1612.02136. [Online]. “Multi-agent diverse generative adversarial networks,” 2017,
Available: http://arxiv.org/abs/1612.02136 arXiv:1704.02906. [Online]. Available: http://arxiv.org/abs/1704.02906
[21] B. Neyshabur, S. Bhojanapalli, and A. Chakrabarti, “Stabilizing GAN [47] H. Zhang, S. Xu, J. Jiao, P. Xie, R. Salakhutdinov, and E. P. Xing,
training with multiple random projections,” 2017, arXiv:1705.07831. “Stackelberg GAN: Towards provable minimax equilibrium via multi-
[Online]. Available: http://arxiv.org/abs/1705.07831 generator architectures,” 2018, arXiv:1811.08010. [Online]. Available:
[22] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang, “General- https://arxiv.org/abs/1811.08010
ization and equilibrium in generative adversarial nets (GANs),” [48] K. J. Arrow, L. Hurwicz, and H. Uzawa, “Studies in linear and
2017, arXiv:1703.00573. [Online]. Available: http://arxiv.org/abs/1703. non-linear programming,” Can. Math. Bull., vol. 3, no. 3. Cam-
00573 bridge, U.K.: Cambridge Univ. Press, 1960, pp. 196–198, doi:
[23] S. Arora and Y. Zhang, “Do GANs actually learn the distribution? 10.1017/S0008439500025522.
An empirical study,” 2017, arXiv:1706.08224. [Online]. Available: [49] J. Li, A. Madry, J. Peebles, and L. Schmidt, “On the limitations of
https://arxiv.org/abs/1706.08224 first-order approximation in GAN dynamics,” 2017, arXiv:1706.09884.
[24] V. Dumoulin et al., “Adversarially learned inference,” 2016, [Online]. Available: http://arxiv.org/abs/1706.09884
arXiv:1606.00704. [Online]. Available: http://arxiv.org/abs/1606.00704 [50] E. Mazumdar, L. J. Ratliff, and S. Shankar Sastry, “On gradient-
[25] C. Gao, J. Liu, Y. Yao, and W. Zhu, “Robust estimation and gener- based learning in continuous games,” 2018, arXiv:1804.05464. [Online].
ative adversarial nets,” 2018, arXiv:1810.02030. [Online]. Available: Available: http://arxiv.org/abs/1804.05464
http://arxiv.org/abs/1810.02030 [51] M. Sanjabi, J. Ba, M. Razaviyayn, and J. D. Lee, “On the convergence
[26] F. Farnia and D. Tse, “A convex duality framework for GANs,” in Proc. and robustness of training GANs with regularized optimal transport,” in
Adv. Neural Inf. Process. Syst., 2018, pp. 5254–5263. Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 7091–7101.
[27] A. Bora, E. Price, and A. G. Dimakis, “AmbientGAN: Generative models [52] S. Sastry, Nonlinear Systems: Analysis, Stability, and Control, vol. 10.
from lossy measurements,” in Proc. Int. Conf. Learn. Represent. (ICLR), New York, NY, USA: Springer-Verlag, 2013.
2018, pp. 1–22. [53] V. N. Vapnik and A. Y. Chervonenkis, “On the uniform convergence
[28] A. S. Dalalyan and A. B. Tsybakov, “Aggregation by exponential of relative frequencies of events to their probabilities,” in Measures of
weighting and sharp oracle inequalities,” in Proc. Int. Conf. Comput. Complexity. Cham, Switzerland: Springer, 2015, pp. 11–30.
Learn. Theory. Berlin, Germany: Springer, 2007, pp. 97–111. [54] N. Gozlan and C. Léonard, “Transport inequalities. A survey,” 2010,
[29] T. Rainforth et al., “Tighter variational bounds are not neces- arXiv:1003.3852. [Online]. Available: http://arxiv.org/abs/1003.3852
sarily better,” 2018, arXiv:1802.04537. [Online]. Available: http:// [55] C. Villani, Optimal Transport, Old and New, vol. 338. Berlin, Germany:
arxiv.org/abs/1802.04537 Springer-Verlag, 2008.
[30] S. Liu, O. Bousquet, and K. Chaudhuri, “Approximation and [56] A. R. Wade, “Explicit laws of large numbers for random nearest-
convergence properties of generative adversarial learning,” 2017, neighbour-type graphs,” Adv. Appl. Probab., vol. 39, no. 2, pp. 326–342,
arXiv:1705.08991. [Online]. Available: http://arxiv.org/abs/1705.08991 Jun. 2007.
[57] M. D. Penrose and J. Yukich, “Laws of large numbers and nearest
[31] E. Kreyszig, Introductory Functional Analysis With Applications, vol. 1.
neighbor distances,” in Advances in Directional and Linear Statistics.
New York, NY, USA: Wiley, 1978.
Berlin, Germany: Physica-Verlag HD, 2011, pp. 189–199.
[32] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. Póczos, “MMD GAN: [58] E. Liitiäinen, “Asymptotic moments of near neighbor distances for the
Towards deeper understanding of moment matching network,” in Proc. Gaussian distribution,” Electron. J. Probab., vol. 16, pp. 2545–2573,
Adv. Neural Inf. Process. Syst., 2017, pp. 2203–2213. Dec. 2011.
[33] L. Devroye and G. Lugosi, Combinatorial Methods in Density Estima- [59] S. Bobkov and M. Ledoux, One-Dimensional Empirical Measures,
tion. New York, NY, USA: Springer, 2012. Order Statistics, and Kantorovich Transport Distances, vol. 261,
[34] W. Hoeffding, “Probability inequalities for sums of bounded ran- no. 1259. Providence, RI, USA: AMS, 2019.
dom variables,” J. Amer. Stat. Assoc., vol. 58, no. 301, pp. 13–30, [60] Fedja. Is the Matrix Square Root Uniformly Continuous?—Mathematics
1963. Stack Exchange. Accessed: Sep. 25, 2016. [Online]. Available:
[35] M. Ledoux, “Concentration of measure and logarithmic Sobolev inequal- https://math.stackexchange.com/q/1940274
ities,” in Séminaire de Probabilités XXXIII. Berlin, Germany: Springer, [61] I. Olkin and F. Pukelsheim, “The distance between two random vectors
1999, pp. 120–216. with given dispersion matrices,” Linear Algebra its Appl., vol. 48,
[36] Y. G. Yatracos, “Rates of convergence of minimum distance estimators pp. 257–263, Dec. 1982.
and Kolmogorov’s entropy,” Ann. Statist., vol. 13, no. 2, pp. 768–774, [62] N. J. Higham, Functions of Matrices: Theory and Computation, vol. 104.
Jun. 1985. Philadelphia, PA, USA: SIAM, 2008.
[37] O. Bousquet, D. Kane, and S. Moran, “The optimal approximation factor [63] T. W. Anderson and I. Olkin, “An extremal problem for positive definite
in density estimation,” 2019, arXiv:1902.05876. [Online]. Available: matrices,” Stanford Univ., Stanford, CA, USA, Tech. Rep. OLKNSF130,
http://arxiv.org/abs/1902.05876 Jul. 1978.
[64] H. D. Tagare, “Notes on optimization on Stiefel manifolds,” Yale Univ., David Tse (Fellow, IEEE) received the B.A.Sc. degree in systems design
New Haven, CT, USA, 2011. engineering from the University of Waterloo in 1989, and the M.S. and
[65] J. LaSalle, “Some extensions of Liapunov’s second method,” IRE Trans. Ph.D. degrees in electrical engineering from the Massachusetts Institute of
Circuit Theory, vol. 7, no. 4, pp. 520–527, 1960. Technology in 1991 and 1994, respectively.
[66] S. P. Bhat and D. S. Bernstein, “Nontangency-based Lyapunov tests From 1994 to 1995, he was a Post-Doctoral Member of Technical Staff with
for convergence and stability in systems having a continuum of AT&T Bell Laboratories. From 1995 to 2014, he was on the Faculty Member
equilibria,” SIAM J. Control Optim., vol. 42, no. 5, pp. 1745–1775, of the Department of Electrical Engineering and Computer Sciences, Univer-
Jan. 2003. sity of California at Berkeley (U.C. Berkeley). He is currently a Professor
with Stanford University. He is a coauthor, with Pramod Viswanath, of the
text Fundamentals of Wireless Communication, which has been used in over
60 institutions around the world. He is the inventor of the proportional-fair
scheduling algorithm used in all third- and fourth-generation cellular systems.
His research interests are in information theory and its applications in various
fields, including wireless communications, energy, computational biology, and
blockchain.
Prof. Tse was an Elected Member of the U.S. National Academy of Engi-
neering in 2018. He was the recipient of the IEEE Claude E. Shannon Award
in 2017 and the IEEE Richard W. Hamming Medal in 2019. He received
Banghua Zhu (Student Member, IEEE) received the B.Eng. degree in the 1967 NSERC Graduate Fellowship from the Government of Canada
electronic engineering from Tsinghua University, Beijing, China, in 2018. in 1989, the NSF CAREER Award in 1998, the Best Paper Awards at the
He is currently pursuing the Ph.D. degree with the Department of Electrical Infocom Conference in 1998 and 2001, the Erlang Prize in 2000 from the
Engineering and Computer Sciences, University of California at Berkeley. His INFORMS Applied Probability Society, the IEEE Communications Society
research interests include machine learning, statistics, and information theory. and Information Theory Society Joint Paper Awards in 2000 and 2013,
the Information Theory Society Paper Award in 2003, the Gilbreth Lectureship
from the National Academy of Engineering in 2012, the Signal Processing
Society Best Paper Award in 2012, the EURASIP Best Paper Award in 2012,
and the IEEE Communications Society Stephen O. Rice Prize in 2013.
For his contributions to education, he received the Outstanding Teaching
Award from the Department of Electrical Engineering and Computer Sciences,
U.C. Berkeley, in 2008 and the Frederick Emmons Terman Award from the
American Society for Engineering Education in 2009. He was an Associate
Editor of the IEEE T RANSACTIONS ON I NFORMATION T HEORY from 2001 to
2003, the Technical Program Co-Chair in 2004, and the General Co-Chair
Jiantao Jiao (Member, IEEE) received the B.Eng. degree (Hons.) in electronic of the International Symposium on Information Theory in 2015. He served
engineering from Tsinghua University, Beijing, China, in 2012, and the M.S. on the Board of Governors of the IEEE Information Theory Society from
and Ph.D. degrees from Stanford University in 2014 and 2018, respectively. 2003 to 2008 and from 2010 to 2013. He was the plenary speaker for many
He is currently an Assistant Professor with the Department of Electrical international conferences and workshops, including the IEEE International
Engineering and Computer Sciences, University of California at Berkeley. Symposium on Information Theory in 2009, the ACM International Confer-
His research interests are in statistical machine learning, mathematical data ence on Mobile Computing and Networking (MobiCom) in 2007, and the
science, optimization, applied probability, information theory, and their appli- IEEE International Conference on Acoustics, Speech and Signal Processing
cations in science and engineering. in 2006.

Deconstructing Generative Adversarial Networks

Uploaded by

Copyright:

Available Formats

Deconstructing Generative Adversarial Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deconstructing Generative Adversarial Networks

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO.

11, NOVEMBER 2020 7155

Deconstructing Generative Adversarial Networks

Abstract— Generative Adversarial Networks (GANs) are a I. I NTRODUCTION

In the context of Figure 1, the final generator family G is

Fig. 3. Illustration of poor performance of projecting the empirical distrib-

study g̃ that is produced by a GAN architecture, which

min L(Pg(Z) , P̂nX ). (7)

g = arg min L (Pg(Z) , P̂nX ). (12)

We now design an admissible distance for L. Construct the

(21) D. Robust Estimation: Gaussian Location Model and Total

2) 2 · inf g(·)∈G W2 (Pg(Z) , PX ): the non-Gaussianness of

However, when r = d, it is not clear whether the factor 2

sacrificing the stochastic error. It is also shown in [37] that the

solving the minimax. It is still an open problem what all the

A∗ = v1 v1T , (69) Definition 8 (Global Asymptotic Stability [52]): An equilib-

B. Proof of Theorem 1 C. Proof of Theorem 2

D. Proof of Theorem 3 As for the second property, we notice that

Thus we have 3) Proof of Lower Bound for Gaussian Distribution: We

P(W2 (PX , P̂nX ) ≥ E[W2 (PX , P̂nX )] + t) ≤ exp(−nt2 /2B).

2B log(1/δ) Assume X̂ ∼ P̂nX , Z ∼ N (0, Id ), denote

 Var(A) = n . We note that for d ≥ 5, we have dn

2) Proof of Main Theorem: Similar to Theorem 2, Thus we have

+ W2 (Pg∗ (Z) , PX ) W̃22 (PX , P̂nX ) ≤ d KX 2

which implies that with probability at least 1 − δ, we have (169)

where we have reduced the set of µ, Σ we take the infimum We have

The key observation is the following inequality for any z, x c2 d2

− (L1 (Pg2∗ , P2 ) + λL2 (Pg , Pg2∗ )) (212)

It suffices to verify that Consider the special case when

You might also like

Var(A) = n . We note that for d ≥ 5, we have dn