Deconstructing Generative Adversarial Networks
Deconstructing Generative Adversarial Networks
Deconstructing Generative Adversarial Networks
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
7156 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 11, NOVEMBER 2020
2) learning Gaussian location family under total varia- may lie in the space of images. Given infinite samples from
tion distance, where we provide a new proof for the the real distribution PX , the goal of GANs can be understood
near-optimality of Tukey median viewed as GANs; intuitively as finding a generator g whose distribution Pg(Z)
3) learning low-dimensional Gaussian approximations is the closest to the real distribution PX . At the minimum
of a high-dimensional arbitrary distribution under we would need a distance1 L(Pg(Z) , PX ) for every g ∈ G.
Wasserstein-2 distance. Here L(·, ·) : M(Rd ) × M(Rd ) → R≥0 is a function
We demonstrate a fundamental trade-off in the approxi- that characterizes the distance between distributions. Naturally
mation error and statistical error in GANs, and demon- with infinite samples and unbounded computation we would
strate how to apply our principle in practice with only ideally like to solve for
empirical samples to predict how many samples would ĝ = arg min L(Pg(Z) , PX ). (1)
be sufficient for GANs in order not to suffer from the g∈G
discriminator winning problem. Note that the minimum in (1) may not be achievable in
• Optimization: we demonstrate alternating gradient general. However, we assume throughout this paper that it is
descent is provably not locally asymptotically stable in attained, since the other case can be dealt with by a standard
optimizing the GAN formulation of PCA. We found limiting argument. The formulation problem looks at the
that the minimax duality gap being non-zero might be design of L and G. As a concrete example, the vanilla GAN [5]
one of the causes, and propose a new GAN architecture defines the distance function L as the Jensen-Shannon diver-
whose duality gap is zero, and the value of the game is gence, and uses neural network as generator family.
equal to the previous minimax value (not the maximin
value). We prove the new GAN architecture is globally
asymptotically stable in solving PCA under alternating A. The Population Target of GANs: a Perturbation View
gradient descent. Intuitively, at minimum the design of L and G needs to
satisfy that
Organization of paper. The rest of the paper is organized as
follows. In Section II, we discuss the formulation of GANs inf L(Pg(Z) , PX ) is “small”. (2)
g∈G
from a perturbation view, and propose Cascade GANs for
Indeed, were it not true, then one should certainly design
capturing the sequential perturbation model. In Section III,
another generator family G that can better approximate the
we study the generalization property of GANs. In Section IV,
real distribution PX . In other words, we implicitly assume
we study the optimization property of GANs. All the proofs
are deferred to the appendices. The real distribution PX is a “slightly” perturbed
Notations. Throughout this paper, we use capital letter X for generator g ∈ G under L.
random variable, blackbold letter P for probability distribution. The perturbation view allows us to unify the field of robust
Generally, the distribution for random variable X is denoted statistics and GANs. In robust statistics, it is usually assumed
as PX , and the corresponding empirical distribution from PX that the observed distribution (in GANs, the true distribu-
with n samples is P̂nX . In all sections except for Section IV, tion PX ) is a slightly perturbed version of a generator member
we use bold lower case letters for vectors, bold upper case under total variation distance. Hence, we can connect GANs
letters for matrices. In Section IV, we use lower case letters to the robust statistics framework2. This viewpoint proves to
for vectors, upper case letters for matrices. We use A† to be beneficial: indeed, we would later utilize a general analysis
denote the pseudo inverse of matrix A. Random vector Z technique we developed for GANs to provide a new proof
follows zero mean normal distribution with identity covari- for the optimal statistical properties of the Tukey median in
ance. We use r(A) to denote the rank of matrix A, R(A) robust estimation. Our perturbation view of GANs is inspired
to denote the range of matrix A. We say A ≥ 0 if A is by [25] which constructed GAN architectures that are provably
positive semidefinite, A > 0 if A is positive definite. We use achieving the information-theoretic limits in robust estimation
TV(P, Q) = supA P(A) − Q(A) to denote the total variation of mean and covariances. However, it also immediately raises
distance between P and Q. For non-negative sequences aγ , bγ , a conflict: it was argued in [6] that the Jensen–Shannon diver-
we use the notation aγ α bγ to denote that there exists a gence (similarly, the total variation) distance is not supposed
a
constant C that only depends on α such that supγ bγγ ≤ C, to be used as the distance function for GANs. How can we
and aγ α bγ is equivalent to bγ α aγ . When the constant unify these two viewpoints?
C is universal we do not write subscripts for and .
Notation aγ bγ is equivalent to aγ bγ and aγ bγ .
a B. Cascade GANs
Notation aγ bγ means that lim inf γ bγγ = ∞, and aγ bγ
is equivalent to bγ aγ . A careful inspection of the arguments in [6] shows that
the total variation is inappropriate when the generator family
II. F ORMULATION OF GAN S 1 This distance needs to be broadly understood as a function that may not
GANs aim at selecting a generator g(·) from some generator satisfy the axioms of a distance function.
2 For GAN, the perturbation metric and the metric to recover original
family G. The generator function takes Gaussian random noise distribution are the same, but for general robust statistics problem they might
Z ∼ N (0, Ir ) as input and outputs a random variable which be different. The latter problem is beyond the scope of the paper.
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DECONSTRUCTING GANs 7157
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
7158 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 11, NOVEMBER 2020
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DECONSTRUCTING GANs 7159
Definition 1 (Admissible Distance): Suppose L is a pseudo- in practice. It is claimed in [29] that training under a tighter
metric3 . We say that the function L is admissible with evidence lower bounds can be detrimental to the process of
parameter (c1 , c2 , L ) with respect to (L, G) if it satisfies the learning an inference network by reducing the signal-to-noise
following two properties: ratio of the gradient estimator for variational autoencoder.
1) Approximate resolution within generators: for any In [26] and [30], using a restricted discriminator family
g1 , g2 ∈ G, is shown to be related to a moment-matching effect. The
recent work of [15] is most related to us, where a notion of
c1 (L (Pg1 (Z) , Pg2 (Z) ) − L (Pg2 (Z) , Pg2 (Z) )) “conjoining discriminator” is developed, while we emphasize
≥L(Pg1 (Z) , Pg2 (Z) ) − c2 , (8) that the results in [15] only applies to cases where OP T = 0,
and the conjoining discriminator family is derived for a few
2) Robust to perturbation: for any g ∈ G, P1 , P2 , there specific generators with L being the W1 distance.
exists a function L such that The proof is deferred to Appendix (B).
|L (Pg(Z) , P1 ) − L (Pg(Z) , P2 )| ≤ L (P1 , P2 )
≤ L(P1 , P2 ), (9) C. Arbitrary G and Pseudonorm L
Suppose L(P1 , P2 ) can be written as P1 − P2 L , where ·
In other words, the approximate resolution requirement L is a pseudonorm4 on a vector space M which is a subset of
ensures that the new function L can distinguish between signed measures on Rd . We have the following representation
generators at least as well as the desired distance L, and the of · L .
robustness property shows that observing a slightly different Lemma 1: Assume appropriate topology on M such that
real distribution may not change the value of L by too anycontinuous linear functional of μ ∈ M can be written
much. Here L may not be a pseudometric. Thus L (P, P) as f dμ for some measurable f , and for any μ ∈ M,
may not be 0 in general. the linear functional M : αμ → α μ L is continuous. For
Of course, L = L is admissible with parameter (1, 0, L), any measurable function f on Rd , define
but the key point of Definition 1 is that there exist a large vari-
ety of functions L that are admissible. In general, we would f L∗ = sup f dμ. (14)
like to choose L such that it has a better convergence rate μL =1,μ∈M
than L, which will be illustrated in later sections. Then, we can represent · as an integral probability metric:
L
The next theorem shows that projecting under an admissible
function achieves near optimality. μ = sup f dμ. (15)
L
Theorem 1: Suppose L is a pseudometric, and L is admis- f :f L∗ =1
sible with parameter (c1 , c2 , L ) with respect to (L, G). Define
Lemma 1 is a consequence of the Hahn-Banach Theo-
g = arg min L (Pg(Z) , PX ). (10) rem [31, Theorem 4.2]. The proof is deferred to Appendix A.
g∈G Note that we already know any integral probability metric is
Then, a pseudonorm.
In fact, the pseudonorm includes a wide range of dis-
L(Pg (Z) , PX ) ≤ (1 + 2c1 )OP T + c2 . (11) crepancy measures used in various GANs. As a concrete
Furthermore, suppose we observe n i.i.d. samples from PX example, MMD-GAN [32] minimizes the kernel maximum
and denote the empirical distribution formed by the samples mean discrepancy distance, which is a pseudonorm and thus
as P̂nX . Define can be further studied under our framework. Denoting
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
7160 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 11, NOVEMBER 2020
discriminating between members in the -covering of G. 3) Whenever discriminator wins phenomenon occurs, which
In some sense, F is the minimal set of useful discriminators means that in solving the projection estimator the discrim-
for the generator family G. Naturally, it leads to the definition inator can always distinguish between the real and forged
of L as data, reduce the number of parameters of the discrimina-
tor and it will not hurt the learning performance as long
L (PX , PY ) = sup f d(PX − PY ). (19) as L is admissible;
f ∈F
4) As long as L(·, ·) is strong enough such that proximity
We have the following result. under L ensures mode collapse does not happen, project-
Theorem 2: Assuming the true distribution is PX , let the ing under L prohibits mode collapse.
GAN estimator be The idea of set design mainly comes from Yatracos’
method [36] (also referred to as Schaffe’s sets by [33]).
g̃ = arg min L (Pg(Z) , P̂nX ), (20)
g(·)∈G In fact, the prefactor 3 in front of the oracle term
inf g(·)∈G L(Pg(Z) , PX ) is likely to be tight. It is shown in [37]
where L is defined in (19). Then that in Yatracos’ method the constant 3 cannot be improved if
we require proper learning.
L(Pg̃(Z) , PX ) ≤ 3 inf L(Pg(Z) , PX ) + 2L (PX , P̂nX ) + 4.
g(·)∈G
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DECONSTRUCTING GANs 7161
When gi , gj takes all possible generators in G = {N (μ, I) : condition in (4) to consider arbitrary real distribution, but stick
μ ∈ Rd }, A is the set of all halfspaces {{x ∈ Rd : v, x+b > to the rank-constrained linear map generator and W2 distance.
0} : v 2 = 1, b ∈ Rd }. The weakened distance is thus This model generalizes PCA and provides an intellectually
defined as stimulating playground for studying the generalization and
optimization properties of GANs. We demonstrate a few
TV (Pg , PX ) = sup PX (v T
(X − b) ≥ 0)
v=1,b∈Rd
striking behaviors of the W2 -GAN model in this section.
1) Exponential Sample Complexity for the Naive Algorithm:
− Pg (v T (g(Z) − b) ≥ 0). (28)
The W2 distance is defined as
The optimality of projection under TV is guaranteed in the
W2 (PX , PY ) = inf E X −Y 2, (34)
following theorem:
π:πX =PX ,πY =PY
Theorem 3: Let the GAN estimator be
g̃ = arg min TV (Pg , P̂nX ), (29) which admits a dual form:
g∈G
W22 (PX , PY ) = E X 2
2 +E Y 2
2 +2 sup
where TV is defined in (28). Then, TV is an admissible ψ(·):convex
distance of (TV, G) with parameter (1, 0, TV ), and with prob- − E[ψ(X)] − E[ψ (Y )],
(35)
ability at least 1 − δ, there exists some universal constant C,
such that where ψ (x) = supv vT y − ψ(v) is the convex conjugate
d+1 log(1/δ) of ψ.
TV(Pg̃ , PX ) ≤ 3 · OP T + C · +2 . Without loss of generality, we can assume the means of
n 2n
PX , PY are zero. Indeed, for any two distributions PX , PY ,
(30)
assume the corresponding means are µX and µY , and denote
The proof is deferred to Appendix (D). X = X − µX , Y = Y − µY . Then, we have
2) Tukey Median: The Tukey median approach can be
equivalently formulated as projecting under another admissible W22 (PX , PY ) = inf E X 2
2 +E Y 2
2 − 2E[X T Y ]
π
distance. It is defined as the following: for two distribu- = inf E[Tr(XX T )]+E[Tr(Y Y T )]−2E[X T Y ]
tions P, Q, we define a different admissible distance as π
= inf E[Tr(X X T )]+µXT µX +E[Tr(Y Y T )]
LTukey (P, Q) = sup Q (X − EP [X])T v > 0 − 1/2. (31) π
v∈Rd + µY T µY − 2E[X T Y ] − 2µX T µY
2
It has been shown in previous literature that Tukey median = µX − µY 2 + W22 (PX , PY ). (36)
is able to achieve optimal convergence rate under Huber’s
additive contamination model [38]. Based on the theory of Thus, the W22 between distributions with non-zero means
admissible distance, We provide a new simple proof for the can be decomposed to the difference between means and
optimality of Tukey median. W22 between the centered distributions. In the rest of the
Theorem 4: Let the GAN estimator (Tukey median) be paper, we mainly consider W2 distance between zero-mean
defined as distributions except for Theorem 6 and 7.
To demonstrate the intricacies of GAN design, we start
g̃ = arg min LTukey (Pg , P̂nX ), (32) with the vanilla GAN architecture. We assume the generator
g∈G
is linear, i.e. G = {g(Z) = AZ : A ∈ Rd×r , Z ∈ Rr },
where LTukey is defined in (31). Then, LTukey is an admis- which means that g(Z) ∼ N (0, K) for some K positive
sible distance of (TV, G) with parameter (2, 0, TV ), where semidefinite with rank at most r. The optimal generator in
TV (P1 , P2 ) is defined in equation (28). With probability at this setting is trivially g(Z) = K1/2 Z if PX = N (0, K)
least 1 − δ, there exists some universal constant C such that and r = d. However, as the next theorem shows, it takes the
d+1 log(1/δ) GAN architecture more than exponential number of samples
TV(Pg̃ , PX ) ≤ 5 · OP T + C · +4 . to approach the optimal generator even when K is identity.
n 2n
(33) It is a refinement of [16].
Theorem 5: Let the GAN estimator be
The proof is deferred to Appendix (E).
g̃ = arg min W2 (Pg(Z) , P̂nX ). (37)
g(·)∈G
E. W2 -GAN Model
It was first shown in [16] that when the generator family is Then we have
rank-constrained linear maps, L is the Wasserstein-2 distance,
W2 (Pg̃(Z) , PX ) ≤ inf W2 (Pg(Z) , PX ) + 2W2 (PX , P̂nX ).
and the real distribution is Gaussian, then the population GAN g∈G
solution (4) produces the PCA of the covariance matrix of (38)
the true distribution. Inspired by this phenomenon, Quadratic
GAN is proposed and both theoretical and empirical results Furthermore, assume EPX [X] = 0, and PX satisfies the
demonstrate its fast generalization. In this section, we relax the following assumption (which N (0, Id ) satisfies when d ≥ 3):
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
7162 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 11, NOVEMBER 2020
• X ∼ μ satisfies logarithmic Sobolev inequality, i.e. for as the Fréchet Inception Distance [40]. The following result
some constant B > 0 and all smooth f on Rd , analyzes the performance of projecting under W̃2 .
Theorem 6 (Guarantee for quadratic GAN): Assume the
Ent(f 2 ) ≤ 2BE(|∇f |2 ), (39) generator family to search is rank-r affine generators:
where ∇f is the usual gradient of f , Ent(f ) = Gr = {g(Z) = AZ + b : A ∈ Rd×r , Z ∈ Rr , b ∈ Rd }.
Eμ (f log f ) − Eμ (f ) log Eμ (f ).
(44)
• Upper bound on expected W2 convergence rate5 for some
constant C1 depending on PX : Denote the family of full rank Gaussian distribution generator
as G = {g(Z) = AZ + b : A ∈ Rd×d , Z ∈ Rd , b ∈ Rd }. Let
E[W22 (PX , P̂nX )] ≤ C1 (PX ) · n−2/d . (40) the GAN estimator be
Then, there exists some constant C depending on PX , such g̃ = arg min W̃2 (Pg(Z) , P̂nX ). (45)
that with probability at least 1 − δ, g(·)∈Gr
W2 (Pg̃(Z) , PX ) ≤ inf W2 (Pg(Z) , PX ) + C(PX ) · n−1/d Then, W̃2 is an admissible distance for (W2 , G) with parameter
g∈G (1, 0, W̃2 ), and
2B log(1/δ) 1/2
+2 . (41) W2 (Pg̃(Z) , PX ) ≤ ΣX − Σ1/2
r F+ 2 inf W2 (Pg(Z) , PX )
n g(·)∈G
For lower bound, when PX = N (0, Id ), we have with + 2W̃2 (P̂nX , PX ), (46)
probability at least 0.99, there exist some constants C2 , C3 (d),
such that when n > C3 (d) and d ≥ 5, here μX , ΣX are the mean and covariance matrix of X,
respectively, Σr is the best rank-r approximation of ΣX under
W2 (Pg̃(Z) , N (0, Id )) ≥ C2 d1/2 n−3/d . (42) Frobenius norm (the r-PCA solution).
Furthermore, if the minimum eigenvalue of the covariance
Remark 4: Equation (41) depends on two assumptions: matrix of X is lower bounded by a constant B 6 , then
the logarithmic Sobolev inequality (39) and the expected
1/2
convergence rate (40). The logarithmic Sobolev inequality W2 (Pg̃(Z) , PX ) ≤ ΣX −Σ1/2
r F +2 inf W2 (Pg(Z) , PX )
g(·)∈G
holds for a wide range of distributions including Gaussian dis-
tribution [35]. The upper bound on the expected convergence d
+2 ΣX − Σ̂ + 2 µX − µ̂ , (47)
rate holds for any distribution PX with finite 2d/(d − 2) B
moment and d ≥ 5 [39]. It also holds for N (0, Id ) when
where µ̂, Σ̂ are the empirical mean and covariance matrix
d ≥ 3.
of P̂nX , respectively.
The proof is deferred to Appendix F. The proof of the
Remark 5: It follows from [41] that if PX is sub-Gaussian,
theorem also provides a lower bound for convergence rate
i.e., there exists some constant L such that
of Wasserstein-2 distance between Gaussian and its empirical
2
/L2
distribution, which appears to be new. PX (|X − E[X], x| > t) ≤ 2e−t (48)
Theorem 5 shows that even when there exists a generator
that can perfectly reconstruct the real distribution that produces for t > 0 and x 2 = 1, then with probability at least 1 − δ,
the samples, the sample complexity grows at least exponen- we have
tially with the dimension of the data distribution. The result 2 d
2
in Theorem 5 can also be understood as an instantiation of ΣX − Σ̂ L log (49)
δ n
the discriminator winning phenomenon. Indeed, unless we
are given samples at least exponential in the dimension d, and
the discriminator in the W2 dual representation can always
d
easily tell the difference between the empirical distribution µX − µ̂ δ,L . (50)
P̂nX and the solved optimal generator distribution g̃. Intuitively, n
to alleviate this phenomenon, one needs to design discrimina- More generally, it follows from [42, Theorem 5.6.1] that if
tors to properly weaken their discriminating power. E[X] = 0, and for some K > 0, X ≤ K almost surely,
2) Polynomial Sample Complexity GAN Design: It was then for every n ≥ 1,
proposed in [16] that when PX is Gaussian, one shall project
K 2 Σ log d (K 2 + Σ ) log d
under the Quadratic GAN distance defined as E ΣX − Σ̂ + . (51)
n n
W̃2 (P1 , P2 ) = W2 (N (µ1 , Σ1 ), N (µ2 , Σ2 )), (43) The proof of theorem is deferred to Appendix G. The results
in Theorem 6 can be interpreted in the following way.
where µ1 , µ2 are means of P1 , P2 , and Σ1 , Σ2 are covari- 1/2
ances of P1 , P2 . This distance is also used in the literature 1) ΣX − Σ1/2 r F : the PCA error introduced by rank
mismatch between generator family and true distribution;
5 When the dimension of the support of P
X is less than the ambient
6 If the minimum eigenvalue is not lower bounded by a positive constant,
dimension d, the convergence rate is depending on the intrinsic dimension
but not the ambient dimension. we can show the stochastic error d1/2 ΣX − Σ̂1/2 + +µX − µ̂.
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DECONSTRUCTING GANs 7163
The proof is relegated to Appendix (H). As an illustrating when the overall perturbation cannot be represented as only
example in Figure 4, we specify the discrepancy measure perturbing under L1 or L2 , but a cascaded perturbation L2
L to be W2 , G = {g(Z) = AZ + b : A ∈ Rd×d , b ∈ followed by L1 . In this section, we consider the generalization
Rd , Z ∈ Rd }, and plot E[W2 (Pg̃(Z) , PX )] as a function of problem (5) for this new distance.
OP T = inf g∈G W2 (PX , Pg(Z) ), for naive projection and pro- Theorem 9: Assume we have L1 as an admissible distance
posed algorithm. Furthermore, we show the minimal stochastic for (L1 , G ) with parameter (c1 , c2 , L1 ), L2 as an admissible
error for keeping the slope 1, which means in order to make distance for (L2 , G ) with parameter (d1 , d2 , L2 ) for generator
the stochastic error n in equation (5) smaller, it is necessary family G , and G ⊂ G . We further assume that L1 (Pg , Pg ) =
to sacrifice the constant C. L2 (Pg , Pg ) = 0 for all g ∈ G . Define the new distance as
Combining Theorem 6 and 7, we know that for the
non-Gaussianness term, the prefactor 2 is achievable by the L (P, Q) = min L1 (Pg (Z) , Q) + λL2 (P, Pg (Z) ). (58)
Quadratic GAN, while the constant 1 is not achievable without g ∈G
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
7164 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 11, NOVEMBER 2020
we have
L(Pĝ(Z) , PX ) ≤ (1 + 2 max(c1 , d1 ))OP T
+ 2 max(c1 , d1 )L1 (PX , P̂nX )
c2 d2
+ max(c1 , d1 ) · + , (60)
c1 d1
where
Fig. 5. The estimated W2 and W̃2 distances between a high dimensional
OP T = inf L(Pg(Z) , PX ). (61) isotropic Gaussian distribution and its empirical distribution for original W2
g∈G GAN and Quadratic GAN. The left figure is for 10-dimensional Gaussian
and the right figure is for 20-dimensional Gaussian. The x and y axes are
all logarithmic. The fitted coefficients of linear model from two lines show
that the matching method produces results that are consistent with theoretical
G. Testing the Strength of Discriminators With predictions with reasonably small sample sizes.
Empirical Samples
In order to achieve a fast convergence rate, it is crucial that an open problem in optimal transport theory. Based on the
L in Theorem 1 is weak enough. As illustrated in Theorem 5, matching method, we can estimate the number of samples
with inappropriate choice of discriminator, discriminator win- required for L(PX , P̂nX ) to achieve certain error , which helps
ning phenomenon will lead to poor performance of GANs. predict when discriminator winning problem occurs.
In this section, we propose a principled way to predict
the number of samples needed to alleviate the discriminator IV. O PTIMIZATION OF GAN S
winning phenomenon from empirical samples by a matching GANs are usually optimized using alternating gradient
theory [43]. Concretely, we aim at estimating the distance descent (AGD), which is known to be difficult to train, with
between empirical distribution and its population distribution little known stability or convergence guarantees. An unstable
L(P̂nX , PX ) using only empirical samples. training procedure may lead to mode collapse. Given a learn-
The method we propose is as follows. Given N samples ing algorithm proposed from statistical considerations, it is
from true distribution, we select the first n < N/2 samples to crucial to ensure that it is provably computationally efficient.
form empirical distribution P̂1 , and another disjoint n samples Thus it is of paramount importance to ensure that widely used
to form empirical distribution P̂2 . Then we compute L(P̂1 , P̂2 ) algorithms such as AGD are able to provably solve the GAN
as a proxy for L(PX , P̂nX ). A special case of this method optimization problem.
is known as maximum discrepancy [44] in the literature. It was shown in prior work [16] that the optimal generated
However, as is shown below, our method is more general and distribution under (45) is a Gaussian distribution whose covari-
can be applied to any pseudometric that is convex in its first ance matrix is the r-PCA of the empirical covariance matrix Σ̂.
argument. Our non-asymptotic result below can be seen as However, in practice, PCA is solved using specialized algo-
a generalization of the asymptotic justification for matching rithms but not alternating gradient descent. Recent work in
method in [45]. Fréchet Inception distance [40] also suggests training GAN
Theorem 10: Given N samples from true distribution, under the quadratic GAN distance W̃2 between coding layers.
we select the first n < N/2 samples to form empirical distri- Hence, it is natural to answer whether particularizing the
bution P̂1 , and another disjoint n samples to form empirical alternating gradient descent algorithm to this specific setting
distribution P̂2 . Suppose L(P, Q) is a pseudometric which is produces a convergent, stable, and efficient algorithm for the
convex in its first argument. Then, r-PCA. We mention that [16] has obtained some results in this
E[L(PX , P̂nX )] ≤ E[L(P̂1 , P̂2 )] ≤ 2E[L(PX , P̂nX )]. (62) direction but here we explore the question in greater depth.
In this section, we show that for the linear generator family
The proof is deferred to Appendix K. In the experiments
below, we use the matching method to successfully recover Gr = {g(Z) = AZ : A ∈ Rd×r , Z ∈ Rr } (63)
the exponent as predicted in Theorem 5 and 6. We conduct and W2 distance, AGD is globally asymptotically stable when
the experiments for estimating W2 and W̃2 distances between r = d, but is generally not locally asymptotically stable
an isotropic Gaussian distribution and its empirical distribution when r < d. We propose another GAN architecture based
with d = 10 and d = 20, where the result is shown in Figure 5. on parameter sharing which is provably globally asymptoti-
We can see that the matching method succesfully recovers cally stable7 and successfully computes the PCA using AGD.
the exponent of sample size n. In our analysis, it is shown Intuitively, one reason why AGD is not locally asymptotically
that for d ≥ 3 and n large enough, the quadratic GAN stable in solving the W2 -GAN formulation of PCA is that the
can achieve convergence rate as dn−1/2 , and the fitted rate minimax value is not the same as the maximin value in the
from two figures are d1.00 n−0.53 . Meanwhile the convergence GAN formulation, but AGD does not favor minimax explicitly
rate for the naive W2 GAN structure is lower bounded by over maximin. However, we do emphasize that this is not the
d1/2 n−3/d and upper bounded by C(d)n−1/d . The fitted rate sole reason, since the example in Figure 7 has duality gap zero,
is d0.48 n−1.24/d . We conjecture that the convergence rate is but alternating gradient descent is not asymptotically stable in
exactly d1/2 n−1/d up to universal multiplicative constants for
the W2 distance when d ≥ 3, whose validity is currently 7 The definition of local and global asymptotic stability is in Appendix ().
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DECONSTRUCTING GANs 7165
TABLE I
M INIMAX AND M AXIMIN VALUES OF O RIGINAL Q UADRATIC GAN
IN (65) FOR r = 1 AND PARAMETER S HARING GAN IN (72)
non-negligible subset of the local Nash equilibrium, and will Unlike a metric, one may have d(x, y) = 0 for distinct values
lead to convergence to non-Nash strategies. x = y.
Definition 3 (Pseudonorm): A real-valued function · :
B. Stability Results for Parameter Sharing GAN X → R is a pseudonorm if for any x, y ∈ X, the following
Ideally, in order to solve the minimax GAN optimization condition holds:
problem, we need to first solve the max and then the min, since
minimax may not equal maximin in general. The failure of x ≥ 0, x = 0 ⇒ x = 0. (76)
alternating gradient descent in Theorem 12 shows that treating cx = |c| x . (77)
the min and max optimization problems in an equal footing x+y ≤ x + y . (78)
may cause instability problems. The idea of solving more the
maximization problem and less the minimization problem is A pseudonorm differs from a norm in that the pseudonorm
implicitly suggested in [51] and the two time-scale update can be zero for nonzero vectors (functions).
rule [40]. Definition 4 (-Covering Number [33]): Let (V, · ) be
The next result shows that if we embed some knowledge a normed space, and G ∈ V . We say {V1 , ..., VN } is an
of the optimal solution of the inner maximization problem, -covering of G if ∀g ∈ G, ∃i such that g − Vi ≤ .
we can create a new minimax formulation with the following Definition 5 (Equilibrium Point [52]): Consider a nonlinear
time-invariant system dx dt = f (x), where f : R → Rd .
d
properties:
∗
• Zero duality gap: the minimax value equals the maximin A point x ∈ R is an equilibrium point of the system if
d
value; f (x∗ ) = 0.
• Relation to old minimax formulation: the minimax Definition 6 (Local Stability [52]): An equilibrium point
value of the new formulation is equal to the minimax in a system is locally stable if any point initialized near the
value of the old formulation without changing the gener- equilibrium point stays near that equilibrium point. Formally,
ator. we say that an equilibrium point x∗ is locally stable if for all
• Optimization stability: AGD is globally asymptotically > 0, there exists a δ > 0 such that
stable in solving the r-PCA for r = 1.
x(0) − x∗ < δ ⇒ x(t) − x∗ < , ∀t > 0 (79)
Naturally, one shall embed the connections between the
optimal value A∗ and U ∗ in the optimization problem. It can Definition 7 (Local Asymptotic Stability [52]): An equilib-
be shown that when r = 1, the optimal solutions in (65) can rium point x∗ in a system is locally asymptotically stable if it
be written in the following form: is locally stable, and there exists a δ > 0, such that
K = QΣQT (67)
∗
x(0) − x∗ < δ ⇒ lim x(t) − x∗ = 0 (80)
U = λ1 v1 v1T (68) t→∞
(91) 1 1 2
L(Pg̃(Z) , PX ) ≤ 3·OPT+D(B, C) max d log(n), log .
n δ
(100)
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
7168 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 11, NOVEMBER 2020
TV(Pg̃ , PX ) ≤ 5 · OP T + 4TV (PX , P̂nX ). (109)
TV(Pg̃ , PX ) ≤ 3 · inf TV(Pg , PX ) + 2TV (PX , P̂nX )
g∈G It follows from similar arguments to that in the proof of
≤ 3 · OP T + 2TV (PX , P̂nX ). (102) Theorem 3 that there exists an absolute constant C > 0 such
that with probability at least 1 − δ,
Since the class A = {{x ∈ Rd : v, x + b > 0} : v 2 =
d+1 log(1/δ)
1, b ∈ Rd } is all halfspaces in Rd , it has VC dimension TV(Pg̃ , PX ) ≤ 5 · OP T + C · +4 .
d + 1 [53]. By VC inequality [33, Theorem 2.2, Chap. 4], n 2n
regardless of the distribution PX , we have with probability at (110)
least 1 − δ, there exists some absolute constant C, such that
F. Proof of Theorem 5
d+1 log(1/δ) 1) Concentration for W2 Distance for Distributions That
TV (PX , P̂nX ) ≤ C · + . (103)
n 2n Satisfy Logarithmic Sobolev Inequality: Following the similar
idea in [54, Corollary 5.5], we can prove the following lemma.
Combining equation (102) and (103), we can derive with
Lemma 2: For any measure μ on Polish space X , consider
probability at least 1 − δ, there exists some universal con-
random variable X ∼ μ and corresponding empirical samples
stant C, such that
x ∈ X n , with each xi sampled i.i.d. from distribution μ.
d+1 log(1/δ) The map x → W2 (PX , P̂nX ) is √1n -Lipschitz under Euclidean
TV(Pg̃ , PX ) ≤ 3 · OP T + C · +2 . distance, i.e., viewing W2 (PX , P̂nX ) as a function W (x), for
n 2n
(104) different sets of samples x, x
1
W (x) − W (x ) ≤ √ x − x . (111)
n
E. Proof of Theorem 4
Proof: By triangle inequality of the W2 distance
We first show the first property is satisfied. From the
W (x) − W (x ) ≤ W2 (P̂nX , P̂nX ). (112)
definition of LTukey in (31), there is LTukey (PX , PX ) = 0 for
any PX ∈ G since each distribution in G is symmetric. For We know that the Wasserstein distance is related to Talagrand
any two distribution P1 = N (μ1 , I), P2 = N (μ2 , I), it follows function [54] as follows:
from straightforward computation that 1
W2 (μ, v) = T2 (μ, v) 2
μ1 − μ2
TV(P1 , P2 ) = 2Τ
2
− 1. (105) = inf |x − y|2 dπ(x, y);
2 π X2
12
We also have π(x, y) ∈ P(X 2 ); π(x) = μ, π(y) = v . (113)
According to the convexity of T2 (·, ·) [55, Theorem 4.8], one
LTukey (P1 , P2 ) = Τ ( μ1 − μ2 2 ) − 1/2. (106) has
x 1
Here Τ(x) = −∞ φ(x)dx is the standard normal CDF. W2 (P̂nX , P̂nX ) = (T2 (P̂nX , P̂nX )) 2
We notice that for any P1 , P2 ∈ G, we have 1
n
1
≤( T2 (δ(xi ), δ(xi )) 2
1 n i=1
LTukey (P1 , P2 ) ≥ TV(P1 , P2 ), (107)
1
n
2 1
= √ ( (xi − xi )2 ) 2
which follows from the fact that Τ(x) as a function on [0, ∞) n i=1
is non-decreasing. Hence the first property is satisfied with 1
= √ x − x . (114)
c1 = 2, c2 = 0. n
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DECONSTRUCTING GANs 7169
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
7170 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 11, NOVEMBER 2020
equation (129) is given by Lemma 4 [60]: For any symmetric positive definite matri-
ces A, B, we have
W22 (N (0, ĉI), P̂nX ) = min W22 (N (0, cI), P̂nX ) 1 1 1
c≥0
A2 − B2 A − B 2 , (136)
2 2 2
= min E Z 2 c + E X̂ 2
c≥0 where · is the operator norm.
− 2 sup E[Z T X̂]c Proof: Since equation (136) is homogeneous in scaling,
1
π it suffices to prove that for A − B ≤ 1, there is A 2 −
1
n 1
B 2 ≤ C. Now consider the function
= min dc2 + xi 2
− 2ρn c 1
c≥0 n i=1 1
f (x) = 1− t−3/2 dt. (137)
1 1 +
n tx
2 0
= xi + min dc2 − 2ρn c Let s = tx, we get
n i=1 c≥0
x
n s −3/2
1 2 ρ2n f (x) = x 1/2
s ds
= xi − . (130) 0 1+s
n d ∞ ∞
i=1
1/2 s −3/2 s −3/2
=x s ds − s ds
The optimal c that minimizes the distance between empirical 0 1 + s x 1 + s
distribution and real distribution is 1
= Kx 2 + g(x), (138)
ρn ∞ s −3/2
ĉ = . (131) where K = 0 1+s s
∞
ds, g(x) = x1/2 x 1+s
s
s−3/2 ds.
d
Furthermore, we know that It can be easily seen that K < ∞ and |g(x)| ≤ 2. Thus,
we know
W22 (N (0, I), P̂nX ) = E Z 2
2 + E X̂ 2
2 − 2 sup E[Z T X̂]
π KA1/2 − f (A) = g(A) ≤ 2. (139)
1
n
This would imply
=d+ xi 2 − 2ρn . (132)
n i=1 1 1 1 f (A) f (A) f (B)
n A2 − B2 ≤ A2 − + −
We denote A = n1 i=1 xi 2 , which satisfies E[A] = d K K K
2d −2/d f (B) 1
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DECONSTRUCTING GANs 7171
2
= inf W̃2 (Pg(Z) , PX )+2 inf W2 (Pg(Z) , PX) d KX − K̂X + µX − µ̂X . (153)
g∈Gr g∈G
+ 2W̃2 (P̂nX , PX )
1/2
H. Proof of Theorem 7
≤ ΣX − Σ1/2
r F + 2 inf W2 (Pg(Z) , PX )
g∈G We aim at justifying the tightness of Theorem 2. Consider
+ 2W̃2 (P̂nX , PX ). (146) the following two sampling models:
• Case 1. We draw n i.i.d. samples directly from N (0, I).
Here ΣX = E[XX T ] is the covariance matrix of X, and Denote the resulting empirical distribution as E1 .
Σr is the best rank-r approximation of ΣX under Frobenius • Case 2. We draw N elements i.i.d. from N (0, I). Denote
norm. Next, we bound the term W̃2 (P̂nX , PX ). Denote the the empirical distribution as DN . Then, we sample n
covariance matrices for PX and P̂nX as KX and K̂X . The i.i.d. points from DN . Denote the resulting empirical
distance W̃22 (PX , P̂nX ) can be written as [61] distribution as E2 .
For any distribution E we write AE to indicate that algo-
W̃22 (PX , P̂nX ) = Tr(KX ) + Tr(K̂X ) − 2 Tr((KX
2
K̂X KX
2
)2)
1 1 1 rithm A is only given access to the information contained in E.
2
The following lemma states that no algorithm
√ can successfully
+ µX − µ̂X . (147) distinguish these two cases when n N .
Lemma 5: There is an absolute
√ constant c > 0 such that the
We know following holds. If n ≤ c N , and B is any “distinguishing”
1 1 1 1 1 1
algorithm that aims outputting either “1” or “2”. Then,
1 1
Tr((KX
2
K̂X KX
2
) 2 ) = Tr(((K̂X
2
KX ) K̂X
2 T 2
KX
2
)2 )
1 1 |P(B E1 outputs“1”) − P(B E2 outputs“1”)| ≤ 0.01. (154)
≥ Tr(K̂X KX ).
2 2
(148)
Proof: We √ show that in both cases, with probability at
1 1 least√0.99, the c N draws received by B are a set consisting
Denote X = K̂X KX . The last inequality follows from the
2 2
of c N i.i.d. samples from N (0, I). This can be shown
fact that for any matrix X and its singular value decomposition
using a birthday paradox type argument below. The empirical
X = UΣV,
distribution E2 can be understood as drawing n i.i.d. samples
1 1 from DN with replacement. On the other hand, the empirical
Tr((XT X) 2 ) = Tr((VT Σ2 V) 2 ) = Tr(VT ΣV) distribution E1 can be understood as drawing n i.i.d. samples
≥ Tr(UΣV) = Tr(X). (149) from DN without replacement. It follows from the birthday
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
7172 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 11, NOVEMBER 2020
paradox that the probability of sampling all distinct points in It follows from the triangle inequality that
DN in sampling with replacement is
W2 (P̂X , N (0, I)) ≥ W2 (DN , N (0, I)) − W2 (P̂X , DN )
1 2 3 n (162)
1− 1− 1− ··· 1 −
N N N N
2 4 6 2n
≥ W2 (DN , N (0, I))
>e− N e− N e− 2N · · · e− N − inf W2 (Pg(Z) , DN ) − (n, d). (163)
n(n−1) g(·)∈G
=e− N . (155)
According to Theorem 5, following the same procedure
The inequality comes from the fact that 1 − x ≥ e−2x for as (133), we have with probability at least 0.999
n(n−1)
x ∈ [0, 12 ]. Taking c = 0.01, we have e− N > 0.99, which
gives the conclusion of the lemma. W2 (DN , N (0, I)) − inf W2 (Pg(Z) , DN ) d N −3/d
g(·)∈G
Now we use Lemma 5 to prove Theorem 7. Suppose √ that
(164)
there exists an algorithm A, such that given n = c N
−6/d
independent points draws from any PX ∈ PK , A outputs a d n ,
hypothesis P̂X with probability at least 51/100 satisfying (165)
√
W2 (P̂X , PX ) ≤ inf W2 (Pg(Z) , PX ) + (n, d), (156) where in the last step we used the assumption that n = c N .
g(·)∈G
It implies that with probability at least 0.507
where (n, d) d n−6/d . We will describe how the existence W2 (P̂X , N (0, I)) > (n, d) (166)
of A yields a distinguishing algorithm B violating Lemma 5.
The algorithm B works as follows. Given access to i.i.d. draws if (n, d) d n−6/d as n → ∞.
from distribution PX , it first runs algorithm A, obtaining with For algorithm B, it outputs “1" if W2 (P̂X , N (0, 1)) ≤
probability at least 51/100 a distribution P̂X satisfying equa- (n, d), hence we know
tion (156), and then computes the value W2 (P̂X , N (0, I)).
If W2 (P̂X , N (0, I)) ≤ (n, d) then it outputs “1", and other- P B E2 outputs “1” ≤ 0.493, (167)
wise it outputs “2". which implies that
Since G is the family of full-dimensional affine genera-
tors, inf g(·)∈G W2 (Pg(Z) , N (0, I)) = 0, which shows that in |P(B E1 outputs“1”) − P(B E2 outputs“1”)| > 0.01. (168)
Case 1, we have with probability 51/100 W2 (P̂X , N (0, I)) ≤
This contradicts with Lemma 5 and proves the theorem.
(n, d). Hence,
P B E1 outputs “1” ≥ 0.51. (157) I. Proof of Theorem 8
Now let’s consider Case 2. In this case the true underlying For infinite sample size, we have access to the real distribu-
distribution is DN , and for N d [41], we have the tion PX , and our GAN estimator would output a Gaussian
minimum eigenvalue of the covariance matrix of DN is bigger distribution N (µX , ΣX ), where µX = EPX [X], ΣX =
than 1/2 with probability at least 0.999. It follows from the EPX [XX T ] are the mean and covariance of the distri-
concentration of the norm [42, Theorem 3.1.1] there exists a bution PX . It suffices to show that for any > 0,
universal constant L > 0 such that if Y ∼ N (0, I), we can find some distribution PX with µX = 0,
inf µ,Σ W2 (N (µ, Σ), PX ) > 0, such that
√ t2 √
P(| Y − d| ≥ t) ≤ 2 exp − , (158) W2 (N (0, ΣX ), PX ) ≥ ( 2 − ) inf W2 (N (µ, Σ), PX ).
L µ,Σ
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DECONSTRUCTING GANs 7173
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
7174 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 11, NOVEMBER 2020
Now we verify the second property. For any distribu- which by definition is an
upper bound
on
tion P1 , P2 and any g ∈ G, denote A Ψ
W22 (N (0, A), N (0, B))
for any Ψ such that ≥ 0.
ΨT B
g1∗ = arg min L1 (Pg , P1 ) + λL2 (Pg , Pg ), (198) We now show that for any S ≥ 0, R(B) ⊂ R(S),
g ∈G
g2∗ = arg min L1 (Pg , P2 ) + λL2 (Pg , Pg ). (199) Tr((I − S)A + (I − S † )B) ≤ W22 (N (0, A), N (0, B)).
g ∈G (211)
Without loss of generality, assume L (Pg , P1 ) ≥ L (Pg , P2 ), Indeed,
then there is
W22 (N (0, A), N (0, B)) = Tr(A) + Tr(B) − 2 sup Eπ [X T Y ]
|L (Pg , P1 ) − L (Pg , P2 )| = (L1 (Pg1∗ , P1 ) + λL2 (Pg , Pg1∗ )) π
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DECONSTRUCTING GANs 7175
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
7176 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 11, NOVEMBER 2020
we now analyze the alternating gradient descent flow for this The case of b = 0, λ > 0 can also be solved and it is
problem. The gradient flow differential equation is given by straightforward to show that dLdt ≤ 0 for any v such that
⎡ ⎤ v T v = 1, b = 0, λ > 0.
⎡ ⎤ − ∂f
v Taking the part corresponding to b out, we have
d ⎣ ⎦ ⎢ ⎥
∂v
b =⎢ ∂f ⎥
⎣− ∂b ⎦ . (249)
dt 1 1 1 b
λ ∂f b −1+ − 2 − 2
2v T K 2λKv − 2λ(v T Kv)v
∂λ λ λ λ 4λ
2
Note that we have constrained v 2 = 1, in other words, v bb
lies in the unit sphere. Suppose the unconstrained derivative of − − v T Kv (254)
2λ3
λ2
f at v is g, then the gradient of f constrained on the Stiefel 2
1 b T 2
manifold [64] v 2 = 1 is g−vg T v. Indeed, the direction = −b 1 − − v K v − (v T Kv)2
g−vg T v is orthogonal to v. λ λ
2
Concrete computation shows when b > 0, λ > 0, b b
− − v T Kv (255)
⎡ ⎤ ⎡ ⎤ 2λ3 λ2
v 2λKv − 2λ(v T Kv)v
d ⎣ ⎦ ⎢ 1 ⎥ ≤ 0, (256)
b =⎣ λ −1 ⎦. (250)
dt
λ −v Kv + λ2 .
T b
where equality holds if and only if λ = 1, v T K 2 v =
1
(v T Kv)2 , b = v T Kv, here we have used the inequality that
whenT b = 0,b dt = λ − 1 ∨ 0, and when λ = 0, dt =
db dλ
v T K 2 v ≥ (v T Kv)2 .
−v Kv + λ2 ∨ 0.
We decompose v as
The saddle point we aim for is
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
d
v ±v1 ±v1
⎣ b ⎦ = ⎣v1T Kv1 ⎦ = ⎣ λ1 ⎦ v= ai vi , (257)
(251)
i=1
λ 1, 1,
where {v1 , v2 , . . . , vd } are the eigenvectors of matrix K. Here
Here v1 is an eigenvector corresponding to the largest eigen- 2
value of K, and λ1 is the largest eigenvalue of K, which is i ai = 1.
The terms that remain in the derivative of L is
assumed to be positive. We also assume that at the beginning
of the optimization, we initialize v such that v1T v = 0. 1 v1T 1
− + λ21 (2λKv − 2λv T Kvv) + λ1 1 −
Hence, we construct the Lyapunov function L(v, b, λ) as 8 v T v1 λ
1 1 1 1
L(v, b, λ) = + λ21 ln + (b − λ1 )2 + (1−λ)v T Kv+ v T Kv 2v T K 2λKv−2λ(v T Kv)v ,
8 |v T v1 | 2 4
2 2 (258)
1 1 1 1 b
+ (λ−1)2 + −1 + −v T Kv .
2 8 λ 8 λ2 which is equivalent to
(252)
1 λ1
− + λ21 2λ λ1 − λi a2i + λ1 − + λi a2i
Note that if λ → 0+ , L → ∞, so if we are able to prove the 8 λ
descent property of L then λ will not hit zero in the trajectory. i
i
Solving for the gradient of L and denoting v(t), b(t), λ(t) 2 2 2 2 2
−λ λi ai + λ λi ai λi ai − ( λi ai ) .
as v, b, λ when b > 0, λ > 0, we have i i i i
d (259)
L(v(t), b(t), λ(t))
dt
1 v1T Hence, it suffices to show that
=− + λ21 (2λKv − 2λv T Kvv)
8 v T v1 λ1 + λi a2i
1
+ (b − λ1 ) −1
i
λ λ1 1
b ≤ +λ λi a2i + 2 + λ21 (λ1 − λi a2i )
+ (λ − 1) −v T Kv + 2 λ 8
λ i i
1 1 −1 b − λi a2i λ2i a2i − ( λi a2i ) . (260)
+ −1 − v T Kv
4 λ λ2 λ2 i i i
1 b 1 1 To simplify notation, we introduce
+ − v T Kv −1
4 λ2 λ2 λ
1 b x= λi a2i (261)
+ v T Kv − 2 2v T K 2λKv − 2λ(v T Kv)v i
4 λ
1 b −2b b y= λ2i a2i , (262)
+ 2
− v T Kv − v T Kv . (253)
4 λ λ3 λ2 i
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DECONSTRUCTING GANs 7177
clearly we have To apply the Lasalle Theorem [65], it suffices to show that
over the trajectory of (v, b, λ) is constrained in a compact
λ1 ≥ y ≥ x ≥ 0. (263)
dt ≤ 0 and L(v, b, λ) > 0 except for
set, and over that set dL
The inequality that we need to prove is λ = 1, b = v T Kv = λ1 . Indeed, the compact set can be
chosen to be
λ1 1
λ1 +x ≤ +λ x+2 +λ21 (λ1 − x) − x(y 2 − x2 ) .
λ 8 {(v, b, λ) : v T v = 1, b ∈ [0, bc ], λ ∈ [λc1 , λc2 ]}. (277)
(264)
Indeed, the constraint bc comes from the fact that L → ∞ as
We note that b → ∞, hence the trajectory of b is bounded by a constant bc
1 which depends on the initial condition. The constraint λc1 , λc2
x+2 + λ21 (λ1 − x) − x(y 2 − x2 ) comes from similar considerations since L → ∞ whenever
8
λ → 0+ or λ → ∞.
1
≥x+2 + λ21 (λ1 − x) − x(y + x)(λ1 − x) (265) Furthermore, one can check that L is radially
8
unbounded, i.e. L(v, b, λ) → ∞ as (v, b, λ) → ∞.
1
≥ x + (λ1 − x) + 2λ21 − x(y + x) (266) Invoking [66, Corollary 5.2], we know that the
4 dynamics converge to a single point in the set
1 {(v, b, λ) : v T v = 1, λ = 1, b = λ1 , v T Kv = λ1 }
≥ x + (λ1 − x) + 2λ21 − λ1 (2λ1 ) (267)
4 since every point in this set is globally asymptotically stable.
1
≥ x + (λ1 − x) (268)
4 ACKNOWLEDGMENT
> 0, (269)
The authors would like to thank Farzan Farnia and Soheil
since we have assumed that λ1 > 0. Feizi for helpful discussions at the initial stage of this project.
Minimizing the right hand side with respect to λ shows that They would like to thank Luigi Ambrosio and Yihong Wu for
it suffices to show that discussions related to optimal transport, Jacob Steinhardt and
1 Ilias Diakonikolas for discussions related to robust statistics.
(λ1 + x)2 ≤ 4λ1 x + 2 + λ21 (λ1 − x)−x(y 2 −x2 ) , They are grateful to Chao Gao for stimulating discussions
8
(270) on the relationship between GAN and robust statistics at the
workshop of robust and high-dimensional statistics held at the
which is equivalent to Simons Institute at the University of California, Berkeley in
1 fall, 2018.
(λ1 − x)2 ≤ 8λ1 + λ21 (λ1 − x) − 4λ1 x(y + x)(y − x).
8
(271) R EFERENCES
[1] B. W. Silverman, Density Estimation for Statistics and Data Analysis.
Since λ1 ≥ y, it suffices to show that Evanston, IL, USA: Routledge, 2018.
1 [2] L. J. P. van der Maaten, E. O. Postma, and H. J. van den Herik,
(λ1 − x)2 ≤ 8λ1 + λ21 (λ1 − x) − 4λ1 x(y +x)(λ1 −x). “Dimensionality reduction: A comparative review,” J. Mach. Learn. Res.,
8 vol. 10, nos. 1–41, pp. 66–71, Oct. 2009.
(272) [3] A. Brock, J. Donahue, and K. Simonyan, “Large scale GAN training for
high fidelity natural image synthesis,” 2018, arXiv:1809.11096. [Online].
Available: http://arxiv.org/abs/1809.11096
Clearly, this inequality holds when λ1 = x. If λ1 > x, [4] T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trends in
it suffices to show that deep learning based natural language processing,” IEEE Comput. Intell.
Mag., vol. 13, no. 3, pp. 55–75, Aug. 2018.
λ1 − x ≤ λ1 (1 + 8λ21 ) − 4λ1 x(y + x). (273) [5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
It is true since Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2672–2680.
[6] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” 2017,
λ1 (1 + 8λ21 ) − 4λ1 x(y + x) ≥ λ1 + 8λ31 − 4λ1 λ1 (λ1 + λ1 ) arXiv:1701.07875. [Online]. Available: http://arxiv.org/abs/1701.07875
[7] I. Durugkar, I. Gemp, and S. Mahadevan, “Generative multi-
(274) adversarial networks,” 2016, arXiv:1611.01673. [Online]. Available:
= λ1 (275) http://arxiv.org/abs/1611.01673
[8] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie, “Stacked
≥ λ1 − x. (276) generative adversarial networks,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), vol. 2, Jul. 2017, p. 3.
The above chain of inequalities holds equality if and only [9] D. Jiwoong Im, H. Ma, C. Dongjoo Kim, and G. Taylor, “Generative
if λ1 = x. adversarial parallelization,” 2016, arXiv:1612.04021. [Online]. Avail-
able: http://arxiv.org/abs/1612.04021
Hence, we have showed that the derivative of L is strictly [10] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis with
negative in the regime v T v = 1, λ > 0, b ≥ 0 except auxiliary classifier GANs,” 2016, arXiv:1610.09585. [Online]. Available:
when λ = 1, v T Kv = λ1 , b ∈ {0, λ1 }. The largest invariant http://arxiv.org/abs/1610.09585
[11] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
set of (v, b, λ) satisfying the condition above is {(v, b, λ) : learning with deep convolutional generative adversarial networks,” 2015,
v T v = 1, λ = 1, b = λ1 , v T Kv = λ1 }. arXiv:1511.06434. [Online]. Available: http://arxiv.org/abs/1511.06434
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
7178 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 66, NO. 11, NOVEMBER 2020
[12] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and [38] M. Chen, C. Gao, and Z. Ren, “Robust covariance and scatter matrix
X. Chen, “Improved techniques for training GANs,” in Proc. Adv. Neural estimation under Huber’s contamination model,” Ann. Statist., vol. 46,
Inf. Process. Syst., 2016, pp. 2234–2242. no. 5, pp. 1932–1960, Oct. 2018.
[13] I. O. Tolstikhin, S. Gelly, O. Bousquet, C.-J. Simon-Gabriel, and [39] N. Fournier and A. Guillin, “On the rate of convergence in Wasserstein
B. Schölkopf, “AdaGAN: Boosting generative models,” in Proc. Adv. distance of the empirical measure,” Probab. Theory Rel. Fields, vol. 162,
Neural Inf. Process. Syst., 2017, pp. 5424–5433. nos. 3–4, pp. 707–738, Aug. 2015.
[14] T. Xu et al., “AttnGAN: Fine-grained text to image generation with [40] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,
attentional generative adversarial networks,” 2017, arXiv:1711.10485. “GANs trained by a two time-scale update rule converge to a local
[Online]. Available: http://arxiv.org/abs/1711.10485 Nash equilibrium,” in Proc. Adv. Neural Inf. Process. Syst., 2017,
[15] Y. Bai, T. Ma, and A. Risteski, “Approximability of discriminators pp. 6626–6637.
implies diversity in GANs,” 2018, arXiv:1806.10586. [Online]. Avail- [41] R. Vershynin, “How close is the sample covariance matrix to the actual
able: http://arxiv.org/abs/1806.10586 covariance matrix?” J. Theor. Probab., vol. 25, no. 3, pp. 655–686,
[16] S. Feizi, F. Farnia, T. Ginart, and D. Tse, “Understanding GANs: Sep. 2012.
The LQG setting,” 2017, arXiv:1710.10793. [Online]. Available: [42] R. Vershynin, High-Dimensional Probability: An Introduction With
http://arxiv.org/abs/1710.10793 Applications in Data Science, vol. 47. Cambridge, U.K.: Cambridge
[17] K. Li and J. Malik, “On the implicit assumptions of GANs,” 2018, Univ. Press, 2018.
arXiv:1811.12402. [Online]. Available: http://arxiv.org/abs/1811.12402 [43] L. Ambrosio and F. Glaudo, “Finer estimates on the 2-dimensional
matching problem,” 2018, arXiv:1810.07002. [Online]. Available:
[18] M. Sanjabi, J. Ba, M. Razaviyayn, and J. D. Lee, “On the
http://arxiv.org/abs/1810.07002
convergence and robustness of training GANs with regularized
[44] P. L. Bartlett, S. Boucheron, and G. Lugosi, “Model selection and
optimal transport,” 2018, arXiv:1802.08249. [Online]. Available:
error estimation,” Mach. Learn., vol. 48, nos. 1–3, pp. 85–113, 2002,
http://arxiv.org/abs/1802.08249
doi: 10.1023/A:1013999503812.
[19] T. Liang, “On how well generative adversarial networks learn densi- [45] T. Rippl, A. Munk, and A. Sturm, “Limit laws of the empirical Wasser-
ties: Nonparametric and parametric results,” 2018, arXiv:1811.03179.
stein distance: Gaussian distributions,” J. Multivariate Anal., vol. 151,
[Online]. Available: http://arxiv.org/abs/1811.03179 pp. 90–109, Oct. 2016.
[20] T. Che, Y. Li, A. Paul Jacob, Y. Bengio, and W. Li, “Mode regularized [46] A. Ghosh, V. Kulharia, V. Namboodiri, P. H. S. Torr, and P. K. Dokania,
generative adversarial networks,” 2016, arXiv:1612.02136. [Online]. “Multi-agent diverse generative adversarial networks,” 2017,
Available: http://arxiv.org/abs/1612.02136 arXiv:1704.02906. [Online]. Available: http://arxiv.org/abs/1704.02906
[21] B. Neyshabur, S. Bhojanapalli, and A. Chakrabarti, “Stabilizing GAN [47] H. Zhang, S. Xu, J. Jiao, P. Xie, R. Salakhutdinov, and E. P. Xing,
training with multiple random projections,” 2017, arXiv:1705.07831. “Stackelberg GAN: Towards provable minimax equilibrium via multi-
[Online]. Available: http://arxiv.org/abs/1705.07831 generator architectures,” 2018, arXiv:1811.08010. [Online]. Available:
[22] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang, “General- https://arxiv.org/abs/1811.08010
ization and equilibrium in generative adversarial nets (GANs),” [48] K. J. Arrow, L. Hurwicz, and H. Uzawa, “Studies in linear and
2017, arXiv:1703.00573. [Online]. Available: http://arxiv.org/abs/1703. non-linear programming,” Can. Math. Bull., vol. 3, no. 3. Cam-
00573 bridge, U.K.: Cambridge Univ. Press, 1960, pp. 196–198, doi:
[23] S. Arora and Y. Zhang, “Do GANs actually learn the distribution? 10.1017/S0008439500025522.
An empirical study,” 2017, arXiv:1706.08224. [Online]. Available: [49] J. Li, A. Madry, J. Peebles, and L. Schmidt, “On the limitations of
https://arxiv.org/abs/1706.08224 first-order approximation in GAN dynamics,” 2017, arXiv:1706.09884.
[24] V. Dumoulin et al., “Adversarially learned inference,” 2016, [Online]. Available: http://arxiv.org/abs/1706.09884
arXiv:1606.00704. [Online]. Available: http://arxiv.org/abs/1606.00704 [50] E. Mazumdar, L. J. Ratliff, and S. Shankar Sastry, “On gradient-
[25] C. Gao, J. Liu, Y. Yao, and W. Zhu, “Robust estimation and gener- based learning in continuous games,” 2018, arXiv:1804.05464. [Online].
ative adversarial nets,” 2018, arXiv:1810.02030. [Online]. Available: Available: http://arxiv.org/abs/1804.05464
http://arxiv.org/abs/1810.02030 [51] M. Sanjabi, J. Ba, M. Razaviyayn, and J. D. Lee, “On the convergence
[26] F. Farnia and D. Tse, “A convex duality framework for GANs,” in Proc. and robustness of training GANs with regularized optimal transport,” in
Adv. Neural Inf. Process. Syst., 2018, pp. 5254–5263. Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 7091–7101.
[27] A. Bora, E. Price, and A. G. Dimakis, “AmbientGAN: Generative models [52] S. Sastry, Nonlinear Systems: Analysis, Stability, and Control, vol. 10.
from lossy measurements,” in Proc. Int. Conf. Learn. Represent. (ICLR), New York, NY, USA: Springer-Verlag, 2013.
2018, pp. 1–22. [53] V. N. Vapnik and A. Y. Chervonenkis, “On the uniform convergence
[28] A. S. Dalalyan and A. B. Tsybakov, “Aggregation by exponential of relative frequencies of events to their probabilities,” in Measures of
weighting and sharp oracle inequalities,” in Proc. Int. Conf. Comput. Complexity. Cham, Switzerland: Springer, 2015, pp. 11–30.
Learn. Theory. Berlin, Germany: Springer, 2007, pp. 97–111. [54] N. Gozlan and C. Léonard, “Transport inequalities. A survey,” 2010,
[29] T. Rainforth et al., “Tighter variational bounds are not neces- arXiv:1003.3852. [Online]. Available: http://arxiv.org/abs/1003.3852
sarily better,” 2018, arXiv:1802.04537. [Online]. Available: http:// [55] C. Villani, Optimal Transport, Old and New, vol. 338. Berlin, Germany:
arxiv.org/abs/1802.04537 Springer-Verlag, 2008.
[30] S. Liu, O. Bousquet, and K. Chaudhuri, “Approximation and [56] A. R. Wade, “Explicit laws of large numbers for random nearest-
convergence properties of generative adversarial learning,” 2017, neighbour-type graphs,” Adv. Appl. Probab., vol. 39, no. 2, pp. 326–342,
arXiv:1705.08991. [Online]. Available: http://arxiv.org/abs/1705.08991 Jun. 2007.
[57] M. D. Penrose and J. Yukich, “Laws of large numbers and nearest
[31] E. Kreyszig, Introductory Functional Analysis With Applications, vol. 1.
neighbor distances,” in Advances in Directional and Linear Statistics.
New York, NY, USA: Wiley, 1978.
Berlin, Germany: Physica-Verlag HD, 2011, pp. 189–199.
[32] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. Póczos, “MMD GAN: [58] E. Liitiäinen, “Asymptotic moments of near neighbor distances for the
Towards deeper understanding of moment matching network,” in Proc. Gaussian distribution,” Electron. J. Probab., vol. 16, pp. 2545–2573,
Adv. Neural Inf. Process. Syst., 2017, pp. 2203–2213. Dec. 2011.
[33] L. Devroye and G. Lugosi, Combinatorial Methods in Density Estima- [59] S. Bobkov and M. Ledoux, One-Dimensional Empirical Measures,
tion. New York, NY, USA: Springer, 2012. Order Statistics, and Kantorovich Transport Distances, vol. 261,
[34] W. Hoeffding, “Probability inequalities for sums of bounded ran- no. 1259. Providence, RI, USA: AMS, 2019.
dom variables,” J. Amer. Stat. Assoc., vol. 58, no. 301, pp. 13–30, [60] Fedja. Is the Matrix Square Root Uniformly Continuous?—Mathematics
1963. Stack Exchange. Accessed: Sep. 25, 2016. [Online]. Available:
[35] M. Ledoux, “Concentration of measure and logarithmic Sobolev inequal- https://math.stackexchange.com/q/1940274
ities,” in Séminaire de Probabilités XXXIII. Berlin, Germany: Springer, [61] I. Olkin and F. Pukelsheim, “The distance between two random vectors
1999, pp. 120–216. with given dispersion matrices,” Linear Algebra its Appl., vol. 48,
[36] Y. G. Yatracos, “Rates of convergence of minimum distance estimators pp. 257–263, Dec. 1982.
and Kolmogorov’s entropy,” Ann. Statist., vol. 13, no. 2, pp. 768–774, [62] N. J. Higham, Functions of Matrices: Theory and Computation, vol. 104.
Jun. 1985. Philadelphia, PA, USA: SIAM, 2008.
[37] O. Bousquet, D. Kane, and S. Moran, “The optimal approximation factor [63] T. W. Anderson and I. Olkin, “An extremal problem for positive definite
in density estimation,” 2019, arXiv:1902.05876. [Online]. Available: matrices,” Stanford Univ., Stanford, CA, USA, Tech. Rep. OLKNSF130,
http://arxiv.org/abs/1902.05876 Jul. 1978.
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: DECONSTRUCTING GANs 7179
[64] H. D. Tagare, “Notes on optimization on Stiefel manifolds,” Yale Univ., David Tse (Fellow, IEEE) received the B.A.Sc. degree in systems design
New Haven, CT, USA, 2011. engineering from the University of Waterloo in 1989, and the M.S. and
[65] J. LaSalle, “Some extensions of Liapunov’s second method,” IRE Trans. Ph.D. degrees in electrical engineering from the Massachusetts Institute of
Circuit Theory, vol. 7, no. 4, pp. 520–527, 1960. Technology in 1991 and 1994, respectively.
[66] S. P. Bhat and D. S. Bernstein, “Nontangency-based Lyapunov tests From 1994 to 1995, he was a Post-Doctoral Member of Technical Staff with
for convergence and stability in systems having a continuum of AT&T Bell Laboratories. From 1995 to 2014, he was on the Faculty Member
equilibria,” SIAM J. Control Optim., vol. 42, no. 5, pp. 1745–1775, of the Department of Electrical Engineering and Computer Sciences, Univer-
Jan. 2003. sity of California at Berkeley (U.C. Berkeley). He is currently a Professor
with Stanford University. He is a coauthor, with Pramod Viswanath, of the
text Fundamentals of Wireless Communication, which has been used in over
60 institutions around the world. He is the inventor of the proportional-fair
scheduling algorithm used in all third- and fourth-generation cellular systems.
His research interests are in information theory and its applications in various
fields, including wireless communications, energy, computational biology, and
blockchain.
Prof. Tse was an Elected Member of the U.S. National Academy of Engi-
neering in 2018. He was the recipient of the IEEE Claude E. Shannon Award
in 2017 and the IEEE Richard W. Hamming Medal in 2019. He received
Banghua Zhu (Student Member, IEEE) received the B.Eng. degree in the 1967 NSERC Graduate Fellowship from the Government of Canada
electronic engineering from Tsinghua University, Beijing, China, in 2018. in 1989, the NSF CAREER Award in 1998, the Best Paper Awards at the
He is currently pursuing the Ph.D. degree with the Department of Electrical Infocom Conference in 1998 and 2001, the Erlang Prize in 2000 from the
Engineering and Computer Sciences, University of California at Berkeley. His INFORMS Applied Probability Society, the IEEE Communications Society
research interests include machine learning, statistics, and information theory. and Information Theory Society Joint Paper Awards in 2000 and 2013,
the Information Theory Society Paper Award in 2003, the Gilbreth Lectureship
from the National Academy of Engineering in 2012, the Signal Processing
Society Best Paper Award in 2012, the EURASIP Best Paper Award in 2012,
and the IEEE Communications Society Stephen O. Rice Prize in 2013.
For his contributions to education, he received the Outstanding Teaching
Award from the Department of Electrical Engineering and Computer Sciences,
U.C. Berkeley, in 2008 and the Frederick Emmons Terman Award from the
American Society for Engineering Education in 2009. He was an Associate
Editor of the IEEE T RANSACTIONS ON I NFORMATION T HEORY from 2001 to
2003, the Technical Program Co-Chair in 2004, and the General Co-Chair
Jiantao Jiao (Member, IEEE) received the B.Eng. degree (Hons.) in electronic of the International Symposium on Information Theory in 2015. He served
engineering from Tsinghua University, Beijing, China, in 2012, and the M.S. on the Board of Governors of the IEEE Information Theory Society from
and Ph.D. degrees from Stanford University in 2014 and 2018, respectively. 2003 to 2008 and from 2010 to 2013. He was the plenary speaker for many
He is currently an Assistant Professor with the Department of Electrical international conferences and workshops, including the IEEE International
Engineering and Computer Sciences, University of California at Berkeley. Symposium on Information Theory in 2009, the ACM International Confer-
His research interests are in statistical machine learning, mathematical data ence on Mobile Computing and Networking (MobiCom) in 2007, and the
science, optimization, applied probability, information theory, and their appli- IEEE International Conference on Acoustics, Speech and Signal Processing
cations in science and engineering. in 2006.
Authorized licensed use limited to: Politecnico di Torino. Downloaded on July 29,2024 at 15:57:33 UTC from IEEE Xplore. Restrictions apply.