AVAE

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

The Autoencoding Variational Autoencoder

Taylan Cemgil Sumedh Ghaisas Krishnamurthy Dvijotham


DeepMind DeepMind DeepMind
taylancemgil@google.com sghaisas@google.com dvij@google.com

Sven Gowal Pushmeet Kohli


arXiv:2012.03715v1 [cs.LG] 7 Dec 2020

DeepMind DeepMind
sgowal@google.com pushmeet@google.com

Abstract
Does a Variational AutoEncoder (VAE) consistently encode typical samples gener-
ated from its decoder? This paper shows that the perhaps surprising answer to this
question is ‘No’; a (nominally trained) VAE does not necessarily amortize inference
for typical samples that it is capable of generating. We study the implications of
this behaviour on the learned representations and also the consequences of fixing it
by introducing a notion of self consistency. Our approach hinges on an alternative
construction of the variational approximation distribution to the true posterior of
an extended VAE model with a Markov chain alternating between the encoder
and the decoder. The method can be used to train a VAE model from scratch or
given an already trained VAE, it can be run as a post processing step in an entirely
self supervised way without access to the original training data. Our experimental
analysis reveals that encoders trained with our self-consistency approach lead to
representations that are robust (insensitive) to perturbations in the input introduced
by adversarial attacks. We provide experimental results on the ColorMnist and
CelebA benchmark datasets that quantify the properties of the learned representa-
tions and compare the approach with a baseline that is specifically trained for the
desired property.
https://github.com/deepmind/deepmind-research/tree/master/avae

1 Introduction
The variational AutoEncoder (VAE) is a deep generative model [10, 15] where one can simultaneously
learn a decoder and an encoder from data. An attractive feature of the VAE is that while it estimates
an implicit density model for a given dataset via the decoder, it also provides an amortized inference
procedure for computing a latent representation via the encoder. While learning a generative model
for data, the decoder is the key object of interest. However, when the goal is extracting useful features
from data and learning a good representation, the encoder plays a more central role [20]. In this paper,
we will focus primarily on the encoder and its representation capabilities.
Learning good representations is one of the fundamental problems in machine learning to facilitate
data-efficient learning and to boost the ability to transfer to new tasks. The surprising effectiveness of
representation learning in various domains such as natural language processing [7] or computer vision
[11] has motivated several research directions, in particular learning representations with desirable
properties like adversarial robustness, disentanglement or compactness [1, 3–5, 12].
In this paper, our starting point is based on the assumption that if the learned decoder can provide
a good approximation to the true data distribution, the exact posterior distribution (implied by the
decoder) tends to possess many of the mentioned desired properties of a good representation, such as

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
(a) (b)
Figure 1: Iteratively encoding and decoding a color MNIST image using a decoder and encoder (a)
fitted a VAE with no observation noise (b) with AVAE. We find the drift in the generated images is an
indicator of an inconsistency between the encoder and the decoder.

robustness. On a high level, we want to approximate properties of the exact posterior, in a way that
supports representation learning.
One may naturally ask in what extent this goal is different from learning a standard VAE, where the
encoder is doing amortized posterior inference and is directly approximating the true posterior. As
we are using finite data for fitting within a restricted family of approximation distributions using local
optimization, there will be a gap between the exact posterior and the variational approximation. We
will illustrate that for finite data, even global minimization of the VAE objective is not sufficient to
enforce natural properties which we would like a representation to have.
We identify the source of the problem as an inconsistency between the decoder and encoder, attributed
to the lack of autoencoding, see also [2]. We argue that, from a probabilistic perspective, autoencoding
for a VAE should ideally mean that samples generated by the decoder can be consistently encoded.
More precisely, given any typical sample from the decoder model, the approximate conditional
posterior over the latents should be concentrated on the values that could be used to generate the
sample. In this paper, we show (through analysis and experiments) that this is not the case for a
VAE learned with normal training and we propose an additional specification to enforce the model to
autoencode, bringing us to the choice of the title of this paper.
Our Contributions: The key contributions of our work can be summarized as follows:
• We uncover that widely used VAE models are not autoencoding - samples generated by the
decoder of a VAE are not mapped to the corresponding representations by the encoder, and
an additional constraint on the approximating distribution is required.
• We derive a novel variational approach, Autoencoding VAE (AVAE), that is based on a new
lower bound of the true marginal likelihood, and also enables data augmentation and self
supervised density estimation in a theoretically principled way.
• We demonstrate that the learned representations achieve adversarial robustness. We show
robustness of the learned representations in downstream classification tasks on two bench-
mark datasets: colorMNIST and CelebA. Our results suggest that a high performance can
be achieved without adversarial training.

2 The Variational Autoencoder


The VAE is a latent variable model that has the form
Z ∼ p(Z) = N (Z; 0, I) X|Z ∼ p(X|Z, θ) = N (X; g(Z; θ), vI) (1)
where N (·; µ, Σ) denotes a Gaussian density with mean and covariance parameters µ and Σ, v is a
positive scalar variance parameter and I is an identity matrix of suitable size. The mean function
g(Z; θ) is parametrized typically by a deep neural network with parameters θ and the conditional
distribution is known as the decoder. We use here a conditionally Gaussian observation model but
other choices are possible, such as Bernoulli or Poisson, with mean parameters given by g.
To learn this model by maximum likelihood, one can use a variational approach to maximize the
evidence lower bound (ELBO) defined as
log p(X|θ) ≥ hlog p(X|Z, θ)iq(Z|X,η) + hlog p(Z)iq(Z|X,η) + H[q(Z|X, η)] ≡ B(θ, η) (2)
R
where hhiq ≡ h(a)q(a)da denotes the expectation of the test function h with respect to the
distribution q, and, H is the entropy functional H[q] = − hlog qiq . Here, q is an instrumental
distribution, also known as the encoder, defined as
q(Z|X, η) = N (Z; f µ (X; η), f Σ (X; η))

2
Here, the functions f µ and f Σ are also chosen as deep neural networks with parameters η. Using the
reparametrization trick [10], it is possible to optimize θ and η jointly by maximization of the ELBO
using stochastic gradient descent aided by automatic differentiation. This mechanism is known as
amortised inference, as once an encoder-decoder pair is trained, in principle the representation for
a new data point can be readily calculated without the need of running a costly iterative inference
procedure.
The ELBO in (2) is defined for a single data point X. It can be shown that in a batch setting,
maximization of the ELBO is equivalent to minimization of the Kullback-Leibler divergence
KL(Q|P) = hlog(Q/P)iQ Q = π(X)q(Z|X, η) P = p(X|Z, θ)p(Z) (3)
with respect to θ and η. Here, π is ideally the true data distribution, but in practice it is replaced by
the dataset, i.e., the empirical data distribution π̂(X) = N1 i δ(X − x(i) ). This form has also an
P
intuitive interpretation as the minimization of the divergence between two alternative factorizations
of a joint distribution with the desired marginals. See [21] for alternative interpretations of the ELBO
including the one in (3). In the sequel, we will build our approach on this interpretation.

2.1 Extended VAE model for a pair of observations

In this section, we will justify the VAE from an alternative perspective of learning a variational
approximation in an extended model, essentially first arriving at an identical algorithm to the original
VAE. This alternative perspective enables us to justify additional specifications that a consistent
encoder/decoder pair should satisfy.
Imagine the following extended VAE model for a pair of observations X, X 0
     
0 0 Z 0 I ρI
(Z, Z ) ∼ pρ (Z, Z ) = N ; , (4)
Z0 0 ρI I
X|Z ∼ p(X|Z, θ) = N (X; g(Z; θ), vI)
X 0 |Z 0 ∼ p(X 0 |Z 0 , θ) = N (X 0 ; g(Z 0 ; θ), vI) (5)
0 0
where the prior distribution pρ (Z , Z) is chosen as symmetric with p(Z ) = p(Z), that is both
marginals (here unit Gaussians) are equal in distribution. Here, I is the identity matrix and ρ is a
hyperparameter |ρ| ≤1 that we will refer as the coupling strength. Note that we have pρ (Z 0 |Z) =
N Z 0 ; ρZ, (1 − ρ2 )I . The two border cases ρ = 1 and ρ = 0, correspond to Z 0 = Z, and,
independence p(Z 0 , Z) = p(Z 0 )p(Z) respectively. This model defines the joint distribution
P̄ ≡ p(X 0 |Z 0 ; θ)p(X|Z; θ)pρ (Z 0 , Z) (6)
Proposition 2.1. Approximating P̄ with a distribution of form
Q̄ ≡ π̂(X)q(Z|X, η)pρ (Z 0 |Z)p(X 0 |Z 0 , θ) (7)
where π̂ is the empirical data distribution and q and p are the encoder and the decoder models
respectively gives the original VAE objective in (3).
Proof: See Appendix A.1.
Proposition 2.1 shows that we could have derived the VAE from this extended model as well. This
derivation is of course completely redundant since we introduce and cancel out terms. However,
we will argue that this extended model is actually more relevant from a representation learning
perspective where a coupling ρ ≈ 1 is preferable as this is the desired behaviour of the encoder.
Example 2.1. Conditional Generation: Imagine that we generate a latent z ∼ p(Z) and an
observation x ∼ p(X|Z = z; θ) from the decoder of a VAE, but consequently discard z. Clearly, x is
a sample from the marginal p(X; θ). Now suppose we are told to generate a new sample using the
same z as x0 ∼ p(X 0 |Z 0 = z; θ). As we have discarded z, our best bet would be using the extended
model with ρ = 1 and sample from P̄(X 0 |X = x; θ, ρ = 1) instead.
Example 2.2. Representation Learning: Imagine as in the previous example, that we generate
a latent z and the corresponding observation x, but discard z. Suppose we are told to solve a
classification task based on a classifier p(y|z).
R As we have discarded z, we would ideally compute
an expectation under the true posterior p(y|Z)P̄(Z|X = x; η)dZ. If our goal is learning the
representation, and the task is unknown at training time, we wish to maintain as much as information
about x. One strategy is asking a faithful reconstruction x0 given x, i.e., we would like to be able to
sample from P̄(X 0 |X = x; θ, ρ = 1) so the goal is not very different from conditional generation.

3
In practice, P̄(X 0 |X R= x; θ, ρ) is not available and we would be using the approximate transition
kernel Q̄(X 0 |X; ·) ≡ Q̄dZdZ 0 /π̂(X) obtained from the variational distribution. A natural question
is how good the approximation P̄(X 0 |X = x; ·) ≈ Q̄(X 0 |X; ·) is for different ρ, and in particular
ρ ≈ 1. Therefore, we will investigate first some properties of this conditional distribution.
Proposition 2.2. The marginal P̄(X 0 , X; θ, ρ) is symmetric in X 0 and X, and its marginal does not
depend on ρ, i.e., P̄(X = x; θ, ρ) = p(X; θ) for any |ρ| ≤ 1.
Proof: See Appendix A.2.

The next proposition shows that if the encoder q is equal to the exact posterior, then Q̄(X 0 |X; ·) is
also exact
Proposition 2.3. If the encoder q and decoder p satisfy the consistency condition
q(Z|X, η)p(X; θ) = p(X|Z; θ)p(Z) (8)
then, for all |ρ| ≤ 1, we have Q̄(X 0 |X; ·)p(X; θ) = P̄(X 0 , X; θ, ρ). We will say that Q̄(X 0 |X; ·) is
p(X; θ)-invariant.
Proof: See Appendix A.3.

The subtle point of Proposition 2.3 is that the exact posterior is valid for any coupling strength
parameter ρ. However, we have seen that the approximation computed by the VAE would be
completely agnostic to our choice of ρ. Moreover, note that even if the original VAE objective is
globally minimized to get KL(π̂(X)q(Z|X, η)||p(X|Z, θ)p(Z)) = 0, this may not be sufficient to
make sure that the transition kernel Q̄(X 0 |X; ·) to be p(X; θ)-invariant. The empirical distribution π̂
has a discrete support and we have no control over the encoder out of this support. Intuitively, we need
to introduce additional terms to the objective to steer the encoder if we would like to use the model as
a conditional generator, or as a representation especially in the regime ρ ≈ 1. For this purpose, we
will need to make our approximating distribution to match the transition P̄(Z 0 |Z; ·) = pρ (Z 0 |Z) that
is by construction p(Z) invariant.
In Figure 1, we compare what can happen when a VAE is learned nominally with an example of
expected behaviour. Here, we show a sequence of images generated by iteratively sampling from a
nominally learned encoder-decoder pair on colorMNIST. The drift in the generated images points to
a potential inconsistency between the decoder and the encoder. For further insight, we also discuss
the special case of probabilistic PCA (Principal Component Analysis) in the appendix Section B, to
give an analytically tractable example.

3 Autoencoding Variational Autoencoder (AVAE)


Motivated by our analysis, we propose using the following extended model as a target distribution,
Pρ = p(X|Z; θ)p(Z)pρ (Z 0 |Z)u(X̃) (9)
where ρ is considered as a fixed hyper-parameter, not to be learned from data. Here X̃ is an additional
auxiliary observation (a delusion) that we introduce. We want the target model to be agnostic to its
value, hence we choose a flat distribution u(X̃) = 1. We choose as the approximating distribution
QAVAE = q(Z 0 |X̃; η)pθ (X̃|Z)q(Z|X; η)π(X)
The central idea of AVAE is making the encoder and the decoder to be consistent both on the training
data and on the auxiliary observations generated by the decoder. We use the notation pθ to highlight
that the decoder is considered as constant when used as a factor of QAVAE otherwise the original VAE
bound would be invalid. Intuitively, the self generated delusion X̃ should not be changing the true
log likelihood as otherwise we would be modifying the original objective. The following proposition
justifies our choice:
Proposition 3.1. Assume that the consistency condition (8) is true, Then, the transition kernel defined
as Z
QAVAE (Z |Z; η, θ) ≡ q(Z 0 |X̃, η)p(X̃|Z; θ)dX̃
0

is p(Z) invariant. Moreover, assume that the latent space has a lower dimension than the observation
space ( Z ∈ RNz and X ∈ RNx , and Nz < Nx ), and the decoder mean mapping g : RNz → Xg ,

4
Z Z0 Z X̃ Z0 Figure 2: Graphical model of the extended
target distribution Pρ , and the variational
approximation QAVAE . Here X̃ is a sam-
X X̃ X ple generated by the decoder that is subse-
quently encoded by the encoder.
Pρ QAVAE

is one-to-one, where Xg ⊂ RNx is the image of g. Then in the limit when the observation noise
variance v goes to zero (v → 0) we have QAVAE (Z 0 |Z; η, θ) = p(Z 0 |Z; ρ = 1).
Proof: See Appendix A.4.
The proposition shows that our choice of the variational approximation (10) is natural: it forces
QAVAE (Z 0 , Z) to be close to the marginal of the exact posterior P̄(Z 0 , Z) = pρ (Z 0 |Z)p(Z). The
choice QAVAE is also convenient because it uses the original encoder and decoder as building blocks.
In the appendix Section C.1, we derive the variational objective BAVAE = −KL(QAVAE ||Pρ ). The
result is 1
BAVAE =+ −KL(π(X)q(Z|X; η)||p(X|Z; θ)p(Z))
D E
+ hlog pρ (Z 0 |Z)iq̃(Z;η)q̃θ (Z 0 |Z;η) − log q(Z 0 |X̃; η) (10)
q(Z 0 |X̃;η)q̃θ (X̃;η)

where q̃(Z; η) ≡ q(Z|X; η)π(X)dX, q̃θ (X̃; η) ≡ pθ (X̃|Z)q̃(Z; η)dZ and q̃θ (Z 0 |Z; η) ≡
R R

q(Z 0 |X̃; η)pθ (X̃|Z)dX̃. In the appendix 1, we provide pseudocode with a stop-gradients primitive
R
to avoid using contributions of θ dependent terms of q̃θ to the gradient computation.
The resulting objective (10) is intuitive. The first term is identical to the standard VAE ELBO. The
second term is a smoothness term that measures the distance between z (the representation that is
used to generate the delusion) and z 0 (the encoding of the delusion). The third term is an extra entropy
term on auxiliary observations.

3.1 Illustration

As a illustration, we show results for a discrete VAE, where both the encoder and the decoder can be
represented as parametrized probability tables. Our goal in choosing a discrete model is visualization
of the behaviour of the algorithms in a way that is independent from a particular neural network
architecture. The details of this model are explained in the appendix Section D.
Z X0 Z X0
Figure 3: Each Panel is showing
(from north east clockwise order)
Z0

Z0
Z0

Z0

heatmaps of probabilities (darker is


higher) i) the decoder p(X 0 |Z 0 ; θ); ii)
Q(X 0 |X; θ, η)p(X; θ); iii) the encoder
q(Z|X; η); iv) Q(Z|Z 0 ; θ, η)p(Z 0 ). See
X

X
X

text for definitions.


Z X0 Z X0

VAE AVAE
We can visually compare the learned transition models of the learned encoders and decoders for VAE
and AVAE in Figure 3, where we show for each model (starting from north east clockwise order) i)
the estimated decoder p(X 0 |Z 0 ; θ) (that is equal in distribution to p(X|Z; θ) due to parameter tying),
ii) the joint distribution Q(X 0 |X; θ, η)p(X; θ), that should be ideally symmetric, where
X
Q(X 0 |X; θ, η) = p(X 0 |Z 0 = z; θ)q(Z = z|X; η)
z
with iii) the encoder q(Z|X; η) and iv) the joint distribution Q(Z|Z 0 ; θ, η)p(Z 0 ) that should be also
ideally symmetric and additionally should be close to identity where
X
Q(Z|Z 0 ; θ, η) = q(Z|X = x; η)p(X 0 = x|Z 0 ; θ).
x
1
When a and b are proportional, we write log a =+ log b.

5
We see in Figure 3 that the joint distribution Q(Z|Z 0 ; θ, η)p(Z 0 ) is far from an identity mapping,
and it is also not symmetrical. In contrast, the distributions learned by AVAE are enforced to be
symmetrical, and the joint distribution is concentrated on the diagonal.
Intermediate summary: Learning a VAE from data is an highly ill-posed problem and regularization
is necessary. Here, instead of modifying the decoder model, we have proposed to constrain the encoder
in such a way that it approaches the desired properties of the exact posterior. Our argument started
with the extended model in proposition 2.1 that admits the VAE as a marginal. In 2.2, we show that
the exact decoder model is by construction independent of the choice of the coupling strength ρ.
In plain language this means that, if we had access to the exact decoder and if we were able to do
exact inference, any ρ will be acceptable. In this ideal case, the extended model would be actually
redundant, and in 2.3, we highlight the properties of an exact encoder.
In practice however, we will be learning the decoder from data while doing only approximate
amortized inference. Naturally, we want to still retain the properties of the exact posterior. We
argue that the extended model is relevant for representation learning (see also example 2.2) and
incorporate this explicitly to the encoder by AVAE, and in 3.1 we provide the justification that our
choice coincides with the exact target conditional for ρ = 1, the representation learning case, when
we encode and decode consistently.
We argue that encoder/decoder consistency is related to robustness. The existence of certain ’sur-
prising’ adversarial examples [17], where an input image is classified as a very different class after
slightly changing input pixels, can be attributed non-smoothness of the representation such as having
a large Lipschitz constant, see [5]. The smooth encoder (SE) method proposed in [5] attempted to
fix this by data augmentation while training the encoder. In this paper, we introduce a more general
framework, (where SE is also a special case) and investigate methods that circumvent the need for
computationally costly adversarial attacks during training. The data augmentation is achieved by
using the learned generative model itself, as a factor of the approximating QAVAE distribution. The
AVAE objective ensures that samples that can be generated by the decoder in the vicinity of the
representations corresponding to the training inputs are consistently encoded. In the next experi-
mental section, we will show that the consistency translates to a nontrivial adversarial robustness
performance. While our approach does not provide formal guarantees, learning an encoder that
retains properties of an exact posterior seems to be central in achieving adversarial robustness.

4 Experimental Results
In this section, we will experimentally explore consequences of training with the AVAE objective
in (10), implemented as Algorithm 1 in Appendix C. Optimizing the AVAE objective will change
the learned encoder and decoder, and we expect that the additional terms will enforce a smooth
representation leading to input perturbation robustness in downstream tasks. To test this claim, we
will evaluate the encoder in terms of adversarial robustness, using an approach that will be described
in the evaluation protocol. To see the effect of the new objective on the decoder and reconstruction
quality, we will report Frechet Inception Distance (FID) [9], as well as the test mean square error
(MSE).
Models: AVAE will be compared to two models, i) VAE trained using the standard ELBO (2), and,
ii) Smooth Encoder (SE), a method recently proposed by [5] that uses the same target model as in (6),
but uses a different variational approximation. This approximation is computed using adversarial
attacks to generate the auxiliary observations X̃. We provide a self contained derivation of this
algorithm in Appendix C.2. We also include two hybrid models in our simulations. iii) AVAE-SS
(Self Supervised) a post-training method, where a VAE is first trained normally and then post-trained
only by samples generated from the decoder (without training data, see Figure 4 for the graphical
model – for details see appendix C.4). Finally, we also provide results for a model iv) SE-AVAE, a
model that combines both the SE and the AVAE objectives and is concurrently trained using both
adversarial attacks and self generated auxiliary observations (for details see appendix C.5). In all
models, we use a latent space dimension of 64.
Data and Tasks: Experiments are conducted on datasets colorMNIST and CelebA, using both MLP
and Convnet architectures in the former, and only Convnet in the latter (for details see Appendix F).
The colorMNIST dataset has 2 separate downstream classification tasks (color, digit). For CelebA,
we have 17 different binary classification tasks, such as (’has mustache?’ or ’wearing hat?’).

6
Z 00 Z Z0 Z 00 Z Z0 Figure 4: Graphical model of the AVAE-
SS target distribution Pρ,AVAE-SS , and the
variational approximation QAVAE-SS . Here
X X̃ X X̃
both X and X̃ is a sample generated by the
decoder. The decoder factors (dotted arcs)
Pρ,AVAE-SS QAVAE-SS is fixed (as pretrained by normal VAE).

Task digit color


Time MSE FID
 0.0 0.1 0.2 0.0 0.1 0.2
VAE 93.8 5.8 0.0 100.0 19.9 2.0 ×1 1369.2 12.44
SE50.1 94.3 89.6 1.8 100.0 100.0 21.8 ×4 1372.5 13.01
SE50.2 95.7 92.6 87.3 100.0 99.9 99.9 ×4 1374.9 11.72
AVAE 97.3 88.1 54.8 100.0 99.8 87.7 × 1.5 1371.9 15.46
SE0.1 -AVAE 97.4 93.6 24.5 100.0 100.0 60.0 × 4.7 1373.3 13.90
SE0.2 -AVAE 97.6 94.2 79.8 100.0 100.0 83.2 × 4.7 1374.3 13.89
AVAE SS 94.1 72.8 20.8 100.0 99.6 56.8 × 1.5 1379.3 12.44

Table 1: Adversarial test accuracy (in percentage) of the representations for digit and color classifi-
cation tasks on color MNIST (ConvNet). Evaluation attack radius is  where pixels are normalized
between [0, 1]. Time is the ratio of the wall-clock time of method to the time taken by the VAE. We
show performance of the decoder in terms of MSE and FID score. For AVAE and SE, the coupling
strengths are ρ = 0.975 and ρSE = 0.95. The subscript of SE0 is the radius 0 used during adversarial
training of SE.

Evaluation Protocol: The learned representations will be evaluated by their robustness against
adversarial perturbations. For the evaluation of adversarial accuracy we employ a three step process,
i) first we train an encoder-decoder pair agnostic to any classification task, and ii) subsequently, we
freeze the encoder parameters and use the mean mapping f µ as a representation to train a linear
classifier on top. Thus, each task specific classifier will share the common representation learned by
the encoder. This linear classifier is trained normally, without any adversarial attacks. Finally, iii)
we evaluate the adversarial accuracy of the resulting classifiers. For this, we compute an adversarial
perturbation δ such that kδk∞ ≤  using projected gradient descent (PGD). Here,  is the attack
radius and the optimization goal is changing the classification decision to a different class. The
adversarial accuracy is reported in terms of percentage of examples where the attack is not able to
find an adversarial example.
Results: In Table 1, we show a comparison of the models on colorMNIST, where we evaluate
adversarial accuracy for different attack radii  (0.0 means no attack). Our key observation is that
AVAE increases the nominal accuracy and achieves high adversarial robustness, that extends well
to strong attacks with a large radius. The surprising fact is that the method is trained completely
agnostic to the particular attack type (e.g, PGD), or the attack radius , and it is able to achieve
comparable performance to SE. Due to the computational burden of adversarial attacks, a single
iteration of SE takes almost 2.7 times more computation time than a single AVAE iteration, making
AVAE practical for large networks. We also observe that for small , both objectives can be combined
to provide improved adversarial accuracy (digit,  = 0.1), however, for higher radius ( = 0.2), the
advantage seems to disappear. Another notable algorithm in Table 1 is AVAE-SS, where a pretrained
VAE model can be further trained to significantly improve the robustness of the encoder. While this
approach seems to be not competitive in terms of the final adversarial accuracy, the fact that it can
improve robustness entirely in a self supervised way (Section C.4) is an attractive property.
In Figure 5, we show the summary of several experiments with different architectures, models and
various choices of the coupling strength hyper-parameter ρ, on the colorMNIST dataset. The first
column shows the effect of varying ρ, and as expected from our analysis, the adversarial accuracy
is high for ρ close to one. The middle column compares different architectures in terms of training
dynamics. For MLP, we see that adversarial accuracy slowly increases, while for Convnet, the
improvement is very rapid. We also observe that AVAE-SS can significantly improve the robustness
of a pretrained VAE, but not to the level of the other methods (results with further ρ are in Appendix G).
The third column highlights the behaviour of the algorithms with increasing attack radius. We see

7
0.1 100
70 AVAE
60 0.2 SE
0.3 60 80 SE AVAE
50 AVAE SS
50

Adv. Accuracy

Adv. Accuracy

Adv. Accuracy
40 60
40
30 30 40
20 AVAE
20 SE
10 10 SE AVAE 20
VAE
0 0 AVAE SS
0 0.7 0.8 0.9 0.93 0.94 0.945 0.95 0.955 0.96 0.98 0.99 0.00 1.00M 2.00M 3.00M 4.00M 5.00M 6.00M 0.0 0.1 0.2
Iterations
0.1 100
80 0.2
0.3 80
80
60 60 AVAE
Adv. Accuracy

Adv. Accuracy

Adv. Accuracy
SE 60
SE AVAE
40 40 VAE
AVAE SS 40

20 20 AVAE
20 SE
SE AVAE
0 0 0 AVAE SS
0 0.7 0.8 0.9 0.93 0.94 0.945 0.95 0.955 0.96 0.98 0.99 0.00 500.00K 1.00M 1.50M 2.00M 2.50M 3.00M 3.50M 4.00M 0.0 0.1 0.2
Iterations

Figure 5: Adversarial accuracy of ’digit’ classification task of colorMNIST. (Top row) All results
use MLP architecture (Bottom row) Convnet architecture. (Left Column), Adversarial accuracy as a
function of the coupling strength ρ for attack radius . (Middle column) Comparison of algorithms
and test adversarial accuracy for attack radius  = 0.1, for ρ = 0.975 and ρSE = 0.95. (Right column)
Adversarial accuracy as a function of the evaluation attack radius, SE model is trained with 0 = 0.1
(SE0.1 ).

AVAE SE5 SE20 [5] SE5 AVAE AVAE SS


Task /  0.0 0.1 0.0 0.1 0.0 0.1 0.0 0.1 0.0 0.1
Bald 97.9 85.2 97.9 72.0 97.4 86.5 97.9 87.0 97.8 70.0
Mustache 96.1 91.5 95.0 69.5 95.7 84.4 96.0 92.3 94.9 74.3
Necklace 86.1 78.4 87.8 56.7 88.0 78.9 86.1 80.3 88.0 59.7
Eyeglasses 95.4 68.9 95.9 20.3 95.7 33.0 95.4 67.5 94.4 57.1
Smiling 77.7 3.6 87.0 3.10 85.7 1.1 77.9 6.3 81.4 0.9
Lipstick 81.0 7.3 83.9 2.0 80.3 0.6 80.2 11.5 80.7 0.9
Time ×2.2 ×3.1 ×7.8 ×4.3 ×2.2
MSE 7276.6 7208.8 N/A 7269.2 7347.3
FID 97.92 98.00 N/A 109.4 99.8

Table 2: Adversarial test accuracy (in percentage) of the representations for subset of classification
tasks on CelebA. For SE methods, superscript L in SEL denotes the number of PGD iterations used
during training of the model.

that for SE, the robustness does not extend beyond the radius that the model was trained for, whereas
the accuracy of the AVAE trained model degrades gracefully.
In Table 2, we show that the robustness of a representation learned with AVAE extends to a complex
dataset such as CelebA. The VAE results are omitted as they achieve around 0.0 adversarial accuracy
and can be found in Table 3 in Appendix F, along with complete results on all tasks. We see that AVAE
performs robustly in downstream tasks, and even surpasses robustness of SE5 and even challenges
SE20 (reported by [5]) which is substantially requires more computation. We observe that AVAE-SS
also achieves non trivial adversarial accuracy across tasks, in some cases even beating SE5 . Finally
we also report results for SE-AVAE model, showing non-trivial improvements over both SE and
AVAE model in most of the downstream tasks. Yet, the increased robustness seems to come with
a cost: in all the experiments, we observe an increase in FID (lower is better) and MSE for AVAE
models, indicating a tendency of reduced decoder quality and test set reconstruction. We conjecture
that this tradeoff may be the result of the encoder being much more constrained than in the VAE case.

8
5 Conclusions

We show that VAE models derived from the canonical formulation in (1) are not unique. We used an
alternative model (5) to show that the standard model is unable to capture some desired properties of
the exact posterior that are important for representation learning. We proposed a principled alternative,
the AVAE, and observed that it gives rise to robust representations without requiring adversarial
training. In addition, we present a self supervised variant (AVAE-SS) that also exhibits robustness.
The paper justifies, theoretically and experimentally, the modelling choices of AVAE. Using two
benchmark datasets, we demonstrate that the approach can achieve surprisingly high adversarial
accuracy without adversarial training. An important feature of the AVAE is that the likelihood
function is still equal to the standard VAE likelihood. In doing so, we also provide a principled
justification for data augmentation in a density estimation context, where naive data augmentation
is not valid as it alters the data distribution. In our framework, the generated data points act like
nuisance parameters of the approximating distribution and do not have an effect on the estimated
decoder.
Although we are not modifying the original likelihood function, currently, we observe a tradeoff:
while the learnt encoder becomes more robust, the corresponding decoder seems to slightly suffer in
quality, as measured by FID and MSE. It remains to be seen if there is actually a fundamental reason
behind this and we believe that with more carefully designed training methods, both the decoder and
the encoder could be in principle improved.
The autoencoding specification is related loosely to the concept of cycle consistency, that is often
enforced between two data modalities [22]. In [8], authors propose a supervised VAE model for pairs
of data points with the same labels to enforce disentanglement by using extra relational information
(similar relational models include [18, 13]). In contrast, our approach does only change the encoder.
Our method is a VAE based representation learning approach, and there is a large literature on
the subject, see, e.g. [20], however models are often not evaluated on their robustness properties.
Another fruitful approach in representation learning, especially for the image modality, is contrastive
learning [14, 6]. These approaches can challenge and surpass supervised approaches in terms of
label efficiency and accuracy, and it remains a future work to investigate the links with our work and
whether or not these models can also be used for learning robust representations.

6 Societal Impact

General Research Direction. In recent years, researchers have trained deep generative models that
can generate synthetic examples, often indistinguishable from natural data. The high quality of
these samples suggest that these models may be able to learn latent representations useful for other
downstream tasks. Learning such representations without task specific supervision facilitates transfer
to yet unseen, future tasks. This also fosters label efficiency, and interpretability. Unsupervised
representation learning would have a high societal impact as it could enable learning representations
from data that can be shared with a wider community of researchers who do not have the computational
resources for training such a representations, or do not have direct access to training data due to
privacy/security/commercial considerations. However, the properties of these representations in
terms of test accuracy, robustness, privacy preservation must be carefully studied before their release,
especially if systems will be deployed in the real world. In the current work, we have taken
a step towards learning representations in an unsupervised way, that exhibit robustness against
transformations of the input.
Ethical Considerations. The current work studies representations learned by a specific generative
model, the VAE and shares a finding that training a VAE by enforcing an additional natural autoen-
coding specification is able to provide significant robustness on the learned representation without
adversarial training. The study is not proposing a particular system for a specific application. The
human face dataset CelebA is choosen as a standard benchmark dataset with several attributes to
illustrate the viability of the approach. Still, we have decided to exclude potentially sensitive and
subjective attributes from the original dataset, such as ’big-nose’ or ’Asian’, and use only 17 neutral
attributes that we have selected using our own judgment.

9
Acknowledgments and Disclosure of Funding
We would like to thank Andriy Minh for the fruitful discussions and excellent feedback.

References
[1] Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep
representations. The Journal of Machine Learning Research, 19(1):1947–1980, 2018.

[2] Guillaume Alain and Yoshua Bengio. What regularized auto-encoders learn from the data-
generating distribution. J. Mach. Learn. Res., 15(1):3563–3593, January 2014.

[3] Yoshua Bengio. Deep learning of representations for unsupervised and transfer learning. In
Proceedings of ICML workshop on unsupervised and transfer learning, pages 17–36, 2012.

[4] Christopher P. Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume
Desjardins, and Alexander Lerchner. Understanding disentangling in β-VAE. arXiv e-prints,
page arXiv:1804.03599, Apr 2018.

[5] A. T. Cemgil, S. Ghaisas, K. Dvijotham, and P. Kohli. Adversarially robust representations


with smooth encoders. In Proceedings of the Eighth International Conference on Learning
Representations, ICLR 2020, 2020.

[6] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework
for contrastive learning of visual representations. In International Conference on Machine
Learning, 2020.

[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding, 2018.

[8] Ananya Harsh Jha, Saket Anand, Maneesh Singh, and VSR Veeravasarapu. Disentangling
factors of variation with cycle-consistent variational auto-encoders. In Proceedings of the
European Conference on Computer Vision (ECCV), pages 805–820, 2018.

[9] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings
of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page
6629–6640, Red Hook, NY, USA, 2017. Curran Associates Inc.

[10] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Yoshua Bengio and
Yann LeCun, editors, 2nd International Conference on Learning Representations, ICLR 2014,
Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.

[11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger,
editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran
Associates, Inc., 2012.

[12] Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard
Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning
of disentangled representations. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,
Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceed-
ings of Machine Learning Research, pages 4114–4124, Long Beach, California, USA, 09–15
Jun 2019. PMLR.

[13] Christos Louizos, Xiahan Shi, Klamer Schutte, and Max Welling. The Functional Neural
Process. arXiv e-prints, page arXiv:1906.08324, Jun 2019.

[14] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive
predictive coding. arXiv preprint arXiv:1807.03748, 2018.

10
[15] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation
and approximate inference in deep generative models. In Eric P. Xing and Tony Jebara,
editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of
Proceedings of Machine Learning Research, pages 1278–1286, Bejing, China, 22–24 Jun 2014.
PMLR.
[16] Sam T Roweis. Em algorithms for pca and spca. In Advances in neural information processing
systems, pages 626–632, 1998.
[17] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel-
low, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199,
2013.
[18] Da Tang, Dawen Liang, Tony Jebara, and Nicholas Ruozzi. Correlated Variational Auto-
Encoders. arXiv e-prints, page arXiv:1905.05335, May 2019.
[19] Michael E. Tipping and Christopher M. Bishop. Probabilistic principal component analysis.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622,
1999.
[20] Michael Tschannen, Olivier Bachem, and Mario Lucic. Recent advances in autoencoder-based
representation learning. In Third workshop on Bayesian Deep Learning (NeurIPS 2018), 2018.
[21] Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Balancing learning and inference
in variational autoencoders. In Proceedings of the AAAI Conference on Artificial Intelligence,
volume 33, pages 5885–5892, 2019.
[22] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image
translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international
conference on computer vision, pages 2223–2232, 2017.

11
A Proofs
In this section, we provide the proofs of the propositions stated in the main text.

A.1 Proposition 2.1

Proof. By definition
p(X|Z; θ)p(Z)pρ (Z 0 |Z)p(X 0 |Z 0 ; θ)
 
KL(Q̄|P̄) = − log = KL(Q|P)
π̂(X)q(Z|X, η)pρ (Z 0 |Z)p(X 0 |Z 0 , θ) Q̄
Equality follows as for any test function of form f (X, Z), that does not depend on X 0 and Z 0 we
have hf (X, Z)iQ̄ = hf (X, Z)iQ .

A.2 Proposition 2.2

Proof. P̄ is symmetric (exchangeable) in (X 0 , Z 0 ) and (X, Z). Hence, the pairwise marginal
Z
0
P̄(X , X; θ, ρ) = P̄(X, X 0 , Z, Z 0 ; θ, ρ)dZdZ 0 (11)

is also symmetric by the parametrization, and X 0 and X are exchangeable with identical marginal
densities. We have P̄(X; θ, ρ) = p(X; θ) as the marginals of pρ (Z 0 , Z) do not depend on ρ. Hence
we have P̄(X 0 |X = x; θ, ρ) = P̄(X 0 , X; θ, ρ)/p(X; θ).

A.3 Proposition 2.3

Proof. Consider the joint distribution


Z
Q̄(X 0 |X; η, θ, ρ)p(X; θ) = (q(Z|X, η)p(X; θ)) pρ (Z 0 |Z)p(X 0 |Z 0 , θ)dZdZ 0
Z
= p(X|Z; θ)p(Z)pρ (Z 0 |Z)p(X 0 |Z 0 , θ)dZdZ 0 = P̄(X, X 0 ; θ, ρ)

The invariance of p(X; θ) follows from symmetry as Q̄(X 0 |X; ·)p(X; θ) = p(X 0 ; θ).
R

A.4 Proposition 3.1

Proof. Consider
Z Z
0 0
QAVAE (Z |Z; η, θ)p(Z) ≡ q(Z |X̃, η)p(X̃|Z; θ)p(Z)dX̃ = q(Z 0 |X̃, η)q(Z|X̃, η)p(X̃; θ)dX̃

As the expression at the right is symmetric in Z and Z 0 , p(Z)-invariance follows. When the
observation noise v → 0, the probability of observing x̃ out of the image of g vanishes, i.e.,
Pr {x̃ ∈
/ Xg } = 0. When g is continuous and differentiable, the image of g is a manifold Xg . As g is
one-to-one, it is invertible on Xg , x̃ = g(z; θ) ⇔ g −1 (x̃) = z and the optimal encoder will have the
mean mapping f µ (x̃; η ∗ ) = g −1 (x̃; η ∗ ) and variance mapping f Σ (x̃; η ∗ ) = 0. As z 0 = g −1 (x̃) and
x̃ = g(z), we have z 0 = z and have QAVAE (Z 0 |Z; η, θ) = p(Z 0 |Z; ρ = 1).

B Example: PCA case


As a special case, it is informative to consider the probabilistic principal component analysis (pPCA)
[16, 19], a special case of VAE where g and f µ are constrained to be linear functions. When
g(Z; θ = W ) = W z, using standard results about Gaussian distributions, the optimal encoder is
given by the exact posterior and is available in closed form as
q∗ (Z|X = x) = N (Z; f∗µ (x), f∗Σ )
where f∗µ (x) = (W > W + vI)−1 W > x, f∗Σ = v(W > W + vI)−1 . The transition kernel can be
shown to be the following form
Q(X 0 |X = x) = N (X 0 ; PW,v x, v(PW,v + I))

12
where PW,v = W (W > W + vI)−1 W > .
In the limit when v is zero, we have the PCA case and the mean mapping of the transition kernel
g ◦ f∗µ corresponds to a projection matrix PW,0 = PW = W (W > W )−1 W > , as PW = PW 2
.
Hence all noise will vanish, and iterating the encode-decode steps would generate the sequence
PW x0 = x1 = x2 = . . . ; any initial input will be projected first into the range space of the projector,
and subsequent points will be confined to the invariant subspace.
Note that the encoder and the decoder are also consistent, as any possible sample x that can be
generated by the decoder, is mapped to a representation in the vicinity of the original representation.
To see this, consider the transition kernel that can be shown to be
Q(Z 0 |Z = z) = N (Z 0 ; JW,v z, S)
where JW,v = (W > W + vI)−1 W > W and S = v(JW,v + I)(W > W + vI)−1 . In the limit, when v
goes to zero, we have JW,v ≈ I, or equivalently z 0 = z.
However, if an ‘inconsistent’ decoder-encoder pair would be used, an encoder with a perturbed mean
transform (W > W + vI)−1 W > + ∆ for some nonzero matrix ∆, the resulting transition matrix of
Q(X 0 |X = x) won’t be in general a projection and the chains will ’drift away’ from the original
invariant subspace depending upon the norm of the perturbation and the spectrum of the resulting
matrix W ∆.
In the PCA case, the invariant subspace is explicitly known thanks to the linearity. In contrast, for
a VAE with flexible nonlinear decoder and encoder functions, the analogous object to an invariant
subspace would be a global invariant manifold (sometimes referred as the data manifold), embedded
in X and each manifold point charted by x = g(z; θ) where z is the coordinate of x in Z, given
by f µ (x). While it seems to be hard to characterize the invariant manifold, we can argue that
"autoencoding" requires that realizations generated by the decoder are approximately invariant when
encoded again; i.e. autoencoding means that the coordinate of the point g(z) is z, or equivalently
z ≈ f µ (g(z)). However, this requirement is enforced by the VAE only for the samples in the dataset
but not for typical samples z from the prior p(Z). So the name autoencoder is perhaps a misnomer as
the resulting model is not necessarily autoencoding the samples it can generate.

C Variational Approximations
In this appendix, we provide the derivation of the variational objective for the AVAE and then an
alternative derivation for the smooth encoder [5], and other hybrid models that we used in our
experiments as contender models.

C.1 Autoencoding Variational Autoencoder QAVAE

The AVAE objective is


(θ∗ , η ∗ ) = arg max BAVAE (θ, η)
θ,η

where
BAVAE (θ, η) = −KL(QAVAE |Pρ ) = hlog(Pρ /QAVAE )iQAVAE
The target distribution is
Pρ = p(X|Z; θ)p(Z)pρ (Z 0 |Z)u(X̃) (12)
where ρ is considered as a fixed hyper-parameter and u(x̃) = 1. The approximating distribution is
QAVAE= π(X)q(Z|X; η)q(Z 0 |X̃; η)pθ (X̃|Z)
* +
p(X|Z; θ)p(Z)pρ (Z 0 |Z)u(X̃)
BAVAE (θ, η) = log
π(X)q(Z|X; η)q(Z 0 |X̃; η)pθ (X̃|Z) Q
AVAE

= −KL(π(X)q(Z|X; η)|p(X|Z; θ)p(Z))


pρ (Z 0 |Z)
 
+ log
q(Z 0 |X̃; η)pθ (X̃|Z) q̃(Z;η)q(Z 0 |X̃;η)pθ (X̃|Z)

13
We have used pθ in the subscript to denote the fact that the decoder parameters will be assumed to be
fixed. Cancelling the constant terms we obtain

BAVAE =+ hlog p(X|Z; θ)iπ(X)q(Z|X;η) + hlog p(Z)iπ(X)q(Z|X;η) − hlog q(Z|X; η)iπ(X)q(Z|X;η)


D E
+ hlog p(Z 0 |Z; ρ)iq̃(Z;η)q̃(Z 0 |Z;η) − log q(Z 0 |X̃; η) 0 q(Z |X̃;η)q̃(X̃;η)

q̃θ (Z 0 |Z; η) ≡ q(Z 0 |X̃; η)pθ (X̃|Z)dX̃, q̃(X̃; η) ≡ q(X̃|X; η)π(X)dX and q̃(Z; η) ≡
R R
where
R
q(Z|X; η)π(X)dX.
The algorithm is shown in Algorithm 1. We approximate the required expectations by their Monte
Carlo estimates and the objective function to maximize becomes

B̂AVAE (θ, η; x, x̃, z) = log p(X = x|Z = z; θ)


+ hlog p(Z)iq(Z|X=x;η) + hlog p(Z 0 |Z; ρ)iq̃(Z|X=x;η)q̃(Z 0 |X̃=x̃;η)
D E
− log q(Z 0 |X̃ = x̃; η) 0 q(Z |X̃=x̃;η)
− hlog q(Z|X = x; η)iq(Z|X=x;η) (13)

Algorithm 1: AVAE Training


Algorithm 2: SE Training
1: function T RAINAVAE(x, iterations)
1: function T RAIN SE(x, iterations)
2: encoder ← (f µ (·; η), f Σ (·; η))
2: encoder ← (f µ (·; η), f Σ (·; η))
3: decoder ← g(·; θ)
3: decoder ← g(·; θ)
4: for i ← 1 to iterations do
1/2 4: for i ← 1 to iterations do
5: µx , Σx ← encoder(x)
1/2 5: x̃ ← stopgradient(PGD(x))
6: z ← µx + Σx N (0, 1) 1/2
7: x̃ ← stopgradient(decoder(z)) 6: µx , Σx ← encoder(x)
1/2
1/2 7: z ← µx + Σx N (0, 1)
8: µx̃ , Σx̃ ← encoder(x̃)
8: Maximizeθ,η B̂SE (θ, η; x, x̃, z)
9: Maximizeθ,η B̂AVAE (θ, η; x, x̃, z)
. See (32)
. See (13)
9: end for
10: end for
10: end function
11: end function

Algorithm 3: SE-AVAE Training Algorithm 4: AVAE-SS Training


1: function T RAIN SE-AVAE(x, iterations) 1: function T RAINAVAE-SS(x, iterations)
2: encoder ← (f µ (·; η), f Σ (·; η)) 2: encoder ← (f µ (·; η), f Σ (·; η))
3: decoder ← g(·; θ) 3: decoder ← g(·; θ)
4: for i ← 1 to iterations do 4: for i ← 1 to iterations do
5: ˜ ← stopgradient(PGD(x))
x̃ 5: z 00 ← N (0, 1)
6:
1/2
µx , Σx ← encoder(x) 6: x ← stopgradient(decoder(z 00 ))
1/2 1/2
7: z ← µx + Σx N (0, 1) 7: µx , Σx ← encoder(x)
1/2
8: x̃ ← stopgradient(decoder(z)) 8: z ← µx + Σx N (0, 1)
9: x̃ ← stopgradient(decoder(z))
9: Maximizeθ,η
˜ z)
B̂SE-AVAE (θ, η; x, x̃, x̃, 10: Maximizeθ,η B̂AVAE-SS (θ, η; x, x̃, z)
. See (35) . See (34)
10: end for 11: end for
11: end function 12: end function

14
C.2 Smooth Encoder

Smooth encoder model proposed in [5] employs an alternative variational approximation strategy to
learn a representation. While SE introduced an ’external selection mechanism’ to generate adversarial
examples, the analysis in this appendix shows that the approach could be viewed as a robust Bayesian
approach to variational inference and choosing a different variational distribution than AVAE.
SE aims at learning a model that is insensitive to a class of input transformations, in particular small
input perturbations Tα (x) = x + α where α ∈ A = {α : kαk ≤ }. The variational distribution has
the form
QSE = π(X)q(Z|X; η)qT (X̃|X; uA )q(Z 0 |X̃; η) (14)
where qT (X̃|X; uA ) is the conditional distribution defined by
α ∼ uA (α) X̃ = Tα (X) (15)
Here, uA is an arbitrary distribution with uA ∈ UA , where UA is the set of distributions defined on A.
The notation suggests that uA is now taken as a parameter. The bound is
B̃SE (η, θ; uA ) = −KL(QSE (η, uA )||Pρ (θ)) (16)

We can employ a robust Bayesian approach to define a ’pessimistic’ bound in the sense of selecting
the worst prior distribution uA
BSE (η, θ) = min B̃SE (η, θ; uA ) (17)
uA ∈UA

This optimization is still computationally feasible as the maximum will be attained by a degenerate
distribution concentrated on an adversarial example xa = x + α as uA (α) = δ(α − (xa − x)) and
can be computed using projected gradient descent to find xa . Once uA is fixed, we can optimize
model parameters in the outer maximization.
The resulting algorithm has two steps i) (Augmentation) Generate a new empirical data distribution
π̂a (X̃|X) adversarially, by finding the worst case transformation for each data point in the sense
of maximizing the change between representations; and ii) (Maximization) Maximizing the bound
denoted by BSE that has the following form
BSE =+ hlog p(X|Z; θ)iπ(X)q(Z|X;η) − KL(π̃a (X, X̃)q(Z, Z 0 |X, X̃; η)|p(Z, Z 0 ; ρ)) (18)

where q(Z, Z 0 |X, X̃; η) ≡ q(Z|X; η)q(Z 0 |X̃; η) and π̃a (X, X̃) ≡ π(X)π̂a (X̃|X).
This objective measures the data fidelity (first term) and the divergence of the joint encoder mapping
from the pairwise coupling target (second term). The second term forces the encoder mapping to be
smooth when ρ ≈ 1.

C.3 Derivation Details


• The SE objective:
(θ∗ , η ∗ ) = arg max min B̃SE (θ, η; uA )
θ,η uA ∈UA

B̃SE (θ, η; uA ) = −KL(QSE (η; uA )|Pρ (θ)) = hlog(Pρ /QSE )iQSE


• Target:
Pρ = p(X|Z; θ)p(Z)pρ (Z 0 |Z)u(X̃) (19)
fixed hyper-parameter ρ ∈ [0, 1) and u(x̃) = 1.
• The variational distribution:
QSE (η; uA ) = π(X)q(Z|X; η)qT (X̃|X; uA )q(Z 0 |X̃; η) (20)
– π(X): Data distribution to be replaced by empirical data distribution
– q(Z|X; η), q(Z 0 |X̃; η): Encoders tied with same parameters η

15
– qT (X̃|X; uA ): Distribution induced by random translations

α ∼ uA (α) X̃ = X + α (21)

Here, uA is an arbitrary distribution on A = {a : kak∞ ≤ } with uA ∈ UA .

B̃SE (θ, η; uA ) =+ hlog p(X|Z; θ)p(Z)pρ (Z 0 |Z)iQSE


D E D E
− log q(Z|X; η)q(Z 0 |X̃; η) − log qT (X̃|X; uA ) (22)
QSE QSE
= hlog p(X|Z; θ)iπ(X)q(Z|X;η)
+ hlog p(Z)iπ(X)q(Z|X;η)
− hlog q(Z|X; η)iπ(X)q(Z|X;η)
+ hlog pρ (Z 0 |Z)iπ(X)q(Z|X;η)qT (X̃|X;uA )q(Z 0 |X̃;η)
D E
− log q(Z 0 |X̃; η)
π(X)q(Z|X;η)qT (X̃|X;uA )q(Z 0 |X̃;η)
D E
− log qT (X̃|X; uA ) (23)
π(X)qT (X̃|X;uA )

The objective is

(η ∗ , θ∗ ) = arg max min B̃SE (θ, η; uA ) (24)


η,θ uA ∈UA

Iterative optimization for τ = 1, 2, . . . :

(τ )
• Augmentation step: Solve or improve uA = arg minuA ∈UA B̃SE (θ(τ −1) , η (τ −1) ; uA )
(τ )
• Maximization step: Solve or improve (η (τ ) , θ(τ ) ) = arg maxη,θ B̃SE (θ, η; uA )

Augmentation step: In the inner optimization we seek for the worst case uA that will minimize the
ELBO, or equivalently
n o
u∗A = arg min B̃SE (θ, η; uA )
uA ∈UA

While this optimization seems to be on the space of all distributions, we can see that the last term (23)
of the bound BSE is the entropy of qT , H[qT ]. Consequently, the bound B̃SE is minimized when the
entropy H[qT ] is minimized, i.e. when qT is degenerate, and concentrated on a point. Consequently,
focusing on the remaining terms we can rewrite this optimization problem for each data point as
n
α∗ = arg max − hlog p(Z 0 |Z; ρ)iq(Z|X=x;η)q(Z 0 |X̃=Tα (x);η) (25)
α∈A
o
−H[q(Z 0 |X̃ = Tα (x); η)] − H[q(Z|X = x; η)] (26)

x̃a = x + α (27)

This term can be identified as a lower bound to the entropy regularized `2 optimal transport [5] hence
is measuring the discrepancy between q(Z|X = x; η) and q(Z 0 |X̃ = x̃; η). It can be interpreted as an
adversarial attack trying to maximize the change in the representations. We will denote the empirical
P (i)
distribution that is obtained by attacking each sample adversarially as π̂a (X̃|X) = i δ(X̃ − x̃a ).

16
Maximization step: Given the empirical distribution of inputs and their augmentation via adversarial
attacks, denoted as π̂a (X, X̃) ≡ π(X)π̂a (X̃|X) is fixed in an iteration, the objective is
B̃SE (θ, η) =+ hlog p(X|Z; θ)iπ(X)q(Z|X;η)
+ hlog p(Z)iπ(X)q(Z|X;η)
− hlog q(Z|X; η)iπ(X)q(Z|X;η)
+ hlog pρ (Z 0 |Z)iq(Z|X;η)q(Z 0 |X̃;η)π̂a (X,X̃)
D E
− log q(Z 0 |X̃; η)
q(Z|X;η)q(Z 0 |X̃;η)π̂a (X,X̃)
= hlog p(X|Z; θ)iπ(X)q(Z|X;η)
−KL(π̂a (X, X̃)q(Z, Z 0 |X, X 0 ; η)|pρ (Z 0 , Z)) (28)
where q(Z, Z 0 |X, X 0 ; η) ≡ q(Z|X; η)q(Z 0 |X̃; η).

C.3.1 A Tighter bound


Even though we derived the above bound using a factorized distribution assumption
q(Z, Z 0 |X, X 0 ; η) = q(Z|X; η)q(Z 0 |X̃; η), [5] is using a tighter bound using
 µ   Σ 
f (x; η) f (x; η) ψ
q(Z 0 , Z|X = x, X̃ = x̃; η) = N ,
f µ (x̃; η) ψ f Σ (x̃; η)
Here, ψ is a diagonal matrix with the i’th diagonal element
q 
1
ψi = 1 + 4γ 2 f Σ (x; η)i f Σ (x̃; η)i − 1

where γ ≡ ρ/(1 − ρ2 ). With this modification we get
1
hlog pρ (Z 0 , Z)iq̃(Z 0 ,Z|X=x,X̃=X̃;η) f µ (x; η) + f Σ (x; η) + f µ (x̃; η) + f Σ (x̃; η)

= − 2
1−ρ

+ (ψ + f µ (x; η)f µ (x̃; η)) + C (29)
1 − ρ2

Under the q distribution, the desired expectations are


Tr Z 0 Z 0> = Tr(Σ̃ + µ̃µ̃> ) ZZ > = Tr(Σ + µµ> ) Z 0 Z > = Tr(ψ + µ̃µ> ) (30)

µ ≡ f µ (x; η), µ̃ ≡ f µ (x̃; η), Σ ≡ f Σ (x; η), Σ̃ ≡ f Σ (x̃; η)

1  
hlog pρ (Z 0 , Z)iq̃ = − Tr Σ̃ + Σ + µ̃µ̃>
+ µµ>
− 2ρ(ψ + µ̃µ>
) (31)
2(1 − ρ2 )

With the given tighter bound the algorithm for SE is shown in Algorithm 2. From Equation 18 we
approximate the required expectations by their Monte Carlo estimates and the objective function to
maximize becomes

B̂SE (θ, η; x, x̃, z) = hlog p(X = x|Z = z)iq(Z|X=x;η)


+ hlog p(Z 0 , Z)iq̃(Z,Z 0 |X=x,X̃=x̃)
− hlog q̃(Z, Z 0 )iq̃(Z,Z 0 |X̃=x̃,X=x) (32)

C.4 AVAE-SS (Self Supervised)

This algorithm can be used for post training an already trained VAE. Figure 6 shows the graphical
model describing AVAE-SS model.

17
Z 00 Z Z0 Z 00 Z Z0 Figure 6: Graphical model of the AVAE-
SS target distribution Pρ,AVAE-SS , and the
variational approximation QAVAE-SS . Here
X X̃ X X̃
both X and X̃ is a sample generated by the
decoder.
Pρ,AVAE-SS QAVAE-SS

Z 00 Z Z0 Z 00 Z Z0 Figure 7: Graphical model of the target dis-


tribution Pρ,ρSE , and the variational approx-
imation QSE-AVAE . Here X̃ is a sample gen-
˜
X̃ X X̃ ˜
X̃ X X̃ erated by the decoder that is subsequently
encoded by the encoder.
Pρ,ρSE QSE-AVAE

 
BAVAE-SS =+ −KL p(Z 00 )p(X|Z 00 )q(Z|X; η)p(X̃|Z)q(Z 0 |X̃)||p(Z 00 )p(Z 0 |Z)p(Z)u(X̃)u(X)
= hlog p(Z)iq(Z|X) + hlog p(Z 0 |Z)iq(Z|X)q(Z 0 |X̃)
D E
− log p(Z 00 )p(X|Z 00 )q(Z|X; η)p(X̃|Z)q(Z 0 |X̃) (33)
p(X|Z 00 )q(Z|X;η)p(X̃|Z)q(Z 0 |X̃)

The algorithm is shown in Algorithm 4. We approximate the required expectations by their Monte
Carlo estimates and the objective function to maximize becomes

B̂AVAE-SS (θ, η; x, x̃, z) = hlog p(Z)iq(Z|X=x;η) + hlog p(Z 0 |Z; ρ)iq̃(Z|X=x;η)q̃(Z 0 |X̃=x̃;η)
D E
− log q(Z 0 |X̃ = x̃; η) 0 q(Z |X̃=x̃;η)
− hlog q(Z|X = x; η)iq(Z|X=x;η) (34)

Also see Section C.1 for further expansion on terms.

C.5 SE-AVAE

Figure 7 shows the graphical model describing AVAE-SS model.

BSE-AVAE =+ −KL(π(X)p(X̃|X)q(Z˜ 00 ˜
|X̃; η)q(Z|X)p(X̃|Z; η)q(Z 0 |X̃)
||p(Z 0 |Z)p(Z 00 |Z)p(X|Z)p(Z))

The algorithm is shown in Algorithm 3. We approximate the required expectations by their Monte
Carlo estimates and the objective function to maximize becomes

˜ z)
B̂SE-AVAE (θ, η; x, x̃, x̃, = hlog p(Z)iq(Z|X=x;η) + hlog p(Z 0 |Z; ρ)iq̃(Z|X=x;η)q̃(Z 0 |X̃=x̃;η)
+ hlog p(Z 00 |Z; ρSE )iq̃(Z|X=x;η)q̃(Z 00 |X̃=
˜ x̃;η)
˜
D E
− log q(Z 0 |X̃ = x̃; η) 0
q(Z |X̃=x̃;η)
− hlog q(Z|X = x; η)iq(Z|X=x;η)
˜ = x̃;
D E
− log q(Z 00 |X̃ ˜ η)
00 ˜
(35)
˜
q(Z |X̃=x̃;η)

18
π̂(X) π̂(X)q(Z|X) p(X)q(Z|X) p(Z)p(X|Z) p(X) Figure 8: Results of a VAE. (Left to
right) i) The empirical data distribution
π̂(X), ii) Encoder weighted by the empir-
ical distribution π̂(X)q(Z|X; η), iii) En-
X coder weighted by the model distribu-
tion p(X; θ)q(Z|X; η), iv) the decoder
p(X|Z)p(Z), v) the model distribution
Z Z Z p(X; θ), obtained by marginalizing the de-
coder.

D Details of the 1-D Example


In this example Section 3.1, we will assume that both the observations x and latents z can only take
values from discrete sets and to avoid boundary effects we adopt a parametrization reminiscent of a
von-Mises distribution with normalization constant J:
1 X
VM(X; µ, v) ≡ exp (cos (X − µ) /v) J(µ, v) = exp (cos (X − µ) /v)
J(µ, v)
x∈X

As these densities (up to quantization effects) are unimodal and symmetric around their means with a
bell-shape, this example is qualitatively similar to a standard conditionally Gaussian VAE. We define
the following system of conditional distributions as the decoder and encoder models as:
p(X|Z = z; θ) = VM(X; g(z; θ), v) q(Z|X = x; η) = VM(Z; f µ (x; η), f Σ (x; η))
where we let X ∈ {cx , 2cx , 3cx , . . . , Nx cx }, with cx = 2π/Nx and Z ∈ {cz , 2cz , 3cz , . . . , Nz cz },
with cz = 2π/Nz where Nx and Nz are the cardinalities of each set. As both X and Z are discrete we
can store g, f µ , f Σ as tables, hence the trainable parameters are just the function values at each point
θ = (g1 , . . . , gNz ) and η = (µ1 , . . . , µNx , σ1 , . . . , σNx ). This emulates a high capacity network that
can model any functional relationship between latents and observations. The prior p(Z) is chosen as
uniform and the coupling term is
p(Z 0 |Z = z) = VM(Z 0 ; z, νρ )
where the spread term νρ is chosen on the order of 10−3 .
In Figure 8, we illustrate an example where we fit a VAE to the empirical data distribution π̂(X).
The encoder-empirical data joint distribution π̂(X)q(Z|X; η) and the decoder joint distribution
p(X|Z)p(Z) are in fact closely matching but only on the support of π̂. In the next panel, we show
the encoder-model data joint distribution p(X; θ)q(Z|X; η). The nonsmooth nature of the encoder
is evident, conditional distributions at each row are quite different from one other. This reveals
that samples that can be still generated with high probability with the decoder would be mapped to
unrelated states when encoded again.

E Wasserstein Distance
The `2 -Wasserstein distance W2 between two Gaussians Pa and Pb with means µa , µb and covariance
matrices Σa , Σb is given by
W22 (Pa , Pb ) ≡ kµa − µb k22
1/2 1/2 1/2 
+ Tr Σa + Σb − 2 Σb Σa Σb

F Experimental Details
All experiments are performed on NVIDIA Tesla P100 GPU. Optimization is performed using
AdamOptimizer with learning rate 1e-4.
Color Mnist For experiments with the Color MNIST dataset with MLP architecture, a 4 layer multi
layer perceptron (MLP), with 200 neurons at each layer, is used for both encoder and decoder. For
experiments with the Color MNIST dataset and conv architecture, a 7 layer VGG network is used for

19
80 80 80

60 AVAE 60 AVAE 60 AVAE


Adv. Accuracy

Adv. Accuracy

Adv. Accuracy
SE SE SE
SE AVAE SE AVAE SE AVAE
40 VAE 40 VAE 40 VAE
AVAE SS AVAE SS AVAE SS
20 20 20

0 0 0
0.00 500.00K 1.00M 1.50M 2.00M 2.50M 3.00M 3.50M 4.00M 0.00 500.00K 1.00M 1.50M 2.00M 2.50M 3.00M 3.50M 4.00M 0.00 500.00K 1.00M 1.50M 2.00M 2.50M 3.00M 3.50M 4.00M
Iterations Iterations Iterations

Figure 9: Comparison of Algorithms (on ColorMnist with Conv arch) for various different ρ and
ρSE values. (LEFT) ρ = 0.95 and ρSE = 0.975. (MIDDLE) ρ = 0.97 and ρSE = 0.97. (RIGHT)
ρ = 0.97 and ρSE = .975

encoder with number of output channels 8, 16, 32, 64, 128, 256 and 512 respectively with strides 2, 1,
2, 1, 2, 1 and 2 respectively and kernel shape of (3, 3). For decoder a de-convolutional architecture 3
de-conv layers with output channels 64, 32 and 3, strides 1, 2 and 1, and kernel shape (3, 3) is used.
Convolutional architectures are stabilized using BatchNorm between each convolutional layer. In
training where adversarial attack is required, PGD with L-inf perturbation radius and iteration budget
20 is used. In PGD no random restarts are used for training, while evaluating 10 random restarts are
used. In evaluation, PGD with iteration budget 40 is used.
CelebA CelebA experiments are performed using VGG encoder with 4 convolutional layers, output
channels 128, 256, 512, and 1024 respectively, stride 2 and kernel size (5, 5). CelebA decoder is
VGG network with 4 layers, output channels 512, 256, 128 and 3, stride 2 and kernel shape (5, 5). Al
convolutional layers are normalized with BatchNorm. In training where adversarial attack is required,
PGD attack with iteration budget 5 is used with L-inf perturbation radius. In training no random
restarts are used in PGD where as in evaluation 10 random restarts are used with 20 iteration budget.

G Comparing Algorithms for different ρ settings


Figure 9 shows comparison of all algorithms for different ρ and ρSE values. Figure solidifies the
claim that SE-AVAE achieves better downstream adversarial accuracy than both SE and VAEA, for
smaller  values for reasonable ρ and ρSE values.

H CelebA Results for All Downstream Tasks


Table 3 shows adversarial downstream accuracy for all 17 downstream tasks of CelebA, compared
across models VAE, AVAE, SE with 5 PGD iteration budget, SE with 20 PGD iteration budget, SE
AVAE and AVAE SS. Here  = 0.0 represents the nominal downstream accuracy.

20
VAE AVAE SE5 SE20 SE AVAE AVAE SS
Task 0.0 0.1 0.0 0.1 0.0 0.1 0.0 0.1 0.0 0.1 0.0 0.1
Bald 97.9 2.1 97.8 83.9 97.9 71.0 97.4 86.5 97.8 87.0 97.8 70.0
Mustache 94.9 0.7 94.8 89.8 95.0 69.5 95.7 84.4 96.3 92.3 95.2 74.0
Eyeglasses 95.4 0.0 94.9 71.7 95.9 20.3 95.7 33.0 95.3 67.5 94.3 56.8
Necklace 87.7 0.7 88.2 76.7 88.0 56.0 88.0 78.9 86.1 80.3 87.9 59.9
Smiling 87.7 0.4 78.5 3.0 86.9 3.1 85.7 1.1 77.9 6.3 81.6 1.0
Lipstick 84.5 0.0 80.4 6.5 84.5 2.0 80.3 0.6 80.7 11.2 80.3 0.9
Bangs 90.3 0.2 89.7 36.0 90.8 19.3 89.6 27.0 89.6 45.9 89.6 22.2
Black Hair 83.5 0.0 82.5 31.7 83.4 23.0 81.4 31.4 79.5 33.3 83.3 26.2
Blond Hair 91.5 0.2 91.1 46.4 92.5 35.9 90.7 53.5 92.4 55.8 90.6 37.6
Brown Hair 78.4 1.0 77.8 35.9 77.9 24.1 80.5 41.5 82.9 46.0 78.2 19.3
Gender 87.6 0.0 82.6 5.4 87.8 1.5 81.6 0.7 82.1 10.7 82.1 0.7
Beard 85.8 0.0 83.8 39.2 85.4 18.3 85.3 24.3 86.0 50.1 84.4 16.3
Straight Hair 79.6 2.3 79.2 72.2 79.6 64.8 78.7 77.3 78.8 74.9 79.2 64.8
Wavy Hair 75.8 0.1 75.8 17.8 76.1 10.0 72.8 10.2 72.7 22.1 75.5 9.3
Earrings 81.3 0.1 81.0 60.9 81.0 26.8 81.3 55.3 79.5 64.1 81.2 24.7
Hat 96.5 0.2 96.6 75.2 96.9 55.3 96.4 77.3 97.3 82.7 96.0 65.5
Necktie 92.6 0.2 92.5 61.9 92.7 41.0 92.7 51.7 93.0 72.5 92.6 39.8
Time Factor ×1 ×2.2 ×3.1 ×7.8 ×4.3 ×2.2
MSE 7203.9 7276.6 7208.8 N/A 7269.2 7347.3
FID 99.86 97.92 98.00 N/A 109.4 99.86

Table 3: Adversarial test accuracy (in percentage) of the representations for all of classification tasks
on CelebA.

21

You might also like