Covariances, Robustness, and Variational Bayes: Ryan Giordano
Covariances, Robustness, and Variational Bayes: Ryan Giordano
Covariances, Robustness, and Variational Bayes: Ryan Giordano
Abstract
Mean-field Variational Bayes (MFVB) is an approximate Bayesian posterior inference technique that is in-
creasingly popular due to its fast runtimes on large-scale data sets. However, even when MFVB provides
accurate posterior means for certain parameters, it often mis-estimates variances and covariances. Further-
more, prior robustness measures have remained undeveloped for MFVB. By deriving a simple formula for the
effect of infinitesimal model perturbations on MFVB posterior means, we provide both improved covariance
estimates and local robustness measures for MFVB, thus greatly expanding the practical usefulness of MFVB
posterior approximations. The estimates for MFVB posterior covariances rely on a result from the classical
Bayesian robustness literature that relates derivatives of posterior expectations to posterior covariances and
includes the Laplace approximation as a special case. Our key condition is that the MFVB approximation
provides good estimates of a select subset of posterior means—an assumption that has been shown to hold in
many practical settings. In our experiments, we demonstrate that our methods are simple, general, and fast,
providing accurate posterior uncertainty estimates and robustness measures with runtimes that can be an order
of magnitude faster than MCMC.
Keywords: Variational Bayes; Bayesian robustness; Mean field approximation; Linear response theory;
Laplace approximation; Automatic differentiation
1. Introduction
Most Bayesian posteriors cannot be calculated analytically, so in practice we turn to approximations. Vari-
ational Bayes (VB) casts posterior approximation as an optimization problem in which the objective to be
minimized is the divergence, among a sub-class of tractable distributions, from the exact posterior. For
example, one widely-used and relatively simple flavor of VB is “mean field variational Bayes” (MFVB),
which employs Kullback-Leibler (KL) divergence and a factorizing exponential family approximation for the
tractable sub-class of posteriors (Wainwright and Jordan, 2008). MFVB has been increasingly popular as an
alternative to Markov Chain Monte Carlo (MCMC) in part due to its fast runtimes on large-scale data sets.
Although MFVB does not come with any general accuracy guarantees (except asymptotic ones in special
cases (Westling and McCormick, 2015; Wang and Blei, 2017)), MFVB produces posterior mean estimates of
certain parameters that are accurate enough to be useful in a number of real-world applications (Blei et al.,
2016). Despite this ability to produce useful point estimates for large-scale data sets, MFVB is limited as
an inferential tool; in particular, MFVB typically underestimates marginal variances (MacKay, 2003; Wang
and Titterington, 2004; Turner and Sahani, 2011). Moreover, to the best of our knowledge, techniques for
assessing Bayesian robustness have not yet been developed for MFVB. It is these inferential issues that are
the focus of the current paper.
Unlike the optimization approach of VB, an MCMC posterior estimate is an empirical distribution formed
with posterior draws. MCMC draws lend themselves naturally to the approximate calculation of posterior
moments, such as those required for covariances. In contrast, VB approximations lend themselves naturally to
sensitivity analysis, since we can analytically differentiate the optima with respect to perturbations. However,
as has long been known in the Bayesian robustness literature, the contrast between derivatives and moments is
not so stark since, under mild regularity conditions that allow the exchange of integration and differentiation,
there is a direct correspondence between derivatives and covariance (Gustafson, 1996b; Basu et al., 1996;
Efron, 2015, Section 2.2 below).
Thus, in order to calculate local sensitivity to model hyperparameters, the Bayesian robustness liter-
ature re-casts derivatives with respect to hyperparameters as posterior covariances that can be calculated
with MCMC. In order to provide covariance estimates for MFVB, we turn this idea on its head and use the
sensitivity of MFVB posterior expectations to estimate their covariances. These sensitivity-based covariance
estimates are referred to as “linear response” estimates in the statistical mechanics literature (Opper and Saad,
2001), so we refer to them here as linear response variational Bayes (LRVB) covariances. Additionally, we
derive straightforward MFVB versions of hyperparameter sensitivity measures from the Bayesian robustness
literature. Under the assumption that the posterior means of interest are well-estimated by MFVB for all the
perturbations of interest, we establish that LRVB provides a good estimate of local sensitivities. In our exper-
iments, we compare LRVB estimates to MCMC, MFVB, and Laplace posterior approximations. We find that
the LRVB covariances, unlike the MFVB and Laplace approximations, match the MCMC approximations
closely while still being computed over an order of magnitude more quickly than MCMC.
In Section 2 we first discuss the general relationship between Bayesian sensitivity and posterior covari-
ance and then define local robustness and sensitivity. Next, in Section 3, we introduce VB and derive the
linear system for the MFVB local sensitivity estimates. In Section 4, we show how to use the MFVB local
sensitivity results to estimate covariances and calculate canonical Bayesian hyperparameter sensitivity mea-
sures. Finally, in Section 5, we demonstrate the speed and effectiveness of our methods with simple simulated
data, an application of automatic differentiation variational inference (ADVI), and a large-scale industry data
set.
We will assume that we are interested in a posterior expectation of some function g (θ) (e.g., a parameter
mean, a posterior predictive value, or squared loss): Epα [g (θ)]. In the current work, we will quantify
the uncertainty of g (θ) by the posterior variance, Varpα (g (θ)). Other measures of central tendency (e.g.,
posterior medians) or uncertainty (e.g., posterior quantiles) may also be good choices but are beyond the
scope of the current work.
2
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
Note the dependence of Epα [g (θ)] on both the likelihood and prior, and hence on α, through Bayes’
Theorem. The choice of a prior and choice of a likelihood are made by the modeler and are almost invariably
a simplified representation of the real world. The choices are therefore to some extent subjective, and so
one hopes that the salient aspects of the posterior would not vary under reasonable variation in either choice.
Consider the prior, for example. The process of prior elicitation may be prohibitively time-consuming; two
practitioners may have irreconcilable subjective prior beliefs, or the model may be so complex and high-
dimensional that humans cannot reasonably express their prior beliefs as formal distributions. All of these
circumstances might give rise to a range of reasonable prior choices. A posterior quantity is “robust” to the
prior to the extent that it does not change much when calculated under these different prior choices.
Quantifying the sensitivity of the posterior to variation in the likelihood and prior is one of the central
concerns of the field of robust Bayes (Berger et al., 2000). (We will not discuss the other central concern,
which is the selection of priors and likelihoods that lead to robust estimators.) Suppose that we have deter-
mined that the hyperparameter α belongs to some open set A, perhaps after expert prior elicitation. Ideally,
we would calculate the extrema of Epα [g (θ)] as α ranges over all of A. These extrema are a measure of
global robustness, and their calculation is intractable or difficult except in special cases (Moreno, 2000; Hu-
ber, 2011, Chapter 15). A more practical alternative is to examine how much Epα [g (θ)] changes locally in
response to small perturbations in the value of α near some tentative guess, α0 ∈ A. To this end we define
the local sensitivity at α0 (Gustafson, 2000).
dEpα [g (θ)]
Sα0 := . (1)
dα α0
Sα0 , the local sensitivity, can be considered a measure of local robustness (Gustafson, 2000). Throughout the
paper we will distinguish between sensitivity, which comprises objectively defined quantities such as Sα0 ,
and robustness, which we treat as a more subjective concept that may be informed by the sensitivity as well
as other considerations. For example, even if one knows Sα0 precisely, how much posterior change is too
much change and how much prior variation is reasonable remain decisions to be made by the modeler. For a
more in-depth discussion of how we use the terms sensitivity and robustness, see Appendix C.
The quantity Sα0 can be interpreted as measuring sensitivity to hyperparameters within a small region
near α = α0 where the posterior dependence on α is approximately linear. Then local sensitivity provides an
approximation to global sensitivity in the sense that, to first order,
Generally, the dependence of Epα [g (θ)] on α is not given in any closed form that is easy to differentiate.
However, as we will now see, the derivative Sα0 is equal, under mild regularity conditions, to a particular
posterior covariance that can easily be estimated with MCMC draws.
Assumption 1 For all α ∈ A, ρ (θ, α) is continuously differentiable with respect to α, and, for a given λ-
measurable g (θ) there exist λ-integrable functions f0 (θ) and f1 (θ) such that |p0 (θ) exp (ρ (θ, α)) g (θ)| <
f0 (θ) and |p0 (θ) exp (ρ (θ, α))| < f1 (θ).
Under Assumption 1 we can normalize the log-perturbed quantity p0 (θ) exp (ρ (θ, α)) to get a density in θ
with respect to λ.
3
G IORDANO , B RODERICK , AND J ORDAN
Theorem 1 is a straightforward consequence of the Lebesgue dominated convergence theorem; see Appendix
A for a detailed proof. Versions of Theorem 1 have appeared many times before; e.g., Diaconis and Freedman
(1986); Basu et al. (1996); Gustafson (1996b); Pérez et al. (2006) have contributed variants of this result to
the robustness literature.
By using MCMC draws from p0 (θ) to calculate the covariance on the right-hand side of Eq. (4), one
can form an estimate of dEpα [g (θ)] /dα| at α = α0 . One might also approach the problem of calculating
dEpα [g (θ)] /dα| using importance sampling as follows (Owen, 2013, Chapter 9). First, an importance
sampling estimate of the dependence of Epα [g (θ)] on α can be constructed with weights that depend on α.
Then, differentiating the weights with respect to α provides a sample-based estimate of dEpα [g (θ)] /dα| .
We show in Appendix B that this importance sampling approach is equivalent to using MCMC samples to
estimate the covariance in Theorem 1.
An immediate corollary of Theorem 1 allows us to calculate Sα0 as a covariance.
Corollary 1 Suppose that Assumption 1 holds for some α0 ∈ A, some g (θ), and for
Corollary 1 can be found in Basu et al. (1996), in which a version of Corollary 1 is stated in the proof of
their Theorem 1, as well as in Pérez et al. (2006) and Efron (2015). Note that the definition of ρ (θ, α) does
not contain any normalizing constants and so can typically be easily calculated. Given Ns MCMC draws
{θn }Nn=1 from a chain that we assume to have reached equilibrium at the stationary distribution p0 (θ), one
s
can calculate an estimate of Sα0 using the sample covariance version of Eq. (4):
Ns Ns
! Ns
!
1 X ∂ρ (θn , α) 1 X 1 X ∂ρ (θn , α)
Ŝα0 := g (θn ) − g (θn ) (6)
Ns n=1 ∂α α0 Ns n=1 Ns n=1 ∂α α0
4
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
Given Q, we define the optimal q ∈ Q, which we call qα (θ), as the distribution that minimizes the KL
divergence KL (q (θ; η) ||pα (θ)) from pα (θ). We denote the corresponding optimal variational parameters
as η ∗ .
Definition 4 The variational approximation qα (θ) to pα (θ) is defined by
where
In the KL divergence, the (generally intractable) normalizing constant for pα (θ) does not depend on q (θ)
and so can be neglected when optimizing. In order for the KL divergence to be well defined, we assume
that both p0 (θ) and q (θ) are given with respect to the same base measure, λ, and that the support of q (θ) is
contained in the support of pα (θ). We will require some additional mild regularity conditions in Section 3.2
below.
A common choice for the approximating family Q in Eq. (7) is the “mean field family” (Wainwright and
Jordan, 2008; Blei et al., 2016),
( )
Y
Qmf := q (θ) : q (θ) = q (θk ; ηk ) , (9)
k
where k indexes a partition of the full vector θ and of the parameter vector η. That is, Qmf approximates
the posterior pα (θ) as a distribution that factorizes across sub-components of θ. This approximation is
commonly referred to as “MFVB,” for “mean field variational Bayes.” Note that, in general, each function
q (θk ; ηk ) in the product is different. For notational convenience we write q (θk ; ηk ) instead of qk (θk ; ηk )
when the arguments make it clear which function we are referring to, much as the same symbol p is used to
refer to many different probability distributions without additional indexing.
One may additionally assume that the components q (θk ; ηk ) are in a convenient exponential family.
Although the exponential family assumption does not in general follow from a factorizing assumption, for
compactness we will refer to both the factorization and the exponential family assumption as MFVB.
In an MFVB approximation, Ωη could be a stacked vector of the natural parameters of the exponential
families, or the moment parameterization, or perhaps a transformation of these parameters into an uncon-
strained space (e.g., the entries of the log-Cholesky decomposition of a positive definite information matrix).
For more concrete examples, see Section 5. Although all of our experiments and much of our motivating
intuition will use MFVB, our results extend to other choices of Q that satisfy the necessary assumptions.
5
G IORDANO , B RODERICK , AND J ORDAN
Assumption 3 There exists a strict local minimum, η ∗ (α), of KL (q (θ; η) ||pα (θ)) in Eq. (8) such that
η ∗ (α) is interior to Ωη .
Since g (θ), α, and η are all vectors, the quantities Hηη , fαη , and gη are matrices. We are now ready to state
a VB analogue of Theorem 1.
Theorem 2 Consider a variational approximation qα (θ) to pα (θ) as given in Definition 4 and a λ-measurable
function g (θ). Then, under Assumptions 1–4 , using the definitions given in Definition 5, we have
dEqα [g (θ)]
= gη H−1 |
ηη fαη . (10)
dα| α0
A proof of Theorem 2 is given in Appendix D. As with Theorem 1, by choosing the appropriate ρ (θ, α) and
evaluating fαη , we can use Theorem 2 to calculate the exact sensitivity of VB solutions to any arbitrary local
perturbations that satisfy the regularity conditions. Assumptions 1–4 are typically not hard to verify. For an
example, see Appendix E, where we establish Assumptions 1–4 for a multivariate normal target distribution
and a mean-field approximation.
Eq. (10) is formally similar to frequentist sensitivity estimates. For example, the pioneering paper of Cook
(1986) contains a formula for assessing the curvature of a marginal likelihood surface (Cook, 1986, Equation
15) that, like our Theorem 2, represents the sensitivity as a linear system involving the Hessian of an objective
function at its optimum. The geometric interpretation of local robustness suggested by Cook (1986) has been
extended to Bayesian settings (see, for example, Zhu et al. (2007, 2011)). In addition to generality, one
attractive aspect of their geometric approach is its invariance to parameterization. Investigating geometric
interpretations of the present work may be an interesting avenue for future research.
6
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
Condition 1 Under Assumptions 1–4 and the quantities defined therein, we additionally have, for all α ∈ A,
We will not attempt to be precise about what we mean by the “approximately equal” sign, since we are
not aware of any practical tools for evaluating quantitatively whether Condition 1 holds other than running
both VB and MCMC (or some other slow but accurate posterior approximation) and comparing the results.
However, VB has been useful in practice to the extent that Condition 1 holds true for at least some parameters
of interest. We provide some intuition for when Condition 1 might hold in Section 5.1, and will evaluate
Condition 1 in each of our experiments below by comparing the VB and MCMC posterior approximate
means and sensitivities.
Since Condition 1 holds only for a particular choice of g (θ), it is weaker than the assumption that qα is
close to pα in KL divergence, or even that all the posterior means are accurately estimated. For example,
as discussed in Appendix B of Giordano et al. (2015) and in Section 10.1.2 of Bishop (2006), a mean-field
approximation to a multivariate normal posterior produces inaccurate covariances and may have an arbitrarily
bad KL divergence from pα , but Condition 1 holds exactly for the location parameters. We discuss the
multivariate normal example further in Section 4.1 and Section 5.1 below.
We can consequently use Theorem 2 to provide an estimate of Covp0 (g (θ)) that may be superior to Covq0 (g (θ)).
With this motivation in mind, we make the following definition.
Definition 6 The linear response variational Bayes (LRVB) approximation, CovLR
q0 (g (θ)), is given by
−1 |
CovLR
q0 (g (θ)) := gη Hηη gη . (14)
Corollary 2 For a given p0 (θ), class Q, and function g (θ), when Assumptions 1–4 and Condition 1 hold for
ρ (θ, α) = α| g (θ) and α0 = 0, then
CovLR
q0 (g (θ)) ≈ Covp0 (g (θ)) .
7
G IORDANO , B RODERICK , AND J ORDAN
The strict optimality of η0∗ in Assumption 3 guarantees that Hηη will be positive definite and symmetric, and,
as desired, the covariance estimate CovLR q0 (g (θ)) will be positive semidefinite and symmetric. Since the op-
timal value of every component of Eqα [g (θ)] may be affected by the log perturbation α| g (θ), CovLR q0 (g (θ))
can estimate non-zero covariances between elements of g (θ) even when they have been partitioned into sep-
arate factors of the mean field approximation.
Note that CovLR q0 (g (θ)) and Covq0 (g (θ)) differ only when there are at least some moments of p0 that
q0 fails to accurately estimate. In particular, if qα provided a good approximation to pα for both the first and
second moments of g (θ), then we would have CovLR q0 (g (θ)) ≈ Covq0 (g (θ)) since, for q0 and p0 ,
Putting these two approximate equalities together, we see that, when the first and second moments of qα
approximately match those of pα ,
One can show (see Appendix E) that the optimal MFVB approximation to pα in the family Qmf is given by
Note that the posterior mean of θ1 is exactly estimated by the MFVB procedure:
However, if Σ12 6= 0, then Λ−1 11 < Σ11 , and the variance of θ1 is underestimated. It follows that the
expectation of θ12 is not correctly estimated by the MFVB procedure:
8
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
An analogous statement holds for θ2 . Of course, the covariance is also mis-estimated if Σ12 6= 0 since, by
construction of the MFVB approximation,
Now let us take the log perturbation ρ (θ, α) = θ1 α1 + θ2 α2 . For all α in a neighborhood of zero, the log-
perturbed posterior given by Eq. (2) remains multivariate normal, so it remains the case that, as a function of
α, Eqα [θ1 ] = Epα [θ1 ] and Eqα [θ2 ] = Epα [θ2 ]. Again, see Appendix E for a detailed proof. Consequently,
Condition 1 holds with equality (not approximate equality) when g (θ) = θ. However, since the second
|
moments are not accurate (irrespective of α), Condition 1 does not hold exactly when g (θ) = θ12 , θ22 , nor
when g (θ) = θ1 θ2 . (Condition 1 may still hold approximately for second moments when Σ12 is small.) The
fact that Condition 1 holds with equality for g (θ) = θ allows us to use Theorem| 1 and Theorem 2 to calculate
CovLRq0 (g (θ)) = Covp0 (g (θ)), even though Ep0 [θ1 θ2 ] and Ep0 θ12 , θ22 are mis-estimated.
In fact, when Condition 1 holds with equality for some θi , then the estimated covariance in Eq. (14)
for all terms involving θi will be exact as well. Condition 1 holds with equality for the means of θi in
the bivariate normal model above, and in fact holds for the general multivariate normal case, as described
in Appendix E. Below, in Section 5, in addition to robustness measures, we will also report the accuracy
of Eq. (14) for estimating posterior covariances. We find that, for most parameters of interest, particularly
location parameters, CovLR q0 (g (θ)) provides a good approximation to Covp0 (g (θ)).
9
G IORDANO , B RODERICK , AND J ORDAN
Since both LRVB and the Laplace approximation require the solution of an optimization problem (Eq. (8) and
Eq. (16) respectively) and the estimation of covariances via an inverse Hessian of the optimization objective
(Eq. (14) and Eq. (17) respectively), it will be instructive to compare the two approaches.
Following Neal and Hinton (1998), we can, in fact, view the MAP estimator as a special variational
approximation, where we define
n Z
QLap := q (θ; θ0 ) : q (θ; θ0 ) log p0 (θ) λ (dθ) = log p0 (θ0 ) and
Z o
q (θ; θ0 ) log q (θ; θ0 ) λ (dθ) = Constant ,
where the Constant term is constant in θ0 . That is, QLap consists of “point masses” at θ0 with constant en-
tropy. Generally such point masses may not be defined as densities with respect to λ, and the KL divergence
in Eq. (8) may not be formally defined for q ∈ QLap . However, if QLap can be approximated arbitrarily well
by well-defined densities (e.g., normal distributions with variance fixed at an arbitrarily small number), then
we can use QLap as a heuristic tool for understanding the MAP estimator.
Since QLap contains only point masses, the covariance of the variational approximation is the zero ma-
trix: CovqLap (θ) = 0. Thus, as when one uses the mean field assumption, CovqLap (θ) underestimates the
marginal variances and magnitudes of the covariances of Covp0 (θ). Of course, the standard Laplace approx-
imation uses CovLap Lap
qLap (θ), not CovqLap (θ), to approximate Covp0 (θ). In fact, CovqLap (θ) is equivalent to a
linear response covariance matrix calculated for the approximating family QLap :
KL (q (θ; θ0 ) ||p0 (θ)) = − log p0 (θ0 ) − Constant ⇒
θ̂Lap = argmax p0 (θ) = argmin KL (q (θ; θ0 ) ||p0 (θ)) = θ0∗
θ θ0
2
∂ p0 (θ) ∂ 2 KL (q (θ; θ0 ) ||p0 (θ))
HLap = − =− = Hηη .
∂θ∂θ| θ̂Lap ∂θ0 ∂θ0| θ0∗
10
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
Corollary 3 Suppose that Assumptions 1–4 and Condition 1 hold for some α0 ∈ A and for
Corollary 3 states that, as with the covariance approximations in Section 4.1, Sqα0 is a useful approximation to
Sα0 to the extent that Condition 1 holds—that is, to the extent that the MFVB means are good approximations
to the exact means for the prior perturbations α ∈ A0 .
Under the ρ (θ, α) given in Corollary 3, Theorem 2 gives the following formula for the variational local
sensitivity:
" #
q −1 ∂ ∂ρ (θ, α)
Sα0 = gη Hηη Eq(θ;η) . (19)
∂η | ∂α α0 ∗
η0
We now use Eq. (19) to reproduce MFVB versions of some standard robustness measures found in the
existing literature. A simple case is when the prior p (θ|α) is believed to be in a given parametric family,
and we are simply interested in the effect of varying the parametric family’s parameters (Basu et al., 1996;
Giordano et al., 2016). For illustration, we first consider a simple example where p (θ|α) is in the exponential
family, with natural sufficient statistic θ and log normalizer A (α), and we take g (θ) = θ. In this case,
Note that when fαη = gη , Eq. (19) is equivalent to Eq. (14). So we see that
Sqα0 = CovLR
q0 (θ) .
In this case, the sensitivity is simply the linear response covariance estimate of the covariance, CovLR
q0 (θ).
By the same reasoning, the exact posterior sensitivity is given by
Thus, Sqα0 ≈ Sα0 to the extent that CovLRq0 (θ) ≈ Covp0 (θ), which again holds to the extent that Condition 1
holds. Note that if we had used a mean field assumption and had tried to use the direct, uncorrected response
covariance Covq0 (θ) to try to evaluate Sqα0 , we would have erroneously concluded that the prior on one
component, θk1 , would not affect the posterior mean of some other component, θk2 , for k2 6= k1 .
Sometimes it is easy to evaluate the derivative of the log prior even when it is not easy to normalize it. As
an example, we will show how to calculate the local sensitivity to the concentration parameter of an LKJ prior
(Lewandowski et al., 2009) under an inverse Wishart variational approximation. The LKJ prior is defined as
follows. Let Σ (as part of θ) be an unknown K × K covariance matrix. Define the K × K scale matrix M
such that
(p
Σij if i = j
Mij =
0 otherwise.
11
G IORDANO , B RODERICK , AND J ORDAN
R = M−1 ΣM−1 .
The LKJ prior on the covariance matrix R with concentration parameter α > 0 is given by:
α−1
pLKJ (R|α) ∝ |R| .
The Stan manual recommends the use of pLKJ , together with an independent prior on the diagonal entries of
the scaling matrix M, for the prior on a covariance matrix that appears in a hierarchical model (Stan Team,
2015, Chapter 9.13).
Suppose that we have chosen the variational approximation
where Ψ is a positive definite scale matrix and ν is the number of degrees of freedom. In this case, the
variational parameters are η = (Ψ, ν). We write η with the understanding that we have stacked only the
upper-diagonal elements of Ψ since Ψ is constrained to be symmetric and η ∗ must be interior. As we show
in Appendix G,
K !
ν X 1 ν−K +1
Eq [log pLKJ (R|α)] = (α − 1) log |Ψ| − ψK − log Ψkk + Kψ + Constant,
2 2 2
k=1
where Constant contains terms that do not depend on α, and where ψK denotes the multivariate digamma
function. Consequently, we can evaluate
∂ ∂
fαη = Eq(θ;η) log p (Σ|α)
∂η | ∂α η=η0∗ ,α=α0
K !
∂ n X 1 n−K +1
= log |Ψ| − ψK − log Ψkk + Kψ . (20)
∂η | 2 2 2
k=1 ∗ η0
This derivative has a closed form, but the bookkeeping required to represent an unconstrained parameteri-
zation of the matrix Ψ within η would be tedious. In practice, we evaluate terms like fαη using automatic
differentiation tools (Baydin et al., 2018).
Finally, in cases where we cannot evaluate Eq(θ;η) [log p (θ|α)] in closed form as a function of η, we
can use numerical techniques as described in Section 4.5. We thus view Sqα0 as the exact sensitivity to an
approximate KL divergence.
12
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
quickly. Consider, for example, a model with “global” parameters, θglob , that are shared by all the individual
datapoint likelihoods, and “local” parameters, θloc,n , associated with likelihood of a single datapoint indexed
by n. By “global” and “local” we mean the likelihood and assumed variational distribution factorize as
N
Y
p (x, θglob , θloc,1 , ..., θloc,N ) = p (θglob ) p (x|θloc,n , θglob ) p (θloc,n |θglob ) (21)
n=1
N
Y
q (θ; η) = q (θglob ; ηglob ) q (θloc,n ; ηn ) for all q (θ; η) ∈ Q.
n=1
In this case, the second derivatives of the variational objective between the parameters for local variables
vanish:
∂ 2 KL (q (θ; η) ||p0 (θ))
for all n 6= m, | = 0.
∂ηloc,n ∂ηloc,m
The model in Section 5.3 has such a global / local structure; see Section 5.3.2 for more details. Additional
discussion, including the use of Schur complements to take advantage of sparsity in the log likelihood, can
be found in Giordano et al. (2015).
When even calculating or instantiating Hηη is prohibitively time-consuming, one can use conjugate gradi-
ent algorithms to approximately compute H−1 |
ηη gη (Wright and Nocedal, 1999, Chapter 5). The advantage of
conjugate gradient algorithms is that they approximate H−1 | |
ηη gη using only the Hessian-vector product Hηη gη ,
which can be computed efficiently using automatic differentiation without ever forming the full Hessian Hηη .
See, for example, the hessian vector product method of the Python autograd package (Maclaurin
et al., 2015). Note that a separate conjugate gradient problem must be solved for each column of gη| , so if the
parameter of interest g (θ) is high-dimensional it may be faster to pay the price for computing and inverting
the entire matrix Hηη . See 5.3.2 for more discussion of a specific example.
In Theorem 2, we require η0∗ to be at a true local optimum. Otherwise the estimated sensitivities may not
be reliable (e.g., the covariance implied by Eq. (14) may not be positive definite). We find that the classical
MFVB coordinate ascent algorithms (Blei et al. (2016, Section 2.4)) and even quasi-second order methods,
such as BFGS (e.g., Regier et al., 2015), may not actually find a local optimum unless run for a long time
with very stringent convergence criteria. Consequently, we recommend fitting models using second-order
Newton trust region methods. When the Hessian is slow to compute directly, as in Section 5, one can use the
conjugate gradient trust region method of Wright and Nocedal (1999, Chapter 7), which takes advantage of
fast automatic differentiation Hessian-vector products without forming or inverting the full Hessian.
5. Experiments
We now demonstrate the speed and effectiveness of linear response methods on a number of simulated and
real data sets. We begin with simple simulated data to provide intuition for how linear response methods
can improve estimates of covariance relative to MFVB and the Laplace approximation. We then develop
linear response covariance estimates for ADVI and apply them to four real-world models and data sets
taken from the Stan examples library (Stan Team, 2017). Finally, we calculate both linear response co-
variances and prior sensitivity measures for a large-scale industry data set. In each case, we compare linear
response methods with ordinary MFVB, the Laplace approximation, and MCMC. We show that linear re-
sponse methods provide the best approximation to MCMC while still retaining the speed of approximate
methods. Code and instructions to reproduce the results of this section can be found in the git repository
rgiordan/CovariancesRobustnessVBPaper.
13
G IORDANO , B RODERICK , AND J ORDAN
the sensitivity analysis motivating CovLRq0 (g (θ)) differs from the local posterior approximation motivating
CovLap
qLap (g (θ)).
For each example, we will explicitly specify the target posterior p0 (θ) using a mixture of normals. This
will allow us to define known target distributions with varying degrees of skewness, over-dispersion, or
correlation and compare the truth with a variational approximation. Formally, for some fixed Kz , component
indicators zk , k = 1, ..., Kz , component probabilities πk , locations µk , and covariances Σk , we set
Kz
Y
p (z) = πkzk
k=1
X X Kz
Y zk
p0 (θ) = p (z) p (θ|z) = p (z) N (θ; mk , Σk ) .
z z k=1
The values π, m and Σ will be chosen to achieve the desired shape for each example using up to Kz = 3
components. There will be no need to state the precise values of π, m, and Σ; rather, we will show plots of
the target density and report the marginal means and variances, calculated by Monte Carlo.1
We will be interested in estimating the mean and variance of the first component, so we take g (θ) = θ1 .
Consequently, in order to calculate CovLRq0 (θ1 ), we will be considering the perturbation ρ (θ, α) = αθ1 with
scalar α and α0 = 0.
For the variational approximations, we will use a factorizing normal approximation:
( K
)
Y
2
Qmf = q (θ) : q (θ) = N θk ; µk , σk .
k=1
|
In terms of Eq. (7), we take η = (µ1 , ..., µK , log σ1 , ..., log σK ) . Thus Eq(θ;η) [g (θ)] = Eq(θ;η) [θ1 ] = µ1 .
In the examples below, we will use multiple distinct components in the definition of p0 (θ), so that p0 (θ) is
non-normal and p0 (θ) ∈ / Qmf .
Since the expectation Eq(θ;η) [log p (θ)] is intractable, we replace the exact KL divergence with a Monte
Carlo approximation using the “re-parameterization trick” (Kingma and Welling, 2013; Rezende et al., 2014;
iid
Titsias and Lázaro-Gredilla, 2014). Let ◦ denote the Hadamard (component-wise) product. Let ξm ∼
N (0, IK ) for m = 1, ..., M . We define
θm := σ ◦ ξm + µ
M K
1 X X
KLapprox (q (θ; η) ||p0 (θ)) := − log p0 (θm ) − log σk ,
M m=1
k=1
which is a Monte Carlo estimate of KL (q (θ; η) ||p0 (θ)). We found M = 10000 to be more than adequate
for our present purposes of illustration. Note that we used the same draws ξm for both optimization and for
the calculation of Hηη in order to ensure that the η0∗ at which Hηη was evaluated was in fact an optimum.
This approach is similar to our treatment of ADVI; see Section 5.2 for a more detailed discussion.
14
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
Figure 1: A univariate skewed distribution. Vertical lines show the location of the means.
the variational approximation, the Laplace approximation, and the exact p0 (θ) all coincide in their estimates
of E [θ], and by, Corollary 2, Σ = Covp0 (θ) = CovLR Lap
q0 (θ) = CovqLap (θ). Of course, if Σ is not diagonal,
Covq0 (θ) 6= Σ because of the mean field assumption. Since this argument holds for the whole vector θ, it
holds a fortiori for our quantity of interest, the first component g (θ) = θ1 .
In other words, the Laplace approximation will differ only from the LRVB approximation when p0 (θ) is
not multivariate normal, a situation that we will now bring about by adding new components to the mixture;
i.e., by increasing Kz .
15
G IORDANO , B RODERICK , AND J ORDAN
Figure 3: A univariate over-dispersed distribution. Vertical lines show the location of the means.
The perturbation ρ (θ, α) = αθ1 is sometimes also described as a “tilting,” and the right panel of Fig. 1
shows the effect of tilting on this posterior approximation. Tilting increases skew, but the MFVB approxima-
tion remains accurate, as shown in Fig. 2. Since local sensitivity of the expectation of θ1 to α is the variance
of θ1 (see Eq. (13)), we have in Fig. 2 that:
• The slope of the exact distribution’s line is Covp0 (θ1 );
Since the MFVB and exact lines nearly coincide, we expect the LRVB variance estimate to be quite accurate
for this example. Similarly, since the slope of the Laplace approximation line is lower, we expect the Laplace
variance to underestimate the exact variance. This outcome, which can be seen visually in the left-hand panel
of Fig. 2, is shown quantitatively in the corresponding table in the right-hand panel. The columns of the table
contain information for the exact distribution and the three approximations. The first row, labeled “mean,”
shows E [θ1 ] and the second row, labeled “var,” shows Cov (θ1 ). (The “LRVB” entry for the mean is blank
because LRVB differs from MFVB only in covariance estimates.) We conclude that, in this case, Condition 1
holds for Qmf but not for QLap .
16
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
In this case, Condition 1 holds for Qmf . For the Laplace approximation, EqLap [g (θ)] = Ep0 [g (θ)] for
α = 0, so QLap satisfies Eq. (11) of Condition 1 for α near zero, the derivatives of the two expectations with
respect to α are quite different, so Eq. (12) of Condition 1 does not hold for QLap .
17
G IORDANO , B RODERICK , AND J ORDAN
One might say, in this case, that Condition 1 does not hold for either Qmf or QLap , or, if it does, it is
with a liberal interpretation of the “approximately equals” sign. However, the expressiveness of Qmf allows
LRVB to improve on the Laplace approximation, and the linear response allows it to improve over the MFVB
approximation, and so LRVB gives the best of both worlds.
Thinking about problems in terms of these three simple models can provide intuition about when and
whether Condition 1 might be expected to hold in a sense that is practically useful.
That is, Qad is a fully factorizing normal family with means µk and log standard deviations ζk . Because we
are including exponential family assumptions in the definition of MFVB (as described in Section 3.1), Qad
is an instance of a mean-field family Qmf . In the notation of Eq. (7),
|
η = (µ1 , ..., µK , ζ1 , ..., ζK ) , (22)
2. Kucukelbir et al. (2017) describe a non-factorizing version of ADVI, which is called “fullrank” ADVI in Stan. The factorizing
version that we describe here is called “meanfield” ADVI in Stan. On the examples we describe, in the current Stan implementation,
we found that fullrank ADVI provided much worse approximations to the MCMC posterior means than the meanfield version, and
so we do not consider it further.
18
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
Ωη = R2K , λ is the Lebesgue measure, and the objective function Eq. (8) is
Z K
X
KL (q (θ; η) ||p0 (θ)) = − N (θk |µk , exp (2ζk )) log p0 (θ) λ (dθ) − ζk ,
k=1
where we have used the form of the univariate normal entropy up to a constant.
The unconstraining parameterization is required because the use of a normal variational approximation
dictates that the base measure on the parameters θ ∈ RK be supported on all of RK . Although many param-
eters of interest, such as covariance matrices, are not supported on RK , there typically exist differentiable
maps from an unconstrained parameterization supported on RK to the parameter of interest. Software pack-
ages such as Stan automatically provide such transforms for a broad set of parameter types. In our notation,
we will take these constraining maps to be the function of interest, g (θ), and take θ to be unconstrained. Note
that, under this convention, the prior p (θ|α) must be a density in the unconstrained space. In practice (e.g.,
in the Stan software package), one usually specifies the prior density in the constrained space and converts
it to a density p (θ|α) in the unconstrained space using the determinant of the Jacobian of the constraining
transform g (·).
The re-parameterization trick allows easy approximation of derivatives of the (generally intractable) ob-
jective KL (q (θ; η) ||p0 (θ)). By defining zk using the change of variable
KL (q (θ; η) ||p0 (θ)) can be re-written as an expectation with respect to a standard normal distribution. We
write θ = exp (ζ) ◦ z + µ by using the component-wise Hadamard product ◦. Then
K
X
KL (q (θ; η) ||p0 (θ)) = −Ez [log p0 (exp (ζ) ◦ z + µ)] − ζk + Constant.
k=1
The expectation is still typically intractable, but it can be approximated using Monte Carlo and draws from
a K-dimensional standard normal distribution. For a fixed number M of draws z1 , ..., zM from a standard
K-dimensional normal, we can define the approximate KL divergence
M K
d (η) := − 1
X X
KL log p0 (exp (ζ) ◦ zm + µ) − ζk + Constant. (24)
M m=1
k=1
so gradients of KLd (η) are unbiased for gradients of the exact KL divergence. Furthermore, for fixed draws
z1 , ..., zM , KL
d (η) can be easily differentiated (using, again, the re-parameterization trick). Standard ADVI
∂ d
uses this fact to optimize KL (q (θ; η) ||p0 (θ)) using the unbiased gradient draws ∂η KL (η) and a stochastic
gradient optimization method, where the stochasticity comes from draws of the standard normal random
variable z. Note that stochastic gradient methods typically use a new draw of z at every gradient step.
19
G IORDANO , B RODERICK , AND J ORDAN
To apply linear response to an ADVI approximation, we need to be able to approximate the Hessian of
KL (q (θ; η) ||p0 (θ)) and to be assured that we have found an optimal η0∗ . But, by using a stochastic gradient
method, ADVI avoids ever actually calculating the expectation in KL (q (θ; η) ||p0 (θ)). Furthermore even if
a stochastic gradient method finds an point that is close to the optimal value of KL (q (θ; η) ||p0 (θ)) it may
not be close to an optimum of KL d (η) for a particular finite M . Indeed, we found that, even for very large
M , the optimum found by ADVI’s stochastic gradient method is typically not close enough to an optimum
of the approximate KL d (η) for sensitivity calculations to be useful. Sensitivity calculations are based on
differentiating the fixed point equation given by the gradient being zero (see the proof in Appendix D), and
do not apply at points for which the gradient is not zero either in theory nor in practice.
Consequently, in order to calculate the local sensitivity, we simply eschew the stochastic gradient method
and directly optimize KLd (η) for a particular choice of M . (We will discuss shortly how to choose M .) We
can then use KL (η) in Eq. (10) rather than the exact KL divergence. Directly optimizing KL
d d (η) both frees
us to use second-order optimization methods, which we found to converge more quickly to a high-quality
optimum than first-order methods, and guarantees that we are evaluating the Hessian Hηη at an optimum of
the objective function used to calculate Eq. (10).
As M approaches infinity, we expect the optimum of KL d (η) to approach the optimum of KL (q (θ; η) ||p0 (θ))
by the standard frequentist theory of estimating equations (Keener, 2010, Chapter 9). In practice we must
fix a particular finite M , with larger M providing better approximations of the true KL divergence but at
increased computational cost. We can inform this tradeoff between accuracy and computation by considering
the frequentist variability of η0∗ when randomly sampling M draws of the random variable z used to approx-
d (η). Denoting this frequentist variability by Covz (η ∗ ), standard results
imate the intractable integral in KL 0
(Keener, 2010, Chapter 9) give that
!
∗ −1 ∂ d
Covz (η0 ) ≈ Hηη Covz KL (η) H−1
ηη . (25)
∂η η ∗
0
A sufficiently large M will be one for which Covz (η0∗ ) is adequately small. One notion of “adequately small”
might be that the ADVI means found with KL (η) are within some fraction of a posterior standard deviation
d
of the optimum of KL (q (θ; η) ||p0 (θ)). Having chosen a particular M , we can calculate the frequentist
variability of µ∗ using CovLR q0 (g (θ)) and estimate the posterior standard deviation using Eq. (14). If we find
that each µ∗ is probably within 0.5 standard deviations of the optimum of KL (q (θ; η) ||p0 (θ)), we can keep
the results; otherwise, we increase M and try again. In the examples we consider here, we found that the
relatively modest M = 10 satisfies this condition and provides sufficiently accurate results.
Finally, we note a minor departure from Eq. (14) when calculating CovLR q0 (g (θ)) from Hηη . Recall
that, in this case, we are taking g (·) to be ADVI’s constraining transform, and that Eq. (14) requires the
Jacobian, gη , of this transform. At the time of writing, the design of the Stan software package did not readily
support automatic calculation of gη , though it did support rapid evaluation of g (θ) at particular values of θ.
Consequently, we used linear response to estimate CovLR q0 (θ), drew a large number Ns of Monte Carlo draws
from θn ∼ N µ, CovLR
q0 (θ) for n = 1, ..., N s , and then used these draws to form a Monte Carlo estimate
of the sample covariance of g (θ). Noting that Eqα [θ] = µ, and recalling the definition of η for ADVI in
Eq. (22), by Eq. (14) we have
∂Eqα [θ] −1 ∂Eqα [θ| ]
IK 0 −1 IK 0
CovLRq0 (θ) = H ηη = H ηη ,
∂η | ∂η 0 0 0 0
which is the upper-left quarter of the matrix H−1
ηη . In addition to obviating the need for gη , this approach also
allowed us to take into account possible nonlinearities in g (·) at little additional computational cost.
5.2.2 R ESULTS
We present results from four models taken from the Stan example set, namely the models election88
(“Election model”), sesame street1 (“Sesame Street model”), radon vary intercept floor (“Radon
20
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
model”), and cjs cov randeff (“Ecology model”). We experimented with many models from the Stan
examples and selected these four as representative of the type of model where LR-ADVI can be expected
to provide a benefit—specifically, they are models of a moderate size. For very small models, MCMC runs
quickly enough in Stan that fast approximations are not necessary, and for very large models (with thousands
of parameters) the relative advantages of LR-ADVI and the Laplace approximation diminish due to the need
to calculate Hηη or HLap using automatic differentiation.3 The size of the data and size of the parameter
space for our four chosen models are shown in Fig. 11. We also eliminated from consideration models where
Stan’s MCMC algorithm reported divergent transitions or where Stan’s ADVI algorithm returned wildly in-
accurate posterior mean estimates.
For brevity, we do not attempt to describe the models or data in any detail here; rather, we point to the
relevant literature in their respective sections. The data and Stan implementations themselves can be found
on the Stan website (Stan Team, 2017) as well as in Appendix F.
To assess the accuracy of each model, we report means and standard deviations for each of Stan’s model
parameters as calculated by Stan’s MCMC and ADVI algorithms and a Laplace approximation, and we report
the standard deviations as calculated by CovLR q0 (g (θ)). Recall that, in our notation, g (·) is the (generally
nonlinear) map from the unconstrained latent ADVI parameters to the constrained space of the parameters
of interest. The performance of ADVI and Laplace vary, and only LR-ADVI provides a consistently good
approximation to the MCMC standard deviations. LR-ADVI was somewhat slower than a Laplace approxi-
mation or ADVI alone, but it was typically about five times faster than MCMC; see Section 5.2.7 for detailed
timing results.
21
G IORDANO , B RODERICK , AND J ORDAN
approximation and ADVI do a reasonable job of matching to MCMC, though LR-ADVI is slightly more
accurate for standard deviations.
22
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
Finally, we consider a more complicated mark-recapture model from ecology known as the Cormack-Jolly-
Seber (CJS) model. This model is described in detail in Kéry and Schaub (2011, Chapter 7), and discussion
of the Stan implementation can be found in Stan Team (2015, Section 15.3).
The Laplace approximation is again degenerate, and the ADVI standard deviations again deviate consid-
erably from MCMC. In this case, the ADVI means are also somewhat inaccurate, and some of the LR-ADVI
standard deviations are mis-estimated in turn. However, LR-ADVI remains by far the most accurate method
for approximating the MCMC standard errors.
23
G IORDANO , B RODERICK , AND J ORDAN
24
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
These data-cleaning decisions were made for convenience. The goal of the present paper is to demonstrate
our inference methods, not to draw conclusions about online advertising.
Although the meaning of the covariates has been obfuscated, for the purpose of discussion we will imag-
ine that the single retained factor-valued covariate represents the identity of the advertiser, and the numeric
covariates represent salient features of the user and/or the advertiser (e.g., how often the user has clicked
or converted in the past, a machine learning rating for the advertisement quality, etc.). As such, it makes
sense to model the probability of each row’s binary outcome (whether or not the user converted) as a function
of the five numeric covariates and the advertiser identity using a logistic GLMM. Specifically, we observe
binary conversion outcomes, yit , for click i on advertiser t, with probabilities given by observed numerical
explanatory variables, xit , each of which are vectors of length Kx = 5. Additionally, the outcomes within
a given value of t are correlated through an unobserved random effect, ut , which represents the “quality” of
advertiser t, where the value of t for each observation is given by the factor-valued covariate. The random
effects ut are assumed to follow a normal distribution with unknown mean and variance. Formally,
N µ0 , τµ−1
µ|µ0 , τµ ∼
τ |ατ , βτ ∼ Gamma (ατ , βτ )
−1
β0 τβ γβ γβ
. ..
β|β0 , τβ , γβ ∼ N .. , γβ . γβ .
β0 γβ γβ τβ
Note that we initially take γβ = 0 so that the prior information matrix on β is diagonal. Nevertheless, by
retaining γβ as a hyperparameter we will be able to assess the sensitivity to the assumption of a diagonal
prior in Section 5.3.6. The remaining prior values are given in Appendix H. It is reasonable to expect that
a modeler would be interested both in the effect of the numerical covariates and in the quality of individual
|
advertisers themselves, so we take the parameter of interest to be g (θ) = (β | , u1 , ..., uT ) .
To produce a data set small enough to be amenable to MCMC but large and sparse enough to demonstrate
our methods, we subsampled the data still further. We randomly chose 5000 distinct advertisers to analyze,
and then subsampled each selected advertiser to contain no more than 20 rows each. The resulting data set
had N = 61895 total rows. If we had more observations per advertiser, the “random effects” ut would have
been estimated quite precisely, and the nonlinear nature of the problem would not have been important; these
changes would thus have obscured the benefits of using MFVB versus the Laplace approximation. In typical
internet data sets a large amount of data comes from advertisers with few observations each, so our subsample
is representative of practically interesting problems.
25
G IORDANO , B RODERICK , AND J ORDAN
Method Seconds
MAP (optimum only) 12
VB (optimum only) 57
VB (including sensitivity for β) 104
VB (including sensitivity for β and u) 553
MCMC (Stan) 21066
For the MCMC estimates, we used Stan to draw 5000 MCMC draws (not including warm-up), which
took 351 minutes. We estimated all the prior sensitivities of Section 5.3.6 using the Monte Carlo version of
the covariance in Eq. (5).
For the MFVB approximation, we use the following mean field exponential family approximations:
With these choices, evaluating the variational objective requires the following intractable univariate varia-
tional expectation:
eρit
Eq(θ;η) [log (1 − pit )] = Eq(θ;η) log 1 − .
1 + eρit
We used the re-parameterization trick and four points of Gauss-Hermite quadrature to estimate this integral
for each observation. See Appendix H for more details.
We optimized the variational objective using the conjugate gradient Newton’s trust region method, trust-ncg,
of scipy.optimize. One advantage of trust-ncg is that it performs second-order optimization but
requires only Hessian-vector products, which can be computed quickly by autograd without constructing
the full Hessian. The MFVB fit took 57 seconds, roughly 370 times faster than MCMC with Stan.
With variational parameters for each random effect ut , Hηη is a 10014 × 10014 dimensional matrix.
Consequently, evaluating Hηη directly as a dense matrix using autograd would have been prohibitively
time-consuming. Fortunately, our model can be decomposed into global and local parameters, and the Hes-
| |
sian term Hηη in Theorem 2 is extremely sparse. In the notation of Section 4.5, take θglob = (β , µ, τ ) ,
|
|
take θloc,t = ut , and stack the variational parameters as η = ηglob , ηloc,1 , ..., ηloc,T . The cross terms in
Hηη between the local variables vanish:
Equivalently, note that the full likelihood in Appendix H, Eq. (31), has no cross terms between ut1 and ut2
for t1 6= t2 . As the dimension T of the data grows, so does the length of η. However, the dimension of ηglob
remains constant, and Hηη remains easy to invert. We show an example of the sparsity pattern of the first few
rows and columns of Hηη in Fig. 12 .
Taking advantage of this sparsity pattern, we used autograd to calculate the Hessian of the KL diver-
gence one group at a time and assembled the results in a sparse matrix using the scipy.sparse Python
26
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
Figure 12: Sparsity pattern of top-left sub-matrix of Hηη for the logit GLMM model. The axis numbers
represent indices within η, and black indicates non-zero entries of Hηη .
package. Even so, calculating the entire sparse Hessian took 323 seconds, and solving the system H−1 |
ηη gη
using scipy.sparse.linalg.spsolve took an additional 173 seconds. These results show that the
evaluation and inversion of Hηη was several times more costly than optimizing the variational objective itself.
(Of course, the whole procedure remains much faster than running MCMC with Stan.)
We note, however, that instead of the direct approach to calculating H−1 |
ηη gη one can use the conjugate
gradient algorithm of sp.sparse.linalg.cg (Wright and Nocedal, 1999, Chapter 5) together with the
fast Hessian-vector products of autograd to query one column at a time of H−1 |
ηη gη . On a typical column
−1 |
of Hηη gη in our experiment, calculating the conjugate gradient took only 9.4 seconds (corresponding to 81
Hessian-vector products in the conjugate gradient algorithm). Thus, for example, one could calculate the
columns of H−1 |
ηη gη corresponding to the expectations of the global variables β in only 9.4 × Kx = 46.9
seconds, which is much less time than it would take to compute the entire H−1 |
ηη gη for both β and every
random effect in u.
For the Laplace approximation, we calculated the MAP estimator and HLap using Python code similar to
that used for the MFVB estimates. We observe that the MFVB approximation to posterior means would be
expected to improve on the MAP estimator only in cases when there is both substantial uncertainty in some
parameters and when this uncertainty, through nonlinear dependence between parameters, affects the values
of posterior means. These circumstances obtain in the logistic GLMM model with sparse per-advertiser data
since the random effects ut will be quite uncertain and the other posterior means depend on them through the
nonlinear logistic function.
27
G IORDANO , B RODERICK , AND J ORDAN
Parameter MCMC MFVB MAP MCMC std. err. Eff. # of MCMC draws
β1 1.454 1.447 1.899 0.02067 33
β2 0.031 0.033 0.198 0.00025 5000
β3 0.110 0.110 0.103 0.00028 5000
β4 -0.172 -0.173 -0.173 0.00016 5000
β5 0.273 0.273 0.280 0.00042 5000
µ 2.041 2.041 3.701 0.04208 28
τ 0.892 0.823 827.724 0.00051 1232
u1431 1.752 1.757 3.700 0.00937 5000
u4150 1.217 1.240 3.699 0.01022 5000
u4575 2.427 2.413 3.702 0.00936 5000
u4685 3.650 3.633 3.706 0.00862 5000
and, perhaps, excessive measured prior sensitivity, as discussed below in Section 5.3.6. Although we will
report the results for both β1 and µ without further comment, the reader should bear in mind that the MCMC
“ground truth” for these two parameters is somewhat suspect.
The results in Table 2 and Fig. 13 show that MFVB does an excellent job of approximating the posterior
means in this particular case, even for the random effects u and the related parameters µ and τ . In contrast,
the MAP estimator does reasonably well only for certain components of β and does extremely poorly for
the random effects parameters. As can be seen in Fig. 14, the MAP estimate dramatically overestimates the
information τ of the random effect distribution (that is, it underestimates the variance). As a consequence, it
estimates all the random effects to have essentially the same value, leading to mis-estimation of some location
parameters, including both µ and some components of β. Since the MAP estimator performed so poorly at
estimating the random effect means, we will not consider it any further.
28
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
estimated covariances between the global parameters and all other parameters, including the random effects,
and the right panel shows only the covariances amongst the random effects. The LRVB covariances are quite
accurate, particularly when we recall that the MCMC draws of µ may be inaccurate due to poor mixing.
diagonal vector of a matrix, and the division is element-wise. Note that we use the sensitivity-based variance
estimates CovLRq0 , not the uncorrected MFVB estimates Covq0 , to normalize the variational sensitivities. We
refer to a sensitivity divided by a standard deviation as a “normalized” sensitivity.
The comparison between the MCMC and MFVB sensitivity measures is shown in Fig. 17. The MFVB and
MCMC sensitivities correspond very closely, though the MFVB means appear to be slightly more sensitive to
the prior parameters than the MCMC means. This close correspondence should not be surprising. As shown
in Section 5.3.3, the MFVB and MCMC posterior means match quite closely. If we assume, reasonably, that
they continue to match to first order in a neighborhood of our original prior parameters, then Condition 1 will
hold and we would expect Ŝα0 ≈ Sqα0 .
29
G IORDANO , B RODERICK , AND J ORDAN
Table 4 shows the detailed MFVB normalized sensitivity results. Each entry is the sensitivity of the
MFVB mean of the row’s parameter to the column’s prior parameter. One can see that several parameters
are quite sensitive to the information parameter prior τµ . In particular, Epα [µ] and Epα [β1 ] are expected to
change approximately −0.39 and −0.35 standard deviations, respectively, for every unit change in τµ . This
size of change could be practically significant (assuming that such a change in τµ is subjectively plausible).
To investigate this sensitivity further, we re-fit the MFVB model at a range of values of the prior parameter
τµ , assessing the accuracy of the linear approximation to the sensitivity. The results are shown in Fig. 18.
Even for very large changes in τµ —resulting in changes to Epα [µ] and Epα [β1 ] far in excess of two standard
deviations—the linear approximation holds up reasonably well. Fig. 18 also shows a (randomly selected)
random effect to be quite sensitive, though not to a practically important degree relative to its posterior
standard deviation. The insensitivity of Epα [β2 ] is also confirmed. Of course, the accuracy of the linear
approximation cannot be guaranteed to hold as well in general as it does in this particular case, and the quick
30
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
β0 τβ γβ µ0 τµ ατ βτ
µ 0.0094 -0.1333 -0.0510 0.0019 -0.3920 0.0058 -0.0048
τ 0.0009 -0.0086 -0.0142 0.0003 -0.0575 0.0398 -0.0328
β1 0.0089 -0.1464 -0.0095 0.0017 -0.3503 0.0022 -0.0018
β2 0.0012 -0.0143 -0.0113 0.0003 -0.0516 0.0062 -0.0051
β3 -0.0035 0.0627 -0.0081 -0.0006 0.1218 -0.0003 0.0002
β4 0.0018 -0.0037 -0.0540 0.0004 -0.0835 0.0002 -0.0002
β5 0.0002 0.0308 -0.0695 0.0002 -0.0383 0.0011 -0.0009
u1431 0.0028 -0.0397 -0.0159 0.0006 -0.1169 0.0018 -0.0015
u4150 0.0026 -0.0368 -0.0146 0.0005 -0.1083 0.0022 -0.0018
u4575 0.0028 -0.0406 -0.0138 0.0006 -0.1153 0.0011 -0.0009
u4685 0.0028 -0.0409 -0.0142 0.0006 -0.1163 0.0003 -0.0002
Figure 17: Comparison of MCMC and MFVB normalized parametric sensitivity results
and reliable evaluation of the linearity assumption without re-fitting the model remains interesting future
work.
Since we started the MFVB optimization close to the new, perturbed optimum, each new MFVB fit took
only 27.2 seconds on average. Re-estimating the MCMC posterior so many times would have been extremely
time-consuming. (Note that importance sampling would be useless for prior parameter changes that moved
the posterior so far from the original draws.) The considerable sensitivity of this model to a particular prior
parameter, which is perhaps surprising on such a large data set, illustrates the value of having fast, general
tools for discovering and evaluating prior sensitivity. Our framework provides just such a set of tools.
6. Conclusion
By calculating the sensitivity of MFVB posterior means to model perturbations, we are able to provide two
important practical tools for MFVB posterior approximations: improved variance estimates and measures of
prior robustness. When MFVB models are implemented in software that supports automatic differentiation,
our methods are fast, scalable, and require little additional coding beyond the MFVB objective itself. In our
experiments, we were able to calculate accurate posterior means, covariances, and prior sensitivity measures
orders of magnitude more quickly than MCMC.
31
G IORDANO , B RODERICK , AND J ORDAN
Figure 18: MFVB sensitivity as measured both by linear approximation (blue) and re-fitting (red)
Acknowledgments
We are grateful to the anonymous reviewers for their insightful comments and suggestions. Ryan Giordano’s
research was funded in part by the National Energy Research Scientific Computing Center, a DOE Office of
Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract
number DE-AC02- 05CH11231, and in part by the Gordon and Betty Moore Foundation through Grant
GBMF3834 and by the Alfred P. Sloan Foundation through Grant 2013-10-27 to the University of California,
Berkeley. Tamara Broderick’s research was supported in part by an NSF CAREER Award, an ARO YIP
Award, and a Google Faculty Research Award. This work was also supported by the DARPA program on
Lifelong Learning Machines, the Office of Naval Research under contract/grant number N00014-17-1-2072,
and the Army Research Office under grant number W911NF-17-1-0304.
32
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
Appendices
Appendix A. Proof of Theorem 1
In this section we prove Theorem 1.
∂
R
Proof Under Assumption 1, we can exchange differentiation and integration in ∂α | p0 (θ) exp (ρ (θ, α)) g (θ) λ (dθ)
∂
R
and ∂α| p0 (θ) exp (ρ (θ, α)) λ (dθ) by Fleming (1965, Chapter 5-11, Theorem 18), which ultimately de-
pends on the Lebesgue dominated convergence theorem. By Assumption 1, Epα [g (θ)] is well-defined for
α ∈ A0 and
∂p0 (θ) exp (ρ (θ, α)) ∂ρ (θ, α)
= p0 (θ) exp (ρ (θ, α)) λ-almost everywhere.
∂α ∂α
Armed with these facts, we can directly compute
R
dEpα [g (θ)] d g (θ) p0 (θ) exp (ρ (θ, α)) λ (dθ)
= R
dα| α0 dα | p0 (θ) exp (ρ (θ, α)) λ (dθ) α0
∂ ∂
R R
∂α| g (θ) p0 (θ) exp (ρ (θ, α)) λ (dθ) α0 ∂α| p0 (θ) exp (ρ (θ, α)) λ (dθ) α0
= R − Ep0 [g (θ)] R
p0 (θ) exp (ρ (θ, α0 )) λ (dθ) p0 (θ) exp (ρ (θ, α0 )) λ (dθ)
R ∂ρ(θ,α)
g (θ) p0 (θ) exp (ρ (θ, α)) ∂α λ (dθ)
" #
α0 ∂ρ (θ, α)
= R − Ep0 [g (θ)] Ep0
p0 (θ) exp (ρ (θ, α0 )) λ (dθ) ∂α α0
!
∂ρ (θ, α)
= Covp0 g (θ) , .
∂α α0
Typically we cannot compute the dependence of the normalizing constant p (θ0 ) exp (ρ (θ0 , α)) λ (dθ0 ) on
R
α, so we use the following importance sampling estimate for Epα [g (θ)] (Owen, 2013, Chapter 9):
Ns
X
Epα [g (θ)] ≈ w̃n g (θn ) .
n=1
33
G IORDANO , B RODERICK , AND J ORDAN
Note that w̃n |α0 = N1s , so the importance sampling estimate recovers the ordinary sample mean at α0 . The
derivatives of the weights are given by
∂wn ∂ρ (θn , α)
= wn
∂α ∂α
PN
∂ w̃n ∂wn
∂α wn n0s=1 ∂w
∂α
n0
= PNs − P 2
∂α n0 =1 wn
0 Ns
n0 =1 wn
0
Ns
wn ∂ρ (θn , α) wn X wn ∂ρ (θn0 , α)
= PNs − PNs PNs
n0 =1 w n0 ∂α n0 =1 wn
0 0 n0 =1 wn
0 ∂α
n =1
Ns
∂ρ (θn , α) X ∂ρ (θn0 , α)
= w̃n − w̃n w̃n0 .
∂α ∂α
n0 =1
It follows that
Ns Ns Ns
!
∂ X X ∂ρ (θn , α) X ∂ρ (θn0 , α)
w̃n g (θn ) = w̃n − w̃n w̃n0 g (θn )
∂α n=1 n=1
∂α 0
∂α
α0 n =1 α0
Ns
" Ns
#" Ns
#
1 X ∂ρ (θn , α) 1 X ∂ρ (θn , α) 1 X
= g (θn ) − g (θn ) ,
Ns n=1 ∂α α0 Ns n=1 ∂α α0 Ns n=1
The set of plausible prior values necessarily remains a subjective decision.4 Whether or not a particular
change in Epα [g (θ)] is important depends on the ultimate use of the posterior mean. For example, the
posterior standard deviation can be a guide: if the prior sensitivity is swamped by the posterior uncertainty
then it can be neglected when reporting our subjective uncertainty about g (θ), and the model is robust.
Similarly, even if the prior sensitivity is much larger than the posterior standard deviation but small enough
that it would not affect any actionable decision made on the basis of the value of Epα [g (θ)], then the model
is robust. Intermediate values remain a matter of judgment. An illustration of the relationship between
sensitivity and robustness is shown in Fig. 19.
Finally, we note that if A is small enough that Epα [g (θ)] is roughly linear in α for α ∈ A, then calculating
Eq. (1) for all α ∈ A and finding the worst case can be thought of as a first-order approximation to a global
robustness estimate. Depending on the problem at hand, this linearity assumption may not be plausible except
4. This decision can be cast in a formal decision theoretic framework based on a partial ordering of subjective beliefs (Insua and
Criado, 2000).
34
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
for very small A. This weakness is inherent to the local robustness approach. Nevertheless, even when
the perturbations are valid only for a small A, these easily-calculable measures may still provide valuable
intuition about the potential modes of failure for a model.
If g (θ) is a scalar, it is natural to attempt to summarize the high-dimensional vector Sα0 in a single easily
reported number such as
Ssup
α0 := sup S|α0 (α − α0 ) .
α:kα−α0 k≤1
For example, the calculation of Ssupα0 is the principal ambition of Basu et al. (1996). The use of such sum-
maries is also particularly common in work that considers function-valued perturbations (e.g., Gustafson,
1996b; Roos et al., 2015). (Function-valued perturbations can be connected to the finite-dimensional per-
turbations of the present work through the notion of the Gateaux derivative (Huber, 2011, Chapter 2.5), the
elaboration of which we leave to future work.) Although the summary Ssup α0 has obvious merits, in the present
work we emphasize the calculation only of Sα0 in the belief that its interpretation is likely to vary from ap-
plication to application and require some critical thought and subjective judgment. For example, the unit
ball kα − α0 k ≤ 1 (as in Basu et al. (1996)) may not make sense as a subjective description of the range of
plausible variability of p (θ|α). Consider, e.g.: why should the off-diagonal term of a Wishart prior plausibly
vary as widely as the mean of some other parameter, when the two might not even have the same units? This
problem is easily remedied by choosing an appropriate scaling of the parameters and thereby making the unit
ball an appropriate range for the problem at hand, but the right scaling will vary from problem to problem
and necessarily be a somewhat subjective choice, so we refrain from taking a stand on this decision. As
another example, the worst-case function-valued perturbations of Gustafson (1996a,b) require a choice of a
metric ball in function space whose meaning may not be intuitively obvious, may provide worst-case pertur-
bations that depend on the data to a subjectively implausible degree, and may exhibit interesting but perhaps
counter-intuitive asymptotic behavior for different norms and perturbation dimensions. Consequently, we
do not attempt to prescribe a particular one-size-fits-all summary measure. The local sensitivity Sα0 is a
well-defined mathematical quantity. Its relationship to robustness must remain a matter of judgment.
35
G IORDANO , B RODERICK , AND J ORDAN
By Assumption 3, η ∗ (α) is both optimal and interior for all α ∈ A0 , and by Assumption 2, KL (η, α) is
continuously differentiable in η. Therefore, the first-order conditions of the optimization problem in Eq. (8)
give:
∂KL (η, α)
= 0 for all α ∈ A0 . (26)
∂η η=η ∗ (α)
∂ 2 KL(η,α) ∂ 2 KL(η,α)
∂η∂η | is positive definite by the strict optimality of η ∗ in Assumption 3, and ∂η∂α| is continuous
α0
∗
by Assumption 2. It follows that η (α) is a continuously differentiable function of α by application of the
implicit function theorem to the first-order condition in Eq. (26) (Fleming, 1965, Chapter 4.6). So we can use
the chain rule to take the total derivative of Eq. (26) with respect to α.
!
d ∂KL (η, α)
= 0 for all α ∈ A0 ⇒
dα ∂η η=η ∗ (α)
∂ 2 KL(η,t)
The strict optimality of KL (η, α) at η ∗ (α) in Assumption 3 requires that ∂η∂η T
be invertible.
η=η ∗ (α)
So we can evaluate at α = α0 and solve to find that
−1 2
dη ∗ (α)
2
∂ KL (η, α) ∂ KL (η, α)
= − .
dα| α0 ∂η∂η | ∂η∂α|
η=η0∗ ,α=α0
Eqα [g (θ)] is a continuously differentiable function of η ∗ (α) by Assumption 4. So by the chain rule and
Assumption 2, we have that
dEq(θ;η) [g (θ)] ∂Eq(θ;η) [g (θ)] dη ∗ (α)
= .
dα| α0 ∂η dα| η=η0∗ ,α=α0
Here, the term Constant contains quantities that do not depend on η. Plugging in gives the desired result.
p0 (θ) = N (θ; µ, Σ)
36
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
Q
for full-rank Σ. This posterior arises, for instance, given a multivariate normal likelihood p (x|µ) = n=1:N N (xn |θ, Σx )
with known covariance Σx and a conjugate multivariate normal prior on the unknown mean parameter
θ ∈ RK . Additionally, even when the likelihood is non-normal or the prior is not conjugate, the posterior
may be closely approximated by a multivariate normal distribution when a Bayesian central limit theorem
can be applied (Le Cam and Yang, 2012, Chapter 8).
We will consider an MFVB approximation to p0 (θ). Specifically, let the elements of the vector θ be given
by scalars θk for k = 1, ..., K, and take the MFVB normal approximation with means mk and variances vk :
( K
)
Y
Q = q (θ) : q (θ) = N (θk ; mk , vk ) .
k=1
|
In the notation of Eq. (9), we have ηk = (mk , vk ) with Ωη = {η : vk > 0, ∀k = 1, ..., K}. The optimal
|
variational parameters are given by ηk∗ = (m∗k , vk∗ ) .
n QK o
Lemma 1 Let p0 (θ) = N (θ; µ, Σ) for full-rank Σ and let Q = q (θ) : q (θ) = k=1 N (θk ; mk , vk ) be
the mean field approximating family. Then there exists an η ∗ = (m∗ , v ∗ ) that solves
η ∗ = argmin KL (q (θ; η) ||pα (θ))
η:q(θ;η)∈Q
∗
with m = µ.
Proof Let diag (v) denote the K × K matrix with the vector v on the diagonal and zero elsewhere. Using
the fact that the entropy of a univariate normal distribution with variance v is 21 log v plus a constant, the
variational objective in Eq. (8) is given by
1 | 1X
KL (q (θ; η) ||pα (θ)) = Eq(θ;η) (θ − µ) Σ−1 (θ − µ) − log vk + Constant
2 2
k
1 1X
= trace Σ−1 Eq(θ;η) [θθ| ] − µ| Σ−1 Eq(θ;η) [θ] −
log vk + Constant
2 2
k
1 1X
= trace Σ−1 (mm| + diag (v)) − µ| Σ−1 m −
log vk + Constant
2 2
k
1 1 1X
= trace Σ−1 diag (v) + m| Σ−1 m − µ| Σ−1 m − log vk + Constant.
2 2 2
k
(27)
The first-order condition for the optimal m∗ is then
∂KL (q (θ; η) ||p0 (θ))
=0⇒
∂m m=m∗ ,v=v ∗
−1 ∗ −1
Σ m −Σ µ=0⇒
m∗ = µ.
The optimal variances follow similarly:
∂KL (q (θ; η) ||p0 (θ))
=0⇒
∂vk m=m∗ ,v=v ∗
1 1 1
Σ−1 kk −
=0⇒
2 2 vk∗
1
vk∗ = .
Σ−1 kk
37
G IORDANO , B RODERICK , AND J ORDAN
Listing 1: election88.stan
1 data {
2 int<lower=0> N;
3 int<lower=0> n_state;
4 vector<lower=0,upper=1>[N] black;
5 vector<lower=0,upper=1>[N] female;
6 int<lower=1,upper=n_state> state[N];
7 int<lower=0,upper=1> y[N];
8 }
9 parameters {
10 vector[n_state] a;
11 vector[2] b;
12 real<lower=0,upper=100> sigma_a;
13 real mu_a;
14 }
38
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
15 transformed parameters {
16 vector[N] y_hat;
17
18 for (i in 1:N)
19 y_hat[i] <- b[1] * black[i] + b[2] * female[i] + a[state[i]];
20 }
21 model {
22 mu_a ˜ normal(0, 1);
23 a ˜ normal (mu_a, sigma_a);
24 b ˜ normal (0, 100);
25 y ˜ bernoulli_logit(y_hat);
26 }
F.2 Sesame Street Model (sesame street1)
27 //data level
28 Sigma_yt[1,1] <- pow(sigma_y,2);
29 Sigma_yt[2,2] <- pow(sigma_t,2);
30 Sigma_yt[1,2] <- rho_yt*sigma_y*sigma_t;
31 Sigma_yt[2,1] <- Sigma_yt[1,2];
32
33 // group level
34 Sigma_ag[1,1] <- pow(sigma_a,2);
35 Sigma_ag[2,2] <- pow(sigma_g,2);
39
G IORDANO , B RODERICK , AND J ORDAN
39 for (j in 1:J) {
40 a[j] <- ag[j,1];
41 g[j] <- ag[j,2];
42 }
43
44 for (i in 1:N) {
45 yt_hat[i,2] <- g[siteset[i]] + d * z[i];
46 yt_hat[i,1] <- a[siteset[i]] + b * yt[i,2];
47 }
48
49 //data level
50 sigma_y ˜ uniform (0, 100);
51 sigma_t ˜ uniform (0, 100);
52 rho_yt ˜ uniform(-1, 1);
53 d ˜ normal (0, 31.6);
54 b ˜ normal (0, 31.6);
55
56 //group level
57 sigma_a ˜ uniform (0, 100);
58 sigma_g ˜ uniform (0, 100);
59 rho_ag ˜ uniform(-1, 1);
60 mu_ag ˜ normal (0, 31.6);
61
62 for (j in 1:J)
63 ag[j] ˜ multi_normal(mu_ag,Sigma_ag);
64
65 //data model
66 for (i in 1:N)
67 yt[i] ˜ multi_normal(yt_hat[i],Sigma_yt);
68
69 }
F.3 Radon Model (radon vary intercept floor)
40
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
14 real<lower=0,upper=100> sigma_y;
15 }
16 transformed parameters {
17 vector[N] y_hat;
18
19 for (i in 1:N)
20 y_hat[i] <- a[county[i]] + u[i] * b[1] + x[i] * b[2];
21 }
22 model {
23 mu_a ˜ normal(0, 1);
24 a ˜ normal(mu_a, sigma_a);
25 b ˜ normal(0, 1);
26 y ˜ normal(y_hat, sigma_y);
27 }
F.4 Ecology Model (cjs cov randeff)
4 functions {
5 int first_capture(int[] y_i) {
6 for (k in 1:size(y_i))
7 if (y_i[k])
8 return k;
9 return 0;
10 }
11
28 for (i in 1:nind) {
29 chi[i, n_occasions] = 1.0;
30 for (t in 1:(n_occasions - 1)) {
31 // Compoud declaration was enabled in Stan 2.13
32 int t_curr = n_occasions - t;
33 int t_next = t_curr + 1;
41
G IORDANO , B RODERICK , AND J ORDAN
34 /*
35 int t_curr;
36 int t_next;
37
38 t_curr = n_occasions - t;
39 t_next = t_curr + 1;
40 */
41 chi[i, t_curr] = (1 - phi[i, t_curr])
42 + phi[i, t_curr] * (1 - p[i, t_next - 1]) * chi[
i, t_next];
43 }
44 }
45 return chi;
46 }
47 }
48
49 data {
50 int<lower=0> nind; // Number of individuals
51 int<lower=2> n_occasions; // Number of capture occasions
52 int<lower=0,upper=1> y[nind, n_occasions]; // Capture-history
53 vector[n_occasions - 1] x; // Covariate
54 }
55
56 transformed data {
57 int n_occ_minus_1 = n_occasions - 1;
58 // int n_occ_minus_1;
59 int<lower=0,upper=n_occasions> first[nind];
60 int<lower=0,upper=n_occasions> last[nind];
61 vector<lower=0,upper=nind>[n_occasions] n_captured;
62
63 // n_occ_minus_1 = n_occasions - 1;
64 for (i in 1:nind)
65 first[i] = first_capture(y[i]);
66 for (i in 1:nind)
67 last[i] = last_capture(y[i]);
68 n_captured = rep_vector(0, n_occasions);
69 for (t in 1:n_occasions)
70 for (i in 1:nind)
71 if (y[i, t])
72 n_captured[t] = n_captured[t] + 1;
73 }
74
75 parameters {
76 real beta; // Slope parameter
77 real<lower=0,upper=1> mean_phi; // Mean survival
78 real<lower=0,upper=1> mean_p; // Mean recapture
79 vector[n_occ_minus_1] epsilon;
80 real<lower=0,upper=10> sigma;
81 // In case a weakly informative prior is used
82 // real<lower=0> sigma;
83 }
42
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
84
85 transformed parameters {
86 matrix<lower=0,upper=1>[nind, n_occ_minus_1] phi;
87 matrix<lower=0,upper=1>[nind, n_occ_minus_1] p;
88 matrix<lower=0,upper=1>[nind, n_occasions] chi;
89 // Compoud declaration was enabled in Stan 2.13
90 real mu = logit(mean_phi);
91 // real mu;
92
93 // mu = logit(mean_phi);
94 // Constraints
95 for (i in 1:nind) {
96 for (t in 1:(first[i] - 1)) {
97 phi[i, t] = 0;
98 p[i, t] = 0;
99 }
100 for (t in first[i]:n_occ_minus_1) {
101 phi[i, t] = inv_logit(mu + beta * x[t] + epsilon[t]);
102 p[i, t] = mean_p;
103 }
104 }
105
109 model {
110 // Priors
111 // Uniform priors are implicitly defined.
112 // mean_phi ˜ uniform(0, 1);
113 // mean_p ˜ uniform(0, 1);
114 // sigma ˜ uniform(0, 10);
115 // In case a weakly informative prior is used
116 // sigma ˜ normal(5, 2.5);
117 beta ˜ normal(0, 100);
118 epsilon ˜ normal(0, sigma);
119
43
G IORDANO , B RODERICK , AND J ORDAN
R = M−1 ΣM−1 .
Define the LKJ prior on R with concentration parameter ξ (Lewandowski et al., 2009):
ξ−1
pLKJ (R|ξ) ∝ |R| .
Let q Σ|V−1 , ν be an inverse Wishart distribution with matrix parameter V−1 and ν degrees of freedom.
Then
K
−1
ν X
−1
ν−K +1
Eq [log |R|] = log V − ψK − log V kk
+ Kψ + Constant
2 2
k=1
Eq [log pLKJ (R|ξ)] = (ξ − 1) Eq [log |R|] + Constant,
where Constant does not depend on V or ν. Here, ψK is the multivariate digamma function.
Proof First note that
By Eq. B.81 in (Bishop, 2006), a property of the inverse Wishart distribution is the following relation.
ν
Eq [log |Σ|] = log V−1 − ψK − K log 2, (29)
2
44
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
where ψK is the multivariate digamma function. By the marginalization property of the inverse Wishart
distribution,
Plugging Eq. (29) and Eq. (30) into Eq. (28) gives the desired result.
µ0 = 0.000
σµ−2 = 0.010
β0 = 0.000
σβ−2 = 0.100
ατ = 3.000
βτ = 3.000.
Under the variational approximation, ρit is normally distributed given xit , with
ρit = xTit β + ut
Eq [ρit ]= xTit Eq [β] + Eq [ut ]
T
Varq (ρit ) = Eq β T xit xTit β − Eq [β] xit xTit Eq [β] + Varq (ut )
T
= Eq tr β T xit xTit β − tr Eq [β] xit xTit Eq [β] + Varq (ut )
T
= tr xit xTit Eq ββ T − Eq [β] Eq [β]
+ Varq (ut ) .
45
G IORDANO , B RODERICK , AND J ORDAN
h i
eρ
We can thus use nM C = 4 points of Gauss-Hermite quadrature to numerically estimate Eq log 1 − 1+eρ :
q
ρit,s := Varq (ρit )zs + Eq [ρit ]
nM C
eρit eρit,s
1 X
Eq log 1 − ≈ log 1 −
1 + eρit nM C s=1 1 + eρit,s
We found that increasing the number of points used for the quadrature did not measurably change any of
the results. The integration points and weights were calculated using the numpy.polynomial.hermite
module in Python (Jones et al., 2001).
46
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
References
A. Agresti and M. Kateri. Categorical Data Analysis. Springer, 2011.
S. Basu, S. Rao Jammalamadaka, and W. Liu. Local posterior robustness with parametric priors: Maximum
and average sensitivity. In Maximum Entropy and Bayesian Methods, pages 97–106. Springer, 1996.
A. Baydin, B. Pearlmutter, A. Radul, and J. Siskind. Automatic differentiation in machine learning: a survey.
Journal of Machine Learning Research, 18:1–43, 2018.
J. O. Berger, D. R. Insua, and F. Ruggeri. Robust Bayesian analysis. In D. R. Insua and F. Ruggeri, editors,
Robust Bayesian Analysis, volume 152. Springer Science & Business Media, 2000.
C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, 2006. Chapter 10.
D. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:
993–1022, 2003.
D. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. arXiv preprint
arXiv:1601.00670, 2016.
B. Carpenter, M. D. Hoffman, M. Brubaker, D. Lee, P. Li, and M. Betancourt. The Stan math library:
Reverse-mode automatic differentiation in C++. arXiv preprint arXiv:1509.07164, 2015.
R. D. Cook. Assessment of local influence. Journal of the Royal Statistical Society: Series B, 28(2):133–169,
1986.
B. Efron. Frequentist accuracy of Bayesian estimates. Journal of the Royal Statistical Society: Series B, 77
(3):617–646, 2015.
W. H. Fleming. Functions of Several Variables. Addison-Wesley Publishing Company, Inc., 1965.
A. Gelman and J. Hill. Data analysis using regression and multilevel/hierarchical models. Cambridge
university press, 2006.
A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis, volume 2. Chapman &
Hall/CRC, 2014.
R. J. Giordano, T. Broderick, and M. I. Jordan. Linear response methods for accurate covariance estimates
from mean field variational Bayes. In Advances in Neural Information Processing Systems, pages 1441–
1449, 2015.
R. J. Giordano, T. Broderick, R. Meager, J. Huggins, and M. I. Jordan. Fast robustness quantification with
variational Bayes. arXiv preprint arXiv:1606.07153, 2016.
P. Gustafson. Local sensitivity of inferences to prior marginals. Journal of the American Statistical Associa-
tion, 91(434):774–781, 1996a.
P. Gustafson. Local sensitivity of posterior expectations. The Annals of Statistics, 24(1):174–195, 1996b.
P. Gustafson. Local robustness in Bayesian analysis. In D. R. Insua and F. Ruggeri, editors, Robust Bayesian
Analysis, volume 152. Springer Science & Business Media, 2000.
47
G IORDANO , B RODERICK , AND J ORDAN
48
C OVARIANCES , ROBUSTNESS , AND VARIATIONAL BAYES
D. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep
generative models. arXiv preprint arXiv:1401.4082, 2014.
Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Con-
ference on Machine Learning, pages 1530–1538, 2015.
M. Roos, T. G. Martins, L. Held, and H. Rue. Sensitivity analysis for Bayesian hierarchical models. Bayesian
Analysis, 10(2):321–349, 2015.
Stan Team. Stan Modeling Language Users Guide and Reference Manual, Version 2.8.0, 2015. URL http:
//mc-stan.org/.
Stan Team. Stan example models wiki, 2017. URL https://github.com/stan-dev/
example-models/wiki. Referenced on May 19th, 2017.
T. Tanaka. Mean-field theory of Boltzmann machine learning. Physical Review E, 58(2):2302, 1998.
T. Tanaka. Information geometry of mean-field approximation. Neural Computation, 12(8):1951–1968,
2000.
Michalis Titsias and Miguel Lázaro-Gredilla. Doubly stochastic variational Bayes for non-conjugate infer-
ence. In International Conference on Machine Learning, pages 1971–1979, 2014.
D. Tran, D. Blei, and E. M. Airoldi. Copula variational inference. In Advances in Neural Information
Processing Systems, pages 3564–3572, 2015a.
D. Tran, R. Ranganath, and D. Blei. The variational Gaussian process. arXiv preprint arXiv:1511.06499,
2015b.
R. E. Turner and M. Sahani. Two problems with variational expectation maximisation for time-series models.
In D. Barber, A. T. Cemgil, and S. Chiappa, editors, Bayesian Time Series Models. 2011.
M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foun-
dations and Trends in Machine Learning, 1(1-2):1–305, 2008.
B. Wang and M. Titterington. Inadequacy of interval estimates corresponding to variational Bayesian ap-
proximations. In Workshop on Artificial Intelligence and Statistics, pages 373–380, 2004.
Y. Wang and D. M. Blei. Frequentist consistency of variational Bayes. arXiv preprint arXiv:1705.03439,
2017.
M. Welling and Y. W. Teh. Linear response algorithms for approximate inference in graphical models. Neural
Computation, 16(1):197–221, 2004.
T. Westling and T. H. McCormick. Establishing consistency and improving uncertainty estimates of varia-
tional inference through m-estimation. arXiv preprint arXiv:1510.08151, 2015.
S. Wright and J. Nocedal. Numerical optimization. Springer Science, 35:67–68, 1999.
H. Zhu, J. G. Ibrahim, S. Lee, and H. Zhang. Perturbation selection and influence measures in local influence
analysis. The Annals of Statistics, 35(6):2565–2588, 2007.
H. Zhu, J. G. Ibrahim, and N. Tang. Bayesian influence analysis: A geometric approach. Biometrika, 98(2):
307–323, 2011.
49