0% found this document useful (0 votes)

64 views14 pages

A Stein Variational Newton Method: Preprint. Work in Progress

This document proposes a Stein variational Newton method (SVN) to accelerate the Stein variational gradient descent (SVGD) algorithm for nonparametric variational inference. SVN approximates a Newton-like iteration in function space by computing each transport map as a perturbation along the direction that minimizes a local quadratic approximation of the KL divergence, incorporating second-order information. This can dramatically accelerate convergence compared to SVGD. The document also introduces geometry-aware Gaussian kernels for SVN that exploit second-order information to yield faster convergence than SVN with an isotropic kernel.

Uploaded by

Alex Leviyev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views14 pages

A Stein Variational Newton Method: Preprint. Work in Progress

Uploaded by

Alex Leviyev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

A Stein variational Newton method

Gianluca Detommaso Tiangang Cui

University of Bath & The Alan Turing Institute Monash University
gd391@bath.ac.uk Tiangang.Cui@monash.edu

Alessio Spantini Youssef Marzouk

arXiv:1806.03085v1 [stat.ML] 8 Jun 2018

Massachusetts Institute of Technology Massachusetts Institute of Technology

spantini@mit.edu ymarz@mit.edu

Robert Scheichl
University of Bath
R.Scheichl@bath.ac.uk

Abstract
Stein variational gradient descent (SVGD) was recently proposed as a general
purpose nonparametric variational inference algorithm [Liu & Wang, NIPS 2016]:
it minimizes the Kullback–Leibler divergence between the target distribution and
its approximation by implementing a form of functional gradient descent on a
reproducing kernel Hilbert space. In this paper, we accelerate and generalize the
SVGD algorithm by including second-order information, thereby approximating
a Newton-like iteration in function space. We also show how second-order in-
formation can lead to more effective choices of kernel. We observe significant
computational gains over the original SVGD algorithm in multiple test cases.

1 Introduction
Approximating an intractable probability distribution via a collection of samples—in order to evaluate
arbitrary expectations over the distribution, or to otherwise characterize uncertainty that the distribu-
tion encodes—is a core computational challenge in statistics and machine learning. Common features
of the target distribution can make sampling a daunting task. For instance, in a typical Bayesian
inference problem, the posterior distribution might be strongly non-Gaussian (perhaps multimodal)
and high dimensional, and evaluations of its density might be computationally intensive.
There exist a wide range of algorithms for such problems, ranging from parametric variational
inference [2] to Markov chain Monte Carlo (MCMC) techniques [8]. Each algorithm offers a
different computational trade-off. At one end of the spectrum, we find the parametric mean-field
approximation—a cheap but potentially inaccurate variational approximation of the target density.
At the other end, we find MCMC—a nonparametric sampling technique yielding estimators that are
consistent, but potentially slow to converge. In this paper, we focus on the Stein variational gradient
descent (SVGD) algorithm [15], which lies somewhere in the middle of the spectrum and can be
described as a particular nonparametric variational inference method [2].
The SVGD algorithm seeks a deterministic coupling between a tractable reference distribution of
choice (e.g., a standard normal) and the intractable target. This coupling is induced by a transport
map T that can transform a collection of reference samples into samples from the desired target
distribution. For a given pair of distributions, there may exist infinitely many such maps [26]; several
existing algorithms (e.g., [25, 22, 19]) aim to approximate feasible transport maps of various forms.
The distinguishing feature of the SVGD algorithm lies in its definition of a suitable map T . Its central

Preprint. Work in progress.

idea is to approximate T as a growing composition of simple maps, computed sequentially:
T = T1 ◦ · · · ◦ Tk ◦ · · · , (1)
where each map Tk is a perturbation of the identity map along the steepest descent direction of a
functional J that describes the Kullback–Leibler (KL) divergence between the pushforward of the
reference distribution through the composition T1 ◦ · · · ◦ Tk and the target distribution. The steepest
descent direction is further projected onto a reproducing kernel Hilbert space (RKHS) in order to
give Tk a nonparametric closed form [1]. Even though the resulting map Tk is available explicitly
without any need for numerical optimization, the SVGD algorithm implicitly approximates a steepest
descent iteration on a space of maps of given regularity.
A primary goal of this paper is to explore the use of second-order information (e.g., Hessians)
within the SVGD algorithm. Our idea is to develop the analogue of a Newton iteration—rather
than gradient descent—for the purpose of sampling distributions more efficiently. Specifically, we
design an algorithm where each map Tk is now computed as the perturbation of the identity function
along the direction that minimizes a certain local quadratic approximation of J. Accounting for
second-order information can dramatically accelerate convergence to the target distribution, at the
price of additional work per iteration. The tradeoff between speed of convergence and cost per
iteration is resolved in favor of the Newton-like algorithm—which we call a Stein variational Newton
method (SVN)—in several numerical examples.
The efficiency of the SVGD and SVN algorithms depends further on the choice of reproducing kernel.
A second contribution of this paper is to design geometry-aware Gaussian kernels that also exploit
second-order information, yielding substantially faster convergence towards the target distribution
than SVGD or SVN with an isotropic kernel.
In the context of parametric variational inference, second-order information has been used to acceler-
ate the convergence of certain variational approximations, e.g., [12, 11, 19]. In this paper, however, we
focus on nonparametric variational approximations, where the corresponding optimisation problem
is defined over an infinite-dimensional RKHS of transport maps. More closely related to our work is
the Riemannian SVGD algorithm [16], which generalizes a gradient flow interpretation of SVGD
[13] to Riemannian manifolds, and thus also exploits geometric information within the inference task.
The rest of the paper is organized as follows. Section 2 briefly reviews the SVGD algorithm, and
Section 3 introduces the new SVN method. In Section 4 we introduce geometry-aware kernels for
the SVN method. Numerical experiments are described in Section 5. Proofs of our main results and
further numerical examples addressing scaling to high dimensions are given in the supplementary
material. Code and all numerical examples are collected in our GitHub repository.1

2 Background
Suppose we wish to approximate an intractable target distribution with density π on Rd via an
empirical measure, i.e., a collection of samples. Given samples {xi } from a tractable reference
density p on Rd , one can seek a transport map T : Rd → Rd such that the pushforward density of
p under T , denoted by T∗ p, is a close approximation to the target π.2 There exist infinitely many
such maps [26]. The image of the reference samples through the map, {T (xi )}, can then serve as an
empirical measure approximation of π (e.g., in the weak sense [15]).

Variational approximation. Using the KL divergence to measure the discrepancy between the
target π and the pushforward T∗ p, one can look for a transport map T that minimises the functional
T 7→ DKL (T∗ p || π) (2)
over a broad class of functions. The Stein variational method breaks the minimization of (2) into
several simple steps: it builds a sequence of transport maps {T1 , T2 , . . . , Tl , . . .} to iteratively push
an initial reference density p0 towards π. Given a scalar-valued RKHS H with a positive definite
kernel k(x, x0 ), each transport map Tl : Rd → Rd is chosen to be a perturbation of the identity map
I(x) = x along the vector-valued RKHS Hd ' H × · · · × H, i.e.,
Tl (x) := I(x) + Q(x) for Q ∈ Hd . (3)
1
http://github.com/gianlucadetommaso/Stein-variational-samplers
2
If T is invertible, then T∗ p(x) = p(T −1 (x)) | det(∇x T −1 (x))|.

2
The transport maps are computed iteratively. At each iteration l, our best approximation of π is given
by the pushforward density pl = (Tl ◦ · · · ◦ T1 )∗ p0 . The SVGD algorithm then seeks a transport
map Tl+1 = I + Q that further decreases the KL divergence between (Tl+1 )∗ pl and π,
Q 7→ Jpl [Q] := DKL ((I + Q)∗ pl || π), (4)
for an appropriate choice of Q ∈ Hd . In other words, the SVGD algorithm seeks a map Q ∈ Hd
such that
Jpl [Q] < Jpl [0], (5)
where 0(x) = 0 denotes the zero map. By construction, the sequence of pushforward densities
{p0 , p1 , p2 , . . . , pl , . . .} becomes increasingly closer (in KL divergence) to the target π. Recent
results on the convergence of the SVGD algorithm are presented in [13].

Functional gradient descent. Assuming that the objective function Jpl : Hd → R is Fréchet
differentiable, the first variation of Jpl at S ∈ Hd along V ∈ Hd can be defined as
1
DJpl [S](V ) := lim Jpl [S + τ V ] − Jpl [S] . (6)
τ →0 τ

The functional gradient of Jpl at S ∈ Hd is the element ∇Jpl [S] of Hd such that
DJpl [S](V ) = h∇Jpl [S], V iHd ∀ V ∈ Hd , (7)
where h·, ·iHd denotes an inner product on Hd .
In order to satisfy (5), the SVGD algorithm defines Tl+1 as a perturbation of the identity map along
the steepest descent direction of the functional Jpl evaluated at the zero map, i.e.,
Tl+1 = I − ε∇Jpl [0], (8)
for a small enough ε > 0. It was shown in [15] that the functional gradient at 0 has a closed form
expression given by
−∇Jpl [0](z) = Ex∼pl [k(x, z)∇x log π(x) + ∇x k(x, z)]. (9)

Empirical approximation. There are several ways to approximate the expectation in (9). For
instance, a set of particles {x0i }ni=1 can be generated from the initial reference density p0 and pushed
forward by the transport maps {T1 , T2 , . . .}. The pushforward density pl can then be approximated
by the empirical measure given by the particles {xli }ni=1 , where xli = Tl (xl−1i ) for i = 1, . . . , n, so
that
1 Pn l l l

−∇Jpl [0](z) ≈ G(z) := j=1 k(xj , z)∇xlj log π(xj ) + ∇xlj k(xj , z) . (10)
n
The first term in (10) corresponds to a weighted average steepest descent direction of the log-target
density π with respect to pl . This term is responsible for transporting particles towards high-
probability regions of π. In contrast, the second term can be viewed as a “repulsion force” that
spreads the particles along the support of π, preventing them from collapsing around the mode of π.
The SVGD algorithm is summarised in Algorithm 1.

Algorithm 1: One iteration of the Stein variational gradient algorithm

Input :Particles {xli }ni=1 at previous iteration l; step size εl+1
Output :Particles {xl+1 n
i }i=1 at new iteration l + 1
1: for i = 1, 2, . . . , n do
2: Set xl+1
i ← xli + εl+1 G(xli ), where G is defined in (10).
3: end for

3 Stein variational Newton method

Here we propose a new method that incorporates second-order information to accelerate the conver-
gence of the SVGD algorithm. We replace the steepest descent direction in (8) with an approximation
of the Newton direction.

3
Functional Newton direction. Given a differentiable objective function Jpl , we can define the
second variation of Jpl at 0 along directions V, W ∈ Hd as
1
D2 Jpl [0](V, W ) := lim

DJpl [τ W ](V ) − DJpl [0](V ) .
τ →0 τ

At each iteration, the Newton method seeks to minimize a local quadratic approximation of Jpl , so
that the corresponding Newton direction W ∈ Hd can be defined as the solution of the following
system of equations:
D2 Jpl [0](V, W ) = −DJpl [0](V ), ∀ V ∈ Hd . (11)
We can then look for a transport map Tl+1 that is a local perturbation of the identity map along the
Newton direction, i.e.,
Tl+1 = I + εW, (12)
for some ε > 0 that satisfies (5). W is guaranteed to be a descent direction if the bilinear form
D2 Jpl [0] in (11) is positive definite. The following theorem gives an explicit form for D2 Jpl [0] and
is proved in Appendix A.
Theorem 1. The second variation of Jpl evaluated at the zero map along directions V, W ∈ Hd is
given by
D2 Jpl [0](V, W ) = − Ex∼pl Hl (x, ·)W (x) , V (·) Hd ,

where
Hl (x, z) = ∇2x log π(x)k(x, z) + ∇2x k(x, z) + ∇x k(x, z)∇x log pl (x)> .

We can then use Theorem 1 to rewrite the variational characterization of the Newton direction W as
follows:

Ex∼pl Hl (x, ·)W (x) − ∇Jpl [0], V Hd = 0, (13)
for all maps V ∈ Hd . A practical characterization of W , however, requires further approximations of
(13).

Empirical approximation. First, we use the current empirical approximation of pl given by the
ensemble {xli }ni=1 to replace the expectation in (13) by its discrete counterpart, i.e.,
* n +
1X
Hl (xlj , ·)W (xlj ) − ∇Jpl [0](·), V (·) = 0, ∀V ∈ Hd . (14)
n j=1
Hd

Then, we notice that we really only need to approximate the action of W onto the particles {xli }ni=1
in order to evaluate the map Tl+1 in (12) for the next iteration of the algorithm. One idea is to
enforce (14) only for a finite subset of functions in Hd , namely {k(xli , ·)}ni=1 , and use the reproducing
property of the kernel to obtain a coupled linear system for the desired actions, zjl+1 := W (xlj ) for
j = 1, . . . , n, as
n
1X
Hl (xlj , xli ) zjl+1 = ∇Jpl [0](xli ), i = 1, . . . , n. (15)
n j=1
We can also use (10) to make the previous equation a bit more explicit,
n
X n
X
Hl (xlj , xli ) zjl+1 = k(xlj , xli )∇xlj log π(xlj ) + ∇xlj k(xlj , xli ) .

(16)
j=1 j=1

Inexact Newton. Solving the linear system in (16) poses several challenges. First, the matrix
Hl cannot be computed directly, as the pushforward density pl is unavailable. Second, the lin-
ear system is enormous: the unknown is of dimension n × d. We then employ two approxima-
tions to obtain an inexact Newton direction that can alleviate these difficulties. First, we drop the
term ∇x k(x, z)∇x log p(x)> in Hl and respectively replace ∇2x log π(x) and ∇2x k(x, z) by their
Gauss-Newton approximations Nπ (x) (equivalent to the Fisher information matrix for certain target
densities) and Nk (x, z), which results in
Ĥ(x, z) = Nπ (x)k(x, z) + Nk (x, z). (17)

4
Second, we consider a strategy reminiscent of the “mass lumping” technique common in the finite
element literature to approximate the (nd) × (nd) coupled linear system (16) by the solution of n
decoupled d × d systems. In particular, each zil+1 is approximated independently by the solution of
the following d × d linear system:
X n n
X
Ĥ(xlj , xli ) zil+1 = k(xlj , xli )∇xlj log π(xlj ) + ∇xlj k(xlj , xli ) .

(18)
j=1 j=1

In essence, (18) corresponds to a “block-diagonal” approximation of the original system in (16)

obtained by averaging the blocks Ĥ(xlj , xli ) along each row. We note that by construction each matrix
Ĥ(xlj , xli ) is symmetric and positive definite.
The approximation of the Newton direction implicitly reshapes the local geometry of the objective
function (4), while providing a more natural scaling for the constant step size, ε = O(1), than the
original SVGD algorithm. Although the choice ε = 1 performs reasonably well in all our numerical
experiments (Section 5 and Appendix B), in future work we might consider performing an inexact
line search along the descent direction. The resulting Stein variational Newton method is summarised
in Algorithm 2.

Algorithm 2: One iteration of the Stein variational Newton algorithm

Input :Particles {xli }ni=1 at stage l; step size ε
Output :Particles {xl+1 n
i }i=1 at stage l + 1
1: for i = 1, 2, . . . , n do
2: Solve the linear system (18) for zil+1
3: Set xli ← xli + εzil+1
4: end for

4 Scaled Hessian kernel

In the Stein variational method, the kernel weighs the contribution of each particle to a locally
averaged steepest descent direction of the target distribution, and it also spreads the particles along
the support of the target distribution. Thus it is essential to choose a kernel that can capture the
underlying geometry of the target distribution, so the particles can traverse the support of the target
distribution efficiently. To this end, we can use the curvature information characterised by the Hessian
of the logarithm of the target density to design anisotropic kernels.
Consider a positive definite matrix A(x) that approximates the local Hessian of the negative logarithm
of the target density, i.e., A(x) ≈ −∇2x log π(x). We introduce the metric
Mπ := Ex∼π [A(x)] , (19)
to characterise the average curvature of the target density, stretching and compressing the parameter
space in different directions. There are a number of computationally efficient ways to evaluate such
an A(x)—for example, the generalised eigenvalue approach in [18] and the Fisher information-based
approach in [9]. The expectation in (19) is taken against the target density π, and thus cannot be
directly computed. Utilising the ensemble {xli }ni=1 in each iteration, we introduce an alternative
metric
1 Pn
Mpl := A(xli ), (20)
n i=1
to approximate Mπ . Similar approximations have also been introduced in the context of dimension
reduction for statistical inverse problems; see [5]. Note that the computation of the metric (20) does
not incur extra computational cost, as we already calculated (approximations to) ∇2x log π(x) at each
particle in the Newton update.
Given a kernel of the generic form k(x, x0 ) = f (kx − x0 k2 ), we can then use the metric Mpl to
define an anisotropic kernel

1
kl (x, x0 ) = f kx − x0 k2Mp ,
g(d) l

5
where the norm k · kMpl is defined as kxk2Mp = x> Mpl x and g(d) is a positive and real-valued
l
function of the dimension d. For example, with g(d) = d, the Gaussian kernel used in the SVGD of
[15] can be modified as
1
kl (x, x0 ) := exp − kx − x0 k2Mp . (21)
2d l

The metric Mpl induces a deformed geometry in the parameter space: distance is greater along
directions where the (average) curvature is large. This geometry directly affects how particles in
SVGD or SVN flow—by shaping the locally-averaged gradients and the “repulsion force” among the
particles—and tends to spread them more effectively over the high-probability regions of π.
The dimension-dependent scaling factor g(d) plays an important role in high dimensional problems.
Consider a sequence of target densities that converges to a limit as the dimension of the parameter
space increases. For example, in the context of Bayesian inference on function spaces, e.g., [24],
the posterior density is often defined on a discretisation of a function space, whose dimensionality
increases as the discretisation is refined. In this case, the g(d)-weighed norm k · k2 /d is the square
of the discretised L2 norm under certain technical conditions (e.g., the examples in Section 5.2 and
Appendix B.2) and converges to the functional L2 norm as d → ∞. With an appropriate scaling g(d),
the kernel may thus exhibit robust behaviour with respect to discretisation if the target distribution
has appropriate infinite-dimensional limits. For high-dimensional target distributions that do not
have a well-defined limit with increasing dimension, an appropriately chosen scaling function g(d)
can still improve the ability of the kernel to discriminate inter-particle distances. Further numerical
investigation of this effect is presented in Appendix B.

5 Test cases
We evaluate our new SVN method with the scaled Hessian kernel on a set of test cases drawn from
various Bayesian inference tasks. For these test cases, the target density π is the (unnormalised)
posterior density. We assume the prior distributions are Gaussian, that is, π0 (x) = N (mpr , Cpr ),
where mpr ∈ Rd and Cpr ∈ Rd×d are the prior mean and prior covariance, respectively. Also, we
assume there exists a forward operator F : Rd → Rm mapping from the parameter space to the
data space. The relationship between the observed data and unknown parameters can be expressed
as y = F(x) + ξ, where ξ ∼ N (0, σ 2 I) is the measurement error and I is the identity matrix.
This relationship defines the likelihood function L(y|x) = N (F(x), σ 2 I) and the (unnormalised)
posterior density π(x) ∝ π0 (x)L(y|x).
We will compare the performance of SVN and SVGD, both with the scaled Hessian kernel (21) and
the heuristically-scaled isotropic kernel used in [15]. We refer to these algorithms as SVN-H, SVN-I,
SVGD-H, and SVGD-I, where ‘H’ or ‘I’ designate the Hessian or isotropic kernel, respectively.
Recall that the heuristic used in the ‘-I’ algorithms involves a scaling factor based on the number
of particles n and the median pairwise distance between particles [15]. Here we present two test
cases, one multi-modal and the other high-dimensional. Appendix B reports on additional tests that
investigate the dimension-scalability of our scaled kernel and illustrate further aspects of the SVN
method. Code for all test cases is available in our GitHub repository.

5.1 Two-dimensional double banana

The first test case is a two-dimensional bimodal and “banana” shaped posterior density. The prior is a
standard multivariate Gaussian, i.e., mpr = 0 and Cpr = I, and the observational error has standard
deviation σ = 0.3. The forward operator is taken to be a scalar logarithmic Rosenbrock function, i.e.,
F(x) = log (1 − x1 )2 + 100(x2 − x21 )2 ,

where x = (x1 , x2 ). We take a single observation y = F(xtrue ) + ξ, with xtrue being a random
variable drawn from the prior and ξ ∼ N (0, σ 2 I).
Figure 1 summarises the outputs of four algorithms at selected iteration numbers, each with n = 1000
particles initially sampled from the prior π0 . The rows of Figure 1 correspond to the choice of
algorithms and the columns of Figure 1 correspond to the outputs at different iteration numbers.
We run 10, 50, and 100 iterations of SVN-H. To make a fair comparison, we rescale the number of
iterations for each of the other algorithms so that the total cost (CPU time) is approximately the same.

6
Figure 1: Particle configurations superimposed on contour plots of the double-banana density.

The first row of Figure 1 displays the performance of SVN-H, where second-order information is
exploited both in the optimisation and in the kernel. After only 10 iterations, the algorithm has
already converged, and the configuration of particles does not visibly change afterwards. Here, all the
particles quickly reach the high probability regions of the posterior distribution, due to the Newton
acceleration in the optimisation. Additionally, the scaled Hessian kernel seems to spread the particles
into a structured and precise configuration.
The second row shows the performance of SVN-I, where the second-order information is used
exclusively in the optimisation. We can see the particles quickly moving towards the high-probability
regions, but the configuration is much less structured. After 47 iterations, the algorithm has essentially
converged, but the configuration of the particles is noticeably rougher than that of SVN-H.
SVGD-H in the third row exploits second-order information exclusively in the kernel. Compared to
SVN-I, the particles spread more quickly over the support of the posterior, but not all the particles
reach the high probability regions, due to slower convergence of the optimisation. The fourth row
shows the original algorithm, SVGD-I. The algorithm lacks both of the benefits of second-order
information: with slower convergence and a more haphazard particle distribution, it appears less
efficient for reconstructing the posterior distribution.

5.2 100-dimensional conditioned diffusion

The second test case is a high-dimensional model arising from a Langevin SDE, with state u :
[0, T ] → R and dynamics given by
βu (1 − u2 )
dut = dt + dxt , u0 = 0 . (22)
(1 + u2 )
Here x = (xt )t≥0 is a Brownian motion, so that x ∼ π0 = N (0, C), where C(t, t0 ) = min(t, t0 ).
This system represents the motion of a particle with negligible mass trapped in an energy potential,
with thermal fluctuations represented by the Brownian forcing; it is often used as a test case for
MCMC algorithms in high dimensions [4]. Here we use β = 10 and T = 1. Our goal is to infer the
driving process x and hence its pushforward to the state u.
The forward operator is defined by F(u) = [ut1 , ut2 , . . . , ut20 ]> ∈ R20 , where ti are equispaced
observation times in the interval (0, 1], i.e., ti = 0.05 i. By taking σ = 0.1, we define an observation

7
SVN-H -- 10 iterations SVN-H -- 50 iterations SVN-H -- 100 iterations
0.5 0.5 0.5

0 0 0

-0.5 -0.5 -0.5

-1 -1 -1

-1.5 -1.5 -1.5

0 0.5 1 0 0.5 1 0 0.5 1

SVN-I -- 11 iterations SVN-I -- 54 iterations SVN-I -- 108 iterations

2 1 1

1
0 0
0
-1 -1
-1

-2 -2 -2
0 0.5 1 0 0.5 1 0 0.5 1

SVGD-H -- 40 iterations SVGD-H -- 198 iterations SVGD-H -- 395 iterations

2 2 2

1 1 1

0 0 0

-1 -1 -1

-2 -2 -2
0 0.5 1 0 0.5 1 0 0.5 1

SVGD-I -- 134 iterations SVGD-I -- 668 iterations SVGD-I -- 1336 iterations

0.5 0.5 0.5

0 0 0

-0.5 -0.5 -0.5

-1 -1 -1

-1.5 -1.5 -1.5

0 0.5 1 0 0.5 1 0 0.5 1

Figure 2: In each plot, the magenta path is the true solution of the discretised Langevin SDE; the
blue line is the reconstructed posterior mean; the shaded area is the 90% marginal posterior credible
interval at each time step.

y = F(xtrue ) + ξ ∈ R20 , where xtrue is a Brownian motion path and ξ ∼ N (0, σ 2 I). For discretiza-
tion, we use an Euler-Maruyama scheme with step size ∆t = 10−2 ; therefore the dimensionality
of the problem is d = 100. The prior distribution is given by the Brownian motion x = (xt )t≥0 ,
described above.
Figure 2 summarises the outputs of four algorithms, each with n = 1000 particles initially sampled
from π0 . Figure 2 is presented in the same way as Figure 1 from the first test case. The iteration
numbers are scaled, so that we can compare outputs generated by various algorithms using approxi-
mately the same amount of CPU time. In Figure 2, the path in magenta corresponds to the solution of
the Langevin SDE in (22) driven by the true Brownian path xtrue . The red points correspond to the
20 noisy observations. The blue path is the reconstruction of the magenta path, i.e., it corresponds
to the solution of the Langevin SDE driven by the posterior mean of (xt )t≥0 . Finally, the shaded
area represents the marginal 90% credible interval of each dimension (i.e., at each time step) of the
posterior distribution of u.
We observe excellent performance of SVN-H. After 50 iterations, the algorithm has already converged,
accurately reconstructing the posterior mean (which in turn captures the trends of the true path) and
the posterior credible intervals. (See Figure 3 and below for a validation of these results against
a reference MCMC simulation.) SVN-I manages to provide a reasonable reconstruction of the
target distribution: the posterior mean shows fair agreement with the true solution, but the credible
intervals are slightly overestimated, compared to SVN-H and the reference MCMC. The overestimated
credible interval may be due to the poor dimension scaling of the isotropic kernel used by SVN-I.
With the same amount of computational effort, SVGD-H and SVGD-I cannot reconstruct the posterior
distribution: both the posterior mean and the posterior credible intervals depart significantly from
their true values.

8
In Figure 3, we compare the posterior distribution approximated with SVN-H (using n = 1000
particles and 100 iterations) to that obtained with a reference MCMC run (using the DILI algorithm
of [4] with an effective sample size of 105 ), showing an overall good agreement. The thick blue
and green paths correspond to the posterior means estimated by SVN-H and MCMC, respectively.
The blue and green shaded areas represent the marginal 90% credible intervals (at each time step)
produced by SVN-H and MCMC. In this example, the posterior mean of SVN-H matches that of
MCMC quite closely, and both are comparable to the data-generating path (thick magenta line). (The
posterior means are much smoother than the true path, which is to be expected.) The estimated
credible intervals of SVN-H and MCMC also match fairly well along the entire path of the SDE.
0.2
true
SVN-H
MCMC
0 obs

-0.2

-0.4

-0.6

-0.8

-1

-1.2

-1.4

-1.6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 3: Comparison of reconstructed distributions from SVN-H and MCMC

6 Discussion
In general, the use of Gaussian reproducing kernels may be problematic in high dimensions, due to
the locality of the kernel [6]. While we observe in Section 4 that using a properly rescaled Gaussian
kernel can improve the performance of the SVN method in high dimensions, we also believe that
a truly general purpose nonparametric algorithm using local kernels will inevitably face further
challenges in high-dimensional settings. A sensible approach to coping with high dimensionality
is also to design algorithms that can detect and exploit essential structure in the target distribution,
whether it be decaying correlation, conditional independence, low rank, multiple scales, and so on.
See [23, 27] for recent efforts in this direction.

A Proof of Theorem 1
The following proposition is used to prove Theorem 1.
Proposition 1. Define the directional derivative of Jp as the first variation of Jp at S ∈ Hd along a
direction V ∈ Hd ,
1
DJp [S](V ) := lim Jp [S + τ V ] − Jp [S] .
τ →0 τ
The first variation takes the form
h > i
DJp [S](V ) = −Ex∼p ∇x log π(x + S(x)) V (x) + trace (I + ∇x S(x))−1 ∇x V (x) .
(23)

9
Proof. Given the identity map I and a transport map in the form of T = I + S + τ V , the pullback
density of π is defined as
T ∗ π = π(T (x)) | det ∇x T (x)| = π x + S(x) + τ V (x) det I + ∇x S(x) + τ ∇x T (x) .

The perturbed objective function Jp [S + τ V ] takes the form

Jp [S + τ V ] = DKL ((I + S + τ V )∗ p k π)
= DKL (p k (I + S + τ V )∗ π)
Z Z
= p(x) log p(x)dx − p(x) log π x + S(x) + τ V (x)

+ log det I + ∇x S(x) + τ ∇x V (x) dx .
Thus we have
Z

Jp [S + τ V ] − Jp [S] = − p(x) log π x + S(x) + τ V (x) − log π(x + S(x)) dx
| {z }
(i)
Z

− p(x) log det I + ∇x S(x) + τ ∇x V (x) − log det I + ∇x S(x) dx .
| {z }
(ii)
(24)
Performing a Taylor expansion of the terms (i) and (ii) in (24), we have
>
(i) = τ ∇x log π(x + S(x)) V (x) + O(τ 2 ) ,
(ii) = τ trace (I + ∇x S(x))−1 ∇x V (x) + O(τ 2 ) ,

where ∇x log π(x + S(x)) is the partial derivative of log π evaluated at x + S(x). Plugging the above
expression into (24) and the definition of the directional derivative, we obtain
h > i
DJp [S](V ) = −Ex∼p ∇x log π(x + S(x)) V (x) + trace ∇x (I + ∇x S(x))−1 ∇x V (x) .
(25)

The Fréchet derivative of Jp evaluated at S ∈ Hd , ∇Jp [S] : Hd → L(Hd , R) satisfies

DJp [S](V ) = h∇Jp [S] , V iHd , ∀ V ∈ Hd ,
and thus we can use Proposition 1 to prove Theorem 1.

Proof of Theorem 1. The second variation of Jp at 0 along directions V, W ∈ Hd takes the form
1
D2 Jp [0](V, W ) := lim

DJp [τ W ](V ) − DJp [0](V ) .
τ →0 τ

Following Proposition 3, we have

1
D2 Jp [0](V, W ) = lim

DJp [τ W ](V ) − DJp [0](V )
τ →0 τ
1 >
= − Ex∼p lim ∇x log π(x + τ W (x)) − ∇x log π(x) V (x)
τ →0 τ
| {z }
(i)
1
− Ex∼p trace lim [(I + τ ∇x W (x))−1 − I] ∇x V (x) .

(26)
τ →0 τ
| {z }
(ii)

By Taylor expansion, the limits (i) and (ii) of the above equation can be written as
(i) = ∇2x log π(x)W (x) ,
(ii) = −∇x W (x) .

10
Thus, the second variation of Jp at 0 along directions V, W ∈ Hd becomes

D2 Jp [0](V, W ) = −Ex∼p W (x)> ∇2x log π(x)V (x) − trace ∇x W (x)∇x V (x) .

(27)

Using the reproducing property of V ∈ Hd ,

vi (x) = hk(x, ·), vi (·)iHd ,
∇x vi (x) = h∇x k(x, ·), vi (·)iHd ,
we then have
Ex∼p W (x)> ∇2x log π(x)V (x) = Ex∼p ∇2x log π(x)k(x, ·)W (x) , V (·) Hd .

Applying integration by parts and the reproducing property of V ∈ Hd , we have

Ex∼p −trace ∇x W (x)∇x V (x) = Ex∼p ∇2x k(x, ·)+∇x k(x, ·)∇x log p(x)> W (x) , V (·) Hd .

Plugging the above identities into (27), the second variation can be expressed as
D2 Jp [0](V, W ) = − Ex∼p H(x, ·)W (x) , V (·) Hd ,

where H(x, z) = ∇2x log π(x)k(x, z) + ∇2x k(x, z) + ∇x k(x, z)∇x log p(x)> .

B Additional test cases

B.1 Nonlinear regression

Similar to the test case in Section 5.1, we present another two-dimensional example with a twisted
geometry. We adopt the same prior and likelihood function as in Section 5.1, but with a different
forward operator given by
F(x) = c1 x31 + c2 x2 ,
where the coefficients c = [c1 , c2 ]> ∈ R2 are drawn from a standard Gaussian N (0, I) and fixed for
the purpose of the example.

Figure 4: Particle configurations superimposed on the contour plots of the target density.

11
As in the test case of Section 5.1, we perform a comparison of the four algorithms with n = 1000
particles using comparable total computational budgets. Figure 4 shows the results, where each
column corresponds to a common level of computational effort. After only 10 iterations, both SVN-H
and SVN-I are nearly converged. In contrast, both SVGD-H and SVGD-I seem quite far from
convergence even after 100 iterations. In this example, while the choice of kernel is not particularly
influential, using second-order information in the optimisation significantly improves the convergence
of the Stein variational method.

B.2 Scalability of kernels in high dimensions

Discretization-invariant posterior distribution. Here we illustrate the dimension scalability of

the scaled Hessian kernel in (21), compared to the isotropic kernel used in [15]. We consider a linear
Bayesian inverse problem in a function space setting [24]: the forward operator is a linear functional
F(x) = hsin(πs), x(s)i, where the function x is defined for s ∈ [0, 1]. The scalar observation
y = F(x) + ξ, where ξ is Gaussian with zero mean and standard deviation σ = 0.3. The prior
is a Gaussian measure N (0, K−1 ) where K is the Laplace operator −x00 (s), s ∈ [0, 1], with zero
essential boundary conditions.
Discretising this problem with finite differences on a uniform grid with d degrees of freedom, we
obtain a Gaussian prior density π0 (x) with zero mean and covariance matrix K −1 , where K is the
finite difference approximation of the Laplacian. Let the vector a denote the discretised function
sin(πs), s ∈ [0, 1], and let the corresponding discretised parameter be denoted by x (overloading
notation for convenience). Then the finite-dimensional forward operator can be written as F(x) =
a> x. After discretization, the posterior has a Gaussian density of the form π = N (mpos , Cpos ), where
y 1 > −1
mpos = Cpos a , Cpos = K −1 + aa .
σ2 σ2

To benchmark the performance of various kernels, we construct certain summaries of the posterior
distribution. In particular, we use our SVN methods with the scaled Hessian kernel (SVN-H)
and the isotropic kernel (SVN-I) to estimate the component-wise average of the posterior mean,
1
Pd
d i=1 mpos,i , and the trace of the posterior covariance, trace(Cpos ), for problems discretised at
different resolutions d ∈ {40, 60, 80, 100}. We run each experiment with n = 1000 particles and 50
iterations of SVN. We compare the numerical estimates of these quantities to the analytically known
results. These comparisons are summarised in Tables 1 and 2.
From Table 1, we can observe that all algorithms almost perfectly recover the average of the posterior
mean up to the first three significant figures. However, Table 2 shows that SVN-H does a good job
in estimating the trace of the posterior covariance consistently for all dimensions, whereas SVN-I
under-estimates the trace—suggesting that particles are under-dispersed and not correctly capturing
the uncertainty in the parameter x. This example suggests that the scaled Hessian kernel can lead to a
more accurate posterior reconstruction for high-dimensional distributions than the isotropic kernel.

Table 1: Comparison of theoretical and estimated averages of the posterior mean

1
Pd
Averages of the posterior mean d i=1 mpos,i
d 40 60 80 100
Theoretical 0.4658 0.4634 0.4622 0.4615
SVN-H 0.4658 0.4634 0.4623 0.4614
SVN-I 0.4657 0.4633 0.4622 0.4615

A posterior distribution that is not discretization invariant. Now we examine the dimension-
scalability of various kernels in a problem that does not have a well-defined limit with increasing
parameter dimension. We modify the linear Bayesian inverse problem introduced above: now the
prior covariance is the identity matrix, i.e., K −1 = I and the vector a used to define the forward
operator is drawn from a uniform distribution, ai ∼ U(2, 10), i = 1, . . . , d. This way, the posterior
is not discretization invariant. We perform the same set of numerical experiments as above and
summarise the results in Tables 3 and 4. Although the target distribution used in this case is not

12
Table 2: Comparison of theoretical and estimated traces of the posterior covariance

Traces of the posterior covariance trace(Cpos )

d 40 60 80 100
Theoretical 0.1295 0.1297 0.1299 0.1299
SVN-H 0.1271 0.1281 0.1304 0.1293
SVN-I 0.0925 0.0925 0.0925 0.0923

discretization invariant, the scaled Hessian kernel (21) is still reasonably effective in reconstructing
the target distributions of increasing dimension (according to the summary statistics below), whereas
the isotropic kernel under-estimates the target variances for all values of dimension d that we have
tested.
Table 3: Comparison of theoretical and estimated averages of the posterior mean
1
Pd
Averages of the posterior mean d i=1 mpos,i
d 40 60 80 100
Theoretical 0.0037 0.0025 0.0019 0.0015
SVN-H 0.0037 0.0025 0.0019 0.0015
SVN-I 0.0037 0.0025 0.0019 0.0015

Table 4: Comparison of theoretical and estimated traces of the posterior covariance

Traces of the posterior covariance trace(Cpos )

d 40 60 80 100
Theoretical 39.0001 59.0000 79.0000 99.0000
SVN-H 37.7331 55.8354 73.6383 90.7689
SVN-I 8.7133 8.2588 7.9862 7.6876

References
[1] N. Aronszajn Theory of reproducing kernels. Transactions of the American mathematical
society, p. 337–404, 1950.
[2] D. Blei, A. Kucukelbir, and J. D. McAuliffe Variational inference: A review for statisticians.
Journal of the American Statistical Association, p. 859–877, 2017.
[3] W. Y. Chen, L. Mackey, J. Gorham, F. X. Briol, C. J. Oates Stein Points. arXiv preprint
arXiv:1803.10161, 2018.
[4] T. Cui, K. J. H. Law, Y. M. Marzouk Dimension-independent likelihood-informed MCMC.
Journal of Computational Physics, 304: 109–137, 2016.
[5] T. Cui, J. Martin, Y. M. Marzouk, A. Solonen, and A. Spantini. Likelihood-informed dimension
reduction for nonlinear inverse problems. Inverse Problems, 30(11):114015, 2014.
[6] D. Francois, V. Wertz, and M. Verleysen About the locality of kernels in high-dimensional
spaces. International Symposium on Applied Stochastic Models and Data Analysis, p. 238–245,
2005.
[7] S. Gershman, M. Hoffman, D. Blei Nonparametric Variational Inference. arXiv preprint
arXiv:1206.4665, 2012.
[8] W. R. Gilks, S. Richardson, and D. Spiegelhalter. Markov chain Monte Carlo in practice. CRC
press, 1995.
[9] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo
methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–
214, 2011.

13
[10] J. Han and Q. Liu. Stein variational adaptive importance sampling. arXiv preprint
arXiv:1704.05201, 2017.
[11] M. E. Khan, Z. Liu, V. Tangkaratt, Y. Gal Vprop: Variational Inference using RMSprop. arXiv
preprint arXiv:1712.01038, 2017.
[12] M. E. Khan, W. Lin, V. Tangkaratt, Z. Liu, D. Nielsen Adaptive-Newton Method for Explorative
Learning. arXiv preprint arXiv:1711.05560, 2017.
[13] Q. Liu. Stein variational gradient descent as gradient flow. In Advances in Neural Information
Processing systems (I. Guyon et al., Eds.), Vol. 30, p. 3118–3126, 2017.
[14] Y. Liu, P. Ramachandran, Q. Liu, and J. Peng. Stein variational policy gradient. arXiv preprint
arXiv:1704.02399, 2017.
[15] Q. Liu and D. Wang. Stein variational gradient descent: A general purpose Bayesian inference
algorithm. In Advances In Neural Information Processing Systems (D. D. Lee et al., Eds.),
Vol. 29, p. 2378–2386, 2016.
[16] C. Liu and J. Zhu. Riemannian Stein variational gradient descent for Bayesian inference. arXiv
preprint arXiv:1711.11216, 2017.
[17] D. G. Luenberger. Optimization by vector space methods. John Wiley & Sons, 1997.
[18] J. Martin, L. C. Wilcox, C. Burstedde, and O. Ghattas. A stochastic Newton MCMC method for
large-scale statistical inverse problems with application to seismic inversion. SIAM Journal on
Scientific Computing, 34(3), A1460–A1487, Chapman & Hall/CRC, 2012
[19] Y. M. Marzouk, T. Moselhy, M. Parno, and A. Spantini. Sampling via measure transport: An
introduction. Handbook of Uncertainty Quantification, Springer, p. 1–41, 2016.
[20] R. M. Neal. MCMC using Hamiltonian dynamics. In Handbook of Markov Chain Monte Carlo
(S. Brooks et al., Eds.), Chapman & Hall/CRC, 2011.
[21] Y. Pu, Z. Gan, R. Henao, C. Li, S. Han, and L. Carin. Stein variational autoencoder. arXiv
preprint arXiv:1704.05155, 2017.
[22] D. Rezende and S. Mohamed. Variational inference with normalizing flows. arXiv:1505.05770,
2015.
[23] A. Spantini, D. Bigoni, and Y. Marzouk Inference via low-dimensional couplings.
arXiv:1703.06131, 2017.
[24] A. M. Stuart. Inverse problems: a Bayesian perspective. Acta Numerica, 19, p. 451–559, 2010.
[25] E. G. Tabak and T. V. Turner. A family of nonparametric density estimation algorithms.
Communications on Pure and Applied Mathematics, p. 145–164, 2013.
[26] C. Villani. Optimal Transport: Old and New. Springer-Verlag Berlin Heidelberg, 2009.
[27] D. Wang, Z. Zeng, and Q. Liu. Structured Stein Variational Inference for Continuous Graphical
Models. arXiv:1711.07168, 2017.
[28] J. Zhuo, C. Liu, N. Chen, and B. Zhang. Analyzing and improving Stein variational gradient
descent for high-dimensional marginal inference. arXiv preprint arXiv:1711.04425, 2017.

Nonlinear PDEs With Applications PDF
100% (1)
Nonlinear PDEs With Applications PDF
486 pages
Auto Encoding Variational Bayes
No ratings yet
Auto Encoding Variational Bayes
14 pages
Reparametrization Trick
No ratings yet
Reparametrization Trick
8 pages
Copula Variational Bayes Inference Via Information Geometry: Viet Hung Tran
No ratings yet
Copula Variational Bayes Inference Via Information Geometry: Viet Hung Tran
23 pages
Kernel-Based Approximation Methods Using MATLAB
0% (1)
Kernel-Based Approximation Methods Using MATLAB
9 pages
Density Estimation Using Real NVP
No ratings yet
Density Estimation Using Real NVP
32 pages
Gaussian Process Approximations of Stochastic Differential Equation
No ratings yet
Gaussian Process Approximations of Stochastic Differential Equation
16 pages
Minimum L - Distance Estimators For Non-Normalized Parametric Models
No ratings yet
Minimum L - Distance Estimators For Non-Normalized Parametric Models
32 pages
Numerical Derivatives
No ratings yet
Numerical Derivatives
14 pages
Stochastic Gradient Descent As Approximate Bayesian Inference
No ratings yet
Stochastic Gradient Descent As Approximate Bayesian Inference
35 pages
An Explicit Link Between Gaussian Fields and Gaussian Markov Random Fields - The SPDE Approach
No ratings yet
An Explicit Link Between Gaussian Fields and Gaussian Markov Random Fields - The SPDE Approach
39 pages
3 Bayesian Deep Learning
No ratings yet
3 Bayesian Deep Learning
33 pages
Optimal Transport Map Estimation in General Function Spaces
No ratings yet
Optimal Transport Map Estimation in General Function Spaces
49 pages
Estimating The Support of A High-Dimensional Distribution
No ratings yet
Estimating The Support of A High-Dimensional Distribution
28 pages
Gaussian Random Field
No ratings yet
Gaussian Random Field
24 pages
Dalalyan - 2017 - Theoretical Guarantees For Approximate Sampling From Smooth and Log-Concave Densities
No ratings yet
Dalalyan - 2017 - Theoretical Guarantees For Approximate Sampling From Smooth and Log-Concave Densities
26 pages
380J-3 Marine Diesel Engine Set Parts Catalogue (2022)
No ratings yet
380J-3 Marine Diesel Engine Set Parts Catalogue (2022)
53 pages
Reversible Gromov-Monge Sampler for Simulation-Based Inference
No ratings yet
Reversible Gromov-Monge Sampler for Simulation-Based Inference
49 pages
The Bayesian Update: Variational Formulations and Gradient Flows
No ratings yet
The Bayesian Update: Variational Formulations and Gradient Flows
28 pages
Neural Wasserstein Gradient Flows For Discrepancies With Riesz Kernels
No ratings yet
Neural Wasserstein Gradient Flows For Discrepancies With Riesz Kernels
27 pages
Robust, Accurate Stochastic Optimization For Variational Inference
No ratings yet
Robust, Accurate Stochastic Optimization For Variational Inference
16 pages
Nonasymptotic Analysis of Stochastic Gradient Hamiltonian Monte Carlo Under Local Conditions For Nonconvex Optimization
No ratings yet
Nonasymptotic Analysis of Stochastic Gradient Hamiltonian Monte Carlo Under Local Conditions For Nonconvex Optimization
34 pages
Using Gradient Directions To Get Global Convergence of Neewton-Type Metods
No ratings yet
Using Gradient Directions To Get Global Convergence of Neewton-Type Metods
22 pages
A Recursive Local Polynomial Approximation Method Using Dirichlet Clouds and Radial Basis Functions
No ratings yet
A Recursive Local Polynomial Approximation Method Using Dirichlet Clouds and Radial Basis Functions
26 pages
2002.03329v3
No ratings yet
2002.03329v3
33 pages
MCMC Methods For Functions: Modifying Old Algorithms To Make Them Faster
No ratings yet
MCMC Methods For Functions: Modifying Old Algorithms To Make Them Faster
23 pages
Variational Inference A Review For Statisticians
No ratings yet
Variational Inference A Review For Statisticians
20 pages
Bo
No ratings yet
Bo
36 pages
2501.08425v1
No ratings yet
2501.08425v1
50 pages
1901.11033v3
No ratings yet
1901.11033v3
51 pages
SDE For SGD
No ratings yet
SDE For SGD
35 pages
Feature Selection With Gradient Descent On Two-Layer Networks in Low-Rotation Regimes
No ratings yet
Feature Selection With Gradient Descent On Two-Layer Networks in Low-Rotation Regimes
40 pages
Dynamical Measure Transport and Neural PDE Solvers For Sampling
No ratings yet
Dynamical Measure Transport and Neural PDE Solvers For Sampling
25 pages
Generative Modeling by Estimating Gradients of The Data Distribution
No ratings yet
Generative Modeling by Estimating Gradients of The Data Distribution
23 pages
Balasubramanian Et Al - Towards a Theory of Non-Log-Concave Sampling- First-OrderStationarity Guarantees for Langevin Monte Carlo
No ratings yet
Balasubramanian Et Al - Towards a Theory of Non-Log-Concave Sampling- First-OrderStationarity Guarantees for Langevin Monte Carlo
24 pages
Sensors 18 04222
No ratings yet
Sensors 18 04222
17 pages
Damrbine Rand Inv
No ratings yet
Damrbine Rand Inv
16 pages
A Complete Recipe For Stochastic Gradient MCMC: (Yianma@u, Tqchen@cs, Ebfox@stat) .Washington - Edu
No ratings yet
A Complete Recipe For Stochastic Gradient MCMC: (Yianma@u, Tqchen@cs, Ebfox@stat) .Washington - Edu
16 pages
CMNY - Gap Functions and Global Error Bounds For Differential VHI
No ratings yet
CMNY - Gap Functions and Global Error Bounds For Differential VHI
26 pages
DPS
No ratings yet
DPS
30 pages
Pfitzinger, Peter - Douglas, Scott - Road Racing For Serious Runners - Multispeed Training - 5K To Marathon-Human Kinetics (1999)
100% (1)
Pfitzinger, Peter - Douglas, Scott - Road Racing For Serious Runners - Multispeed Training - 5K To Marathon-Human Kinetics (1999)
189 pages
[2024-ICML] Variational Schrodinger Diffusion Models
No ratings yet
[2024-ICML] Variational Schrodinger Diffusion Models
24 pages
IAF Kingma Et Al 2016
No ratings yet
IAF Kingma Et Al 2016
16 pages
A hybrid method for density power divergence minimization with application to robust univariate location and scale estimation
No ratings yet
A hybrid method for density power divergence minimization with application to robust univariate location and scale estimation
25 pages
pa_01_density_estimation
No ratings yet
pa_01_density_estimation
25 pages
2 Cranko19b Boosted Density Estimation Remastered
No ratings yet
2 Cranko19b Boosted Density Estimation Remastered
10 pages
Variational Problems in Machine Learning and Their Solution With Finite Elements
No ratings yet
Variational Problems in Machine Learning and Their Solution With Finite Elements
11 pages
2209.14687v4
No ratings yet
2209.14687v4
30 pages
khartov2014
No ratings yet
khartov2014
15 pages
8.auto-Encoding Variational Bayes
No ratings yet
8.auto-Encoding Variational Bayes
14 pages
2410.00083v1
No ratings yet
2410.00083v1
38 pages
Tao 18 B
No ratings yet
Tao 18 B
10 pages
2506.21278v1
No ratings yet
2506.21278v1
21 pages
1804.00891v3
No ratings yet
1804.00891v3
19 pages
Design and Implementing VAR and SVAR
No ratings yet
Design and Implementing VAR and SVAR
9 pages
Interventional Pain Management Low Back Pain
100% (5)
Interventional Pain Management Low Back Pain
13 pages
1312.6114v1
No ratings yet
1312.6114v1
9 pages
symmetry-14-01188
No ratings yet
symmetry-14-01188
11 pages
Vibration of Discrete and Continuous Systems Lecture Notes PDF
100% (1)
Vibration of Discrete and Continuous Systems Lecture Notes PDF
184 pages
Itc Sustainability Report 2024
No ratings yet
Itc Sustainability Report 2024
229 pages
Approximate inference via variational sampling
No ratings yet
Approximate inference via variational sampling
13 pages
2503.18579v1
No ratings yet
2503.18579v1
5 pages
NIPS-1999-support-vector-method-for-novelty-detection-Paper
No ratings yet
NIPS-1999-support-vector-method-for-novelty-detection-Paper
7 pages
Covalent Bonding
No ratings yet
Covalent Bonding
12 pages
A Concise Text On Advanced Linear Algebra by Yisong Yang
67% (3)
A Concise Text On Advanced Linear Algebra by Yisong Yang
333 pages
1908.11170v3
No ratings yet
1908.11170v3
56 pages
Learning Minimum Variance Unbiased Estimators (3)
No ratings yet
Learning Minimum Variance Unbiased Estimators (3)
5 pages
Tokenomics Blueprint
No ratings yet
Tokenomics Blueprint
20 pages
Westinghouse Lighting Condensed Product Catalog 6-81
No ratings yet
Westinghouse Lighting Condensed Product Catalog 6-81
40 pages
Finn - Unknown - Detection, Measurement and Gravitational Radiation Typeset Using REVTEX 2
No ratings yet
Finn - Unknown - Detection, Measurement and Gravitational Radiation Typeset Using REVTEX 2
33 pages
A Cyber Risk Mitigation Strategy For Sony Assignment
No ratings yet
A Cyber Risk Mitigation Strategy For Sony Assignment
22 pages
Algebraic Properties: Let's Keep Those Equations Balanced!
No ratings yet
Algebraic Properties: Let's Keep Those Equations Balanced!
15 pages
EST2 S6
No ratings yet
EST2 S6
15 pages
SNB-Prepaid-Card-TC-Agreement-EN
No ratings yet
SNB-Prepaid-Card-TC-Agreement-EN
12 pages
Answer 1
No ratings yet
Answer 1
4 pages
The Official Benchmade 2003 Catalog
No ratings yet
The Official Benchmade 2003 Catalog
46 pages
Document 1 PDF
No ratings yet
Document 1 PDF
19 pages
Upload PDF
No ratings yet
Upload PDF
3 pages
PE 113BEED2A MidtermFinal Output
No ratings yet
PE 113BEED2A MidtermFinal Output
16 pages
Victoria Public SR Sec School Class 12 - Computer Science
No ratings yet
Victoria Public SR Sec School Class 12 - Computer Science
9 pages
Rockfeller
No ratings yet
Rockfeller
12 pages
An N-Dimensional Rosenbrock Distribution For MCMC Testing: Filippo Pagani Martin Wiegand Saralees Nadarajah May 8, 2020
No ratings yet
An N-Dimensional Rosenbrock Distribution For MCMC Testing: Filippo Pagani Martin Wiegand Saralees Nadarajah May 8, 2020
22 pages
Chapter 2 Relate
No ratings yet
Chapter 2 Relate
3 pages
Department of Computer Science and Engineering Spring 2012
No ratings yet
Department of Computer Science and Engineering Spring 2012
18 pages
Guildhall Stream - Acting Audition Video Esubmission - Uploading Instructions - Accessible
No ratings yet
Guildhall Stream - Acting Audition Video Esubmission - Uploading Instructions - Accessible
6 pages
20 Right On 1 - Exit Test - Mods 1 - 6
No ratings yet
20 Right On 1 - Exit Test - Mods 1 - 6
4 pages
How The Other Half Lives, Students Will Select One Picture and Write A Narrative of Well
No ratings yet
How The Other Half Lives, Students Will Select One Picture and Write A Narrative of Well
7 pages
IJRPR25875
No ratings yet
IJRPR25875
4 pages
Francois, Wertz, Verleysen - 2005 - About The Locality of Kernels in High-Dimensional Spaces
No ratings yet
Francois, Wertz, Verleysen - 2005 - About The Locality of Kernels in High-Dimensional Spaces
8 pages
Historico Competencia Nautica Washington Terra de Carvalho
No ratings yet
Historico Competencia Nautica Washington Terra de Carvalho
3 pages
7003 Aluminum Composition Spec
No ratings yet
7003 Aluminum Composition Spec
2 pages
The Traditional Approach To Capital Structure Implies That An Optimum Debt
No ratings yet
The Traditional Approach To Capital Structure Implies That An Optimum Debt
3 pages
Name: Jo-Ann R. Arribado Age: 20 Sex: Female Home Address: Lanao, Daanbantayan, Cebu Citizenship: Filipino Birthday: November 11, 1998 CONTACT NUMBER: 09361663659 Email
No ratings yet
Name: Jo-Ann R. Arribado Age: 20 Sex: Female Home Address: Lanao, Daanbantayan, Cebu Citizenship: Filipino Birthday: November 11, 1998 CONTACT NUMBER: 09361663659 Email
3 pages
Choose A Plan - Scribd
No ratings yet
Choose A Plan - Scribd
2 pages
Beam Transition Reinforcement Detail: Schedule of Beams
No ratings yet
Beam Transition Reinforcement Detail: Schedule of Beams
1 page
Level Set Method: Advancing Computer Vision, Exploring the Level Set Method
From Everand
Level Set Method: Advancing Computer Vision, Exploring the Level Set Method
Fouad Sabry
No ratings yet

A Stein Variational Newton Method: Preprint. Work in Progress

Uploaded by

A Stein Variational Newton Method: Preprint. Work in Progress

Uploaded by

A Stein variational Newton method

Gianluca Detommaso Tiangang Cui

Alessio Spantini Youssef Marzouk

Massachusetts Institute of Technology Massachusetts Institute of Technology

Preprint. Work in progress.

Algorithm 1: One iteration of the Stein variational gradient algorithm

3 Stein variational Newton method

In essence, (18) corresponds to a “block-diagonal” approximation of the original system in (16)

Algorithm 2: One iteration of the Stein variational Newton algorithm

4 Scaled Hessian kernel

5.1 Two-dimensional double banana

5.2 100-dimensional conditioned diffusion

-0.5 -0.5 -0.5

-1.5 -1.5 -1.5

SVN-I -- 11 iterations SVN-I -- 54 iterations SVN-I -- 108 iterations

SVGD-H -- 40 iterations SVGD-H -- 198 iterations SVGD-H -- 395 iterations

SVGD-I -- 134 iterations SVGD-I -- 668 iterations SVGD-I -- 1336 iterations

-0.5 -0.5 -0.5

-1.5 -1.5 -1.5

Figure 3: Comparison of reconstructed distributions from SVN-H and MCMC

The perturbed objective function Jp [S + τ V ] takes the form

The Fréchet derivative of Jp evaluated at S ∈ Hd , ∇Jp [S] : Hd → L(Hd , R) satisfies

Following Proposition 3, we have

Using the reproducing property of V ∈ Hd ,

Applying integration by parts and the reproducing property of V ∈ Hd , we have

B Additional test cases

B.2 Scalability of kernels in high dimensions

Discretization-invariant posterior distribution. Here we illustrate the dimension scalability of

Table 1: Comparison of theoretical and estimated averages of the posterior mean

Traces of the posterior covariance trace(Cpos )

Table 4: Comparison of theoretical and estimated traces of the posterior covariance

Traces of the posterior covariance trace(Cpos )

You might also like