A Beginner’s Guide to Variational Inference
A Beginner’s Guide to Variational Inference
A Beginner’s Guide to Variational Inference
Haziq Jamil
Social Statistics
London School of Economics and Political Science
1 February 2018
http://socialstats.haziqj.ml
Outline
1 Introduction
Idea
Comparison to EM
Mean-field distributions
Coordinate ascent algorithm
2 Examples
Univariate Gaussian
Gaussian mixtures
3 Discussion
Exponential families
Zero-forcing vs Zero-avoiding
Quality of approximation
Advanced topics
Introduction
• Consider a statistical model where we have observations
y = (y1 , . . . , yn ) and also some latent variables z = (z1 , . . . , zm ).
• Want to evaluate the intractable integral
Z
I := p(y|z)p(z) dz
In the literature
Google Scholar results for 'variational inference'
6000
4000
Hits
2000
0
1980 1990 2000 2010
Year
a
• Well known in the machine learning community.
• In social statistics:
I E. A. Erosheva et al. (2007). “Describing disability through individual-level
mixture models for multivariate binary data”. Ann. Appl. Stat, 1.2, p. 346
I J. Grimmer (2010). “An introduction to Bayesian inference via variational
approximations”. Political Analysis 19.1, pp. 32–47
I Y. S. Wang et al. (2017). “A variational EM method for mixed membership
models with multivariate rank data: An analysis of public policy preferences”.
arXiv: 1512.08731
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 2 / 25
Introduction Examples Discussion End
Recommended texts
Idea
p(z|y)
KL(qkp)
ν∗
q(z; ν)
ν init
EM (M-step) VI (M-step)
KL(q̃kp)
KL(q̃kp)
EM (M-step) VI (M-step)
KL(q̃kp)
KL(q̃kp)
log p(y ; θ∗ )
L(q̃; θ∗ )
log p(y ; θ∗ )
L(q̃; θ∗ )
Algorithm 1 CAVI
2 Examples
3 Discussion
1.5
Precision (ψ)
1.0
L (q)
L (q)
1.5
Precision (ψ)
1.0
L (q)
1.5
Precision (ψ)
1.0
L (q)
1.5
Precision (ψ)
1.0
L (q)
1.5
Precision (ψ)
1.0
Comparison of solutions
●
●
● ●
●
●
90 ● ●● ● ● ●
Time to next eruption (minutes)
● ● ●
● ● ● ● ● ●
● ●
● ●● ● ●
● ● ● ● ●●
● ● ●● ●●● ●● ●
● ● ● ● ● ●● ● ● ●
● ● ● ●●●● ●● ● ● ●
● ● ● ● ● ● ●● ● ●● ●
80 ● ●
●
●
●
● ●●
● ● ● ●
●● ● ●
●
● ●● ● ● ●● ● ● ●● ●● ● ●
●● ●● ● ● ●● ● ●
● ● ● ● ● ● ● ● ●
● ● ●● ● ●● ●
● ● ● ● ● ●
● ● ● ●● ●
●
● ● ● ● ●
70 ●
● ● ●●
●
●
●
● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●
60 ● ●●
● ● ● ●●●
●● ● ●
● ●● ●
● ●
●● ●
● ●● ● ● ●
●● ●●●● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ●● ● ● ●
50 ●
● ●● ● ●
● ● ●
● ● ●
● ● ●
●●● ●
●● ●
●
2 3 4 5
Eruption length (minutes)
iid PK
• Let xi ∈ Rd and assume xi ∼ k=1 πk Nd (µk , Ψ−1
k ) for i = 1, . . . , n.
● ● ●● ●●● ●● ●
● ● ● ● ● ●● ● ● ●
● ● ● ●●●● ●● ● ● ●
● ● ● ● ● ● ●● ● ●● ●
80 ● ●
●
●
●
● ●●
● ●
●● ● ●
● ● ●
● ●● ● ● ●● ● ● ●● ●● ● ●
●● ●● ● ● ●● ● ●
● ● ● ● ● ● ● ● ●
● ● ●● ● ●● ●
● ● ● ● ● ●
● ● ● ●● ●
●
● ● ● ● ●
70 ●
● ● ●●
●
●
●
● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●
60 ●
●
●●
●
●●
● ●●●
● ●
● ●● ●
● ●
●● ●
● ●● ● ● ●
●● ●●●● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ●● ● ● ●
50 ●
● ●● ● ●
● ● ●
● ● ●
● ● ●
●●● ●
●● ●
●
2 3 4 5
Eruption length (minutes)
● ● ●● ●●● ●● ●
● ● ● ● ● ●● ● ● ●
● ● ● ●●●● ●● ● ● ●
● ● ● ● ● ● ●● ● ●● ●
80 ● ●
●
●
●
● ●●
● ●
●● ● ●
● ● ●
● ●● ● ● ●● ● ● ●● ●● ● ●
●● ●● ● ● ●● ● ●
● ● ● ● ● ● ● ● ●
● ● ●● ● ●● ●
● ● ● ● ● ●
● ● ● ●● ●
●
● ● ● ● ●
70 ●
● ● ●●
●
●
●
● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●
60 ●
●
●●
●
●●
● ●●●
● ●
● ●● ●
● ●
●● ●
● ●● ● ● ●
●● ●●●● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ●● ● ● ●
50 ●
● ●● ● ●
● ● ●
● ● ●
● ● ●
●●● ●
●● ●
●
2 3 4 5
Eruption length (minutes)
● ● ●● ●●● ●● ●
● ● ● ● ● ●● ● ● ●
● ● ● ●●●● ●● ● ● ●
● ● ● ● ● ● ●● ● ●● ●
80 ● ●
●
●
●
● ●●
● ●
●● ● ●
● ● ●
● ●● ● ● ●● ● ● ●● ●● ● ●
●● ●● ● ● ●● ● ●
● ● ● ● ● ● ● ● ●
● ● ●● ● ●● ●
● ● ● ● ● ●
● ● ● ●● ●
●
● ● ● ● ●
70 ●
● ● ●●
●
●
●
● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●
60 ●
●
●●
● ● ●●●
●● ● ●
● ●● ●
● ●
●● ●
● ●● ● ● ●
●● ●●●● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ●● ● ● ●
50 ●
● ●● ● ●
● ● ●
● ● ●
● ● ●
●●● ●
●● ●
●
2 3 4 5
Eruption length (minutes)
2 Examples
3 Discussion
Exponential families
• For the mean-field variational method, suppose that each complete
conditional is in the exponential family:
Non-convexity of ELBO
−50
−40
−42
Lower bound
−100
−44
−150
−46
−200
−48
0 25 50 75 100 0 25 50 75 100
Iteration
Zero-forcing vs Zero-avoiding
p(z)
q(z)
KL(q || p) is small
p(z)
q(z)
KL(q || p) is small
p(z)
q(z)
KL(p || q) is small
p(z)
1
z2
0
q(z)
−1
−2
−2 −1 0 1 2
z1
Quality of approximation
Variational MCMC
inference Expectation propagation
Laplace
Advanced topics
End
Thank you!
Laplace’s method
1
Q(f) ≈ Q(f̃) − (f − f̃)> A(f − f̃)
2
is recognised as the logarithm of an unnormalised Gaussian density,
with A = −D2 Q(f) being the negative Hessian of Q evaluated at f̃.
• The posterior p(f|y) is approximated by N(f̃, A−1 ), and the marginal
by
p(y) ≈ (2π)n/2 |A|−1/2 p(y|f̃)p(f̃)
• Won’t scale with large n; difficult to find modes in high dimensions.
R. Kass and A. Raftery (1995). “Bayes Factors”. Journal of the American Statistical
Association 90.430, pp. 773–795, §4.1, pp.777-778. back
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 27 / 25
Additional material
Truth
Variational
Laplace
Density
mode mean
z back
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 28 / 25
Additional material
Laplace
Truth
Variational
z back
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 29 / 25
Additional material
n Y
Y K K
X
q̃(z) = rikzik , rik = ρik / ρik
i=1 k=1 k=1
1 d
log ρik = E[log πk ] + E log |Ψk | − log 2π
2 2
1 h >
i
− E (xi − µk ) Ψk (xi − µk )
2
Variational E-step
Pn
q̃(π1 , . . . , πK ) = DirK (π|α̃), α̃k = α0k + i=1 rik
K
Y
Nd µk |m̃k , (κ̃k Ψk )−1 Wisd (Ψk |W̃k , ν̃k )
q̃(µ, Ψ) =
k=1
back
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 30 / 25
Additional material
Xn
Wk−1 = W0−1 + rik (xi − x̄k )(xi − x̄k )>
i=1
n
X n
.X
x̄k = rik xi rik
i=1 i=1
Pn
νk = ν0 + i=1 rik
Also useful
h i
E (xi − µk )> Ψk (xi − µk ) = d/κ̃k + νk (xi − m̃k )> W̃k (xi − m̃k )
E[log πk ] = di=1 ψ νk +1−i
P
2 + d log 2 + log |W̃k |
PK
E log |Ψk | = ψ(α̃k ) − ψ k=1 α̃k , ψ(·) is the digamma function
back
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 31 / 25