A Beginner’s Guide to Variational Inference

A Beginner’s Guide to Variational Inference
Haziq Jamil
Social Statistics
London School of Economics and Political Science
1 February 2018
Social Statistics Meeting
http://socialstats.haziqj.ml
Outline
1 Introduction
Idea
Comparison to EM
Mean-field distributions
Coordinate ascent algorithm
2 Examples
Univariate Gaussian
Gaussian mixtures
3 Discussion
Exponential families
Zero-forcing vs Zero-avoiding
Quality of approximation
Advanced topics
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 0 / 25

Introduction Examples Discussion End
Introduction
• Consider a statistical model where we have observations
y = (y1 , . . . , yn ) and also some latent variables z = (z1 , . . . , zm ).
• Want to evaluate the intractable integral
Z
I := p(y|z)p(z) dz
I Bayesian posterior analysis

I Random effects models
I Mixture models
• Variational inference approximates the “posterior” p(z|y) by a
tractably close distribution in the Kullback-Leibler sense.
• Advantages:
I Computationally fast
I Convergence easily assessed
I Works well in practice
In the literature
Google Scholar results for 'variational inference'
6000
4000
Hits
2000
0
1980 1990 2000 2010
Year
a
• Well known in the machine learning community.
• In social statistics:
I E. A. Erosheva et al. (2007). “Describing disability through individual-level
mixture models for multivariate binary data”. Ann. Appl. Stat, 1.2, p. 346
I J. Grimmer (2010). “An introduction to Bayesian inference via variational
approximations”. Political Analysis 19.1, pp. 32–47
I Y. S. Wang et al. (2017). “A variational EM method for mixed membership
models with multivariate rank data: An analysis of public policy preferences”.
arXiv: 1512.08731
Recommended texts
• M. J. Beal and Z. Ghahramani (2003). “The variational Bayesian EM

algorithm for incomplete data: With application to scoring graphical
model structures”. In: Bayesian Statistics 7. Proceedings of the
Seventh Valencia International Meeting. Ed. by J. M. Bernardo et al.
Oxford: Oxford University Press, pp. 453–464
• C. M. Bishop (2006). Pattern Recognition and Machine Learning.
Springer
• K. P. Murphy (2012). Machine Learning: A Probabilistic Perspective.
The MIT Press
• D. M. Blei et al. (2017). “Variational inference: A review for
statisticians”. J. Am. Stat. Assoc, to appear

Idea
p(z|y)
KL(qkp)
ν∗
q(z; ν)
ν init
• Minimise Kullback-Leibler divergence (using calculus of variations)

Z
p(z|y)
KL(qkp) = − log q(z) dz.
q(z)
• ISSUE: KL(qkp) is intractable.
D. M. Blei (2017). “Variational Inference: Foundations and Innovations”. URL:
https://simons.berkeley.edu/talks/david-blei-2017-5-1
The Evidence Lower Bound (ELBO)
• Let q(z) be some density function to approximate p(z|y). Then the

log-marginal density can be decomposed as follows:
log p(y) = log p(y, z) − log p(z|y)

Z
p(y, z) p(z|y)
= log − log q(z) dz
q(z) q(z)
= L(q) + KL(qkp)
≥ L(q)
• L is referred to as the “lower-bound”, and it serves as a surrogate

function to the marginal.
• Maximising L(q) is equivalent to minimising KL(qkp).
• ISSUE: L(q) is (generally) not convex.

Comparison to the EM algorithm

• Suppose for this part, the marginal density p(y|θ) depends on
parameters θ.
• In the EM algorithm, the true posterior density is used, i.e.
q(z) ≡ p(z|y, θ).
• Thus,
Z ( )
p(y, z|θ) p(z|y, θ)

p(z|y, θ(t) ) dz

log p(y|θ) = log − log
p(z|y, θ) p(z|y, θ)
= Eθ(t) [log p(y, z|θ)] − Eθ(t) [log p(z|y, θ)]
= Q(θ|θ(t) ) + entropy.
• Minimising the KL divergence corresponds to the E-step.
• For any θ,
log p(y|θ) − log p(y|θ(t) ) = Q(θ|θ(t) ) − Q(θ(t) |θ(t) ) + ∆entropy
≥ Q(θ|θ(t) ) − Q(θ(t) |θ(t) ).
EM (M-step) VI (M-step)
KL(q̃kp)
KL(q̃kp)
log p(y ; θ∗ ) log p(y ; θ∗ )

L(q̃; θ∗ )
L(q̃; θ∗ )

EM (M-step) VI (M-step)
KL(q̃kp)
KL(q̃kp)
log p(y ; θ∗ )
L(q̃; θ∗ )
log p(y ; θ∗ )
L(q̃; θ∗ )

Factorised distributions (Mean-field theory)

• Maximising L over all possible q not feasible. Need some restrictions,
but only to achieve tractability.
• Suppose we partition elements of z into M disjoint groups
z = (z(1) , . . . , z(M) ), and assume
M
Y
q(z) = qj (z(j) ).
j=1
• Under this restriction, the solution to arg maxq L(q) is
q̃j (z(j) ) ∝ exp E−j [log p(y, z)]

(1)
for j ∈ {1, . . . , m}.

• In practice, these unnormalised densities are of recognisable form
(especially if conjugacy is considered).
Coordinate ascent mean-field variational inference (CAVI)

• The optimal distributions are coupled with another, i.e. each q̃j (z(j) )
depends on the optimal moments of z(k) , k ∈ {1, . . . , M : k 6= j}.
• One way around this to employ an iterative procedure.
• Assess convergence by monitoring the lower bound
L(q) = Eq [log p(y, z)] − Eq [log q(z)].
Algorithm 1 CAVI
1: initialise Variational factors qj (z(j) )

2: while L(q) not converged do
3: for j = 1, . . . , M do
4: log qj (z(j) ) ← E−j [log p(y, z)] + const. . from (1)
5: end for
6: L(q) ← Eq [log p(y, z)] − Eq [log q(z)]
7: end while
return q̃(z) = M (j)
Q
8: j=1 q̃j (z )

1 Introduction
2 Examples
3 Discussion

Estimation of a 1-dim Gaussian mean and variance

• GOAL: Bayesian inference of mean µ and variance ψ −1
iid
yi ∼ N(µ, ψ −1 ) Data
−1

µ|ψ ∼ N µ0 , (κ0 ψ)
Priors
ψ ∼ Γ(a0 , b0 )
i = 1, . . . , n
• Substitute p(µ, ψ|y) with the mean-field approximation
q(µ, ψ) = qµ (µ)qψ (ψ).
• From (1), we can work out the solutions

κ0 µ0 + nȳ 1
q̃µ (µ) ≡ N , and q̃ψ (ψ) ≡ Γ(ã, b̃)
κ0 + n (κ0 + n) Eq [ψ]
n 1 hP
n 2 2
i
ã = a0 + b̃ = b0 + Eq i=1 (y i − µ) + κ0 (µ − µ 0 )
2 2

Estimation of a 1-dim Gaussian mean and variance (cont.)

Iteration 0 (initialisation)
2.0
log p(y)
1.5
Precision (ψ)
1.0
L (q)
−0.25 0.00 0.25 0.50

Mean (µ)

Iteration 1 (µ update)
2.0
log p(y)
L (q)
1.5
Precision (ψ)
1.0
−0.25 0.00 0.25 0.50

Mean (µ)

Iteration 1 (ψ update)
2.0
log p(y)
L (q)
1.5
Precision (ψ)
1.0
−0.25 0.00 0.25 0.50

Mean (µ)

Iteration 2 (µ update)
2.0
log p(y)
L (q)
1.5
Precision (ψ)
1.0
−0.25 0.00 0.25 0.50

Mean (µ)

Iteration 2 (ψ update)
2.0
log p(y)
L (q)
1.5
Precision (ψ)
1.0
−0.25 0.00 0.25 0.50

Mean (µ)
Comparison of solutions
Variational posterior True posterior

κ0 µ0 + nȳ 1 κ0 µ0 + nȳ 1
µ∼N , µ|ψ ∼ N ,
κ0 + n (κ0 + n) E[ψ] κ0 + n (κ0 + n)ψ

n 1 n 1 0
ψ ∼ Γ a0 + , b0 + c ψ ∼ Γ a0 + , b0 + c
2 2 2 2
n n
hX i X κ0
c=E (yi − µ)2 + κ0 (µ − µ0 )2 c0 = (yi − ȳ )2 + (ȳ − µ0 )2
κ0 + n
i=1 i=1
• Cov(µ, ψ) = 0 by design in VI solutions.

• For this simple example, it is possible to decouple and solve explicitly.
• VI solutions leads to unbiased MLE if κ0 = µ0 = a0 = b0 = 0.

Gaussian mixture model (Old Faithful data set)
●
●
● ●
●
●
90 ● ●● ● ● ●
Time to next eruption (minutes)
● ● ●
● ● ● ● ● ●
● ●
● ●● ● ●
● ● ● ● ●●
● ● ●● ●●● ●● ●
● ● ● ● ● ●● ● ● ●
● ● ● ●●●● ●● ● ● ●
● ● ● ● ● ● ●● ● ●● ●
80 ● ●
●
●
●
● ●●
● ● ● ●
●● ● ●
●
● ●● ● ● ●● ● ● ●● ●● ● ●
●● ●● ● ● ●● ● ●
● ● ● ● ● ● ● ● ●
● ● ●● ● ●● ●
● ● ● ● ● ●
● ● ● ●● ●
●
● ● ● ● ●
70 ●
● ● ●●
●
●
●
● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●
60 ● ●●
● ● ● ●●●
●● ● ●
● ●● ●
● ●
●● ●
● ●● ● ● ●
●● ●●●● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ●● ● ● ●
50 ●
● ●● ● ●
● ● ●
● ● ●
● ● ●
●●● ●
●● ●
●
2 3 4 5
Eruption length (minutes)
iid PK
• Let xi ∈ Rd and assume xi ∼ k=1 πk Nd (µk , Ψ−1
k ) for i = 1, . . . , n.

Gaussian mixture model

k = 1, . . . , K p(x, z, π, µ, Ψ)
µk = p(x|z, µ, Ψ)p(z|π)
Ψk
× p(π)p(µ|Ψ)p(Ψ)
= p(x|z, µ, Ψ)p(z|π)
× DirK (π|α01 , . . . , α0K )
π zi xi × K
Q −1

k=1 Nd (µk |m0 , κ0 Ψk )
× K
Q
i = 1, . . . , n k=1 Wisd (Ψk |W0 , ν0
• Introduce zi = (zi1 , . . . , ziK ), a 1-of-K binary vector, where each

zik ∼ Bern(πk ).
• Assuming z = {z1 , . . . , zn } are observed along with x = {x1 , . . . , xn },
n Y
Y K
p(x|z, µ, Ψ) = Nd (xi |µk , Ψ−1 zik
k ) .
i=1 k=1
Variational inference for GMM

• Assume the mean-field posterior density
q(z, π, µ, Ψ) = q(z)q(π, µ, Ψ)
= q(z)q(π)q(µ|Ψ)q(Ψ).
Algorithm 2 CAVI for GMM details
1: initialise Variational factors q(z), q(π) and q(µ, Ψ)

2: while L(q) not converged do
3: q(zik ) ← Bern(·)
4: q(π) ← DirK (·)
5: q(µ|Ψ) ← Nd (·, ·)
6: q(Ψ) ← Wisd (·, ·)
7: L(q) ← Eq [log p(x, z, π, µ, Ψ)] − Eq [log q(z, π, µ, Ψ)]
8: end while
9: return q̃(z, π, µ, Ψ) = q̃(z)q̃(π)q̃(µ|Ψ)q̃(Ψ)

Variational inference for GMM (cont.)

Iteration 0 ●
●
● ●
●
●
90 ●
● ●●
●
● ● ●
●
● ● ● ● ● ●
● ●
● ●● ●●
● ● ● ● ●●
● ● ●● ●●● ●● ●
● ● ● ● ● ●● ● ● ●
● ● ● ●●●● ●● ● ● ●
● ● ● ● ● ● ●● ● ●● ●
80 ● ●
●
●
●
● ●●
● ●
●● ● ●
● ● ●
● ●● ● ● ●● ● ● ●● ●● ● ●
●● ●● ● ● ●● ● ●
● ● ● ● ● ● ● ● ●
● ● ●● ● ●● ●
● ● ● ● ● ●
● ● ● ●● ●
●
● ● ● ● ●
70 ●
● ● ●●
●
●
●
● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●
60 ●
●
●●
●
●●
● ●●●
● ●
● ●● ●
● ●
●● ●
● ●● ● ● ●
●● ●●●● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ●● ● ● ●
50 ●
● ●● ● ●
● ● ●
● ● ●
● ● ●
●●● ●
●● ●
●
2 3 4 5


Iteration 20 ●
●
● ●
●
●
90 ●
● ●●
●
● ● ●
●
● ● ● ● ● ●
● ●
● ●● ●●
● ● ● ● ●●
● ● ●● ●●● ●● ●
● ● ● ● ● ●● ● ● ●
● ● ● ●●●● ●● ● ● ●
● ● ● ● ● ● ●● ● ●● ●
80 ● ●
●
●
●
● ●●
● ●
●● ● ●
● ● ●
● ●● ● ● ●● ● ● ●● ●● ● ●
●● ●● ● ● ●● ● ●
● ● ● ● ● ● ● ● ●
● ● ●● ● ●● ●
● ● ● ● ● ●
● ● ● ●● ●
●
● ● ● ● ●
70 ●
● ● ●●
●
●
●
● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●
60 ●
●
●●
●
●●
● ●●●
● ●
● ●● ●
● ●
●● ●
● ●● ● ● ●
●● ●●●● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ●● ● ● ●
50 ●
● ●● ● ●
● ● ●
● ● ●
● ● ●
●●● ●
●● ●
●
2 3 4 5


Iteration 158 ●
●
● ●
●
●
90 ●
● ●●
●
● ● ●
●
● ● ● ● ● ●
● ●
● ●● ● ●
● ● ● ● ●●
● ● ●● ●●● ●● ●
● ● ● ● ● ●● ● ● ●
● ● ● ●●●● ●● ● ● ●
● ● ● ● ● ● ●● ● ●● ●
80 ● ●
●
●
●
● ●●
● ●
●● ● ●
● ● ●
● ●● ● ● ●● ● ● ●● ●● ● ●
●● ●● ● ● ●● ● ●
● ● ● ● ● ● ● ● ●
● ● ●● ● ●● ●
● ● ● ● ● ●
● ● ● ●● ●
●
● ● ● ● ●
70 ●
● ● ●●
●
●
●
● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●
60 ●
●
●●
● ● ●●●
●● ● ●
● ●● ●
● ●
●● ●
● ●● ● ● ●
●● ●●●● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ●● ● ● ●
50 ●
● ●● ● ●
● ● ●
● ● ●
● ● ●
●●● ●
●● ●
●
2 3 4 5

Final thoughts on variational GMM
• Similar algorithm to the EM, and therefore similar computational time.

• Can extend to mixture of bernoullis a.k.a. latent class analysis.
• PROS:
I Automatic selection of number of mixture components.
I Less pathological special cases compared to EM solutions because
regularised by prior information.
I Less sensitive to number of parameters/components.
• CONS:
I Hyperparameter tuning.

1 Introduction
2 Examples
3 Discussion

Exponential families
• For the mean-field variational method, suppose that each complete
conditional is in the exponential family:
p(z(j) |z−j , y) = h(z(j) ) exp ηj (z−j , y) · z(j) − A(ηj ) .

• Then, from (1),
q̃j (z(j) ) ∝ exp E−j [log p(z(j) |z−j , y)]

= exp log h(z(j) ) + E[ηj (z−j , y)] · z(j) − E[A(ηj )]

(j) (j)
∝ h(z ) exp E[ηj (z−j , y)] · z
is also in the same exponential family.

• C.f. Gibbs conditional densities.
• ISSUE: What if not in exponential family? Importance sampling or
Metropolis sampling.
Non-convexity of ELBO
−50
−40
−42
Lower bound
−100
−44
−150
−46
−200
−48
0 25 50 75 100 0 25 50 75 100
Iteration
• CAVI only guarantees converges to a local optimum.

• Multiple local optima may exist.
Zero-forcing vs Zero-avoiding
• Back to the KL divergence:

Z
q(z)
KL(qkp) = log q(z) dz
p(z|y)
• KL(qkp) is large when p(z|y) is close to zero, unless q(z) is also close
to zero (zero-forcing).
• What about other measures of closeness? For instance,
Z
p(z|y)
KL(pkq) = log p(z|y) dz.
q(z|y)
• This gives the Expectation Propagation (EP) algorithm.
• It is zero-avoiding, because KL(pkq) is small when both p(z|y) and
q(z) are non-zero.

Zero-forcing vs Zero-avoiding (cont.)
p(z)
q(z)
KL(q || p) is small

p(z)
q(z)
KL(q || p) is small

p(z)
q(z)
KL(p || q) is small

Distortion of higher order moments

2
p(z)
1
z2
0
q(z)
−1
−2
−2 −1 0 1 2
z1
• Consider z = (z1 , z2 )> ∼ N2 (µ, Ψ−1 ), Cov(z1 , z2 ) 6= 0.

• Approximating p(z) by q(z) = q(z1 )q(z2 ) yields
−1 −1
q̃(z1 ) = N(z1 |µ1 , ψ11 ) and q̃(z2 ) = N(z2 |µ2 , ψ22 )
and by definition, Cov(z1 , z2 ) = 0 under q̃.
• This leads to underestimation of variances (widely reported in the
literature—Zhao and Marriott 2013).
Quality of approximation
• Variational inference converges to a different optimum than ML,

except for certain models (Gunawardana and Byrne 2005).
• But not much can be said about the quality of approximation.
• Statistical properties not well understood—what is its statistical profile
relative to the exact posterior?
• Speed trumps accuracy?
Fast and Slow and

inaccurate accurate
Variational MCMC
inference Expectation propagation
Laplace

Advanced topics
• Local variational bounds

I Not using the mean-field assumption.
I Instead, find a bound for the marginalising integral I.
I Used for Bayesian logistic regression as follows:
Z Z
I = expit(x β)p(β) dβ ≥ f (x > β, ξ)p(β) dβ.
>
• Stochastic variational inference

I Use ideas from stochastic optimisation—gradient based improvement
of ELBO from subsamples of the data.
I Scales to massive data.
• Black box variational inference
I Beyond exponential families and model-specific derivations.

End
Thank you!
Slides and source code are made available at: http://socialstats.haziqj.ml

References I
Beal, M. J. and Z. Ghahramani (2003). “The variational Bayesian EM

algorithm for incomplete data: With application to scoring graphical
model structures”. In: Bayesian Statistics 7. Proceedings of the Seventh
Valencia International Meeting. Ed. by J. M. Bernardo, A. P. Dawid,
J. O. Berger, M. West, D. Heckerman, M. Bayarri, and A. F. Smith.
Oxford: Oxford University Press, pp. 453–464.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Blei, D. M. (2017). “Variational Inference: Foundations and Innovations”.
URL:
https://simons.berkeley.edu/talks/david-blei-2017-5-1.
Blei, D. M., A. Kucukelbir, and J. D. McAuliffe (2017). “Variational
inference: A review for statisticians”. Journal of the American Statistical
Association, to appear.

References II
Erosheva, E. A., S. E. Fienberg, and C. Joutard (2007). “Describing
disability through individual-level mixture models for multivariate binary
data”. Annals of Applied Statistics, 1.2, p. 346.
Grimmer, J. (2010). “An introduction to Bayesian inference via variational
approximations”. Political Analysis 19.1, pp. 32–47.
Gunawardana, A. and W. Byrne (2005). “Convergence theorems for
generalized alternating minimization procedures”. Journal of Machine
Learning Research 6, pp. 2049–2073.
Kass, R. and A. Raftery (1995). “Bayes Factors”. Journal of the American
Statistical Association 90.430, pp. 773–795.
Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. The
MIT Press.

References III
Wang, Y. S., R. Matsueda, and E. A. Erosheva (2017). “A variational EM
method for mixed membership models with multivariate rank data: An
analysis of public policy preferences”. arXiv: 1512.08731.
Zhao, H. and P. Marriott (2013). “Diagnostics for variational Bayes
approximations”. arXiv: 1309.5117.

4 Additional material
The variational principle
Laplace’s method
Solutions to Gaussian mixture

Additional material
The variational principle

• Name derived from calculus of variations which deals with maximising
or minimising functionals.
Functions p : θ 7→ R (standard calculus)
Functionals H : p 7→ R (variational calculus)
• Using standard calculus, we can solve
arg max p(θ) =: θ̂

θ
e.g. p is a likelihood function, and θ̂ is the ML estimate.

• Using variational calculus, we can solve
arg max H(p) =: p̃

p
R
e.g. H is the entropy H = − p(x) log p(x) dx, and p̃ is the entropy
maximising distribution.
back
Additional material
Laplace’s method
• Interested in p(f|y) ∝ p(y|f)p(f) =: e Q(f) , with normalising constant

p(y) = e Q(f) df. The Taylor expansion of Q about its mode f̃
R
1
Q(f) ≈ Q(f̃) − (f − f̃)> A(f − f̃)
2
is recognised as the logarithm of an unnormalised Gaussian density,
with A = −D2 Q(f) being the negative Hessian of Q evaluated at f̃.
• The posterior p(f|y) is approximated by N(f̃, A−1 ), and the marginal
by
p(y) ≈ (2π)n/2 |A|−1/2 p(y|f̃)p(f̃)
• Won’t scale with large n; difficult to find modes in high dimensions.
R. Kass and A. Raftery (1995). “Bayes Factors”. Journal of the American Statistical
Association 90.430, pp. 773–795, §4.1, pp.777-778. back
Additional material
Comparison of approximations (density)
Truth
Variational
Laplace
Density
mode mean
z back
Additional material
Comparison of approximations (deviance)

Deviance (− 2 x Log−density)
Laplace
Truth
Variational
z back
Additional material
Variational solutions to Gaussian mixture model

Variational M-step
n Y
Y K K
X
q̃(z) = rikzik , rik = ρik / ρik
i=1 k=1 k=1
1 d
log ρik = E[log πk ] + E log |Ψk | − log 2π
2 2
1 h >
i
− E (xi − µk ) Ψk (xi − µk )
2
Variational E-step
Pn
q̃(π1 , . . . , πK ) = DirK (π|α̃), α̃k = α0k + i=1 rik
K
Y
Nd µk |m̃k , (κ̃k Ψk )−1 Wisd (Ψk |W̃k , ν̃k )

q̃(µ, Ψ) =
k=1
back
Additional material
Variational solutions to Gaussian mixture model (cont.)

κ̃k = κ0 + ni=1 rik
P
m̃k = (κ0 m0 + ni=1 rik xi )/κ̃k

P
Xn
Wk−1 = W0−1 + rik (xi − x̄k )(xi − x̄k )>
i=1
n
X n
.X
x̄k = rik xi rik
i=1 i=1
Pn
νk = ν0 + i=1 rik
Also useful
h i
E (xi − µk )> Ψk (xi − µk ) = d/κ̃k + νk (xi − m̃k )> W̃k (xi − m̃k )

E[log πk ] = di=1 ψ νk +1−i
P
2 + d log 2 + log |W̃k |
PK
E log |Ψk | = ψ(α̃k ) − ψ k=1 α̃k , ψ(·) is the digamma function
back

A Beginner’s Guide to Variational Inference

Uploaded by

Copyright:

Available Formats

A Beginner’s Guide to Variational Inference

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Beginner’s Guide to Variational Inference

Uploaded by

Copyright:

Available Formats

A Beginner’s Guide to Variational Inference

Social Statistics Meeting

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 0 / 25

I Bayesian posterior analysis

• M. J. Beal and Z. Ghahramani (2003). “The variational Bayesian EM

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 3 / 25

• Minimise Kullback-Leibler divergence (using calculus of variations)

The Evidence Lower Bound (ELBO)

• Let q(z) be some density function to approximate p(z|y). Then the

log p(y) = log p(y, z) − log p(z|y)

• L is referred to as the “lower-bound”, and it serves as a surrogate

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 5 / 25

Comparison to the EM algorithm

log p(y ; θ∗ ) log p(y ; θ∗ )

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 7 / 25

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 8 / 25

Factorised distributions (Mean-field theory)

• Under this restriction, the solution to arg maxq L(q) is

q̃j (z(j) ) ∝ exp E−j [log p(y, z)]

for j ∈ {1, . . . , m}.

Coordinate ascent mean-field variational inference (CAVI)

1: initialise Variational factors qj (z(j) )

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 10 / 25

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 10 / 25

Estimation of a 1-dim Gaussian mean and variance

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 11 / 25

Estimation of a 1-dim Gaussian mean and variance (cont.)

−0.25 0.00 0.25 0.50

Estimation of a 1-dim Gaussian mean and variance (cont.)

−0.25 0.00 0.25 0.50

Estimation of a 1-dim Gaussian mean and variance (cont.)

−0.25 0.00 0.25 0.50

Estimation of a 1-dim Gaussian mean and variance (cont.)

−0.25 0.00 0.25 0.50

Estimation of a 1-dim Gaussian mean and variance (cont.)

−0.25 0.00 0.25 0.50

Variational posterior True posterior

• Cov(µ, ψ) = 0 by design in VI solutions.

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 13 / 25

Gaussian mixture model (Old Faithful data set)

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 14 / 25

Gaussian mixture model

• Introduce zi = (zi1 , . . . , ziK ), a 1-of-K binary vector, where each

Variational inference for GMM

Algorithm 2 CAVI for GMM details

1: initialise Variational factors q(z), q(π) and q(µ, Ψ)

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 16 / 25

Variational inference for GMM (cont.)

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 17 / 25

Variational inference for GMM (cont.)

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 17 / 25

Variational inference for GMM (cont.)

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 17 / 25

Final thoughts on variational GMM

• Similar algorithm to the EM, and therefore similar computational time.

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 18 / 25

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 18 / 25

p(z(j) |z−j , y) = h(z(j) ) exp ηj (z−j , y) · z(j) − A(ηj ) .

• Then, from (1),

q̃j (z(j) ) ∝ exp E−j [log p(z(j) |z−j , y)]