A Beginner’s Guide to Variational Inference

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

A Beginner’s Guide to Variational Inference

Haziq Jamil

Social Statistics
London School of Economics and Political Science

1 February 2018

Social Statistics Meeting

http://socialstats.haziqj.ml
Outline

1 Introduction
Idea
Comparison to EM
Mean-field distributions
Coordinate ascent algorithm
2 Examples
Univariate Gaussian
Gaussian mixtures
3 Discussion
Exponential families
Zero-forcing vs Zero-avoiding
Quality of approximation
Advanced topics

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 0 / 25


Introduction Examples Discussion End

Introduction
• Consider a statistical model where we have observations
y = (y1 , . . . , yn ) and also some latent variables z = (z1 , . . . , zm ).
• Want to evaluate the intractable integral
Z
I := p(y|z)p(z) dz

I Bayesian posterior analysis


I Random effects models
I Mixture models
• Variational inference approximates the “posterior” p(z|y) by a
tractably close distribution in the Kullback-Leibler sense.
• Advantages:
I Computationally fast
I Convergence easily assessed
I Works well in practice
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 1 / 25
Introduction Examples Discussion End

In the literature
Google Scholar results for 'variational inference'
6000

4000
Hits

2000

0
1980 1990 2000 2010
Year
a
• Well known in the machine learning community.
• In social statistics:
I E. A. Erosheva et al. (2007). “Describing disability through individual-level
mixture models for multivariate binary data”. Ann. Appl. Stat, 1.2, p. 346
I J. Grimmer (2010). “An introduction to Bayesian inference via variational
approximations”. Political Analysis 19.1, pp. 32–47
I Y. S. Wang et al. (2017). “A variational EM method for mixed membership
models with multivariate rank data: An analysis of public policy preferences”.
arXiv: 1512.08731
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 2 / 25
Introduction Examples Discussion End

Recommended texts

• M. J. Beal and Z. Ghahramani (2003). “The variational Bayesian EM


algorithm for incomplete data: With application to scoring graphical
model structures”. In: Bayesian Statistics 7. Proceedings of the
Seventh Valencia International Meeting. Ed. by J. M. Bernardo et al.
Oxford: Oxford University Press, pp. 453–464
• C. M. Bishop (2006). Pattern Recognition and Machine Learning.
Springer
• K. P. Murphy (2012). Machine Learning: A Probabilistic Perspective.
The MIT Press
• D. M. Blei et al. (2017). “Variational inference: A review for
statisticians”. J. Am. Stat. Assoc, to appear

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 3 / 25


Introduction Examples Discussion End

Idea
p(z|y)
KL(qkp)
ν∗
q(z; ν)

ν init

• Minimise Kullback-Leibler divergence (using calculus of variations)


Z
p(z|y)
KL(qkp) = − log q(z) dz.
q(z)
• ISSUE: KL(qkp) is intractable.
D. M. Blei (2017). “Variational Inference: Foundations and Innovations”. URL:
https://simons.berkeley.edu/talks/david-blei-2017-5-1
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 4 / 25
Introduction Examples Discussion End

The Evidence Lower Bound (ELBO)

• Let q(z) be some density function to approximate p(z|y). Then the


log-marginal density can be decomposed as follows:

log p(y) = log p(y, z) − log p(z|y)


Z  
p(y, z) p(z|y)
= log − log q(z) dz
q(z) q(z)
= L(q) + KL(qkp)
≥ L(q)

• L is referred to as the “lower-bound”, and it serves as a surrogate


function to the marginal.
• Maximising L(q) is equivalent to minimising KL(qkp).
• ISSUE: L(q) is (generally) not convex.

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 5 / 25


Introduction Examples Discussion End

Comparison to the EM algorithm


• Suppose for this part, the marginal density p(y|θ) depends on
parameters θ.
• In the EM algorithm, the true posterior density is used, i.e.
q(z) ≡ p(z|y, θ).
• Thus,
Z ( )
p(y, z|θ) p(z|y, θ)

p(z|y, θ(t) ) dz

log p(y|θ) = log − log 
p(z|y, θ) p(z|y, θ)
= Eθ(t) [log p(y, z|θ)] − Eθ(t) [log p(z|y, θ)]
= Q(θ|θ(t) ) + entropy.
• Minimising the KL divergence corresponds to the E-step.
• For any θ,
log p(y|θ) − log p(y|θ(t) ) = Q(θ|θ(t) ) − Q(θ(t) |θ(t) ) + ∆entropy
≥ Q(θ|θ(t) ) − Q(θ(t) |θ(t) ).
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 6 / 25
Introduction Examples Discussion End

EM (M-step) VI (M-step)
KL(q̃kp)
KL(q̃kp)

log p(y ; θ∗ ) log p(y ; θ∗ )


L(q̃; θ∗ )
L(q̃; θ∗ )

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 7 / 25


Introduction Examples Discussion End

EM (M-step) VI (M-step)
KL(q̃kp)

KL(q̃kp)

log p(y ; θ∗ )
L(q̃; θ∗ )
log p(y ; θ∗ )
L(q̃; θ∗ )

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 8 / 25


Introduction Examples Discussion End

Factorised distributions (Mean-field theory)


• Maximising L over all possible q not feasible. Need some restrictions,
but only to achieve tractability.
• Suppose we partition elements of z into M disjoint groups
z = (z(1) , . . . , z(M) ), and assume
M
Y
q(z) = qj (z(j) ).
j=1

• Under this restriction, the solution to arg maxq L(q) is

q̃j (z(j) ) ∝ exp E−j [log p(y, z)]



(1)

for j ∈ {1, . . . , m}.


• In practice, these unnormalised densities are of recognisable form
(especially if conjugacy is considered).
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 9 / 25
Introduction Examples Discussion End

Coordinate ascent mean-field variational inference (CAVI)


• The optimal distributions are coupled with another, i.e. each q̃j (z(j) )
depends on the optimal moments of z(k) , k ∈ {1, . . . , M : k 6= j}.
• One way around this to employ an iterative procedure.
• Assess convergence by monitoring the lower bound
L(q) = Eq [log p(y, z)] − Eq [log q(z)].

Algorithm 1 CAVI

1: initialise Variational factors qj (z(j) )


2: while L(q) not converged do
3: for j = 1, . . . , M do
4: log qj (z(j) ) ← E−j [log p(y, z)] + const. . from (1)
5: end for
6: L(q) ← Eq [log p(y, z)] − Eq [log q(z)]
7: end while
return q̃(z) = M (j)
Q
8: j=1 q̃j (z )

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 10 / 25


1 Introduction

2 Examples

3 Discussion

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 10 / 25


Introduction Examples Discussion End

Estimation of a 1-dim Gaussian mean and variance


• GOAL: Bayesian inference of mean µ and variance ψ −1
iid
yi ∼ N(µ, ψ −1 ) Data
−1

µ|ψ ∼ N µ0 , (κ0 ψ)
Priors
ψ ∼ Γ(a0 , b0 )
i = 1, . . . , n
• Substitute p(µ, ψ|y) with the mean-field approximation
q(µ, ψ) = qµ (µ)qψ (ψ).
• From (1), we can work out the solutions
 
κ0 µ0 + nȳ 1
q̃µ (µ) ≡ N , and q̃ψ (ψ) ≡ Γ(ã, b̃)
κ0 + n (κ0 + n) Eq [ψ]
n 1 hP
n 2 2
i
ã = a0 + b̃ = b0 + Eq i=1 (y i − µ) + κ0 (µ − µ 0 )
2 2

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 11 / 25


Introduction Examples Discussion End

Estimation of a 1-dim Gaussian mean and variance (cont.)


Iteration 0 (initialisation)
2.0
log p(y)

1.5
Precision (ψ)

1.0

L (q)

−0.25 0.00 0.25 0.50


Mean (µ)
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 12 / 25
Introduction Examples Discussion End

Estimation of a 1-dim Gaussian mean and variance (cont.)


Iteration 1 (µ update)
2.0
log p(y)

L (q)
1.5
Precision (ψ)

1.0

−0.25 0.00 0.25 0.50


Mean (µ)
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 12 / 25
Introduction Examples Discussion End

Estimation of a 1-dim Gaussian mean and variance (cont.)


Iteration 1 (ψ update)
2.0
log p(y)

L (q)

1.5
Precision (ψ)

1.0

−0.25 0.00 0.25 0.50


Mean (µ)
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 12 / 25
Introduction Examples Discussion End

Estimation of a 1-dim Gaussian mean and variance (cont.)


Iteration 2 (µ update)
2.0
log p(y)

L (q)

1.5
Precision (ψ)

1.0

−0.25 0.00 0.25 0.50


Mean (µ)
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 12 / 25
Introduction Examples Discussion End

Estimation of a 1-dim Gaussian mean and variance (cont.)


Iteration 2 (ψ update)
2.0
log p(y)

L (q)

1.5
Precision (ψ)

1.0

−0.25 0.00 0.25 0.50


Mean (µ)
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 12 / 25
Introduction Examples Discussion End

Comparison of solutions

Variational posterior True posterior


   
κ0 µ0 + nȳ 1 κ0 µ0 + nȳ 1
µ∼N , µ|ψ ∼ N ,
κ0 + n (κ0 + n) E[ψ] κ0 + n (κ0 + n)ψ
   
n 1 n 1 0
ψ ∼ Γ a0 + , b0 + c ψ ∼ Γ a0 + , b0 + c
2 2 2 2
n n
hX i X κ0
c=E (yi − µ)2 + κ0 (µ − µ0 )2 c0 = (yi − ȳ )2 + (ȳ − µ0 )2
κ0 + n
i=1 i=1

• Cov(µ, ψ) = 0 by design in VI solutions.


• For this simple example, it is possible to decouple and solve explicitly.
• VI solutions leads to unbiased MLE if κ0 = µ0 = a0 = b0 = 0.

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 13 / 25


Introduction Examples Discussion End

Gaussian mixture model (Old Faithful data set)



● ●


90 ● ●● ● ● ●
Time to next eruption (minutes)

● ● ●
● ● ● ● ● ●
● ●
● ●● ● ●
● ● ● ● ●●
● ● ●● ●●● ●● ●
● ● ● ● ● ●● ● ● ●
● ● ● ●●●● ●● ● ● ●
● ● ● ● ● ● ●● ● ●● ●
80 ● ●



● ●●
● ● ● ●
●● ● ●

● ●● ● ● ●● ● ● ●● ●● ● ●
●● ●● ● ● ●● ● ●
● ● ● ● ● ● ● ● ●
● ● ●● ● ●● ●
● ● ● ● ● ●
● ● ● ●● ●

● ● ● ● ●
70 ●
● ● ●●



● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●
60 ● ●●
● ● ● ●●●
●● ● ●
● ●● ●
● ●
●● ●
● ●● ● ● ●
●● ●●●● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ●● ● ● ●
50 ●
● ●● ● ●
● ● ●
● ● ●
● ● ●
●●● ●
●● ●

2 3 4 5
Eruption length (minutes)

iid PK
• Let xi ∈ Rd and assume xi ∼ k=1 πk Nd (µk , Ψ−1
k ) for i = 1, . . . , n.

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 14 / 25


Introduction Examples Discussion End

Gaussian mixture model


k = 1, . . . , K p(x, z, π, µ, Ψ)
µk = p(x|z, µ, Ψ)p(z|π)
Ψk
× p(π)p(µ|Ψ)p(Ψ)
= p(x|z, µ, Ψ)p(z|π)
× DirK (π|α01 , . . . , α0K )
π zi xi × K
Q −1

k=1 Nd (µk |m0 , κ0 Ψk )
× K
Q 
i = 1, . . . , n k=1 Wisd (Ψk |W0 , ν0

• Introduce zi = (zi1 , . . . , ziK ), a 1-of-K binary vector, where each


zik ∼ Bern(πk ).
• Assuming z = {z1 , . . . , zn } are observed along with x = {x1 , . . . , xn },
n Y
Y K
p(x|z, µ, Ψ) = Nd (xi |µk , Ψ−1 zik
k ) .
i=1 k=1
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 15 / 25
Introduction Examples Discussion End

Variational inference for GMM


• Assume the mean-field posterior density
q(z, π, µ, Ψ) = q(z)q(π, µ, Ψ)
= q(z)q(π)q(µ|Ψ)q(Ψ).

Algorithm 2 CAVI for GMM details

1: initialise Variational factors q(z), q(π) and q(µ, Ψ)


2: while L(q) not converged do
3: q(zik ) ← Bern(·)
4: q(π) ← DirK (·)
5: q(µ|Ψ) ← Nd (·, ·)
6: q(Ψ) ← Wisd (·, ·)
7: L(q) ← Eq [log p(x, z, π, µ, Ψ)] − Eq [log q(z, π, µ, Ψ)]
8: end while
9: return q̃(z, π, µ, Ψ) = q̃(z)q̃(π)q̃(µ|Ψ)q̃(Ψ)

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 16 / 25


Introduction Examples Discussion End

Variational inference for GMM (cont.)


Iteration 0 ●

● ●


90 ●
● ●●

● ● ●

● ● ● ● ● ●
● ●
● ●● ●●
● ● ● ● ●●
Time to next eruption (minutes)

● ● ●● ●●● ●● ●
● ● ● ● ● ●● ● ● ●
● ● ● ●●●● ●● ● ● ●
● ● ● ● ● ● ●● ● ●● ●
80 ● ●



● ●●
● ●
●● ● ●
● ● ●
● ●● ● ● ●● ● ● ●● ●● ● ●
●● ●● ● ● ●● ● ●
● ● ● ● ● ● ● ● ●
● ● ●● ● ●● ●
● ● ● ● ● ●
● ● ● ●● ●

● ● ● ● ●
70 ●
● ● ●●



● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●

60 ●

●●

●●
● ●●●
● ●
● ●● ●
● ●
●● ●
● ●● ● ● ●
●● ●●●● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ●● ● ● ●
50 ●
● ●● ● ●
● ● ●
● ● ●
● ● ●
●●● ●
●● ●

2 3 4 5
Eruption length (minutes)

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 17 / 25


Introduction Examples Discussion End

Variational inference for GMM (cont.)


Iteration 20 ●

● ●


90 ●
● ●●

● ● ●

● ● ● ● ● ●
● ●
● ●● ●●
● ● ● ● ●●
Time to next eruption (minutes)

● ● ●● ●●● ●● ●
● ● ● ● ● ●● ● ● ●
● ● ● ●●●● ●● ● ● ●
● ● ● ● ● ● ●● ● ●● ●
80 ● ●



● ●●
● ●
●● ● ●
● ● ●
● ●● ● ● ●● ● ● ●● ●● ● ●
●● ●● ● ● ●● ● ●
● ● ● ● ● ● ● ● ●
● ● ●● ● ●● ●
● ● ● ● ● ●
● ● ● ●● ●

● ● ● ● ●
70 ●
● ● ●●



● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●

60 ●

●●

●●
● ●●●
● ●
● ●● ●
● ●
●● ●
● ●● ● ● ●
●● ●●●● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ●● ● ● ●
50 ●
● ●● ● ●
● ● ●
● ● ●
● ● ●
●●● ●
●● ●

2 3 4 5
Eruption length (minutes)

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 17 / 25


Introduction Examples Discussion End

Variational inference for GMM (cont.)


Iteration 158 ●

● ●


90 ●
● ●●

● ● ●

● ● ● ● ● ●
● ●
● ●● ● ●
● ● ● ● ●●
Time to next eruption (minutes)

● ● ●● ●●● ●● ●
● ● ● ● ● ●● ● ● ●
● ● ● ●●●● ●● ● ● ●
● ● ● ● ● ● ●● ● ●● ●
80 ● ●



● ●●
● ●
●● ● ●
● ● ●
● ●● ● ● ●● ● ● ●● ●● ● ●
●● ●● ● ● ●● ● ●
● ● ● ● ● ● ● ● ●
● ● ●● ● ●● ●
● ● ● ● ● ●
● ● ● ●● ●

● ● ● ● ●
70 ●
● ● ●●



● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●

60 ●

●●
● ● ●●●
●● ● ●
● ●● ●
● ●
●● ●
● ●● ● ● ●
●● ●●●● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ●● ● ● ●
50 ●
● ●● ● ●
● ● ●
● ● ●
● ● ●
●●● ●
●● ●

2 3 4 5
Eruption length (minutes)

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 17 / 25


Introduction Examples Discussion End

Final thoughts on variational GMM

• Similar algorithm to the EM, and therefore similar computational time.


• Can extend to mixture of bernoullis a.k.a. latent class analysis.
• PROS:
I Automatic selection of number of mixture components.
I Less pathological special cases compared to EM solutions because
regularised by prior information.
I Less sensitive to number of parameters/components.
• CONS:
I Hyperparameter tuning.

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 18 / 25


1 Introduction

2 Examples

3 Discussion

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 18 / 25


Introduction Examples Discussion End

Exponential families
• For the mean-field variational method, suppose that each complete
conditional is in the exponential family:

p(z(j) |z−j , y) = h(z(j) ) exp ηj (z−j , y) · z(j) − A(ηj ) .




• Then, from (1),

q̃j (z(j) ) ∝ exp E−j [log p(z(j) |z−j , y)]



 
= exp log h(z(j) ) + E[ηj (z−j , y)] · z(j) − E[A(ηj )]
 
(j) (j)
∝ h(z ) exp E[ηj (z−j , y)] · z

is also in the same exponential family.


• C.f. Gibbs conditional densities.
• ISSUE: What if not in exponential family? Importance sampling or
Metropolis sampling.
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 19 / 25
Introduction Examples Discussion End

Non-convexity of ELBO

−50
−40

−42
Lower bound

−100

−44
−150

−46

−200

−48
0 25 50 75 100 0 25 50 75 100
Iteration

• CAVI only guarantees converges to a local optimum.


• Multiple local optima may exist.
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 20 / 25
Introduction Examples Discussion End

Zero-forcing vs Zero-avoiding

• Back to the KL divergence:


Z
q(z)
KL(qkp) = log q(z) dz
p(z|y)
• KL(qkp) is large when p(z|y) is close to zero, unless q(z) is also close
to zero (zero-forcing).
• What about other measures of closeness? For instance,
Z
p(z|y)
KL(pkq) = log p(z|y) dz.
q(z|y)
• This gives the Expectation Propagation (EP) algorithm.
• It is zero-avoiding, because KL(pkq) is small when both p(z|y) and
q(z) are non-zero.

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 21 / 25


Introduction Examples Discussion End

Zero-forcing vs Zero-avoiding (cont.)

p(z)
q(z)

KL(q || p) is small

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 22 / 25


Introduction Examples Discussion End

Zero-forcing vs Zero-avoiding (cont.)

p(z)
q(z)

KL(q || p) is small

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 22 / 25


Introduction Examples Discussion End

Zero-forcing vs Zero-avoiding (cont.)

p(z)
q(z)

KL(p || q) is small

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 22 / 25


Introduction Examples Discussion End

Distortion of higher order moments


2

p(z)
1
z2

0
q(z)
−1

−2
−2 −1 0 1 2
z1

• Consider z = (z1 , z2 )> ∼ N2 (µ, Ψ−1 ), Cov(z1 , z2 ) 6= 0.


• Approximating p(z) by q(z) = q(z1 )q(z2 ) yields
−1 −1
q̃(z1 ) = N(z1 |µ1 , ψ11 ) and q̃(z2 ) = N(z2 |µ2 , ψ22 )
and by definition, Cov(z1 , z2 ) = 0 under q̃.
• This leads to underestimation of variances (widely reported in the
literature—Zhao and Marriott 2013).
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 23 / 25
Introduction Examples Discussion End

Quality of approximation

• Variational inference converges to a different optimum than ML,


except for certain models (Gunawardana and Byrne 2005).
• But not much can be said about the quality of approximation.
• Statistical properties not well understood—what is its statistical profile
relative to the exact posterior?
• Speed trumps accuracy?

Fast and Slow and


inaccurate accurate

Variational MCMC
inference Expectation propagation

Laplace

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 24 / 25


Introduction Examples Discussion End

Advanced topics

• Local variational bounds


I Not using the mean-field assumption.
I Instead, find a bound for the marginalising integral I.
I Used for Bayesian logistic regression as follows:
Z Z
I = expit(x β)p(β) dβ ≥ f (x > β, ξ)p(β) dβ.
>

• Stochastic variational inference


I Use ideas from stochastic optimisation—gradient based improvement
of ELBO from subsamples of the data.
I Scales to massive data.
• Black box variational inference
I Beyond exponential families and model-specific derivations.

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 25 / 25


Introduction Examples Discussion End

End

Thank you!

Slides and source code are made available at: http://socialstats.haziqj.ml


Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 25 / 25
References I

Beal, M. J. and Z. Ghahramani (2003). “The variational Bayesian EM


algorithm for incomplete data: With application to scoring graphical
model structures”. In: Bayesian Statistics 7. Proceedings of the Seventh
Valencia International Meeting. Ed. by J. M. Bernardo, A. P. Dawid,
J. O. Berger, M. West, D. Heckerman, M. Bayarri, and A. F. Smith.
Oxford: Oxford University Press, pp. 453–464.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Blei, D. M. (2017). “Variational Inference: Foundations and Innovations”.
URL:
https://simons.berkeley.edu/talks/david-blei-2017-5-1.
Blei, D. M., A. Kucukelbir, and J. D. McAuliffe (2017). “Variational
inference: A review for statisticians”. Journal of the American Statistical
Association, to appear.

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 25 / 25


References II
Erosheva, E. A., S. E. Fienberg, and C. Joutard (2007). “Describing
disability through individual-level mixture models for multivariate binary
data”. Annals of Applied Statistics, 1.2, p. 346.
Grimmer, J. (2010). “An introduction to Bayesian inference via variational
approximations”. Political Analysis 19.1, pp. 32–47.
Gunawardana, A. and W. Byrne (2005). “Convergence theorems for
generalized alternating minimization procedures”. Journal of Machine
Learning Research 6, pp. 2049–2073.
Kass, R. and A. Raftery (1995). “Bayes Factors”. Journal of the American
Statistical Association 90.430, pp. 773–795.
Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. The
MIT Press.

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 25 / 25


References III
Wang, Y. S., R. Matsueda, and E. A. Erosheva (2017). “A variational EM
method for mixed membership models with multivariate rank data: An
analysis of public policy preferences”. arXiv: 1512.08731.
Zhao, H. and P. Marriott (2013). “Diagnostics for variational Bayes
approximations”. arXiv: 1309.5117.

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 25 / 25


4 Additional material
The variational principle
Laplace’s method
Solutions to Gaussian mixture

Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 25 / 25


Additional material

The variational principle


• Name derived from calculus of variations which deals with maximising
or minimising functionals.
Functions p : θ 7→ R (standard calculus)
Functionals H : p 7→ R (variational calculus)
• Using standard calculus, we can solve

arg max p(θ) =: θ̂


θ

e.g. p is a likelihood function, and θ̂ is the ML estimate.


• Using variational calculus, we can solve

arg max H(p) =: p̃


p
R
e.g. H is the entropy H = − p(x) log p(x) dx, and p̃ is the entropy
maximising distribution.
back
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 26 / 25
Additional material

Laplace’s method

• Interested in p(f|y) ∝ p(y|f)p(f) =: e Q(f) , with normalising constant


p(y) = e Q(f) df. The Taylor expansion of Q about its mode f̃
R

1
Q(f) ≈ Q(f̃) − (f − f̃)> A(f − f̃)
2
is recognised as the logarithm of an unnormalised Gaussian density,
with A = −D2 Q(f) being the negative Hessian of Q evaluated at f̃.
• The posterior p(f|y) is approximated by N(f̃, A−1 ), and the marginal
by
p(y) ≈ (2π)n/2 |A|−1/2 p(y|f̃)p(f̃)
• Won’t scale with large n; difficult to find modes in high dimensions.

R. Kass and A. Raftery (1995). “Bayes Factors”. Journal of the American Statistical
Association 90.430, pp. 773–795, §4.1, pp.777-778. back
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 27 / 25
Additional material

Comparison of approximations (density)

Truth

Variational
Laplace
Density

mode mean
z back
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 28 / 25
Additional material

Comparison of approximations (deviance)


Deviance (− 2 x Log−density)

Laplace

Truth
Variational

z back
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 29 / 25
Additional material

Variational solutions to Gaussian mixture model


Variational M-step

n Y
Y K K
X
q̃(z) = rikzik , rik = ρik / ρik
i=1 k=1 k=1
1   d
log ρik = E[log πk ] + E log |Ψk | − log 2π
2 2
1 h >
i
− E (xi − µk ) Ψk (xi − µk )
2
Variational E-step
Pn
q̃(π1 , . . . , πK ) = DirK (π|α̃), α̃k = α0k + i=1 rik
K
Y
Nd µk |m̃k , (κ̃k Ψk )−1 Wisd (Ψk |W̃k , ν̃k )

q̃(µ, Ψ) =
k=1
back
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 30 / 25
Additional material

Variational solutions to Gaussian mixture model (cont.)


κ̃k = κ0 + ni=1 rik
P

m̃k = (κ0 m0 + ni=1 rik xi )/κ̃k


P

Xn
Wk−1 = W0−1 + rik (xi − x̄k )(xi − x̄k )>
i=1
n
X n
.X
x̄k = rik xi rik
i=1 i=1
Pn
νk = ν0 + i=1 rik

Also useful
h i
E (xi − µk )> Ψk (xi − µk ) = d/κ̃k + νk (xi − m̃k )> W̃k (xi − m̃k )
 
E[log πk ] = di=1 ψ νk +1−i
P
2 + d log 2 + log |W̃k |
  PK 
E log |Ψk | = ψ(α̃k ) − ψ k=1 α̃k , ψ(·) is the digamma function
back
Haziq Jamil - http://haziqj.ml Variational inference 1 Feb 2018 31 / 25

You might also like