06 Gaussian Distributions

Probabilistic Inference and Learning

Lecture 06
Gaussian Probability Distributions

Philipp Hennig
04 May 2021

Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# date content Ex # date content Ex
1 20.04. Introduction 1 14 09.06. Logistic Regression 8
2 21.04. Reasoning under Uncertainty 15 15.06. Exponential Families
3 27.04. Continuous Variables 2 16 16.06. Graphical Models 9
4 28.04. Monte Carlo 17 22.06. Factor Graphs
5 04.05. Markov Chain Monte Carlo 3 18 23.06. The Sum-Product Algorithm 10
6 05.05. Gaussian Distributions 19 29.06. Example: Topic Models
7 11.05. Parametric Regression 4 20 30.06. Mixture Models 11
8 12.05. Understanding Deep Learning 21 06.07. EM
9 18.05. Gaussian Processes 5 22 07.07. Variational Inference 12
10 19.05. An Example for GP Regression 23 13.07. Example: Topic Models
11 25.05. Understanding Kernels 6 24 14.07. Example: Inferring Topics 13
12 26.05. Gauss-Markov Models 25 20.07. Example: Kernel Topic Models
13 08.06. GP Classification 7 26 21.07. Revision

The (univariate) Gaussian distribution
an exponentiated square



µ the mean of x


0.2 p(x) = √1 e
σ 2π
2σ 2 =: N (x; µ, σ 2 ) σ 2 the variance of x
σ the standard deviation of x

0 1 2 3 4 5 6
µ−σ µ µ+σ
Univariate Gaussians
some observations and notations, conventions


1 (x−µ)2
N (x; µ, σ 2 ) =: √ e− 2σ2 with µ, σ ∈ R
σ 2π
will be called the Gaussian or normal distribution of x. We call x the argument or variable, µ, σ 2 the
parameters. We write x ∼ N (µ, σ 2 ) to say that the variable x is distributed with pdf N (x; µ, σ 2 ).
▶ N (x; µ, σ 2 ) dx = 1 and N (x; µ, σ 2 ) > 0 ∀x ∈ R. So N is the density of a probability measure.
▶ Symmetry in x and µ: N (x; µ, σ 2 ) = N (µ; x, σ 2 )
▶ An exponential of a quadratic polynomial of the natural parameters (a, η, τ ) :
N (x; µ, σ 2 ) = exp a + ηx − τ x2 with τ = σ −2 (“precision”), η = σ −2 µ
a = − log(2π) − log λ2 + λ2 η 2
Gaussian Inference
The Gaussian is its own conjugate prior.

0.8 p(x) = N (x; µ, σ 2 )
p(y | x)
p(x | y) p(y | x) = N (y; x, ν 2 )

0.4 p(x)p(y | x)
p(x | y) = R
p(x)p(y | x) dx
0.2 = N (x; m, s2 ), with
s2 := −2
0 σ + ν −2
σ −2 µ + ν −2 y
0 2 4 6 m :=
σ −2 + ν −2

Gaussian Inference
Least-Squares Estimation

p(x) = N (x; µ, σ 2 )
1 Y
p(y | x) = N (yi ; x, νi2 )
p(x)p(y | x)
p(x | y) = R

0.5 p(x)p(y | x) dx
= N (x; m, s2 ), with

0 X
s−2 := σ −2 + νi−2

−1 0 1 2 3 4 s−2 m := σ −2 µ + νi−2 yi
x i=1

If σ −2 _ 0, νi = ν ∀i, then m is the arithmetic mean.

The Method of Least Squares
The Gaussian distribution is the unique choice yielding a mean that is the mean of measurements. [image: C.A. Jensen, 1840]

The Multivariate Gaussian distribution
An exponentiated quadratic form

Definition (multivariate Gaussian distribution)

1 1 ⊺ −1
N (x; µ, Σ) = exp − (x − µ) Σ (x − µ) x, µ ∈ Rn , Σ ∈ Rn×n , spd.
(2π)n/2 |Σ|1/2 2

Σ must be symmetric positive definite.

The Multivariate Gaussian distribution
An exponentiated quadratic form

Definition (multivariate Gaussian distribution)

1 1 ⊺ −1
N (x; µ, Σ) = exp − (x − µ) Σ (x − µ) x, µ ∈ Rn , Σ ∈ Rn×n , spd.
(2π)n/2 |Σ|1/2 2

Σ must be symmetric positive definite.

Definition (symmetric positive definite matrix)

A matrix A ∈ Rn×n is called symmetric positive (semi-) definite if A = A⊺ , and

v⊺ Av ≥ 0 ∀v ∈ Rn .

Equivalent statement: All eigenvalues of the symmetric matrix A are non-negative.

The Multivariate Gaussian distribution
Equiprobability lines are ellipsoids

1 1 ⊺ −1
N (x; µ, Σ) = exp − (x − µ) Σ (x − µ) x, µ ∈ Rn , Σ ∈ Rn×n , spd.
(2π)n/2 |Σ|1/2 2
8 ▶ N (x; µ, Σ) = 1 and N (x; µ, Σ) > 0 ∀x ∈ Rn .

6 ▶ Symmetry in x and µ: N (x; µ, Σ) = N (µ; x, Σ)

▶ An exponential of a quadratic polynomial:
⊺ 1 ⊺
µ2 N (x; µ, Σ) = exp a + η x − x Λx (1)

0 1
= exp a + η ⊺ x − tr(xx⊺ Λ) (2)

−4 with the natural parameters Λ = Σ−1 (precision

−4 −2 0 µ1 4 6 8 matrix), η = Λµ, and the sufficient statistics
x, xx⊺ .
Products of Gaussians are Gaussians
Closure under Multiplication

6 To multiply Gaussians, add the natural parameters

4 N (x; a, A)N (x; b, B) = N (x; c, C)Z

µ2 C = (A−1 + B−1 )−1


c = C(A−1 a + B−1 b)
0 Z = N (a; b, A + B)

Note similarity to univariate case.
−4 µ1
−4 −2 0 4 6 8

Linear Projections of Gaussians are Gaussians
Closure under linear maps

To linearly project a Gaussian variable,

project the parameters

p(z) = N (z; µ, Σ)
0 ⇒ p(Az) = N (Az, Aµ, AΣA⊺ )


−4 −2 0 2 4 6 8

Marginals of Gaussians are Gaussians
Closure under marginalization

p(z) = N (z; µ, Σ) p(Az) = N (Az, Aµ, AΣA⊺ )

choose A = 1 0
x µx Σxx Σxy
N ; , dy = N (x; µx , Σxx )
y µy Σyx Σyy

▶ this is the sum rule

0 Z Z
p(x, y) dy = p(y | x)p(x) dy = p(x)

▶ so every finite-dim Gaussian is a marginal of
−4 −2 0 2 4 6 8 infinitely many more

Cuts through Gaussians are Gaussians
Closure under conditioning

6 p(x, y)
p(x | Ax = y) =
= N x; µ + ΣA⊺ (AΣA⊺ )−1 (y − Aµ),

2 Σ − ΣA⊺ (AΣA⊺ )−1 AΣ

▶ this is the product rule

▶ so Gaussians are closed under the rules of
−4 −2 0 2 4 6 8

Inference with Gaussians
Since conditioning and marginalization are mapped to linear algebra, so is Bayes’ Theorem


If p(x) = N (x; µ, Σ)
and p(y | x) = N (y; Ax + b, Λ),
then p(y) = N (y; Aµ + b, Λ + AΣA⊺ )
and p(x | y) = N (x; µ + ΣA⊺ (AΣA⊺ + Λ)−1 (y − (Aµ + b)), Σ − ΣA⊺ (AΣA⊺ + Λ)−1 AΣ)
| {z }| {z } | {z }
gain residual Gram matrix
−1 ⊺ −1 −1 ⊺ −1 −1 −1
= N (x; (Σ +A Λ A) (A Λ (y − b) + Σ µ), (Σ + A⊺ Λ−1 A)−1 )
| {z } | {z }
precision matrix precision matrix

The Core Insight for All of This
Gaussian inference is linear algebra at its core [image: Konrad Jacobs]

A= M := (S − RP−1 Q)−1
−1 P + P−1 QMRP−1 −P−1 QM
A =
−MRP−1 M
(Z + UWV⊺ )−1 = Z−1 − Z−1 U(W−1 + V⊺ Z−1 U)−1 V⊺ Z−1
|Z + UWV⊺ | = |Z| · |W| · |W−1 + V⊺ Z−1 U|

Issai Schur (1875–1941)

Gaussians provide the linear algebra of inference
if all joints are Gaussian and all observations are linear, all posteriors are Gaussian

▶ products of Gaussians are Gaussians ▶ marginals of Gaussians are Gaussians

∫ [( ) ( ) ( )]
N (x; a, A)N (x; b, B) x µx Σxx Σxy
N ; , dy = N (x; µx , Σxx )
= N (x; c, C)N (a; b, A + B) y µy Σyx Σyy
C := (A + B−1 )−1 c := C(A−1 a + B−1 b) ▶ (linear) conditionals of Gaussians are Gaussians
▶ linear projections of Gaussians are Gaussians p(x, y)
p(x | y) =
p(z) = N (z; µ, Σ) ( )
⇒ p(Az) = N (Az, Aµ, AΣA⊺ ) = N x; µx + Σxy Σ−1 −1
yy (y − µy ), Σxx − Σxy Σyy Σyx

Bayesian inference becomes linear algebra

If p(x) = N (x; µ, Σ) and p(y | x) = N (y; A⊺ x + b, Λ), then

p(B⊺ x + c | y) = N [B⊺ x + c; B⊺ µ + c + B⊺ ΣA(A⊺ ΣA + Λ)−1 (y − A⊺ µ − b), B⊺ ΣB − B⊺ ΣA(A⊺ ΣA + Λ)−1 A⊺ ΣB]

Example 1: Conditional Independence, Marginal Correlation
A zero in the precision matrix means independence conditional on everything else [DJC MacKay, The humble Gaussian distribution, 2006]

x2 x2 = ν2 p(ν2 ) = N (ν2 ; µ2 , σ22 )

x1 = w1 x2 + ν1 p(ν1 ) = N (ν1 ; µ1 , σ12 )
x3 = w3 x2 + ν3 p(ν3 ) = N (ν3 ; µ3 , σ32 )
x1 x3
to simplify exposition, set µ = 0.

p(x1 , x2 , x3 ) = p(x2 ) · p(x1 | x2 ) · p(x3 | x2 )

1 1 x22 (x1 − w1 x2 )2 (x3 − w3 x2 )2
= exp − + +
Z1 Z2 Z3 2 σ22 σ12 σ32
1 1 2 1 w21 w23 2 1 w1 2 1 w3
= exp − x + 2 + 2 + x1 2 − 2x1 x2 2 + x3 2 − 2x3 x2 2
Z1 Z2 Z3 2 2 σ22 σ1 σ3 σ1 σ1 σ3 σ3
  1  
− σ2 0  
  w1 1  1 
1  1 1
 x1 
exp  − 2 w3   
2 2
w1 w3
=  2− x1 x2 x3  σ1 2
+ σ12 + 2
− σ3 
2 x2 
Z1 Z2 Z3
0 − w3 1 x3
σ32 σ32
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 18
Example 2: Explaining away
1 1
N (x; µ, Σ) = exp − (x − µ)⊺ Σ−1 (x − µ)
(2π)n/2 |Σ|1/2 2
▶ Gaussian distributions provide the linear algebra of inference.
▶ products of Gaussians are Gaussians
▶ linear maps of Gaussian variables are Gaussian variables
▶ marginals of Gaussians are Gaussians
▶ linear conditionals of Gaussians are Gaussians
If all variables in a generative model are linearly related, and the distributions of the parent variables are
Gaussian, then all conditionals, joints and marginals are Gaussian, with means and covariances com-
putable by linear algebra operations.
▶ A zero off-diagonal element in the covariance matrix implies independence if all other variables
are integrated out
▶ A zero off-diagonal element in the precision matrix implies independence conditional on all other
[Σ]ij = 0 ⇒ p(xi , xj ) = N (xi ; [µ]i , [Σ]ii ) · N (xj ; [µ]j , [Σ]jj )
[Σ ]ij = 0 ⇒ p(xi , xj | x̸=i,j ) = N (xi | x̸=i,j ) · N (xj | x̸=i,j )

The Toolbox

p(y | x)p(x)
p(x1 , x2 ) dx2 = p(x1 ) p(x1 , x2 ) = p(x1 | x2 )p(x2 ) p(x | y) =

Modelling: Computation:
▶ Directed Graphical Models ▶ Monte Carlo
▶ Gaussian Distributions ▶ Linear algebra / Gaussian inference
▶ ▶
▶ ▶
▶ ▶

