06 Gaussian Distributions
06 Gaussian Distributions
06 Gaussian Distributions
Lecture 06
Gaussian Probability Distributions
Philipp Hennig
04 May 2021
Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# date content Ex # date content Ex
1 20.04. Introduction 1 14 09.06. Logistic Regression 8
2 21.04. Reasoning under Uncertainty 15 15.06. Exponential Families
3 27.04. Continuous Variables 2 16 16.06. Graphical Models 9
4 28.04. Monte Carlo 17 22.06. Factor Graphs
5 04.05. Markov Chain Monte Carlo 3 18 23.06. The Sum-Product Algorithm 10
6 05.05. Gaussian Distributions 19 29.06. Example: Topic Models
7 11.05. Parametric Regression 4 20 30.06. Mixture Models 11
8 12.05. Understanding Deep Learning 21 06.07. EM
9 18.05. Gaussian Processes 5 22 07.07. Variational Inference 12
10 19.05. An Example for GP Regression 23 13.07. Example: Topic Models
11 25.05. Understanding Kernels 6 24 14.07. Example: Inferring Topics 13
12 26.05. Gauss-Markov Models 25 20.07. Example: Kernel Topic Models
13 08.06. GP Classification 7 26 21.07. Revision
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 2
The (univariate) Gaussian distribution
an exponentiated square
0.4
0.3
µ the mean of x
p(x)
(x−µ)2
−
0.2 p(x) = √1 e
σ 2π
2σ 2 =: N (x; µ, σ 2 ) σ 2 the variance of x
σ the standard deviation of x
0.1
x
0
0 1 2 3 4 5 6
µ−σ µ µ+σ
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
Univariate Gaussians
some observations and notations, conventions
Definition
1 (x−µ)2
N (x; µ, σ 2 ) =: √ e− 2σ2 with µ, σ ∈ R
σ 2π
will be called the Gaussian or normal distribution of x. We call x the argument or variable, µ, σ 2 the
parameters. We write x ∼ N (µ, σ 2 ) to say that the variable x is distributed with pdf N (x; µ, σ 2 ).
Z
▶ N (x; µ, σ 2 ) dx = 1 and N (x; µ, σ 2 ) > 0 ∀x ∈ R. So N is the density of a probability measure.
▶ Symmetry in x and µ: N (x; µ, σ 2 ) = N (µ; x, σ 2 )
▶ An exponential of a quadratic polynomial of the natural parameters (a, η, τ ) :
1
N (x; µ, σ 2 ) = exp a + ηx − τ x2 with τ = σ −2 (“precision”), η = σ −2 µ
2
1
a = − log(2π) − log λ2 + λ2 η 2
2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
Gaussian Inference
The Gaussian is its own conjugate prior.
Let
p(x)
0.8 p(x) = N (x; µ, σ 2 )
p(y | x)
p(x | y) p(y | x) = N (y; x, ν 2 )
0.6
Then
p(x)
0.4 p(x)p(y | x)
p(x | y) = R
p(x)p(y | x) dx
0.2 = N (x; m, s2 ), with
1
s2 := −2
0 σ + ν −2
σ −2 µ + ν −2 y
0 2 4 6 m :=
σ −2 + ν −2
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 5
Gaussian Inference
Least-Squares Estimation
p(x) = N (x; µ, σ 2 )
1 Y
N
p(y | x) = N (yi ; x, νi2 )
i=1
p(x)p(y | x)
p(x | y) = R
p(x)
0.5 p(x)p(y | x) dx
= N (x; m, s2 ), with
0 X
N
s−2 := σ −2 + νi−2
i=1
X
N
−1 0 1 2 3 4 s−2 m := σ −2 µ + νi−2 yi
x i=1
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 7
The Multivariate Gaussian distribution
An exponentiated quadratic form
1 1 ⊺ −1
N (x; µ, Σ) = exp − (x − µ) Σ (x − µ) x, µ ∈ Rn , Σ ∈ Rn×n , spd.
(2π)n/2 |Σ|1/2 2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 8
The Multivariate Gaussian distribution
An exponentiated quadratic form
1 1 ⊺ −1
N (x; µ, Σ) = exp − (x − µ) Σ (x − µ) x, µ ∈ Rn , Σ ∈ Rn×n , spd.
(2π)n/2 |Σ|1/2 2
v⊺ Av ≥ 0 ∀v ∈ Rn .
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 8
The Multivariate Gaussian distribution
Equiprobability lines are ellipsoids
1 1 ⊺ −1
N (x; µ, Σ) = exp − (x − µ) Σ (x − µ) x, µ ∈ Rn , Σ ∈ Rn×n , spd.
(2π)n/2 |Σ|1/2 2
Z
8 ▶ N (x; µ, Σ) = 1 and N (x; µ, Σ) > 0 ∀x ∈ Rn .
2
0 1
= exp a + η ⊺ x − tr(xx⊺ Λ) (2)
2
−2
c = C(A−1 a + B−1 b)
0 Z = N (a; b, A + B)
−2
Note similarity to univariate case.
−4 µ1
−4 −2 0 4 6 8
x1
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
Products of Gaussians are Gaussians
Closure under Multiplication
c = C(A−1 a + B−1 b)
0 Z = N (a; b, A + B)
−2
Note similarity to univariate case.
−4 µ1
−4 −2 0 4 6 8
x1
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
Products of Gaussians are Gaussians
Closure under Multiplication
c = C(A−1 a + B−1 b)
0 Z = N (a; b, A + B)
−2
Note similarity to univariate case.
−4 µ1
−4 −2 0 4 6 8
x1
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
Linear Projections of Gaussians are Gaussians
Closure under linear maps
p(z) = N (z; µ, Σ)
0 ⇒ p(Az) = N (Az, Aµ, AΣA⊺ )
−2
−4
−4 −2 0 2 4 6 8
x1
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 11
Marginals of Gaussians are Gaussians
Closure under marginalization
−4
▶ so every finite-dim Gaussian is a marginal of
−4 −2 0 2 4 6 8 infinitely many more
x1
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
Cuts through Gaussians are Gaussians
Closure under conditioning
6 p(x, y)
p(x | Ax = y) =
p(y)
4
= N x; µ + ΣA⊺ (AΣA⊺ )−1 (y − Aµ),
2 Σ − ΣA⊺ (AΣA⊺ )−1 AΣ
x2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
Inference with Gaussians
Since conditioning and marginalization are mapped to linear algebra, so is Bayes’ Theorem
Theorem
If p(x) = N (x; µ, Σ)
and p(y | x) = N (y; Ax + b, Λ),
then p(y) = N (y; Aµ + b, Λ + AΣA⊺ )
and p(x | y) = N (x; µ + ΣA⊺ (AΣA⊺ + Λ)−1 (y − (Aµ + b)), Σ − ΣA⊺ (AΣA⊺ + Λ)−1 AΣ)
| {z }| {z } | {z }
gain residual Gram matrix
−1 ⊺ −1 −1 ⊺ −1 −1 −1
= N (x; (Σ +A Λ A) (A Λ (y − b) + Σ µ), (Σ + A⊺ Λ−1 A)−1 )
| {z } | {z }
precision matrix precision matrix
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 14
The Core Insight for All of This
Gaussian inference is linear algebra at its core [image: Konrad Jacobs]
P Q
A= M := (S − RP−1 Q)−1
R S
−1
−1 P + P−1 QMRP−1 −P−1 QM
A =
−MRP−1 M
(Z + UWV⊺ )−1 = Z−1 − Z−1 U(W−1 + V⊺ Z−1 U)−1 V⊺ Z−1
|Z + UWV⊺ | = |Z| · |W| · |W−1 + V⊺ Z−1 U|
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
Example 1: Conditional Independence, Marginal Correlation
Bayesian Inference with Gaussians [DJC MacKay, The humble Gaussian distribution, 2006]
temperature outside
x2
x1 x3
temperature temperature
in building 1 in building 2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 17
Example 1: Conditional Independence, Marginal Correlation
Bayesian Inference with Gaussians [DJC MacKay, The humble Gaussian distribution, 2006]
temperature outside
p(ν) = N (ν; µ, diag(σ 2 ))
x2
1 w1 0
A = 0 1 0 =⇒
0 w3 1
x1 x3
w1 σ22 + σ12 w1 σ22 w1 w3 σ22
temperature temperature
p(x = Aν) = N x; Aµ , σ22 w3 σ22
in building 1 in building 2
|{z}
=:m w23 σ22 + σ32
| {z }
x2 = ν2 p(ν2 ) = N (ν2 ; µ2 , σ22 ) =:Σ
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 17
Example 1: Conditional Independence, Marginal Correlation
Bayesian Inference with Gaussians [DJC MacKay, The humble Gaussian distribution, 2006]
temperature outside
p(ν) = N (ν; µ, diag(σ 2 ))
x2
1 w1 0
A = 0 1 0 =⇒
0 w3 1
x1 x3
w1 σ22 + σ12 w1 σ22 w1 w3 σ22
temperature temperature
p(x = Aν) = N x; Aµ , σ22 w3 σ22
in building 1 in building 2
|{z}
=:m w23 σ22 + σ32
| {z }
x2 = ν2 p(ν2 ) = N (ν2 ; µ2 , σ22 ) =:Σ
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 17
Example 1: Conditional Independence, Marginal Correlation
A zero in the precision matrix means independence conditional on everything else [DJC MacKay, The humble Gaussian distribution, 2006]
emission
gas price price
x1 x3
x2
electricity
price
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Example 2: Explaining away
Bayesian Inference with Gaussians [DJC MacKay, The humble Gaussian distribution, 2006]
emission
gas price price
x1 x3
2
σ1 w1 σ12 0
x2
p(x) = N x; m, σ2 + w21 σ12 + w23 σ32
2
w3 σ32
σ32
electricity | {z }
Σ
price 2
x1 µ1 σ1 0
p(x1 , x3 ) = N ; ,
x1 = ν1 p(ν1 ) = N (ν1 ; µ1 , σ12 ) x3 µ3 0 σ32
x3 = ν3 p(ν3 ) = N (ν3 ; µ3 , σ32 )
x2 = w1 x1 + w3 x3 + ν2 p(ν2 ) = N (ν2 ; µ2 , σ22 )
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Example 2: Explaining away
a ± value in the precision matrix implies ∓ correlation conditional on everything else [DJC MacKay, The humble Gaussian distribution, 2006]
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 20
Example 2: Explaining away
Bayesian Inference with Gaussians [DJC MacKay, The humble Gaussian distribution, 2006]
2
x1 µ σ 0
p(x1 , x3 ) = N ; 1 , 1
25 x3 µ3 0 σ32
20
x3 [EUR/t]
15
10
2 4 6 8
x1 [USD/MMBtu]
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 21
Example 2: Explaining away
Bayesian Inference with Gaussians [DJC MacKay, The humble Gaussian distribution, 2006]
2
x1 µ σ 0
p(x1 , x3 ) = N ; 1 , 1
25 x3 µ3 0 σ32
p(x2 ) = N x2 ; w1 µ1 + w3 µ3 + µ2 , σ22 + w21 σ12 + w23 σ32
20
x3 [EUR/t]
15
10
2 4 6 8
x1 [USD/MMBtu]
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 21
Example 2: Explaining away
Bayesian Inference with Gaussians [DJC MacKay, The humble Gaussian distribution, 2006]
2
x1 µ σ 0
p(x1 , x3 ) = N ; 1 , 1
25 x3 µ3 0 σ32
p(x2 ) = N x2 ; w1 µ1 + w3 µ3 + µ2 , σ22 + w21 σ12 + w23 σ32
20 p(x2 | x1 , x3 ) = N (x2 ; w1 x1 + w3 x3 + µ2 , σ22 )
x3 [EUR/t]
15
10
2 4 6 8
x1 [USD/MMBtu]
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 21
Example 2: Explaining away
Bayesian Inference with Gaussians [DJC MacKay, The humble Gaussian distribution, 2006]
2
x1 µ σ 0
p(x1 , x3 ) = N ; 1 , 1
25 x3 µ3 0 σ32
p(x2 ) = N x2 ; w1 µ1 + w3 µ3 + µ2 , σ22 + w21 σ12 + w23 σ32
20 p(x2 | x1 , x3 ) = N (x2 ; w1 x1 + w3 x3 + µ2 , σ22 )
x3 [EUR/t]
x2 − wµ1,3 − µ2
p(x1 , x3 | x2 ) = N x1,3 ; µ1,3 − Σ1,3 w⊺ ,
wΣ1,3 w⊺ + σ22
15
1
Σ1,3 − Σ1,3 w⊺ wΣ1,3
wΣ1,3 w⊺ + σ22
10
x1 µ w σ 2 x2 − w1 µ1 − w3 µ3 − µ2
=N ; 1 − 1 12 ,
x3 µ3 w3 σ3 w21 σ12 + w23 σ32 + σ22
2 4 6 8 2
σ1 0 w1 σ12 1
x1 [USD/MMBtu] − 2
w σ w3 σ32
0 σ32 w3 σ32 w21 σ12 + w23 σ32 + σ22 1 1
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 21
1 1
N (x; µ, Σ) = exp − (x − µ)⊺ Σ−1 (x − µ)
(2π)n/2 |Σ|1/2 2
Today:
▶ Gaussian distributions provide the linear algebra of inference.
▶ products of Gaussians are Gaussians
▶ linear maps of Gaussian variables are Gaussian variables
▶ marginals of Gaussians are Gaussians
▶ linear conditionals of Gaussians are Gaussians
If all variables in a generative model are linearly related, and the distributions of the parent variables are
Gaussian, then all conditionals, joints and marginals are Gaussian, with means and covariances com-
putable by linear algebra operations.
▶ A zero off-diagonal element in the covariance matrix implies independence if all other variables
are integrated out
▶ A zero off-diagonal element in the precision matrix implies independence conditional on all other
variables
[Σ]ij = 0 ⇒ p(xi , xj ) = N (xi ; [µ]i , [Σ]ii ) · N (xj ; [µ]j , [Σ]jj )
−1
[Σ ]ij = 0 ⇒ p(xi , xj | x̸=i,j ) = N (xi | x̸=i,j ) · N (xj | x̸=i,j )
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 22
The Toolbox
Framework:
Z
p(y | x)p(x)
p(x1 , x2 ) dx2 = p(x1 ) p(x1 , x2 ) = p(x1 | x2 )p(x2 ) p(x | y) =
p(y)
Modelling: Computation:
▶ Directed Graphical Models ▶ Monte Carlo
▶ Gaussian Distributions ▶ Linear algebra / Gaussian inference
▶ ▶
▶ ▶
▶ ▶
▶
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 23