06 Gaussian Distributions

Probabilistic Inference and Learning
Lecture 06
Gaussian Probability Distributions
Philipp Hennig
04 May 2021
Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# date content Ex # date content Ex
1 20.04. Introduction 1 14 09.06. Logistic Regression 8
2 21.04. Reasoning under Uncertainty 15 15.06. Exponential Families
3 27.04. Continuous Variables 2 16 16.06. Graphical Models 9
4 28.04. Monte Carlo 17 22.06. Factor Graphs
5 04.05. Markov Chain Monte Carlo 3 18 23.06. The Sum-Product Algorithm 10
6 05.05. Gaussian Distributions 19 29.06. Example: Topic Models
7 11.05. Parametric Regression 4 20 30.06. Mixture Models 11
8 12.05. Understanding Deep Learning 21 06.07. EM
9 18.05. Gaussian Processes 5 22 07.07. Variational Inference 12
10 19.05. An Example for GP Regression 23 13.07. Example: Topic Models
11 25.05. Understanding Kernels 6 24 14.07. Example: Inferring Topics 13
12 26.05. Gauss-Markov Models 25 20.07. Example: Kernel Topic Models
13 08.06. GP Classification 7 26 21.07. Revision
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1
The (univariate) Gaussian distribution
an exponentiated square
0.4
0.3
µ the mean of x
p(x)
(x−µ)2
−
0.2 p(x) = √1 e
σ 2π
2σ 2 =: N (x; µ, σ 2 ) σ 2 the variance of x
σ the standard deviation of x
0.1
x
0
0 1 2 3 4 5 6
µ−σ µ µ+σ
Univariate Gaussians
some observations and notations, conventions
Definition
1 (x−µ)2
N (x; µ, σ 2 ) =: √ e− 2σ2 with µ, σ ∈ R
σ 2π
will be called the Gaussian or normal distribution of x. We call x the argument or variable, µ, σ 2 the
parameters. We write x ∼ N (µ, σ 2 ) to say that the variable x is distributed with pdf N (x; µ, σ 2 ).
Z
▶ N (x; µ, σ 2 ) dx = 1 and N (x; µ, σ 2 ) > 0 ∀x ∈ R. So N is the density of a probability measure.
▶ Symmetry in x and µ: N (x; µ, σ 2 ) = N (µ; x, σ 2 )
▶ An exponential of a quadratic polynomial of the natural parameters (a, η, τ ) :

1
N (x; µ, σ 2 ) = exp a + ηx − τ x2 with τ = σ −2 (“precision”), η = σ −2 µ
2
1
a = − log(2π) − log λ2 + λ2 η 2
2
Gaussian Inference
The Gaussian is its own conjugate prior.
Let
p(x)
0.8 p(x) = N (x; µ, σ 2 )
p(y | x)
p(x | y) p(y | x) = N (y; x, ν 2 )
0.6
Then
p(x)
0.4 p(x)p(y | x)
p(x | y) = R
p(x)p(y | x) dx
0.2 = N (x; m, s2 ), with
1
s2 := −2
0 σ + ν −2
σ −2 µ + ν −2 y
0 2 4 6 m :=
σ −2 + ν −2
x
Gaussian Inference
Least-Squares Estimation
p(x) = N (x; µ, σ 2 )
1 Y
N
p(y | x) = N (yi ; x, νi2 )
i=1
p(x)p(y | x)
p(x | y) = R
p(x)
0.5 p(x)p(y | x) dx
= N (x; m, s2 ), with
0 X
N
s−2 := σ −2 + νi−2
i=1
X
N
−1 0 1 2 3 4 s−2 m := σ −2 µ + νi−2 yi
x i=1
If σ −2 _ 0, νi = ν ∀i, then m is the arithmetic mean.

The Method of Least Squares
The Gaussian distribution is the unique choice yielding a mean that is the mean of measurements. [image: C.A. Jensen, 1840]
The Multivariate Gaussian distribution
An exponentiated quadratic form
Definition (multivariate Gaussian distribution)

1 1 ⊺ −1
N (x; µ, Σ) = exp − (x − µ) Σ (x − µ) x, µ ∈ Rn , Σ ∈ Rn×n , spd.
(2π)n/2 |Σ|1/2 2
Σ must be symmetric positive definite.
An exponentiated quadratic form
Definition (multivariate Gaussian distribution)

1 1 ⊺ −1
(2π)n/2 |Σ|1/2 2
Σ must be symmetric positive definite.
Definition (symmetric positive definite matrix)

A matrix A ∈ Rn×n is called symmetric positive (semi-) definite if A = A⊺ , and
v⊺ Av ≥ 0 ∀v ∈ Rn .
Equivalent statement: All eigenvalues of the symmetric matrix A are non-negative.
Equiprobability lines are ellipsoids

1 1 ⊺ −1
(2π)n/2 |Σ|1/2 2
Z
8 ▶ N (x; µ, Σ) = 1 and N (x; µ, Σ) > 0 ∀x ∈ Rn .
6 ▶ Symmetry in x and µ: N (x; µ, Σ) = N (µ; x, Σ)

▶ An exponential of a quadratic polynomial:
4

⊺ 1 ⊺
µ2 N (x; µ, Σ) = exp a + η x − x Λx (1)
x2
2

0 1
= exp a + η ⊺ x − tr(xx⊺ Λ) (2)
2
−2
−4 with the natural parameters Λ = Σ−1 (precision

−4 −2 0 µ1 4 6 8 matrix), η = Λµ, and the sufficient statistics
x1
x, xx⊺ .
Products of Gaussians are Gaussians
Closure under Multiplication
6 To multiply Gaussians, add the natural parameters
4 N (x; a, A)N (x; b, B) = N (x; c, C)Z
µ2 C = (A−1 + B−1 )−1

x2
c = C(A−1 a + B−1 b)
0 Z = N (a; b, A + B)
−2
Note similarity to univariate case.
−4 µ1
−4 −2 0 4 6 8
x1
4 N (x; a, A)N (x; b, B) = N (x; c, C)Z
µ2 C = (A−1 + B−1 )−1

x2
c = C(A−1 a + B−1 b)
0 Z = N (a; b, A + B)
−2
−4 µ1
−4 −2 0 4 6 8
x1
4 N (x; a, A)N (x; b, B) = N (x; c, C)Z
µ2 C = (A−1 + B−1 )−1

x2
c = C(A−1 a + B−1 b)
0 Z = N (a; b, A + B)
−2
−4 µ1
−4 −2 0 4 6 8
x1
Linear Projections of Gaussians are Gaussians
Closure under linear maps
To linearly project a Gaussian variable,

4
project the parameters
2
x2
p(z) = N (z; µ, Σ)
0 ⇒ p(Az) = N (Az, Aµ, AΣA⊺ )
−2
−4
−4 −2 0 2 4 6 8
x1
Marginals of Gaussians are Gaussians
Closure under marginalization
p(z) = N (z; µ, Σ) p(Az) = N (Az, Aµ, AΣA⊺ )

⇒
6
choose A = 1 0
4
Z
x µx Σxx Σxy
N ; , dy = N (x; µx , Σxx )
y µy Σyx Σyy
2
x2
▶ this is the sum rule

0 Z Z
p(x, y) dy = p(y | x)p(x) dy = p(x)
−2
−4
▶ so every finite-dim Gaussian is a marginal of
−4 −2 0 2 4 6 8 infinitely many more
x1
Cuts through Gaussians are Gaussians
Closure under conditioning
6 p(x, y)
p(x | Ax = y) =
p(y)
4
= N x; µ + ΣA⊺ (AΣA⊺ )−1 (y − Aµ),

2 Σ − ΣA⊺ (AΣA⊺ )−1 AΣ
x2
▶ this is the product rule

−2
▶ so Gaussians are closed under the rules of
−4
probability
−4 −2 0 2 4 6 8
x1
Inference with Gaussians
Since conditioning and marginalization are mapped to linear algebra, so is Bayes’ Theorem
Theorem
If p(x) = N (x; µ, Σ)
and p(y | x) = N (y; Ax + b, Λ),
then p(y) = N (y; Aµ + b, Λ + AΣA⊺ )
and p(x | y) = N (x; µ + ΣA⊺ (AΣA⊺ + Λ)−1 (y − (Aµ + b)), Σ − ΣA⊺ (AΣA⊺ + Λ)−1 AΣ)
| {z }| {z } | {z }
gain residual Gram matrix
−1 ⊺ −1 −1 ⊺ −1 −1 −1
= N (x; (Σ +A Λ A) (A Λ (y − b) + Σ µ), (Σ + A⊺ Λ−1 A)−1 )
| {z } | {z }
precision matrix precision matrix
The Core Insight for All of This
Gaussian inference is linear algebra at its core [image: Konrad Jacobs]

P Q
A= M := (S − RP−1 Q)−1
R S
−1
−1 P + P−1 QMRP−1 −P−1 QM
A =
−MRP−1 M
(Z + UWV⊺ )−1 = Z−1 − Z−1 U(W−1 + V⊺ Z−1 U)−1 V⊺ Z−1
|Z + UWV⊺ | = |Z| · |W| · |W−1 + V⊺ Z−1 U|
Issai Schur (1875–1941)

Gaussians provide the linear algebra of inference
if all joints are Gaussian and all observations are linear, all posteriors are Gaussian
▶ products of Gaussians are Gaussians ▶ marginals of Gaussians are Gaussians

∫ [( ) ( ) ( )]
N (x; a, A)N (x; b, B) x µx Σxx Σxy
N ; , dy = N (x; µx , Σxx )
= N (x; c, C)N (a; b, A + B) y µy Σyx Σyy
−1
C := (A + B−1 )−1 c := C(A−1 a + B−1 b) ▶ (linear) conditionals of Gaussians are Gaussians
▶ linear projections of Gaussians are Gaussians p(x, y)
p(x | y) =
p(y)
p(z) = N (z; µ, Σ) ( )
⇒ p(Az) = N (Az, Aµ, AΣA⊺ ) = N x; µx + Σxy Σ−1 −1
yy (y − µy ), Σxx − Σxy Σyy Σyx
Bayesian inference becomes linear algebra
If p(x) = N (x; µ, Σ) and p(y | x) = N (y; A⊺ x + b, Λ), then

p(B⊺ x + c | y) = N [B⊺ x + c; B⊺ µ + c + B⊺ ΣA(A⊺ ΣA + Λ)−1 (y − A⊺ µ − b), B⊺ ΣB − B⊺ ΣA(A⊺ ΣA + Λ)−1 A⊺ ΣB]
Example 1: Conditional Independence, Marginal Correlation
Bayesian Inference with Gaussians [DJC MacKay, The humble Gaussian distribution, 2006]
temperature outside
x2
x1 x3
temperature temperature
in building 1 in building 2
x2 = ν2 p(ν2 ) = N (ν2 ; µ2 , σ22 )

x1 = w1 x2 + ν1 p(ν1 ) = N (ν1 ; µ1 , σ12 )
x3 = w3 x2 + ν3 p(ν3 ) = N (ν3 ; µ3 , σ32 )
temperature outside
p(ν) = N (ν; µ, diag(σ 2 ))
x2  
1 w1 0
A = 0 1 0 =⇒
0 w3 1
x1 x3  
 
 w1 σ22 + σ12 w1 σ22 w1 w3 σ22 
temperature temperature  

p(x = Aν) = N x; Aµ ,  σ22 w3 σ22 
in building 1 in building 2 
 |{z}
=:m w23 σ22 + σ32 
| {z }
x2 = ν2 p(ν2 ) = N (ν2 ; µ2 , σ22 ) =:Σ
x1 = w1 x2 + ν1 p(ν1 ) = N (ν1 ; µ1 , σ12 )

x3 = w3 x2 + ν3 p(ν3 ) = N (ν3 ; µ3 , σ32 )
temperature outside
p(ν) = N (ν; µ, diag(σ 2 ))
x2  
1 w1 0
A = 0 1 0 =⇒
0 w3 1
x1 x3  
 
 w1 σ22 + σ12 w1 σ22 w1 w3 σ22 
temperature temperature  

p(x = Aν) = N x; Aµ ,  σ22 w3 σ22 
in building 1 in building 2 
 |{z}
=:m w23 σ22 + σ32 
| {z }
x2 = ν2 p(ν2 ) = N (ν2 ; µ2 , σ22 ) =:Σ
x1 = w1 x2 + ν1 p(ν1 ) = N (ν1 ; µ1 , σ12 )

From graph: x1 ⊥⊥ x3 | x2 . Where can we see this in the pdf?
x3 = w3 x2 + ν3 p(ν3 ) = N (ν3 ; µ3 , σ32 )
A zero in the precision matrix means independence conditional on everything else [DJC MacKay, The humble Gaussian distribution, 2006]
x2 x2 = ν2 p(ν2 ) = N (ν2 ; µ2 , σ22 )

x1 = w1 x2 + ν1 p(ν1 ) = N (ν1 ; µ1 , σ12 )
x3 = w3 x2 + ν3 p(ν3 ) = N (ν3 ; µ3 , σ32 )
x1 x3
to simplify exposition, set µ = 0.
p(x1 , x2 , x3 ) = p(x2 ) · p(x1 | x2 ) · p(x3 | x2 )

1 1 x22 (x1 − w1 x2 )2 (x3 − w3 x2 )2
= exp − + +
Z1 Z2 Z3 2 σ22 σ12 σ32

1 1 2 1 w21 w23 2 1 w1 2 1 w3
= exp − x + 2 + 2 + x1 2 − 2x1 x2 2 + x3 2 − 2x3 x2 2
Z1 Z2 Z3 2 2 σ22 σ1 σ3 σ1 σ1 σ3 σ3
  1  
w1
− σ2 0  
 w1 1 1
σ2
1  1 1
 x1 
exp  − 2 w3   
2 2
w1 w3
=  2− x1 x2 x3  σ1 2
σ2
+ σ12 + 2
σ3
− σ3 
2 x2 
Z1 Z2 Z3
0 − w3 1 x3
σ32 σ32
Example 2: Explaining away
emission
gas price price
x1 x3
x2
electricity
price
x1 = ν1 p(ν1 ) = N (ν1 ; µ1 , σ12 )

x3 = ν3 p(ν3 ) = N (ν3 ; µ3 , σ32 )
x2 = w1 x1 + w3 x3 + ν2 p(ν2 ) = N (ν2 ; µ2 , σ22 )
emission
gas price price
x1 x3
 
 2 
 σ1 w1 σ12 0 
 
x2 
p(x) = N x; m,  σ2 + w21 σ12 + w23 σ32
2
w3 σ32 

 σ32 
electricity | {z }
Σ
price 2
x1 µ1 σ1 0
p(x1 , x3 ) = N ; ,
x1 = ν1 p(ν1 ) = N (ν1 ; µ1 , σ12 ) x3 µ3 0 σ32
x3 = ν3 p(ν3 ) = N (ν3 ; µ3 , σ32 )
x2 = w1 x1 + w3 x3 + ν2 p(ν2 ) = N (ν2 ; µ2 , σ22 )
a ± value in the precision matrix implies ∓ correlation conditional on everything else [DJC MacKay, The humble Gaussian distribution, 2006]
x1 = ν1 p(ν1 ) = N (ν1 ; µ1 , σ12 )

x1 x3
x3 = ν3 p(ν3 ) = N (ν3 ; µ3 , σ32 )
x2 x2 = w1 x1 + w3 x3 + ν2 p(ν2 ) = N (ν2 ; µ2 , σ22 )
p(x1 , x2 , x3 ) = p(x1 ) · p(x3 ) · p(x2 | x1 , x3 )

1 1 x1 x3 x2 − w1 x1 − w3 x3
= exp − + 2+
Z 1 · Z 2 · Z3 2 σ12 σ3 σ22

1 1 2 1 w21 2 1 w1 1 w23 w3 w1 w3
= exp − x1 + + x2 2 − 2x x
1 2 2 + x 2
3 + − 2x x
2 3 2 + 2x x
3 1
Z 1 · Z 2 · Z3 2 σ12 σ22 σ2 σ2 σ32 σ22 σ2 σ22
   
1 w2
+ σ12 − σw12 w1 w3  
 1  2σ12 σ 2
 x1 
1 2 2 2
exp  − x1 x2 x3  − σw12 − σw32  x2 

1
=    
Z 1 · Z 2 · Z3 2 2
σ22 2
w1 w3 w3 w23 x3
σ2
− σ2
1
2σ 2
+ σ2
2 2 3 2
2
x1 µ σ 0
p(x1 , x3 ) = N ; 1 , 1
25 x3 µ3 0 σ32
20
x3 [EUR/t]
15
10
2 4 6 8
x1 [USD/MMBtu]
2
x1 µ σ 0
p(x1 , x3 ) = N ; 1 , 1
25 x3 µ3 0 σ32

p(x2 ) = N x2 ; w1 µ1 + w3 µ3 + µ2 , σ22 + w21 σ12 + w23 σ32
20
x3 [EUR/t]
15
10
2 4 6 8
x1 [USD/MMBtu]
2
x1 µ σ 0
p(x1 , x3 ) = N ; 1 , 1
25 x3 µ3 0 σ32

p(x2 ) = N x2 ; w1 µ1 + w3 µ3 + µ2 , σ22 + w21 σ12 + w23 σ32
20 p(x2 | x1 , x3 ) = N (x2 ; w1 x1 + w3 x3 + µ2 , σ22 )
x3 [EUR/t]
15
10
2 4 6 8
x1 [USD/MMBtu]
2
x1 µ σ 0
p(x1 , x3 ) = N ; 1 , 1
25 x3 µ3 0 σ32

p(x2 ) = N x2 ; w1 µ1 + w3 µ3 + µ2 , σ22 + w21 σ12 + w23 σ32
20 p(x2 | x1 , x3 ) = N (x2 ; w1 x1 + w3 x3 + µ2 , σ22 )

x3 [EUR/t]
x2 − wµ1,3 − µ2
p(x1 , x3 | x2 ) = N x1,3 ; µ1,3 − Σ1,3 w⊺ ,
wΣ1,3 w⊺ + σ22
15
1
Σ1,3 − Σ1,3 w⊺ wΣ1,3
wΣ1,3 w⊺ + σ22

10
x1 µ w σ 2 x2 − w1 µ1 − w3 µ3 − µ2
=N ; 1 − 1 12 ,
x3 µ3 w3 σ3 w21 σ12 + w23 σ32 + σ22
2 4 6 8 2
σ1 0 w1 σ12 1
x1 [USD/MMBtu] − 2
w σ w3 σ32
0 σ32 w3 σ32 w21 σ12 + w23 σ32 + σ22 1 1

1 1
N (x; µ, Σ) = exp − (x − µ)⊺ Σ−1 (x − µ)
(2π)n/2 |Σ|1/2 2
Today:
▶ Gaussian distributions provide the linear algebra of inference.
▶ products of Gaussians are Gaussians
▶ linear maps of Gaussian variables are Gaussian variables
▶ marginals of Gaussians are Gaussians
▶ linear conditionals of Gaussians are Gaussians
If all variables in a generative model are linearly related, and the distributions of the parent variables are
Gaussian, then all conditionals, joints and marginals are Gaussian, with means and covariances com-
putable by linear algebra operations.
▶ A zero off-diagonal element in the covariance matrix implies independence if all other variables
are integrated out
▶ A zero off-diagonal element in the precision matrix implies independence conditional on all other
variables
[Σ]ij = 0 ⇒ p(xi , xj ) = N (xi ; [µ]i , [Σ]ii ) · N (xj ; [µ]j , [Σ]jj )
−1
[Σ ]ij = 0 ⇒ p(xi , xj | x̸=i,j ) = N (xi | x̸=i,j ) · N (xj | x̸=i,j )
The Toolbox
Framework:
Z
p(y | x)p(x)
p(x1 , x2 ) dx2 = p(x1 ) p(x1 , x2 ) = p(x1 | x2 )p(x2 ) p(x | y) =
p(y)
Modelling: Computation:
▶ Directed Graphical Models ▶ Monte Carlo
▶ Gaussian Distributions ▶ Linear algebra / Gaussian inference
▶ ▶
▶ ▶
▶ ▶
▶

06 Gaussian Distributions

Uploaded by

Copyright:

Available Formats

06 Gaussian Distributions

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

06 Gaussian Distributions

Uploaded by

Copyright:

Available Formats

Probabilistic Inference and Learning

If σ −2 _ 0, νi = ν ∀i, then m is the arithmetic mean.

Definition (multivariate Gaussian distribution)

Σ must be symmetric positive definite.

Definition (multivariate Gaussian distribution)

Σ must be symmetric positive definite.

Definition (symmetric positive definite matrix)

Equivalent statement: All eigenvalues of the symmetric matrix A are non-negative.

6 ▶ Symmetry in x and µ: N (x; µ, Σ) = N (µ; x, Σ)

−4 with the natural parameters Λ = Σ−1 (precision

6 To multiply Gaussians, add the natural parameters

4 N (x; a, A)N (x; b, B) = N (x; c, C)Z

µ2 C = (A−1 + B−1 )−1

6 To multiply Gaussians, add the natural parameters

4 N (x; a, A)N (x; b, B) = N (x; c, C)Z

µ2 C = (A−1 + B−1 )−1

6 To multiply Gaussians, add the natural parameters

4 N (x; a, A)N (x; b, B) = N (x; c, C)Z

µ2 C = (A−1 + B−1 )−1

To linearly project a Gaussian variable,

p(z) = N (z; µ, Σ) p(Az) = N (Az, Aµ, AΣA⊺ )

▶ this is the sum rule

▶ this is the product rule

Issai Schur (1875–1941)

▶ products of Gaussians are Gaussians ▶ marginals of Gaussians are Gaussians

Bayesian inference becomes linear algebra

If p(x) = N (x; µ, Σ) and p(y | x) = N (y; A⊺ x + b, Λ), then

x2 = ν2 p(ν2 ) = N (ν2 ; µ2 , σ22 )

x1 = w1 x2 + ν1 p(ν1 ) = N (ν1 ; µ1 , σ12 )

x1 = w1 x2 + ν1 p(ν1 ) = N (ν1 ; µ1 , σ12 )

x2 x2 = ν2 p(ν2 ) = N (ν2 ; µ2 , σ22 )

p(x1 , x2 , x3 ) = p(x2 ) · p(x1 | x2 ) · p(x3 | x2 )

x1 = ν1 p(ν1 ) = N (ν1 ; µ1 , σ12 )

x1 = ν1 p(ν1 ) = N (ν1 ; µ1 , σ12 )

x2 x2 = w1 x1 + w3 x3 + ν2 p(ν2 ) = N (ν2 ; µ2 , σ22 )

p(x1 , x2 , x3 ) = p(x1 ) · p(x3 ) · p(x2 | x1 , x3 )

exp  − x1 x2 x3  − σw12 − σw32  x2 

You might also like