03 Prob
03 Prob
03 Prob
Theory
Lecture slides for Chapter 3 of Deep Learning
www.deeplearningbook.org
Ian Goodfellow
2016-09-26
adapted by m.n. for CMPS 392
Probability
• Sample space Ω: set of all outcomes of a random experiment
• Set of events ℱ : collection of possible outcomes of an experiment.
• Probability measure: 𝑃: ℱ → ℝ
q Axioms of probability
o 𝑃 𝐴 ≥ 0 for all 𝐴 ∈ ℱ
o 𝑃 Ω =1
o If 𝐴! , 𝐴" , … are disjoint events then
𝑃 / 𝐴# = 0 𝐴#
# #
(Goodfellow 2016)
Random variable
• Consider an experiment in which we flip 10 coins, and we want to
know the number of coins that come up heads.
• Here, the elements of the sample space Ω are 10-length
sequences of heads and tails.
• For example, we might have
𝑤$ = < 𝐻, 𝐻, 𝑇, 𝐻, 𝑇, 𝐻, 𝐻, 𝑇, 𝑇, 𝑇 >
• However, in practice, we usually do not care about the probability
of obtaining any particular sequence of heads and tails.
• Instead we usually care about real-valued functions of outcomes,
such as
q the number of heads that appear among our 10 tosses,
q or the length of the longest run of tails.
• These functions, under some technical conditions, are known as
random variables: 𝑋: Ω → ℝ
(Goodfellow 2016)
Discrete vs. continuous
• Discrete random variable:
q 𝑃 𝑋 = 𝑘 = 𝑃 𝑤: 𝑋 𝑤 = 𝑘
• Continuous random variable:
q 𝑃 𝑎 ≤ 𝑋 ≤ 𝑏 = 𝑃 𝑤: 𝑎 ≤ 𝑋 𝑤 ≤ 𝑏
(Goodfellow 2016)
Probability Mass Function
(discrete variable)
(Goodfellow 2016)
Probability Density Function
(continuous variable)
(Goodfellow 2016)
Conditional Probability
(Goodfellow 2016)
Chain Rule of Probability
(Goodfellow 2016)
Independence
(Goodfellow 2016)
Conditional Independence
(Goodfellow 2016)
Expectation
linearity of expectations:
(Goodfellow 2016)
Variance and Covariance
Covariance matrix:
(Goodfellow 2016)
Bernoulli Distribution
(Goodfellow 2016)
Gaussian Distribution
Parametrized by variance:
Parametrized by precision:
(Goodfellow 2016)
Gaussian Distribution
Figure 3.1
(Goodfellow 2016)
Multivariate Gaussian
(Goodfellow 2016)
More Distributions
Exponential:
Laplace:
Dirac:
(Goodfellow 2016)
Empirical Distribution
(Goodfellow 2016)
Mixture Distributions
Gaussian mixture
with three
components
A
smoothed
version of
(Goodfellow 2016)
Useful Properties
%&'())
• 𝜎 𝑥 = %&' ) +%&' $
,
• 𝜎 𝑥 =𝜎 𝑥 1−𝜎 𝑥
,)
• 1 − 𝜎 𝑥 = 𝜎 −𝑥
• log 𝜎 𝑥 = −𝜁 −𝑥
,
• 𝜁 𝑥 =𝜎 𝑥
,)
-! )
• ∀𝑥 ∈ 0,1 , 𝜎 𝑥 = log !-)
• ∀𝑥 > 0, 𝜁 -! 𝑥 = log exp 𝑥 − 1
)
• 𝜁 𝑥 = ∫-. 𝜎 𝑦 𝑑𝑦
• 𝜁 𝑥 − 𝜁 −𝑥 = 𝑥
(Goodfellow 2016)
Bayes’ Rule
(Goodfellow 2016)
Bayes Rule
Prior
Posterior Likelihood
𝑃(𝐵|𝐴) 𝑃(𝐴)
𝑃(𝐴|𝐵) =
𝑃(𝐵)
Marginal
likelihood
𝑃 𝐵 = $ 𝑃 𝐵 𝐴 = 𝑎 𝑃(𝐴 = 𝑎)
!
Bag-of-words Naïve Bayes:
(Goodfellow 2016)
Information theory
• Learning that an unlikely event has occurred is more
informative that learning that a likely event has occurred!
• Which statement has more information?
q “The sun rose this morning”
q “There was a solar eclipse this morning”
(Goodfellow 2016)
Self-Information
(Goodfellow 2016)
Entropy
Entropy:
(Goodfellow 2016)
Entropy of a Bernoulli
Variable
𝐻 𝜙 = (𝜙 − 1) log 1 − 𝜙 − 𝜙log(𝜙)
Bernoulli parameter
(Goodfellow 2016)
KL-divergence
• It can be used as a distance measure between
distributions
• But it is not a true distance measure since it is not
symmetric:
q 𝐷56 (𝑃| 𝑄 ≠ 𝐷56 (𝑄| 𝑃
(Goodfellow 2016)
The KL Divergence is
Asymmetric
Mixture of two Gaussians for P, One Gaussian for Q
(Goodfellow 2016)
Directed Model
Figure 3.7
𝑝 x = T 𝑝(x9 |𝑃𝑎𝒢 x9 )
9
(Goodfellow 2016)