Probability and Information Theory: Lecture Slides For Chapter 3 of Deep Learning Ian Goodfellow 2016-09-26
Probability and Information Theory: Lecture Slides For Chapter 3 of Deep Learning Ian Goodfellow 2016-09-26
Probability and Information Theory: Lecture Slides For Chapter 3 of Deep Learning Ian Goodfellow 2016-09-26
Information Theory
Lecture slides for Chapter 3 of Deep Learning
www.deeplearningbook.org
Ian Goodfellow
2016-09-26
has probability 1, and no state can have a grea
a probability distribution over many variables is known as a joint probability
distribution. P (x = x, y = y) denotes the probability that x = x and y = y
P simultaneously. We may also write P (x, y) for brevity.
• Tox2x be a P (x) =mass We onrefer
1.function tovariable
thisx, property as b
Probability
satisfy
probability
the following properties: Mass
a random
P (y = y, x = x)
P (y = y | x = x) = . (3.5)
P (x = x)
(Goodfellow 2016)
P (a, b, c) = P (a | b, c)P (b | c)P (c).
Independence
ependence and Conditional Independence
variables x and y are independent if their probability distribution
ssed as a product of two factors, one involving only x and one involving
Conditional Independence
8x 2 x, y 2 y, p(x = x, y = y) = p(x = x)p(y = y). (3.7)
PROBABILITY
Expectation
on P (x) is the average or mean value that f takes on when x
expected value of some function f (x) with respect to a
r discrete variables this can be computed
AND INFORMATION THEORY with a summation:
on P (x) is the average or mean value that f takes on when x
discrete variables X
this can be computed with a summation:
Ex⇠P [f (x)] = P (x)f (x), (3.9)
ntity of the distributionX x is clear from the context, we may simply
e of theEx⇠P [f (x)]
random =
variable P (x)f
that expectation is over, as in Ex(3.9)
the(x), [f (x)].
variables,
which randomit isvariable
computed
xthe with an integral:
expectation is over, we may omit the
ely, as in E[f (x)]. ByZdefault, we can assume that E[·] averages over
variables, it is computed with an integral:
all the E
random variables
x⇠p [f (x)] inside (x)dx.
= p(x)f the brackets. Likewise, when(3.10)
there is
we may omit the square Z brackets.
ons areElinear,
x⇠p [f (x)] = linearity
p(x)f (x)dx.
for example, of expectations: (3.10)
60
Ex [↵f (x) + g(x)] = ↵Ex [f (x)] + Ex [g(x)], (3.11)
60
are not dependent on x.
nce gives a measure of how much the values of a function of a random (Goodfellow 2016)
and are not dependent on x.
E x [↵f (x)gives
variance + ag(x)] = ↵E
measure much+theEvalues
x [f (x)]
of how x [g(x)], (3.11)
of a function of a random
e x vary as we sample different values of x from its probability distribution:
Variance
not dependent on x.
and Covariance
E
h
E[f 2
gives a measure of how much the values of a function
Var(f (x))
TY AND INFORMATION THEORY
= (f (x) (x)])
i
. (3.12)
of a random
we sample different values of x from its probability distribution:
he variance is low, the values of f (x) cluster near their expected value. The
h
root of the variance is known as the standard i deviation.
2
Var(f (x)) = E (f (x) E[f (x)]) . (3.12)
wecovariance
choose the gives value of sof to
some sense howbemuch Otherwise,
1. two we choose
values are linearly related
other, as well as the scale of these variables:
eWe canthe
is low, then generate
values a random
of f (x) cluster near variable y byvalue.
their expected assigning
The
are Cov(f
not is(x),
variance independent,
known
g(y)) =asE the because
standard
[(f (x) completely
xdeviation.
E [f (x)]) (g(y) E [g(y)])]determines
. (3.13)
ever,
ce
bsolute Cov(x,
gives
values y)
the=
someofsense 0.how much
of
covariance mean two
that values arechange
the values linearly related
very much
ewell
bothasfar
trix ofthe scale
afrom theirofrespective
random these
vector
Covariance matrix:
variables:
means atnthe same time. If the sign of the
x 2 R is an n ⇥ n matrix, such
nce is positive, then both variables tend to take on relatively high values
fneously. If the
(x), g(y)) E [(fof(x)
= sign E [f (x)])is (g(y)
the covariance negative,Ethen one variable
[g(y)])] . (3.13)
tends to
n a relatively
Cov(x)high i,j =value
Cov(xat the xj ). that the other takes on a relatively
i , times (3.14)
ues of the
ue and vice covariance
versa. Othermean measures thatsuch
the asvalues changenormalize
correlation very much the
the their
ution
rom covariance
of variablegive
eachrespective theto at
in means
order variance:
measure only how
the same time.much the sign
If the variables are
of the
,tive,
rather thanboth
then also being affected
variables tend by the scale on
to take of the separate high
relatively variables.
values(Goodfellow 2016)
Cov(x
e notions i , xi ) =and
of covariance Var(x i ).
dependence (3.15)
are related, but are in fact distinct
Distribution
Bernoulli Distribution
ution is a distribution over a single binary random variable.
ngle parameter 2 [0, 1], which gives the probability of the
equal to 1. It has the following properties:
P (x = 1) = (3.16)
P (x = 0) = 1 (3.17)
x 1 x
P (x = x) = (1 ) (3.18)
Ex [x] = (3.19)
Varx (x) = (1 ) (3.20)
Distribution
Gaussian Distribution
0.40
0.35
0.30 Maximum at x = µ
0.20 x=µ±
0.15
0.10
0.05
0.00
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Figure 3.1x
Figure 3.1: The normal distribution: The normal distribution N (x; µ, 2 ) exhibits
a classic “bell curve” shape, with the x coordinate of its central peak given by µ, and
the width of its peak controlled by . In this example, we depict the standard normal
distribution, with µ = 0 and = 1.
(Goodfellow 2016)
e normal distribution encodes the maximum amount of uncertainty over the
al numbers. We can thus think of the normal distribution as being the one
at inserts the least amount of prior knowledge into a model. Fully developing
Multivariate Gaussian
d justifying this idea requires more mathematical tools, and is postponed to
tion 19.4.2.
3. PROBABILITY AND INFORMATION THEORY
The normal distribution generalizes to Rn , in which case it is known as the
ultivariate normal distribution. It may be parametrized with a positive
finite symmetric Parametrized
matrix ⌃: by covariance matrix:
arameter µ still gives the mean of the distribution, though now it is
ued. The parameters ⌃ gives the ✓covariance matrix of the◆ distribution.
1 1
univariate case,
N (x; µ, ⌃) = whenn we wish expto evaluate
(x µ)the
>
⌃ PDF
1
(x several
µ) . times for
(3.23)
(2⇡) det(⌃) 2
erent values of the parameters, the covariance is not a computationally
way to parametrize the distribution, since we need to invert ⌃ to evaluate
We can instead use a precision 64matrix :
Parametrized by precision matrix:
s ✓ ◆
det( ) 1
N (x; µ, 1
)= n
exp (x µ)> (x µ) . (3.24)
(2⇡) 2
arning, p(x;
More Distributions
bitrary point µ is the Laplace distribution
we often
) = want
1x ✓ to have
0 exp ( x)a.◆probability distribution
(3.25)
= 0. To 1
accomplish |x weµ|can use the exponential
this,
stribution
e(x; µ, )uses theexp (3.26)
indicator function 1. x 0 to assign probability
= Exponential:
e values of x. 2
p(x; ) = 1 x 0 exp ( x) . (3.25)
d probability distribution that allows us to place a sharp peak
ribution
sion
at uses theand
an arbitrary Empirical
point
indicator
Laplace: µ is the
function Distribution
Laplace
1 x 0
distribution
to assign probability
s of x. ✓ ◆
1 |x µ|
pecify that all
Laplace(x; µ, of
) =the mass
exp in a probability
. distribution
(3.26)
bability distribution2that allows us to place a sharp peak
nt. This can be accomplished by defining a PDF using
arbitrary point µ is the Laplace distribution
x): Dirac:and✓Empirical◆Distribution
ac Distribution
1 |x µ|
ace(x; µ, =
p(x) ) =(x exp µ). . (3.26)
(3.27)
wish to specify that
2 all of the mass in a probability distribution
ingle point. This can be accomplished by defining a PDF using
defined(x):
nction, such that it is zero-valued everywhere except (Goodfellow 2016)
unctions that put less and less mass on all points other
Empirical Distribution
be shifted by µ we obtain an infinitely narrow and
obability mass where x = µ.
Dirac delta distribution is as a component of an empirical
m
X
1 (i)
p̂(x) = (x x ) (3.28)
m
i=1
ass 1
m on each of the m points (1)
x ,...,x(m) forming a
of samples. The Dirac delta distribution is only necessary
tribution over continuous variables. For discrete variables,
an empirical distribution can be conceptualized as a
with a probability associated to each possible input value
e empirical frequency of that value in the training set.
(Goodfellow 2016)
distribution. A mixture distribution is made up of several
ns. On each trial, the choice of which component distribution
is determined by sampling a component identity from a
n: Mixture Distributions
X
P (x) = P (c = i)P (x | c = i) (3.29)
i
Gaussian mixture
tinoulli distribution over AND
CHAPTER 3. PROBABILITY component
with threeidentities.
INFORMATION THEORY
66
x1
0.8
0.6
(x)
0.4
0.2
0.0
10 5 0 5 10
x
Figure 3.3: The logistic sigmoid function.
6
(x)
0
10 5 0 5 10
x
Figure 3.4: The softplus function.
(Goodfellow 2016)
le Bayes’ Rule
n a situation where we know P (y | x) and need to know
we also know P (x), we can compute the desired quantity
P (x)P (y | x)
P (x | y) = . (3.42)
P (y)
ppears in the formula, it is usually feasible to compute
, so we do not need to begin with knowledge of P (y).
70
(Goodfellow 2016)
@y
Change of Variables
px (x) = py (g(x))
@g(x)
@x
. (3.46)
72
(Goodfellow 2016)
daranteed to occur. an event of probability 1 . Other texts use base-2
by observing
has come up
Self-information
as
deals
heads twice should
e convey twice as
nits called bits or only with a single
shannons; outcome. We measured
information can quantify the amount
in bits is
nding
uncertaintyout that
in an entireaprobability
tossed
information measured in nats. coin has
distribution come
using up as
the Shannon heads
entropy:
Information Theory
H(x) = Ex⇠P [I(x)] = Ex⇠P [log P (x)].
tinuous, we use the same definition of information by analogy, (3.49)
roperties from
o denoted H(P theother
). In discrete
words, case are lost.
the Shannon For of
entropy example, an event
a distribution is the
epected
of these
still has
amountproperties,
zero
of information inwe
information, define
an despite
event the
not
drawn self-information
being
from that an event that
distribution. is
It gives
ur. bound on the number of bitsInformation:
ower (if the logarithm is base 2, otherwise the units
different) needed on average to encode symbols drawn from a distribution P .
ntributions
deals
I(x)onlythatwith
= log
are aPsingle
nearly(x). outcome.
deterministic We the
(where canoutcome
quantify the (3.48)
amount
is nearly certain)
an entire
ve low probability
entropy; distribution
distributions using
that are closer the Shannon
to uniform entropy:
have high entropy. See
og to for
ure 3.5 mean the natural
a demonstration. When logarithm,
x is continuous,with
Entropy: base e.
the Shannon Ouris
entropy
ownH(x)
as the =
differential
E entropy.
= E (3.49)
written inx⇠P
units of nats.
x⇠POne nat is the amount of
[I(x)] [log P (x)].
If we have two separate probability distributions P (x) and Q(x) over the same
dom
). Invariable
other x, we canthe
words, measure how different
Shannon entropytheseoftwo distributions areisusing
a distribution the
73
Kullback-Leibler (KL) divergence:
of information in an eventKL drawn from that distribution. It gives
divergence:
he number of bits (if
DKL (P kQ) = Ex⇠P log
the logarithm
P (x) is base 2, otherwise
= Ex⇠P [log P (x) log Q(x)] .
the units
(3.50)
ed on average to encode symbols drawn from a distribution P .
Q(x)
are nearly deterministic (where the outcome is nearly certain) (Goodfellow 2016)
TER 3. PROBABILITY AND INFORMATION THEORY
0.6
Shannon entropy in nats
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.2 0.4 0.6 0.8 1.0
p Bernoulli parameter
e 3.5: This plot shows how distributions that are closer to deterministic hav
Figure
on entropy while distributions that 3.5 to uniform have high Shannon en
are close
(Goodfellow 2016)
The KL Divergence is
Asymmetric
q ⇤ = argminq DKL (p q) q ⇤ = argminq DKL (q p)
p(x) p(x)
Probability Density
Probability Density
q ⇤ (x) q ⇤ (x)
x x
Figure 3.6
.6: The KL divergence is asymmetric. Suppose we have a distribution p(x
(Goodfellow 2016)
Directed Model
APTER 3. PROBABILITY AND INFORMATION THEORY
a b
a b
Figure 3.7
c d
c d
Undirected Model
a b
a b
c d
c d
e
Figure 3.8
directed graphical model over random variables
e
a, b, c, d and e. This
to probability distributions that can be factored as
1 (1)
Figure 3.8: An
p(a, b, c, d, e) =undirected graphical
(a, b, c) (2) model
(b, d) (3) over random variables a, b
(c, e). (3.56)
Z
graph corresponds to probability distributions that can be factored as
us to quickly see some properties of the distribution. For example, a(Goodfellow 2016)