Ch-1 Probabilistic Distributions

SiTE
AAiT
AAU
Course Title: Machine Learning

Credit Hour: 3
Instructor: Fantahun B. (PhD)  meetfantaai@gmail.com
Office: NB #
Ch-1 Probabilistic distributions

Nov-2022, AA
1.
Probabilistic Distributions
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 2

Probability Distribution General
• Discrete
• Continuous

Discrete Probability Distribution
• A discrete probability distribution function has two
characteristics:
• Each probability is between zero and one, inclusive.
• The sum of the probabilities is one.
• The expected value is often referred to as the "long-term"
average or mean.
• Karl Pearson once tossed a fair coin 24,000 times! He recorded
the results of each toss, obtaining heads 12,012 times.
• The Law of Large Numbers states that, as the number of trials in
a probability experiment increases, the difference between
the theoretical probability of an event and the relative
frequency approaches zero (the theoretical probability and
the relative frequency get closer and closer together).
Continuous Probability Distribution
• The graph of a continuous probability distribution is a curve.
• Probability is represented by area under the curve.
• The curve is called the probability density function (pdf). We
use the symbol f(x) to represent the curve.
• f(x) is the function that corresponds to the graph; we use the
density function f(x) to draw the graph of the probability
distribution.
• Area under the curve is given by a different function called the
cumulative distribution function(cdf). The cumulative
distribution function is used to evaluate probability as area.C´s

Continuous Probability Distribution
• The outcomes are measured, not counted.
• The entire area under the curve and above the x-axis is equal to one.
• Probability is found for intervals of x values rather than for individual x
values.
• P(c<x<d) is the probability that the random variable X is in the interval
between the values c and d.
• P(c<x<d) is the area under the curve, above the x-axis, to the right of c
and the left of d .
• P(x=c)=0. The probability that x takes on any single individual value is
zero.
• P(c<x<d) is the same as P(c≤x≤d) because probability is equal to area.

Binary Variables: Frequentist’s Way
Bernoulli Distribution
• Consider a single binary random variable x {0, 1}.
 Eg., x might describe the outcome of flipping a coin, with x = 1 representing
‘heads’, and x = 0 representing ‘tails’.
 p(head)=p(tail) = 0.5
• We can imagine that this is a damaged coin so that the probability
of landing heads is not necessarily the same as that of landing tails.
• The probability of x = 1 will be denoted by the parameter μ so that ´
p(x = 1|μ) = μ (2.1)

where 0 <= μ <= 1, from which it follows that
p(x = 0|μ) = 1 − μ.
Binary Variables: Frequentist’s way
• The probability distribution over x can be written in the form
Bern(x|μ) = μx(1 − μ)1−x (2.2)
which is known as the Bernoulli distribution.
• The maximum likelihood estimate for μ (estimated mean) is:
μML = m/N (2.8)
with m = (#observations of x = 1)
• Yet this can lead to overfitting (especially for small N),
 Suppose we flip a coin, say, 3 times and happen to observe 3
heads. Then N = m = 3 and μML = 1.
 In this case, the maximum likelihood result would predict that all
future observations should give heads. This is unreasonable.
Binary Variables: Bayesian way
• We can also work out the distribution of the number m of
observations of x = 1, given that the data set has size N.
• This is called the binomial distribution, and it is proportional
to μm(1 − μ)N−m.
• In order to obtain the normalization coefficient we note
that out of N coin flips, we have to add up all of the
possible ways of obtaining m heads, so that the binomial
distribution can be written
where

• Binomial distribution is calculated by multiplying the probability of success raised
to the power of the number of successes and the probability of failure raised to
the power of the difference between the number of successes and the number of
trials. Then, multiply the product by the combination of the number of trials and
successes.
• For example, assume that a casino created a new game in which participants
can place bets on the number of heads or tails in a specified number of coin flips.
Assume a participant wants to place a $10 bet that there will be exactly six heads
in 20 coin flips. The participant wants to calculate the probability of this occurring,
and therefore, they use the calculation for binomial distribution.
• The probability was calculated as (20! / (6! × (20 - 6)!)) × (0.50)(6) × (1 - 0.50)(20 -
6). Consequently, the probability of exactly six heads occurring in 20 coin flips is
0.0369, or 3.7%. The expected value was 10 heads in this case, so the participant
made a poor bet. The graph below shows that the mean is 10 (the expected
value), and the chances of getting six heads is on the left tail in red. You can see
that there is less of a chance of six heads occurring than seven, eight, nine, 10, 11,
12, or 13 heads.

• Histogram of the above example
N=20
P=0.5
P(X=6) = 0.0369644

Binary Variables: Beta distribution
• For a Bayesian treatment, we take the beta distribution as conjugate
prior:
• (The gamma function extends the factorial to real numbers, i.e., Γ(n) =
(n − 1)!.) Mean and variance are given by
• The parameters a and b are often called hyperparameters b/c they

control the distribution of the parameter μ.
Figure 2.2 Plots of the

beta distribution
Beta(μ|a, b) given by
(2.13) as a function of μ
for various values of the
hyperparameters a and b.

• Multiplying the binomial likelihood function (2.9) and the
beta prior (2.13), the posterior is a beta distribution and
has the form:
with l = N − m.
 Simple interpretation of hyperparameters a and b as effective
number of observations of x = 1 and x = 0 (a priori)
 As we observe new data, a and b are updated
 As N → ∞, the variance (uncertainty) decreases and the mean
converges to the ML estimate

Multinomial Variables: Frequentist’s Way
• Binary variables can be used to describe quantities that can
take one of two possible values. Eg. H/T of a coin
• Often, however, we encounter discrete variables that can
take on one of K possible mutually exclusive states.
• Although there are various alternative ways to express such
variables, we shall see shortly that a particularly convenient
representation is the 1-of-K scheme in which the variable is
represented by a K-dimensional vector x in which one of the
elements xk equals 1, and all remaining elements equal 0.
• So, for instance if we have a variable that can take K = 6
states and a particular observation of the variable happens
to correspond to the state where x3 = 1, then x will be
represented by
x = (0, 0, 1, 0, 0, 0)T. (2.25)
• A random variable with K mutually exclusive states can be
represented as a K dimensional vector x with xk = 1 and xi≠k = 0.
• If we denote the probability of xk = 1 by the parameter μk, then
the distribution of x is given by:
where μ = (μ1, . . . , μK)T, and the parameters μk are constrained to

satisfy μk >= 0 and because they represent probabilities.
• The distribution (2.26) can be regarded as a generalization of the
Bernoulli distribution to more than two outcomes.
• For a dataset D with N independent observations x1, …, xN,
the corresponding likelihood function takes the form

Multinomial Variables: Bayesian Way 1/3
• The multinomial distribution is a joint distribution of the
parameters m1, . . . ,mK, conditioned on μ and N:
Mult(m1, m2, . . . , mK|μ, N) =
where the variables mk are subject to the constraint:

• For a Bayesian treatment, the Dirichlet distribution can be
taken as conjugate prior:

Multinomial Variables: Dirichlet Distribution
• Some plots of a Dirichlet distribution over 3 variables:
Figure 2.5 Plots of the Dirichlet distribution over three variables, where the two horizontal axes
are coordinates in the plane of the simplex and the vertical axis corresponds to the value of the
density. Here {αk} = 0.1 on the left plot, {αk} = 1 in the centre plot, and {αk} = 10 in the right plot.
Multinomial Variables: Dirichlet Distribution
• Some plots of a Dirichlet distribution over 3 variables:
Dirichlet distribution with values (clockwise from top left): = (6, 2, 2), (3, 7, 5), (6, 2, 6), (2, 3, 4).
• Multiplying the prior (2.38) by the likelihood function (2.34)
yields the posterior:
with m = (m1, . . . ,mK)⊤. Similarly to the binomial distribution with

its beta prior, k can be interpreted as effective number of
observations of xk = 1 (a priori).

The Gaussian distribution
• The Gaussian, also known as the normal distribution, is a
widely used model for the distribution of continuous variables.
• In the case of a single variable x, the Gaussian distribution
can be written in the form
where μ is the mean and σ2 is the variance.

• For a D-dimensional vector x, the multivariate Gaussian

distribution takes the form
where μ is a D-dimensional mean vector,

Σ is a D × D covariance matrix, and
|Σ|denotes the determinant of Σ.

• The Gaussian distribution arises in many different contexts and
can be motivated from a variety of different perspectives.
 The Gaussian distribution maximizes entropy.
 The central limit theorem (due to Laplace): subject to certain mild
conditions, the sum of a set of random variables, which is itself a
random variable, has a distribution that becomes increasingly
Gaussian as the number of terms in the sum increases (Walker, 1969).
Figure 2.6 Histogram plots of the mean of N uniformly distributed numbers for various values
of N. We observe that as N increases, the distribution tends towards a Gaussian.
The Gaussian distribution: Properties
• The law is a function of the Mahalanobis distance from x to μ:
The quantity Δ is called the Mahalanobis distance from μ to x and

reduces to the Euclidean distance when Σ is the identity matrix.
• Gaussian distribution will be constant on surfaces in x-space for
which this quadratic form is constant.
• The expectation of x under the Gaussian distribution is:
• The covariance matrix of x is:

The Gaussian distribution: Properties
• The law is constant on elliptical surfaces
Figure 2.7 The red curve shows the elliptical

surface of constant probability density for a
Gaussian in a two-dimensional space x = (x1,
x2) on which the density is exp(−1/2) of its
value at x = μ.
The major axes of the ellipse are defined by
the eigenvectors ui of the covariance matrix,
with corresponding eigenvalues λi.
The Gaussian distribution: Conditional and marginal laws
• Given a Gausian distribution N(x|μ, ) with:
• The conditional distribution p(xa|xb) is a Gaussian law with

parameters:
• The marginal distribution p(xa) is a Gaussian law with

parameters (μa, aa).
The Gaussian distribution: Bayes’ theorem
• A linear Gaussian model is a couple of vectors (x, y) described
by the relations:
• (y = Ax+b+ ) where x is Gaussian and is a centered Gaussian

noise).
• Then
where

The Gaussian distribution: Maximum Likelihood
• Assume we have X a set of N iid observations following a
Gaussian law. The parameters of the law, estimated by ML
are:
• The empirical mean is unbiased but it is not the case of the

empirical variance.
• The bias can be correct multiplying ΣML by the factor N/(N−1).
The Gaussian distribution: Maximum Likelihood
• The mean estimated form N data points is a revision of the
estimator obtained from the (N−1) first data points:
• It is a particular case of the algorithm of Robbins-Monro,

which iteratively search the root of a regression function.

The Gaussian distribution: Bayesian inference
• The conjugate prior for μ is gaussian,
• The conjugate prior for λ = 1/σ2 is a Gamma law,
• The conjugate prior of the couple (μ, λ) is the normal gamma
distribution N(μ|μ0, λ0−1 )Gam(λ |a, b) where λ0 is a linear
function of λ.
• The posterior distribution would exhibit a coupling between
the precision of μ and λ.
• The multidimensional conjugate prior is the Gaussian Wishart
law.

The Gaussian distribution: Limitations
• A lot of parameters to estimate D(1 + (D + 1)/2) :
simplification (diagonal variance matrix),
• Maximum likehood estimators are not robust to outliers:
t-Student distribution,
• Not able to describe periodic data: von Mises distribution,
• Unimodal distribution Mixture of Gaussian.

After the Gaussian distribution : t-Student distribution
• A student distribution is an infinite sum of Gaussian having the
same mean but different precisions (described by a Gamma law)
• It is robust to outliers
Figure 2.16 Illustration of the robustness of Student’s t-
distribution compared to a Gaussian.
(a) Histogram distribution of 30 data points drawn
from a Gaussian distribution, together with the
maximum likelihood fit obtained from a t-distribution
(red curve) and a Gaussian (green curve, largely
hidden by the red curve). Because the t-distribution
contains the Gaussian as a special case it gives
almost the same solution as the Gaussian.
(b) The same data set but with three additional

outlying data points showing how the Gaussian
(green curve) is strongly distorted by the outliers,
whereas the t-distribution (red curve) is relatively
unaffected. Machine Learning Fantahun B. (PhD) @ AAU-AAiT 34
After the gaussian distribution : von Mises distribution
• When the data are periodic, it is necessary to work with polar
coordinates.
• The von Mises law is obtained by conditioning the bidimensional
Gaussian law to the unit circle:
• the distribution is:
where
 m is the concentration (precision) parameter,
 θ0 is the mean.
Mixtures (of Gaussians) (1/3)
• Data with distinct regimes better modeled with mixtures
• General form: convex combination of component densities

• Gaussian popular density, and so are mixtures thereof
• Example of mixture of Gaussians on R
• Example of mixture of Gaussians on R2


The Exponential Family (1/3)
• Large family of useful distributions with common properties
 Bernoulli, beta, binomial, chi-square, Dirichlet, gamma,
 Gaussian, geometric, multinomial, Poisson, Weibull, . . .
 Not in the family: Cauchy, Laplace, mixture of Gaussians, . . .
 Variable can be discrete or continuous (or vectors thereof)
• General form: log-linear interaction
• Normalization determines form of g:
• Differentiation with respect to η, using Leibniz’s rule, reveals

The Exponential Family (2/3): Sufficient Statistics
• Maximum likelihood estimation for i.i.d. data
• Setting gradient w.r.t. η to zero yields
 is all we need from the data: sufficient statistics
• Combining with result from previous slide, ML estimate yields

The Exponential Family (3/3): Conjugate Priors
• Given a probability distribution p(x|η), prior p(η) is conjugate if the
posterior p(η |x) has the same form as the prior.
• All exponential family members have conjugate priors:
• Combining the prior with a exponential family likelihood
• We obtain (2.230)

Nonparametric methods
• So far we have seen parametric densities in this chapter
 Limitation: we are tied down to a specific functional form
 Alternatively we can use (flexible) nonparametric methods
• Basic idea: consider small region R, with P =
 For N  ∞ data points we find about K NP in R
 For small R with volume V : P p(x)V for x R
 Thus, combining we find: p(x) K/(NV )
• Simplest example: histograms
 Choose bins
 Estimate density in i-th bin
 Tough in many dimensions: smart chopping required

Kernel density estimators

Kernel density estimators

Kernel density estimators: fix V , find K

Kernel density estimators: fix V , find K

Ch-1 Probabilistic Distributions

Uploaded by

Copyright:

Available Formats

Ch-1 Probabilistic Distributions

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ch-1 Probabilistic Distributions

Uploaded by

Copyright:

Available Formats

SiTE

Course Title: Machine Learning

Ch-1 Probabilistic distributions

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 2

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 3

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 5

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 6

p(x = 1|μ) = μ (2.1)

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 9

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 10

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 11

• The parameters a and b are often called hyperparameters b/c they

Figure 2.2 Plots of the

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 13

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 14

where μ = (μ1, . . . , μK)T, and the parameters μk are constrained to

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 17

Mult(m1, m2, . . . , mK|μ, N) =

where the variables mk are subject to the constraint:

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 18

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 19

with m = (m1, . . . ,mK)⊤. Similarly to the binomial distribution with

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 22

where μ is the mean and σ2 is the variance.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 23

• For a D-dimensional vector x, the multivariate Gaussian

where μ is a D-dimensional mean vector,

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 24

The quantity Δ is called the Mahalanobis distance from μ to x and

• The covariance matrix of x is:

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 26

Figure 2.7 The red curve shows the elliptical

• The conditional distribution p(xa|xb) is a Gaussian law with

• The marginal distribution p(xa) is a Gaussian law with

• (y = Ax+b+ ) where x is Gaussian and is a centered Gaussian

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 29

• The empirical mean is unbiased but it is not the case of the

• It is a particular case of the algorithm of Robbins-Monro,

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 31

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 32

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 33

(b) The same data set but with three additional

• the distribution is:

• General form: convex combination of component densities

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 36

• Example of mixture of Gaussians on R

• Example of mixture of Gaussians on R2

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 37

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 38

• Normalization determines form of g:

• Differentiation with respect to η, using Leibniz’s rule, reveals

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 39

• Setting gradient w.r.t. η to zero yields

 is all we need from the data: sufficient statistics

• Combining with result from previous slide, ML estimate yields

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 40

• Combining the prior with a exponential family likelihood

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 41

 Tough in many dimensions: smart chopping required