Notation
Notation
Notation
Kevin P. Murphy
Last updated October 25, 2006
1 General
• x∗ ∈ arg maxx f (x) means x∗ is the value of x that maximizes the function f , i.e., f (x∗ ) = maxx f (x). Note
that there may be multiple global maxima, in which case we break ties randomly.
• Indicator function: I(e) = 1 if event e is true, I(e) = 0 otherwise.
• Delta function: δ(x) = 1 if e = 0 and δ(x) = 0 otherwise.
• Sometimes probability mass functions (for discrete random variables) are written P (X), and probability density
functions (for continuous random variables) are written p(X). We will use p() for both.
• Usually we write random variables as capital letters and values of random variables as lower case, e.g., p(X = x)
is the probability X has value x. However, we do not follow this convention very closely.
• If X is distributed according to distribution f with parameters θ, i.e., p(X) = f (X|θ), then we write X ∼ f (θ).
• We will often write probability distributions up to a constant of proportionality, p(x|θ) ∝ f (x, θ). This normal-
ization constant is often denoted 1/Z(θ), where Z is called the partition function.
• Vectors are usually column vectors. T denotes transpose, so xT is a row vector. Sometimes we will write vectors
in bold, e.g., x, or as ~x, but usually we will just write x. Matrices will usually be written as capital letters, X.
However, using this convention we will cannot distinguish matrices from scalar (or vector) random variables. It
should be clear from context.
• We use the following matlab notation: 1 : n denotes the sequence of integers {1, 2, . . . , n} and X(i, j, k) is
element i, j, k of some matrix, where i, j, k could each be a sequence of indices.
1
• If Xni is a scalar, then Xni ∈ IR or Xni ∈ IR+ . If it is a vector, then Xni ∈ IRK , where Xnik is the k’th
component of the i’th variable for k = 1 : K, where K is the dimensionality of each variable. (In general, K
may depend on i, but we rarely write Ki ).
• If Xni is binary, then Xni ∈ {0, 1} . If Xni is categorical, then Xni ∈ {1, . . . , K}, where K is the number
of states of variable i. (In general, K may depend on i, but we rarely write Ki ). We write Xni = k if the i’th
variable is in state k, where k ∈ 1 : K. Sometimes you will see a 1-of-K encoding, where Xni ∈ {0, 1}K ,
where Xnik = I(Xni = k). We also use j to index states, mostly of variables that are “parents” of Xi .
Put another
P way, p(X = j|θ) = θj . We denote the sufficient statistics for a multinomial distribution by
Nj = n I(Xn = j).
• We define the Beta distribution θ ∼ Beta(α1 , α0 ) for θ ∈ [0, 1] by
Γ(α1 + α0 ) α1 −1
Beta(θ|α1 , α0 ) = θ (1 − θ)α0 −1 (3)
Γ(α1 )Γ(α0 )
where Γ(x) is the gamma function. Here α0 , α1 ∈ IR+ are called hyper parameters (pseudo counts) and
α = α0 + α1 is the equivalent sample size (strenght) of the prior.
• We define the Dirichlet distribution θ ∼ Dir(α1 , . . . , αK ) for θ ∈ [0, 1]K by
Y
K
α −1
Dir(θ|α1 , . . . , αK ) ∝ θj j (4)
j=1
P
Here αj ∈ IR+ are called hyper parameters (pseudo counts) and α = j αj is the equivalent sample size.
• We define the likelihood as
L(θ) = p(D|θ) (5)
and the log-likelihood as
`(θ) = log p(D|θ) (6)
• We denote the maximum likelihood estimate by
θ̂ML ∈ arg max p(D|θ) (7)
θ
2
4 Naive Bayes classifier
• The 1d Gaussian density is denoted N (x|µ, σ).
• In a generative classifer, the class prior is usually denoted p(Y = c) = πc , if we assume Y has a multinomial
distribution.
• In the naive Bayes model, we have
Y
D
p(x|y = c) = p(xi |y = c) (10)
i=1
where θick = P (Xi = k|Y = c). The sufficient statistics are Nick , which is the number of times Xi = k
amongst those training cases where Y = c. In the case of binary features, we have
I(Xi =1)
p(xi |y = c) = θic (1 − θic )I(Xi =0) (12)
where θic = P (Xi = 1|Y = c). The sufficient statistics are Nic1 , the number of times Xi = 1 amongst cases
where Y = c, and Nic = Nc , the number of times Xi = 0 or Xi = 1 in cases where Y = c.
5 Markov chains
i
• The transition matrix is Tjk = p(Xi = k|Xi−1 = j), which is independent of i if the chain is stationary. The
PN PD
sufficient statistics to estimate this are the observed number of j→k transitions: Njk = n=1 i=2 I(Xni =
k, Xni−1 = j). There is no i index since we assume the parameters are shared (tied) across time.
• The initial state distribution is πk1 = p(X1 = k).
• The stationary distribution is π which satisfies πT = π (if we treat π as a row vector).
6 Information theory
• The entropy of a random variable X ∈ 1 : K with discrete distribution p is denoted by
X
K X
H(p) = H(X) = − p(X = k) log2 p(X = k) = − pk log pk (13)
k=1 k
The joint entropy is denoted H(X, Y ) and the conditional entropy as H(X|Y ). The mutual information is
denoted I(X, Y ) (often written as I(X; Y )). The Kullback-Leibler divergence between two distributions is
denoted KL(p||q).