HW1 Solutions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Homework 1

Q1. Show that the ridge regression estimate is the mean (and mode) of the posterior distribution,
under a Gaussian prior β ∼ N (0, τ 2 I), and Gaussian sampling model y ∼ N (Xβ, σ 2 I). Find the
relationship between the regularization parameter λ in the ridge formula, and the variances τ 2
and σ 2 .

S1. We assume that our input data is centered which allows us to ignore the intercept term β0 . The posterior
distribution is given by
1
Pr(β|y, X) = Pr(y|β, X)Pr(β)
K
R
where K = K(y, X) = Pr(y|β, X)Pr(β) dβ, is clearly independent of β. Using our assumptions, we see that
!
T
(y − Xβ) (y − Xβ) βT β
 
1 1 1
Pr(β|y, X) = exp − exp − 2 . (1)
Z (2π)p/2 σ p 2σ 2 (2π)p/2 τ p 2τ

Then,
T
(y − Xβ) (y − Xβ) β T β
log (Pr(β|y, X)) = −C − − ,
2σ 2 2τ 2
where C collects the terms without β dependence. It is then not difficult to see that this expression is
maximized for −1
σ2

β̂ = XT X + 2 I XT y.
τ
σ2
Letting λ = τ2 , we see the equivalence of the above approach and ridge regression.

It is clear that Pr(β|y, X) is Gaussian and its mean and mode coincide. We will now show that its mean
m = β̂. To this end, note that (1) implies that its covariance Σ satisfies

σ2
 
1
Σ−1 = 2 XT X + 2 I .
σ τ
1 T
This gives β̂ = σ 2 ΣX y and equating the relevant terms in (1), we see that this must be the mean.
Q2. Show that the ridge regression estimates can be obtained by ordinary least squares regression

on an augmented data set. We augment the centered matrix X with p additional rows λI and
augment y with p zeroes. By introducing artificial data having response value zero, the fitting
procedure is forced to shrink the coefficients toward zero.

S2. Denote by X̃ and Ỹ the augmented data sets, i.e.,


   
X̃ = √ X , ỹ =
y
.
λIp×p 0p×1

By (3.6) in [HTF09] an ordinary least squares regression yields the estimate


 −1
β̂ = X̃T X̃ X̃ỹ.

Using the definition of X̃ and ỹ, it is not difficult to see

X̃T X̃ = XT X + λI and X̃T ỹ = XT y.


−1
So, β̂ = XT X + λI XT y.

1
Q3. Consider a mixture model density in p-dimensional feature space,
K
X
g(x) = πk gk (x),
k=1

where gk = N (µk , σ 2 I) and πk ≥ 0 for all k with k πk = 1. Here {µk , πk }, k = 1, ..., K and σ 2 are
P
unknown parameters. Suppose we have data x1 , ..., xN ∼ g(x) and we wish to fit the mixture
model.
a. Write down the log-likelihood of the data.
b. Derive an EM algorithm for computing the maximum likelihood estimates.
c. Show that if σ has a known value in the mixture model and we take σ → 0, then in a sense,
this EM algorithm coincides with K-means clustering.
S3.a) The log-likelihood function for {xi }N
i=1 is given by

N N K N K
! !! !
Y Y X X X
l(θ, Z) = log g(xi ) = log πk gk (xi ) = log πk gk (xi ) , (2)
i=1 i=1 k=1 i=1 k=1

where θ = (σ 2 , θ1 , ..., θK ) = (σ 2 , π1 , µ1 , ..., πK , µK ).


S3.b) We generalize the ideas in [HTF09, p.272]. Introduce the random vector ∆ = (∆1 , ..., ∆K ) satisfying ∆k ∈
PK QK
{0, 1}, k=1 ∆k = 1 and Pr(∆k = 1) = πk . Note Pr(∆) = k=1 πk∆k and Pr(x|∆k = 1) = gk (x). We set

πk gk (xn )
γkn (θ) := Pr(∆k = 1|θ, Z = xn ) = PK . (3)
j=1 πj gj (xn )

dl dl dl
In (2) we calculate the derivatives dµk , dσ 2 and dπk , we determine their zeros and find the extreme points
PN PK P N T PN
n=1 γnk xn 2 k=1 n=1 γnk (xn − µk ) (xn − µk ) n=1 γnk
µk = PN , σ = PK P N , πk = (4)
n=1 γnk k=1 n=1 γnk
N

Now, guess µ0k , σ 0 , πk0 and calculate γkn


0 0
using (3). With γkn at hand, we can use (4) to update our parameters
to µ1k , σ 1 , πk1 . Repeating this procedure, we improve our estimates. To see why, assume we have determined
µik , σ i , πki . Note that k γnk
P i i
= 1 and γnk ≥ 0. Applying Jensen’s inequality to (2), we get
K
N X  
X
i πk gk (xn ) X
i
X
i i
l(θ, Z) ≥ γnk log i
= γnk log(πk gk (xn )) − γnk log(γnk ) = Bi (θ).
n=1 k=1
γnk
n,k n,k

The extreme points for Bi (θ) turn out to be µi+1


k ,σ
i+1
, πki+1 when calculated in (4) using γnk
i
.

In conclusion, the EM algorithm is given by:


1. Take initial guesses for the parameters σ 2 , µ̂i , π̂i for i = 1, ..., K.

2. Expectation step: Compute the responsibilities

πk gk (xn )
γ̂nk = PK , i = 1, ..., N, k = 1, ..., K
j=1 π̂j gj (xn )

3. Maximization step: Compute the weighted means and variances


PN PK PN T PN
n=1 γ̂nk xn 2 γ̂nk (xn − µ̂k ) (xn − µ̂k ) γ̂nk
µ̂k = PN , σ̂ = k=1 n=1 PK PN , π̂k = n=1

n=1 γ̂nk k=1 n=1 γ̂nk


N

2
4. Iterate steps 2 and 3 until convergence.

S3.c) For each n choose j such that (xn − µj )T (xn − µj ) ≤ (xn − µk )T (xn − µk ) for all k and provided πk 6= 0. Note
from (3), that for k 6= j γnk → 0 as σ → 0 and γnj → 1. Hence, we can write

1 if k = arg minj (xn − µj )T (xn − µj )



γnk → rnk := ,
0 otherwise

which assigns each data point to the cluster having the closest mean.

Q4. Derive equation (6.8) in [HTF09, p. 195] for multidimensional x.

S4. We want to determine


N p
!2
  X X
β̂0 , ..., β̂p = arg min Kλ (x0 , xj ) yj − β0 (x0 ) − βi (x0 )xi,j .
β0 ,...,βp
j=1 i=1

Define b(x)T = (1, x), let B be the regression matrix with ith row b(xi )T , and W(x0 ) the matrix with ith
diagonal element Kλ (x0 , xi ). Then,
 
T
β̂ = β̂0 , ..., β̂p = arg min (Bβ − y) W(x0 ) (Bβ − y)
β=(β0 ,...,βp )

It is then not difficult to reduce the problem to ordinary least squares, which yields

BT W(x0 )y.
−1
β = (BW(x0 )B)

Consequently,
fˆ(x0 ) = b(x0 )T (BW(x0 )B) BT W(x0 )y.
−1

References
[HTF09] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning. 2(1), 2009.

You might also like