HW1 Solutions

Homework 1
Q1. Show that the ridge regression estimate is the mean (and mode) of the posterior distribution,
under a Gaussian prior β ∼ N (0, τ 2 I), and Gaussian sampling model y ∼ N (Xβ, σ 2 I). Find the
relationship between the regularization parameter λ in the ridge formula, and the variances τ 2
and σ 2 .
S1. We assume that our input data is centered which allows us to ignore the intercept term β0 . The posterior
distribution is given by
1
Pr(β|y, X) = Pr(y|β, X)Pr(β)
K
R
where K = K(y, X) = Pr(y|β, X)Pr(β) dβ, is clearly independent of β. Using our assumptions, we see that
!
T
(y − Xβ) (y − Xβ) βT β

1 1 1
Pr(β|y, X) = exp − exp − 2 . (1)
Z (2π)p/2 σ p 2σ 2 (2π)p/2 τ p 2τ
Then,
T
(y − Xβ) (y − Xβ) β T β
log (Pr(β|y, X)) = −C − − ,
2σ 2 2τ 2
where C collects the terms without β dependence. It is then not difficult to see that this expression is
maximized for −1
σ2

β̂ = XT X + 2 I XT y.
τ
σ2
Letting λ = τ2 , we see the equivalence of the above approach and ridge regression.
It is clear that Pr(β|y, X) is Gaussian and its mean and mode coincide. We will now show that its mean
m = β̂. To this end, note that (1) implies that its covariance Σ satisfies
σ2

1
Σ−1 = 2 XT X + 2 I .
σ τ
1 T
This gives β̂ = σ 2 ΣX y and equating the relevant terms in (1), we see that this must be the mean.
Q2. Show that the ridge regression estimates can be obtained by ordinary least squares regression
√
on an augmented data set. We augment the centered matrix X with p additional rows λI and
augment y with p zeroes. By introducing artificial data having response value zero, the fitting
procedure is forced to shrink the coefficients toward zero.
S2. Denote by X̃ and Ỹ the augmented data sets, i.e.,

X̃ = √ X , ỹ =
y
.
λIp×p 0p×1
By (3.6) in [HTF09] an ordinary least squares regression yields the estimate

−1
β̂ = X̃T X̃ X̃ỹ.
Using the definition of X̃ and ỹ, it is not difficult to see
X̃T X̃ = XT X + λI and X̃T ỹ = XT y.

−1
So, β̂ = XT X + λI XT y.
1
Q3. Consider a mixture model density in p-dimensional feature space,
K
X
g(x) = πk gk (x),
k=1
where gk = N (µk , σ 2 I) and πk ≥ 0 for all k with k πk = 1. Here {µk , πk }, k = 1, ..., K and σ 2 are
P
unknown parameters. Suppose we have data x1 , ..., xN ∼ g(x) and we wish to fit the mixture
model.
a. Write down the log-likelihood of the data.
b. Derive an EM algorithm for computing the maximum likelihood estimates.
c. Show that if σ has a known value in the mixture model and we take σ → 0, then in a sense,
this EM algorithm coincides with K-means clustering.
S3.a) The log-likelihood function for {xi }N
i=1 is given by
N N K N K
! !! !
Y Y X X X
l(θ, Z) = log g(xi ) = log πk gk (xi ) = log πk gk (xi ) , (2)
i=1 i=1 k=1 i=1 k=1
where θ = (σ 2 , θ1 , ..., θK ) = (σ 2 , π1 , µ1 , ..., πK , µK ).

S3.b) We generalize the ideas in [HTF09, p.272]. Introduce the random vector ∆ = (∆1 , ..., ∆K ) satisfying ∆k ∈
PK QK
{0, 1}, k=1 ∆k = 1 and Pr(∆k = 1) = πk . Note Pr(∆) = k=1 πk∆k and Pr(x|∆k = 1) = gk (x). We set
πk gk (xn )
γkn (θ) := Pr(∆k = 1|θ, Z = xn ) = PK . (3)
j=1 πj gj (xn )
dl dl dl
In (2) we calculate the derivatives dµk , dσ 2 and dπk , we determine their zeros and find the extreme points
PN PK P N T PN
n=1 γnk xn 2 k=1 n=1 γnk (xn − µk ) (xn − µk ) n=1 γnk
µk = PN , σ = PK P N , πk = (4)
n=1 γnk k=1 n=1 γnk
N
Now, guess µ0k , σ 0 , πk0 and calculate γkn

0 0
using (3). With γkn at hand, we can use (4) to update our parameters
to µ1k , σ 1 , πk1 . Repeating this procedure, we improve our estimates. To see why, assume we have determined
µik , σ i , πki . Note that k γnk
P i i
= 1 and γnk ≥ 0. Applying Jensen’s inequality to (2), we get
K
N X
X
i πk gk (xn ) X
i
X
i i
l(θ, Z) ≥ γnk log i
= γnk log(πk gk (xn )) − γnk log(γnk ) = Bi (θ).
n=1 k=1
γnk
n,k n,k
The extreme points for Bi (θ) turn out to be µi+1

k ,σ
i+1
, πki+1 when calculated in (4) using γnk
i
.
In conclusion, the EM algorithm is given by:

1. Take initial guesses for the parameters σ 2 , µ̂i , π̂i for i = 1, ..., K.
2. Expectation step: Compute the responsibilities
πk gk (xn )
γ̂nk = PK , i = 1, ..., N, k = 1, ..., K
j=1 π̂j gj (xn )
3. Maximization step: Compute the weighted means and variances

PN PK PN T PN
n=1 γ̂nk xn 2 γ̂nk (xn − µ̂k ) (xn − µ̂k ) γ̂nk
µ̂k = PN , σ̂ = k=1 n=1 PK PN , π̂k = n=1
n=1 γ̂nk k=1 n=1 γ̂nk

N
2
4. Iterate steps 2 and 3 until convergence.
S3.c) For each n choose j such that (xn − µj )T (xn − µj ) ≤ (xn − µk )T (xn − µk ) for all k and provided πk 6= 0. Note
from (3), that for k 6= j γnk → 0 as σ → 0 and γnj → 1. Hence, we can write
1 if k = arg minj (xn − µj )T (xn − µj )

γnk → rnk := ,
0 otherwise
which assigns each data point to the cluster having the closest mean.
Q4. Derive equation (6.8) in [HTF09, p. 195] for multidimensional x.
S4. We want to determine

N p
!2
X X
β̂0 , ..., β̂p = arg min Kλ (x0 , xj ) yj − β0 (x0 ) − βi (x0 )xi,j .
β0 ,...,βp
j=1 i=1
Define b(x)T = (1, x), let B be the regression matrix with ith row b(xi )T , and W(x0 ) the matrix with ith
diagonal element Kλ (x0 , xi ). Then,

T
β̂ = β̂0 , ..., β̂p = arg min (Bβ − y) W(x0 ) (Bβ − y)
β=(β0 ,...,βp )
It is then not difficult to reduce the problem to ordinary least squares, which yields
BT W(x0 )y.
−1
β = (BW(x0 )B)
Consequently,
fˆ(x0 ) = b(x0 )T (BW(x0 )B) BT W(x0 )y.
−1
References
[HTF09] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning. 2(1), 2009.

HW1 Solutions

Uploaded by

Copyright:

Available Formats

HW1 Solutions

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HW1 Solutions

Uploaded by

Copyright:

Available Formats

Homework 1

S2. Denote by X̃ and Ỹ the augmented data sets, i.e.,

By (3.6) in [HTF09] an ordinary least squares regression yields the estimate

Using the definition of X̃ and ỹ, it is not difficult to see

X̃T X̃ = XT X + λI and X̃T ỹ = XT y.

where θ = (σ 2 , θ1 , ..., θK ) = (σ 2 , π1 , µ1 , ..., πK , µK ).

Now, guess µ0k , σ 0 , πk0 and calculate γkn

The extreme points for Bi (θ) turn out to be µi+1

In conclusion, the EM algorithm is given by:

2. Expectation step: Compute the responsibilities

3. Maximization step: Compute the weighted means and variances

n=1 γ̂nk k=1 n=1 γ̂nk

1 if k = arg minj (xn − µj )T (xn − µj )

Q4. Derive equation (6.8) in [HTF09, p. 195] for multidimensional x.

S4. We want to determine

You might also like