HW1 Solutions
HW1 Solutions
HW1 Solutions
Q1. Show that the ridge regression estimate is the mean (and mode) of the posterior distribution,
under a Gaussian prior β ∼ N (0, τ 2 I), and Gaussian sampling model y ∼ N (Xβ, σ 2 I). Find the
relationship between the regularization parameter λ in the ridge formula, and the variances τ 2
and σ 2 .
S1. We assume that our input data is centered which allows us to ignore the intercept term β0 . The posterior
distribution is given by
1
Pr(β|y, X) = Pr(y|β, X)Pr(β)
K
R
where K = K(y, X) = Pr(y|β, X)Pr(β) dβ, is clearly independent of β. Using our assumptions, we see that
!
T
(y − Xβ) (y − Xβ) βT β
1 1 1
Pr(β|y, X) = exp − exp − 2 . (1)
Z (2π)p/2 σ p 2σ 2 (2π)p/2 τ p 2τ
Then,
T
(y − Xβ) (y − Xβ) β T β
log (Pr(β|y, X)) = −C − − ,
2σ 2 2τ 2
where C collects the terms without β dependence. It is then not difficult to see that this expression is
maximized for −1
σ2
β̂ = XT X + 2 I XT y.
τ
σ2
Letting λ = τ2 , we see the equivalence of the above approach and ridge regression.
It is clear that Pr(β|y, X) is Gaussian and its mean and mode coincide. We will now show that its mean
m = β̂. To this end, note that (1) implies that its covariance Σ satisfies
σ2
1
Σ−1 = 2 XT X + 2 I .
σ τ
1 T
This gives β̂ = σ 2 ΣX y and equating the relevant terms in (1), we see that this must be the mean.
Q2. Show that the ridge regression estimates can be obtained by ordinary least squares regression
√
on an augmented data set. We augment the centered matrix X with p additional rows λI and
augment y with p zeroes. By introducing artificial data having response value zero, the fitting
procedure is forced to shrink the coefficients toward zero.
1
Q3. Consider a mixture model density in p-dimensional feature space,
K
X
g(x) = πk gk (x),
k=1
where gk = N (µk , σ 2 I) and πk ≥ 0 for all k with k πk = 1. Here {µk , πk }, k = 1, ..., K and σ 2 are
P
unknown parameters. Suppose we have data x1 , ..., xN ∼ g(x) and we wish to fit the mixture
model.
a. Write down the log-likelihood of the data.
b. Derive an EM algorithm for computing the maximum likelihood estimates.
c. Show that if σ has a known value in the mixture model and we take σ → 0, then in a sense,
this EM algorithm coincides with K-means clustering.
S3.a) The log-likelihood function for {xi }N
i=1 is given by
N N K N K
! !! !
Y Y X X X
l(θ, Z) = log g(xi ) = log πk gk (xi ) = log πk gk (xi ) , (2)
i=1 i=1 k=1 i=1 k=1
πk gk (xn )
γkn (θ) := Pr(∆k = 1|θ, Z = xn ) = PK . (3)
j=1 πj gj (xn )
dl dl dl
In (2) we calculate the derivatives dµk , dσ 2 and dπk , we determine their zeros and find the extreme points
PN PK P N T PN
n=1 γnk xn 2 k=1 n=1 γnk (xn − µk ) (xn − µk ) n=1 γnk
µk = PN , σ = PK P N , πk = (4)
n=1 γnk k=1 n=1 γnk
N
πk gk (xn )
γ̂nk = PK , i = 1, ..., N, k = 1, ..., K
j=1 π̂j gj (xn )
2
4. Iterate steps 2 and 3 until convergence.
S3.c) For each n choose j such that (xn − µj )T (xn − µj ) ≤ (xn − µk )T (xn − µk ) for all k and provided πk 6= 0. Note
from (3), that for k 6= j γnk → 0 as σ → 0 and γnj → 1. Hence, we can write
which assigns each data point to the cluster having the closest mean.
Define b(x)T = (1, x), let B be the regression matrix with ith row b(xi )T , and W(x0 ) the matrix with ith
diagonal element Kλ (x0 , xi ). Then,
T
β̂ = β̂0 , ..., β̂p = arg min (Bβ − y) W(x0 ) (Bβ − y)
β=(β0 ,...,βp )
It is then not difficult to reduce the problem to ordinary least squares, which yields
BT W(x0 )y.
−1
β = (BW(x0 )B)
Consequently,
fˆ(x0 ) = b(x0 )T (BW(x0 )B) BT W(x0 )y.
−1
References
[HTF09] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning. 2(1), 2009.