Assignment 1: 1. Likelihood and Bayesian Inference in The Linear Model

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Assignment 1

1. Likelihood and Bayesian inference in the linear model:

Suppose that the n × 1 vector Y follows a normal distribution with mean Xβ and variance σ 2 I :

Y ∼ N Xβ, σ 2 I


i.e. that  
1 1
f y | β, σ 2 = √ exp − 2 (y − Xβ)T (y − Xβ)

( 2πσ)n 2σ
 
(a) The maximum likelihood estimates β̂, σ̂ 2 are defined to be the values of β and σ 2 that simultaneously
maximize the likelihood function, or more conveniently the log-likelihood function

ℓ β, σ 2 = log f y | β, σ 2
 

Give expressions for the maximum likelihood estimator of β and σ 2 . You may assume that X has full
column rank.
(b) The likelihood ratio statistic for testing the hypothesis β = β(0) = (β0 , 0, . . . , 0) is defined as
n    o
W = 2 ℓ β̂, σ̂ 2 − ℓ β̂(0) , σ̂02 ,
  
where β̂(0) , σ̂02 is the maximum likelihood estimate of β, σ 2 when β = β(0) . Show that σ̂02 =
2
n−1 Σ (yi − ȳ) , and that W is a function of the F -test for regression.

(c) Assume σ 2 is known. Suppose that we assume a prior distribution for β that is N 0, τ 2 I , where τ 2
is also known:  
1 1
f (β) = √ exp − 2 β T β .
( 2πτ )p 2τ
By Bayes theorem the posterior distribution of β, given y, is
Z
f (β | y) = f (y | β)f (β)/ f (y | β)f (β)dβ.

Show that this posterior distribution for β is normal, with


−1
E(β | y) = X T X + λI XT y
−1
cov(β | y) = X T X + λI σ2

where λ = σ 2 /τ 2 . What is the limiting posterior distribution as τ 2 → ∞ ?

(d) Continuing with the assumption of known σ 2 , show that the mean of the posterior distribution for β
using the double exponential prior
1
f (β) = exp(−|β|/τ )

gives the lasso estimator.

1
2. Exercise 3.2 of HTF:

Given data on two variables X and Y , consider fitting a cubic polynomial regression model f (X) =
P3 j
j=0 βj X . In addition to plotting the fitted curve, you would like a 95% confidence band about the
curve. Consider the following two approaches:

P3
(a) At each point x0 , form a 95% confidence interval for the linear function aT β = j=0 βj xj0 .

(b) For a 95% confidence set for β as in (3.15), which in turn generates confidence intervals for f (x0 ).

How do these approaches differ? Which band is likely to be wider? Conduct a small simulation experiment
to compare the two methods.

3. Exercise 3.12 of HT F :

Show that the ridge regression estimate can be obtained by ordinary least√ squares regression on an augmented
data set. We augment the centered matrix X with p additional rows λI, and augment y with p zeros. By
introducing artificial data having response value zero, the fitting procedure is forced to shrink the coefficients
towards zero. This is related to the idea of hints due to Abu-Mostafa (1995), where model constraints are
implemented by adding artificial data points that satisfy them.

4. The wine quality data:

We will work with the red wine data set. It can be accessed from within R by read.csv(“winequality-red.csv”,
sep=“;”). There are 1599 cases, and 11 inputs. The output variable is the quality score, a number between
0 and 10 . The goal is to use the features to predict the quality score.

(a) Choose 1000 cases at random to be your personal training data set. The remaining 599 cases are the
test data set.
(b) Estimate the coefficients in a linear model using least squares with all 11 features, all possible subsets
regression, ridge regression, lasso regression, PCR and PLS.
(c) Evaluate each method on the test data by computing the mean of the squared prediction error.
(d) Present the results in a Table similar to Table 3.3.
(e) Although the quality score can range from 0 to 10 , most of the values are 5 and 6. What is the range
of quality scores in your test data? How might this affect the estimated mean square error?

5. Classification of the wine quality data:

Another approach to analysing the wine quality data is to consider the quality scores as categorical, and
use the features to classify wines into various categories. For this exercise, create 4 categories: Bad (0-4),
Acceptable (5), Good (6), Excellent (8-10). Using training and test set as before, compare the classification
errors on the test data using linear discriminant analysis, quadratic discriminant analysis, and the naive
Bayes classifier.
Exercise 6.12 of HT F describes a version of local discriminant analysis, which is build on a kernel function
Kλ (·, x0 ), and uses this to provide a set of weights to linear discriminant analysis. Write a program to
implement this on the wine data, and compare the predictions to those above.

2
6. A Bayesian model for smoothing:

Assume yi = f (xi )+ϵi , i = 1, . . . , N , for xi ∈ R and ϵi ∼ 0, σ 2 and that we fit f (·) using a cubic smoothing
PN
spline, i.e. f (x) = j=1 Nj (x)θj , as at (5.10). Show that if the prior distribution for θ is Gaussian with
mean 0 and covariance matrix
σ 2 −1

λ
that the posterior distribution of θ given y is again Gaussian, and give expressions for the mean and variance.
Use this to get expressions for
E(Nθ | y), cov(Nθ | y)

You might also like