Tut6 Questions
Tut6 Questions
Reminder: If you need more guidance to get started on a question, seek clarifications and
hints on the class forum. Move on if you’re getting stuck on a part for a long time. Full
answers will be released after the last group meets.
x (n) ∼ N (m, σ2 ).
a) I don’t give you the raw data, { x (n) }, but tell you the mean of the observations:
N
1
x̄ =
N ∑ x (n) .
n =1
What is the likelihood1 of m given only this mean x̄? That is, what is p( x̄ | m)?2
b) A sufficient statistic is a summary of some data that contains all of the information
about a parameter.
ii) If we don’t know the noise variance σ2 or the mean, is x̄ still a sufficient
statistic in the sense that p(m | x̄ ) = p(m | { x (n) }nN=1 )? Explain your reasoning.
2. Conjugate priors: (This question sets up some intuitions about the larger picture for
Bayesian methods. But if you’re finding the course difficult, look at Q3 first.)
A conjugate prior for a likelihood function is a prior where the posterior is a distribution
in the same family as the prior. For example, a Gaussian prior on the mean of a
Gaussian distribution is conjugate to Gaussian observations of that mean.
1. I’m using the traditional statistics usage of the word “likelihood”: it’s a function of parameters given data, equal
to the probability of the data given the parameters. Personally I avoid saying “likelihood of the data” (Cf p29 of
MacKay’s textbook), although you’ll see that usage too.
2. The sum of Gaussian outcomes is Gaussian distributed; you only need to identify a mean and variance.
3. Numerical libraries often come with a gammaln or lgamma function to evaluate the log of the gamma function.
b) i) If a conjugate prior exists, then the data can be replaced with sufficient
statistics. Can you explain why?
ii) Explain whether there could be a conjugate prior for the hard classifier:
(
> 1 w> x + b > 0
P(y = 1 | x, w) = Θ(w x + b) =
0 otherwise.
where h is a vector of hidden unit values. These could be hidden units from the neural
network used to compute function f (x; θ ), or there could be a separate network to
model the variances.
a) Assume that h is the final layer of the same neural network used to compute f .
How could we modify the training procedure for a neural network that fits f by
least squares, to fit this new model?
b) In the suggestion above, the activation a(σ) = w(σ)> h + b(σ) sets the log of the
variance of the observations.
i) Why not set the variance directly to this activation value, σ2 = a(σ) ?
ii) Harder (I don’t know if you’ll have an answer, but I’m curious to find out):
Why not set the variance to the square of this activation value, σ2 = ( a(σ) )2 ?
c) Given a test input x(∗) , the model above outputs both a guess of an output, f (x(∗) ),
and an ‘error bar’ σ (x(∗) ), which indicates how wrong the guess could be.
The Bayesian linear regression and Gaussian process models covered in lectures
also give error bars on their predictions. What are the pros and cons of the neural
network approach in this question? Would you use this neural network to help
guide which experiments to run?