CS236 Homework 1

CS 236 Homework 1
Instructors: Stefano Ermon and Aditya Grover

{ermon,adityag}@cs.stanford.edu
Available: 10/01/2018; Due: 23:59 PST, 10/15/2018
Problem 1: Maximum Likelihood Estimation and KL Divergence (10 points)

Let p̂(x, y) denote the empirical data distribution over a space of inputs x ∈ X and outputs y ∈ Y. For
example, in an image recognition task, x can be an image and y can be whether the image contains a cat or not.
Let pθ (y|x) be a probabilistic classifier parameterized by θ, e.g., a logistic regression classifier with coefficients
θ. Show that the following equivalence holds:
arg max Ep̂(x,y) [log pθ (y|x)] = arg min Ep̂(x) [DKL (p̂(y|x)kpθ (y|x))] .
θ∈Θ θ∈Θ
where DKL denotes the KL-divergence:
DKL (p(x)kq(x)) = Ex∼p(x) [log p(x) − log q(x)].
Problem 2: Logistic Regression and Naive Bayes (12 points)

A mixture of k Gaussians specifies a joint distribution given by pθ (x, y) where y ∈ {1, . . . , k} signifies the
mixture id and x ∈ Rn denotes n-dimensional real valued points. The generative process for this mixture can
be specified as:
k
X
pθ (y) = πy , where πy = 1 (1)
y=1
pθ (x|y) = N (x|µy , σ 2 I). (2)
where we assume a diagonal covariance structure for modeling each of the Gaussians in the mixture. Such a
model is parameterized by θ = (π1 , π2 , . . . , πk , µ1 , µ2 , . . . , µk , σ), where πi ∈ R++ , µi ∈ Rn , and σ ∈ R++ . Now
consider the multi-class logistic regression model for directly predicting y from x as:
exp(x> wy + by )
pγ (y|x) = Pk , (3)
>
i=1 exp(x wi + bi )
parameterized by vectors γ = {w1 , w2 , . . . , wk , b1 , b2 , . . . , bk }, where wi ∈ Rn and bi ∈ R.

Show that for any choice of θ, there exists γ such that
pθ (y|x) = pγ (y|x). (4)
Problem 3: Conditional Independence and Parameterization (16 points)

Consider a collection of n discrete random variables {Xi }ni=1 , where the number of outcomes for Xi is
|val(Xi )| = ki .
1. [2 points] Without any conditional independence assumptions, what is the total number of independent
parameters needed to describe the joint distribution over (X1 , . . . , Xn )?
1
2. [12 points] Let 1, 2, . . . , n denote the topological sort for a Bayesian network for the random variables
X1 , X2 , . . . , Xn . Let m be a positive integer in {1, 2, . . . , n − 1}. Suppose, for every i > m, the random
variable Xi is conditionally independent of all ancestors given m previous ancestors in the topological
ordering. Mathematically, we impose the independence assumptions
p(Xi |Xi−1 , Xi−2 , . . . X2 , X1 ) = p(Xi |Xi−1 , Xi−2 , . . . Xi−m )
for i > m. For i ≤ m, we impose no conditional independence of Xi with respect to its ancestors.
Derive the total number of independent parameters to specify the joint distribution over (X1 , . . . , Xn )?
3. [2 points]
Pn Under what independence assumptions is it possible to represent the joint distribution (X1 , . . . , Xn )
with i=1 (ki − 1) total number of independent parameters?
Problem 4: Autoregressive Models (12 points)

Consider a set of n univariate continuous real-valued random variables (X1 , . . . , Xn ). You have access to pow-
erful neural networks {µi }ni=1 and {σi }ni=1 that can represent any function µi : Ri−1 → R and σi : Ri−1 → R++ .
We shall, for notational simplicity, define R0 = {0}. You choose to build the following Gaussian autoregressive
model in the forward direction:
n
Y n
Y
pf (x1 , . . . , xn ) = pf (xi |x<i ) = N (xi |µi (x<i ), σi2 (x<i )), (5)
i=1 i=1
where x<i denotes

(
(x1 , . . . , xi−1 )> if i > 1
x<i = (6)
0 if i = 1.
Your friend chooses to factor the model in the reverse order using equally powerful neural networks {µ̂i }ni=1 and
{σ̂i }ni=1 that can represent any function µ̂i : Rn−i → R and σ̂i : Rn−i → R++ :
n
Y n
Y
pr (x1 , . . . , xn ) = pr (xi |x>i ) = N (xi |µ̂i (x>i ), σ̂i2 (x>i )), (7)
i=1 i=1
where x>i denotes

(
(xi+1 , . . . , xn )> if i < n
x>i = (8)
0 if i = n.
Do these models cover the same hypothesis space of distributions? In other words, given any choice of {µi , σi }ni=1 ,
does there always exist a choice of {µ̂i , σ̂i }ni=1 such that pf = pr ? If yes, provide a proof. Else, provide a
counterexample.
[Hint: Consider the case where n = 2.]
Problem 5: Monte Carlo Integration (10 points)

A latent variable generative model specifies a joint probability distribution p(x, z) between a set of observed
variables x ∈ X and a set of latent variables z ∈ Z. From the definition of conditional probability, we can
express the joint distribution as p(x, z) = p(z)p(x|z). Here, p(z) is referred to as the prior distribution over z
and p(x|z) is the likelihood of the observed data condition on the latent variables. One natural objective for
learning a latent variable model is to maximize the marginal likelihood of the observed data given by:
Z
p(x) = p(x, z)dz. (9)
z
When z is high dimensional, tractable evaluation of the marginal likelihood is computationally intractable even
if we can tractably evaluate the prior and the conditional likelihood for any given x and z. We can however
2
use Monte Carlo to estimate the above integral. To do so, we sample k samples from the prior p(z) and our
estimate is given as:
k
1X
A(z (1) , . . . , z (k) ) = p(x|z (i) ), where z (i) ∼ p(z). (10)
k i=1
1. [5 points] An estimator θ̂ is an unbiased estimator of θ if and only if E[θ̂] = θ. Show that A is an unbiased
estimator of p(x).
2. [5 points] Is log A an unbiased estimator of log p(x)? Explain why or why not.
3
64-dim 128-dim 657-dim 657-dim
x0 ex0 h0 l0 p0
x1 ex1 h1 l1 p1
x2 ex2 LSTM h2 Fully-Connected Layer l2 Softmax p2
··· ··· ··· ··· ···
xT exT hT lT pT
Figure 1: The architecture of our model. T is the sequence length of a given input. xi is the index token. exi is
the trainable embedding of token xi . hi is the output of LSTMs. li is the logit and pi is the probability. Nodes
in gray contain trainable parameters.
Problem 6: Programming assignment (40 points)

In this programming assignment, we will use an autoregressive generative model to generate text from ma-
chine learning papers. In particular, we will train a character-based recurrent neural network (RNN) to generate
paragraphs. The training dataset consists of all papers published in NIPS 2015.1 The model used in this as-
signment is a four-layer Long Short-Term Memory (LSTM) network. LSTM is a variant of RNN that performs
better in modeling long-term dependencies. See this blog post for a friendly introduction.
There are a total of 657 different characters in NIPS 2015 papers, including alphanumeric characters as well as
many non-ascii symbols. During training, we first convert characters to a number in the range 0 to 656. Then
for each number, we use a 64-dimensional trainable vector as its embedding. The embeddings are then fed into
a four-layer LSTM network, where each layer contains 128 units. The output vectors of the LSTM network are
finally passed through a fully-connected layer to form a 657-way softmax representing the probability distribu-
tion of the next token. See Figure 1 for an illustration.
Training such models can be computationally expensive, requiring specialized GPU hardware. In this particular
assignment, we provide a pretrained generative model. After loading this pretrained model into P yT orch, you
are expected to implement and answer the following questions.
1. [4 points] Suppose we wish to find an efficient bit representation for the 657 characters. That is, every
character is represented as (a1 , a2 , · · · , an ), where ai ∈ {0, 1}, ∀i = 1, 2, · · · , n. What is the minimal n
that we can use?
2. [6 points] If the size of vocabulary increases from 657 to 900, what is the increase in the number of
parameters? [Hint: You don’t need to consider parameters in the LSTM module in Fig. 1.]
Note: For the following questions, you will need to complete the starter code in designated areas. After the
code is completed, run main.py to provide related files for submission. Run the script ./make submission.sh
to generate hw1.zip and upload it to GradeScope.
3. [10 points] In the starter code, complete the method sample in model.py to generate 5 paragraphs each
of length 1000 from this model.
4. [10 points] Complete the method compute_prob in model.py to compute the log-likelihoods for each
string. Plot a separate histogram of the log-likelihoods of strings within each file.
5. [10 points] Can you determine the category of an input string by only looking at its log-likelihood? We
now provide new strings in snippets.pkl. Try to infer whether the string is generated randomly, copied
from Shakespeare’s work or retrieved from NIPS publications. You will need to complete the code in
main.py.
1 Neural Information Processing Systems (NIPS) is a top machine learning conference.

CS236 Homework 1

Uploaded by

Copyright:

Available Formats

CS236 Homework 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS236 Homework 1

Uploaded by

Copyright:

Available Formats

CS 236 Homework 1

Instructors: Stefano Ermon and Aditya Grover

Available: 10/01/2018; Due: 23:59 PST, 10/15/2018

Problem 1: Maximum Likelihood Estimation and KL Divergence (10 points)

where DKL denotes the KL-divergence:

DKL (p(x)kq(x)) = Ex∼p(x) [log p(x) − log q(x)].

Problem 2: Logistic Regression and Naive Bayes (12 points)

pθ (x|y) = N (x|µy , σ 2 I). (2)

parameterized by vectors γ = {w1 , w2 , . . . , wk , b1 , b2 , . . . , bk }, where wi ∈ Rn and bi ∈ R.

pθ (y|x) = pγ (y|x). (4)

Problem 3: Conditional Independence and Parameterization (16 points)

p(Xi |Xi−1 , Xi−2 , . . . X2 , X1 ) = p(Xi |Xi−1 , Xi−2 , . . . Xi−m )

Problem 4: Autoregressive Models (12 points)

where x<i denotes

where x>i denotes

Problem 5: Monte Carlo Integration (10 points)

x2 ex2 LSTM h2 Fully-Connected Layer l2 Softmax p2

··· ··· ··· ··· ···

Problem 6: Programming assignment (40 points)

1 Neural Information Processing Systems (NIPS) is a top machine learning conference.

You might also like