CS236 Homework 1
CS236 Homework 1
CS236 Homework 1
arg max Ep̂(x,y) [log pθ (y|x)] = arg min Ep̂(x) [DKL (p̂(y|x)kpθ (y|x))] .
θ∈Θ θ∈Θ
where we assume a diagonal covariance structure for modeling each of the Gaussians in the mixture. Such a
model is parameterized by θ = (π1 , π2 , . . . , πk , µ1 , µ2 , . . . , µk , σ), where πi ∈ R++ , µi ∈ Rn , and σ ∈ R++ . Now
consider the multi-class logistic regression model for directly predicting y from x as:
exp(x> wy + by )
pγ (y|x) = Pk , (3)
>
i=1 exp(x wi + bi )
1. [2 points] Without any conditional independence assumptions, what is the total number of independent
parameters needed to describe the joint distribution over (X1 , . . . , Xn )?
1
2. [12 points] Let 1, 2, . . . , n denote the topological sort for a Bayesian network for the random variables
X1 , X2 , . . . , Xn . Let m be a positive integer in {1, 2, . . . , n − 1}. Suppose, for every i > m, the random
variable Xi is conditionally independent of all ancestors given m previous ancestors in the topological
ordering. Mathematically, we impose the independence assumptions
for i > m. For i ≤ m, we impose no conditional independence of Xi with respect to its ancestors.
Derive the total number of independent parameters to specify the joint distribution over (X1 , . . . , Xn )?
3. [2 points]
Pn Under what independence assumptions is it possible to represent the joint distribution (X1 , . . . , Xn )
with i=1 (ki − 1) total number of independent parameters?
Your friend chooses to factor the model in the reverse order using equally powerful neural networks {µ̂i }ni=1 and
{σ̂i }ni=1 that can represent any function µ̂i : Rn−i → R and σ̂i : Rn−i → R++ :
n
Y n
Y
pr (x1 , . . . , xn ) = pr (xi |x>i ) = N (xi |µ̂i (x>i ), σ̂i2 (x>i )), (7)
i=1 i=1
Do these models cover the same hypothesis space of distributions? In other words, given any choice of {µi , σi }ni=1 ,
does there always exist a choice of {µ̂i , σ̂i }ni=1 such that pf = pr ? If yes, provide a proof. Else, provide a
counterexample.
[Hint: Consider the case where n = 2.]
When z is high dimensional, tractable evaluation of the marginal likelihood is computationally intractable even
if we can tractably evaluate the prior and the conditional likelihood for any given x and z. We can however
2
use Monte Carlo to estimate the above integral. To do so, we sample k samples from the prior p(z) and our
estimate is given as:
k
1X
A(z (1) , . . . , z (k) ) = p(x|z (i) ), where z (i) ∼ p(z). (10)
k i=1
1. [5 points] An estimator θ̂ is an unbiased estimator of θ if and only if E[θ̂] = θ. Show that A is an unbiased
estimator of p(x).
2. [5 points] Is log A an unbiased estimator of log p(x)? Explain why or why not.
3
64-dim 128-dim 657-dim 657-dim
x0 ex0 h0 l0 p0
x1 ex1 h1 l1 p1
xT exT hT lT pT
Figure 1: The architecture of our model. T is the sequence length of a given input. xi is the index token. exi is
the trainable embedding of token xi . hi is the output of LSTMs. li is the logit and pi is the probability. Nodes
in gray contain trainable parameters.
There are a total of 657 different characters in NIPS 2015 papers, including alphanumeric characters as well as
many non-ascii symbols. During training, we first convert characters to a number in the range 0 to 656. Then
for each number, we use a 64-dimensional trainable vector as its embedding. The embeddings are then fed into
a four-layer LSTM network, where each layer contains 128 units. The output vectors of the LSTM network are
finally passed through a fully-connected layer to form a 657-way softmax representing the probability distribu-
tion of the next token. See Figure 1 for an illustration.
Training such models can be computationally expensive, requiring specialized GPU hardware. In this particular
assignment, we provide a pretrained generative model. After loading this pretrained model into P yT orch, you
are expected to implement and answer the following questions.
1. [4 points] Suppose we wish to find an efficient bit representation for the 657 characters. That is, every
character is represented as (a1 , a2 , · · · , an ), where ai ∈ {0, 1}, ∀i = 1, 2, · · · , n. What is the minimal n
that we can use?
2. [6 points] If the size of vocabulary increases from 657 to 900, what is the increase in the number of
parameters? [Hint: You don’t need to consider parameters in the LSTM module in Fig. 1.]
Note: For the following questions, you will need to complete the starter code in designated areas. After the
code is completed, run main.py to provide related files for submission. Run the script ./make submission.sh
to generate hw1.zip and upload it to GradeScope.
3. [10 points] In the starter code, complete the method sample in model.py to generate 5 paragraphs each
of length 1000 from this model.
4. [10 points] Complete the method compute_prob in model.py to compute the log-likelihoods for each
string. Plot a separate histogram of the log-likelihoods of strings within each file.
5. [10 points] Can you determine the category of an input string by only looking at its log-likelihood? We
now provide new strings in snippets.pkl. Try to infer whether the string is generated randomly, copied
from Shakespeare’s work or retrieved from NIPS publications. You will need to complete the code in
main.py.