Exam Long Questions
Exam Long Questions
Exam Long Questions
Instructions: Below are listed 6 questions. Exactly 2 of these questions will appear on your
exam. You can upload two files, one for each answer. You should explain your answers, even if
not explicitly asked to do so. For example, if a question asks: “what is the derivative of equation
E”, you should not only give the derivative, but also outline how you obtained it.
The exam is open-book, in the sense that you can consult the textbook and on-line resources
like Wikipedia. The university policy on academic dishonesty and plagiarism (cheating) will be
taken very seriously in this course. Everything submitted should be your own writing or coding.
You must not let other students copy your work. If I determine that you have copied, you will
receive 0 marks.
Group Work: Discussing the problems is okay, for example to understand the concepts involved.
If you work in a group, put down the name of all members of your group. There should be no
group submissions. Each group member should write up their own solution to show their own
understanding.
Question on Recurrent Neural Networks
We want to process two binary input sequences with 0-1 entries and determine if they are
equal. For notation, let x1 = x1(1), x1(2),…,x1(T) be the first input sequence and x2 = x2(1), x2(2),…,x2(T)
be the second. We use the RNN architecture shown in the Figure.
h(t)=g(Wx(t)+b)
y(t)=g(vTh(t)+ry(t-1)+c) for t>1
Where vT is the transpose of vector v and the activation function g is defined as follows.
Specify parameter values that correctly implement this function, like in the table shown. (You
do not have to write your answer in the table). Justify why you think your parameter values are
correct.
W Your solution
b Your solution
v Your solution
r Your solution
c Your solution
c0 Your solution
Question on Gated Recurrent Neural Networks
Suppose we want to build an RNN cell that sums its inputs over time.
1. For the LSTM architecture as explained in Section 4.6. of the text, what should be the
value of the input gate and the forget gate?
2. For the GRU architecture, what should be the value of the reset gate and the update
gate? The GRU architecture is described in the slides and in on-line sources like this one
https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-
explanation-44e9eb85bf21.
Question on Variational Auto-Encoders
1. Compare the loss function for the Variational Auto-Encoder (Figure 7.8 in the text) to
the loss function for an associative auto-encoder (Figure 7.1 in the text). Which parts are
similar and which are different?
2. How does the VAE architecture allow it to generate new data points, especially
compared to associative auto-encoder, which cannot generate new data points?
3. Let d be the latent embedding dimension. The VAE encoder outputs a mean vector
μ=(m1,m2,..,md) and a variance vector σ=(s1,..,sd), where each si ≥0. The variational loss
function for this output is given by
½ ∑ i=1,..d {(si)2+(mi)2-ln[(si)2]-1}.
(This equation is somewhat different from the book.) Show that this variation loss is
minimized when μ=0 and σ=1 (i.e. all the means are 0 and all variances are 1).
(i) (2 point s) Early in the t raining, is the value of D (G(z)) closer to 0 or closer to 1?
Explain.
Solut ion: The value of D (G(z)) is closer to 0 because early in the training D is
much bet t er t han G. One reason is t hat G’s task (generating images t hat look like
real data) is a lot harder to learn than D’s task (distinguishing fake images from
real images).
(ii) (2 points) T wo cost functions are present ed in figure (1), which one would you
Question
choose to trainon Generative
your Adversarial
GAN? Just ify your answer. Networks
Considering
Solut ion: I would usetraining a GAN
t he ” non-sat urat ingwith generator
cost ” because G(z)
it leads t o muchand
higherdiscriminatorD(G(z)). Figure 1 shows the
gradients early in the training and t hus helps the generator learn quicker.
training losses for two generator loss functions. In detail, for the first m generated points, i =
(iii) 1,…,m.
(2 point s) Each point
You know in the
t hat your GAN plot shows
is t rained the value
when D(G(z)) is closeof
to D(G(z
1. True i)), J1(G) and J2(G), defined as follows.
/ False ? Explain.
1. J (G) = -1/m ∑
Solut ion: False, at1 t he end of t he t raining
i=1,..m ln(D(G(z )). Shown as the blue curve.
i D. So D (G(z)) is
G is able t o fool
2. means
close to 0.5 which J2(G)t hat
= 1/m ∑ i=1,..mguessing.
D is randomly ln(1-D(G(zi)). Shown as the orange curve.
Figure 1: Cost function of the generat or plot ted against t he output of the discriminator
1. Early
when given a generat ed imageinG(z).
the Concerning
training,theisdiscriminator’s
the value of D(G(z))
output, closer to 0 or closer to 1? Explain
we consider why.
that 0 (resp. 1) means that the discriminator thinks the input “ has been generated by G”
2. the
(resp. “ comes from Which of). the two cost functions would you choose to train your GAN? Justify
real data” your
answer.
3. A GAN is successfully trained when D(G(z)) is close to 1. True or False? Explain your
answer.
7
Question on Seq2Seq Problems
Write out the input if the French sentence is “AB C D E F" and the English is “M N O P Q R S T".
Question on Attention
Consider the original transformer architecture as described in the paper “Attention is All You
Need” (https://arxiv.org/abs/1706.03762) and also in this blog post
http://jalammar.github.io/illustrated-transformer/. Both the encoder and the decoder each use
a stack of 6 self-attention modules. For the answers below, assume 1 attention head only (not 8
as in the paper). As shown in the notebook
https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tens
or/notebooks/hello_t2t.ipynb#scrollTo=OJKU36QAfqOC the input sentence and output
translation are as follows.
Input: “The animal didn't cross the street because it was too tired”
Output: “Das Tier überquerte die Straße nicht, weil es zu müde war“
All the following questions pertain to this example. Treat each word as a separate token, so the
input sequence contains 11 items, and the output sequence contains also 11 items.
1. Consider the third output word “überquerte”. Describe how the representation of this
output word is computed. List the key vectors used (e.g. key vector for first input word),
query vectors used (e.g. key vector for first input word), and value vectors used (e.g.
value vector for first input word)? How is each of these key/query/value vectors used
computed?
2. To produce the input and output shown, how many key, query, and value vectors are
computed in total by the encoder? How many by the decoder in total (not just for the
3rd output word)? Fill in the following table and also explain your answer.
3. An attention weight connects an input word to other words. For each word in the input
sequence:
a. how many attention weights for other words are computed for each word in the
input sequence during encoding?
b. How many attention weights for other words are computed during decoding?