Gans + Final Practice Questions: Instructor: Preethi Jyothi

GANs
+ 
Final practice questions
Lecture 23
CS 753
Instructor: Preethi Jyothi
Final Exam Syllabus
1. WFST algorithms/WFSTs used in ASR
2. HMM algorithms/EM/Tied state Triphone models
3. DNN-based acoustic models
4. N-gram/Smoothing/RNN language models
5. End-to-end ASR (CTC, LAS, RNN-T)
6. MFCC feature extraction
7. Search & Decoding
8. HMM-based speech synthesis models
9. Multilingual ASR
10. Speaker Adaptation
11. Discriminative training of HMMs
Questions can be asked on any of the 11 topics listed above. You will be allowed a single A-4
cheat sheet of handwritten notes; content on both sides permitted.
Final Project
Deliverables
• 4-5 page final report:
✓ Task definition, Methodology, Prior work, Implementation

Details, Experimental Setup, Experiments and Discussion, Error
Analysis (if any), Summary
• Short talk summarizing the project:
✓ Each team will get 8-10 minutes for their presentation  

and ≈5 minutes for Q/A
✓ Clearly demarcate which team member worked on what part

Final Project Grading
• Break-up of 20 points:
• 6 points for the report
• 4 points for the presentation
• 6 points for Q/A
• 4 points for overall evaluation of the project

Final Project Schedule
• Presentations will be held on Nov 23rd and Nov 24th
• The final report in pdf format should be sent to

pjyothi@cse.iitb.ac.in before Nov 24th
• The order of presentations will be decided on a lottery basis

and shared via Moodle before Nov 9th
Generative Adversarial Networks (GANs) D(x)
• Training process is formulated as a

game between a generator network
and a discriminative network Discriminator
• Objective of the generator: Create

samples that seem to be from the
same distribution as the training
data xreal x = G(z)
• Objective of the discriminator:

Examine a generated sample and Generator
distinguish between fake or real
samples
• The generator tries to fool the

discriminator network
Z
Generative Adversarial Networks
max min L(G, D)

<latexit sha1_base64="7VI+8ZF3gmwRBJn+eQLkMiseN2U=">AAACCHicbVDLSsNAFJ3UV62vqEsXDhahgpSkCrosWqgLFxXsA5oQJtNpO3QyCTMTsYQu3fgrblwo4tZPcOffOGmz0OqBYQ7n3Mu99/gRo1JZ1peRW1hcWl7JrxbW1jc2t8ztnZYMY4FJE4csFB0fScIoJ01FFSOdSBAU+Iy0/dFl6rfviJA05LdqHBE3QANO+xQjpSXP3HcCdO/VoRNQ7tX0h9QQI5ZcT0r1Y1g78syiVbamgH+JnZEiyNDwzE+nF+I4IFxhhqTs2lak3AQJRTEjk4ITSxIhPEID0tWUo4BIN5keMoGHWunBfij04wpO1Z8dCQqkHAe+rkwXlfNeKv7ndWPVP3cTyqNYEY5ng/oxgyqEaSqwRwXBio01QVhQvSvEQyQQVjq7gg7Bnj/5L2lVyvZJuXJzWqxeZHHkwR44ACVggzNQBVegAZoAgwfwBF7Aq/FoPBtvxvusNGdkPbvgF4yPb3ZxmFI=</latexit>
G D
<latexit sha1_base64="VTwlHXywjOChz5LfJXs73m4kjK8=">AAACU3icbVFda9RAFJ2kVetq7aqPvlxchATtklTBvghFV+qDDxXctrAJy2T27u7YySTM3LS7DfmPIvjgH/HFB539QGzrgYHDOecyd85kpZKWouiH529s3rp9Z+tu69797Qc77YePjm1RGYF9UajCnGbcopIa+yRJ4WlpkOeZwpPs7N3CPzlHY2WhP9O8xDTnEy3HUnBy0rD9JSGcUX0xRYPQQJJzmgqu6o9NcPgCeiG8WWkmr983w3oGidTQa2Cwm6hiAr1gFqbw/Erm8q8bxLDrIofBZRiG6bDdibrREnCTxGvSYWscDdvfklEhqhw1CcWtHcRRSWnNDUmhsGkllcWSizM+wYGjmudo03rZSQPPnDKCcWHc0QRL9d+JmufWzvPMJRe72+veQvyfN6hovJ/WUpcVoRari8aVAipgUTCMpEFBau4IF0a6XUFMueGC3De0XAnx9SffJMd73fhld+/Tq87B23UdW+wJe8oCFrPX7IB9YEeszwT7yn6y3x7zvnu/fN/fXEV9bz3zmF2Bv/0H1zCvmQ==</latexit>
where L(G, D) = Ex2D [ log D(x)] + Ez [ log(1 D(G(z)))]
• Cost function of the generator is the opposite of the discriminator’s
• Minimax game: The generator and discriminator are playing a zero-sum

game against each other
Algorithm 1 Minibatch stochastic gradient descent training of generative adversarial nets. The number of
Training Generative Adversarial Networks
steps to apply to the discriminator, k, is a hyperparameter. We used k = 1, the least expensive option, in our
experiments.
for number of training iterations do
for k steps do
• Sample minibatch of m noise samples {z (1) , . . . , z (m) } from noise prior pg (z).
• Sample minibatch of m examples {x(1) , . . . , x(m) } from data generating distribution
pdata (x).
• Update the discriminator by ascending its stochastic gradient:
Xm h ⇣ ⌘ ⇣ ⇣ ⇣ ⌘⌘⌘i
1 (i) (i)
r ✓d log D x + log 1 D G z .
m i=1
end for
• Sample minibatch of m noise samples {z , . . . , z } from noise prior pg (z).
(1) (m)
• Update the generator by descending its stochastic gradient:

1 Xm ⇣ ⇣ ⇣ ⌘⌘⌘
r ✓g log 1 D G z (i) .
m i=1
end for
The gradient-based updates can use any standard gradient-based learning rule. We used momen-
tum in our experiments.
Image from [Goodfellow16]: https://arxiv.org/pdf/1701.00160.pdf

Better objective for the generator
•Problem of saturation: If the

generated sample
Original minimax cost: is really poor,
the generator’s cost is relatively flat
= Ez [log(1
J• GOriginal cost D(G (z)))]
LGEN (G, D) = Ez [log(1 D(G(z)))]
Modified generator cost:
<latexit sha1_base64="LGWrBEXN/oGBn/UGMdYvD/mje+0=">AAACL3icbVDLSgMxFM34rPVVdekmWIQpaJmpgm6Eoq11IVLBPqAdSibNtKGZB0lGaIf5Izf+Sjciirj1L0wfgrYeCJyccy/33mMHjAppGK/awuLS8spqYi25vrG5tZ3a2a0KP+SYVLDPfF63kSCMeqQiqWSkHnCCXJuRmt27Gvm1R8IF9b0H2Q+I5aKORx2KkVRSK3XddJHsYsSi27gVjT/cjUrFuzjWS0ewkIEX8EctqopBDBtN5negbsJjWNBL+iCTyVitVNrIGmPAeWJOSRpMUW6lhs22j0OXeBIzJETDNAJpRYhLihmJk81QkADhHuqQhqIecomwovG9MTxUShs6PlfPk3Cs/u6IkCtE37VV5Wh1MeuNxP+8RiidcyuiXhBK4uHJICdkUPpwFB5sU06wZH1FEOZU7QpxF3GEpYo4qUIwZ0+eJ9Vc1jzJ5u5P0/nLaRwJsA8OgA5McAby4AaUQQVg8ASG4A28a8/ai/ahfU5KF7Rpzx74A+3rG3KWprc=</latexit>
• Modified cost
JG = Ez [ log D(G (z))]
LGEN (G, D) = Ez [ log D(G(z))] <latexit sha1_base64="axrtjiZYFFfZltVUN2PIU2Ef0eo=">AAACKnicbVDLSgMxFM3UV62vqks3wSK0oGWmCroRqrbUhUgF+4CZoWTStA3NPEgyQjvM97jxV9x0oRS3fojpQ9DqgcDJOfdy7z1OwKiQuj7WEkvLK6tryfXUxubW9k56d68u/JBjUsM+83nTQYIw6pGapJKRZsAJch1GGk7/ZuI3nggX1Pce5SAgtou6Hu1QjKSSWukry0WyhxGL7uJWNP1wN6qU7+M4WzmGpRy8hN9qWVUMY2ieWMzvwlK2kh3mcnYrndHz+hTwLzHmJAPmqLbSI6vt49AlnsQMCWEaeiDtCHFJMSNxygoFCRDuoy4xFfWQS4QdTU+N4ZFS2rDjc/U8Cafqz44IuUIMXEdVTrYWi95E/M8zQ9m5sCPqBaEkHp4N6oQMSh9OcoNtygmWbKAIwpyqXSHuIY6wVOmmVAjG4sl/Sb2QN07zhYezTPF6HkcSHIBDkAUGOAdFcAuqoAYweAav4A28ay/aSBtrH7PShDbv2Qe/oH1+ASE1pcM=</latexit>
This fixes the saturation problem.

Large (& growing!) list of GANs
Image from https://github.com/hindupuravinash/the-gan-zoo

D(x)
Conditional GANs
xreal x = G(z)
• Generator and discriminator
receive some additional
conditioning information
C Z
Phillip Isola Jun-Yan Zhu Tinghui Zhou Alexei A. Efros
Image-to-image Translation using C-GANs
Berkeley AI Research (BAIR) Laboratory, UC Berkeley
{isola,junyanz,tinghuiz,efros}@eecs.berkeley.edu
Labels to Street Scene Labels to Facade BW to Color
4v3 [cs.CV] 26 Nov 2018
input output
Aerial to Map
input output input output

Day to Night Edges to Photo
input output input output input output
Figure 1: Many problems in image processing, graphics, and vision involve translating an input image into a corresponding output image.
These problems are often treated with application-specific algorithms, even though the setting is always the same: map pixels to pixels.
Conditional adversarial nets are a general-purpose solution that appears to work well on a wide variety of these problems. Here we show
results of the method on several. In each case we use the same architectureImage
and from
objective,
Isola et and simply
al., CVPR train
2017, on different data.
https://arxiv.org/pdf/1611.07004.pdf
s, Saarbrücken, Germany (MPI - INF. MPG . DE)
Text-to-Image Synthesis
this small bird has a pink this magnificent fellow is
breast and crown, and black almost all black with a red
images from text primaries and secondaries. crest, and white cheek patch.
, but current AI
oal. However, in
ul recurrent neu-
been developed
ature representa-
tional generative
ve begun to gen- the flower has petals that this white and yellow flower
of specific cat- are bright pinkish purple have thin white petals and a
with white stigma round yellow stamen
overs, and room
lop a novel deep
on to effectively
nd image model-
from characters
capability of our
ges of birds and Figure 1. Examples of generated images from
Image from text
Reed et descriptions.
al., ICML 2016, https://arxiv.org/pdf/1605.05396.pdf
Text-to-Image Synthesis
Generative Adversarial Text to Image Synthesis
This flower has small, round violet This flower has small, round violet
petals with a dark purple center petals with a dark purple center
Generator Network Discriminator Network

Figure 2. Our text-conditional convolutional GAN architecture. Text encoding '(t) is used by both generator and discriminator. It is
projected to a lower-dimensions and depth concatenated with image feature maps for further stages of convolutional processing.
then concatenated to the noise vector z. Following this, in- Algorithm 1 GAN-CLS training algorithm with step size
ference proceeds as in a normal deconvolutional network: ↵, using minibatch SGD for simplicity.
we feed-forward it through the generator G; a synthetic im- 1: Input:
Image from Reed etminibatch
al., ICML 2016,images x, matching text t, mis-
https://arxiv.org/pdf/1605.05396.pdf
Three Speech Applications of GANs
GANs for speech synthesis
GAN is
• Generator Discriminator: sity function
produces   ative model
synthesised speech Binary loss functio
OR
which the
classifier solution suc
Discriminator Ez⇠pz (z) [
distinguishes from
real speech Linguistic features
Natural samples where Xrea
• During synthesis, a Generator:
erator G usi
MSE AND Combining
random noise +
framework i
linguistic features
Noise Predicted samples
generates speech lossmulti
Image from Yang et al., “SPSS using GANs”, 2017 We treat

SEGAN: GANs for speech enhancement
• Enhancement: Given an input noisy
signal x̃, we want to clean it to obtain an
enhanced signal x
• Generator G will take both x̃ and z as

inputs; G is fully convolutional
features (con-
of model, we
like mean ab-
he raw speech Figure 1: GAN training process. First, D back-props a batch
s work under of real examples. Then, D back-props a batch of fake exam-
ion is shaped ples that come from G, and classifies them as fake. Finally, D’s
tions (like not parameters are frozen and G back-props to make D misclassify.
he predictions
. Our solution
tive adversar- coming from G, X̂ , have to be classified as fake. This leads
g information to G trying to fool D, and the way to do so is that G adapts
G can slightly its parameters such that D classifies G’s output as real. During
back-propagation, D gets better at finding realistic features in its
c distribution,
input and, in turn, G corrects its parameters to move towards the Figure 2: Encoder-decoder architecture for speech enhance-
ed to be fake. real data manifold described by the training data (Fig. 1). This ment (G network). The arrows between encoder and decoder
me sort of loss adversarial learning process is formulated as a minimax game blocks denote skip https://arxiv.org/pdf/1703.09452.pdf
Image from connections.
Voice Conversion Using Cycle-GANs
Image from https://arxiv.org/abs/1711.11293

Practice Questions
MM 101 (15 points)
HMM 101
ed from Powai lake is either Clean or Polluted. However, this information is hidden
A water sample collected from Powai lake is either Clean or Polluted. However,
observe is whether the water is muddy, clear, odorless or cloudy. We start at time this
information is hidden from us and all we can observe is whether the water is muddy, clear,
ate. The HMM below models this problem. Let qt and Ot denote the state and
odorless or cloudy. We start at time step 1 in the Clean state. The HMM below models this
p t, respectively.
problem. Let qt and Ot denote the state and observation at time step t, respectively.
0.2
a)What is P(O2 = clear)?

0.8
b)What is P(q2 = Clean ∣ O2 = clear)?

Clean  Polluted 
 
Pr(muddy) = 0.5  Pr(muddy) = 0.1  c)What is P(O200 = cloudy)?
Pr(clear) = 0.1 Pr(clear) = 0.5
Pr(odorless) = 0.2 Pr(odorless) = 0.2
Pr(cloudy) = 0.2 Pr(cloudy) = 0.2
d)What’s the most likely sequence of
states for the following observation
0.8 sequence: {O1 = clear,O2 = clear, 
O3 = clear,O4 = clear,O5 = clear}?
0.2
HMM 101
Say that we are now given a modified HMM for the water samples as shown below. Initial
probabilities and transition probabilities are shown next to the arcs. (Note: You do not need to
use the Viterbi algorithm to answer the next two questions.)
0.9 0.9 a) What is the most likely sequence of

states given a sequence of three
observations: {muddy, muddy,
0.01 Clean 
0.1
Polluted  0.99 muddy}?
 
Pr(muddy) = 0.51  Pr(muddy) = 0.49 
Pr(clear) = 0.49 Pr(clear) = 0.51 b) Say we observe a very long
0.1 sequence of “muddy” (e.g. 10 million
“muddy” in a row). What happens to
the most likely state sequence then?
given a modified HMM for the water samples as shown above. Initial probabilities and
ies are shown next to the arcs. (Note: You do not need to use the Viterbi algorithm
wo questions.)
Handling disfluencies in ASR
Recall that a pronunciation lexicon L maps a sequence of phones to a sequence of words. In this
problem, we shall modify L in order to handle some limited forms of interruptions in speech (a.k.a.
disfluencies). We will consider a dictionary of two words: W1 with the phone sequence “a b c” and W2
with the phone sequence “x y z”.
a) Draw the state diagram of the finite-state machine L.
b) We want to modify L such that it accounts for “breaks” when the speaker stops in the middle of a
word and says the word all over again. For instance, the word W1 may be pronounced as “a b ⟨break⟩
a b c,” where ⟨break⟩ is a special token produced by the acoustic model. In a valid phone sequence,
breaks are allowed to appear only within a word, and not at the end or beginning of a word. Further,
two consecutive ⟨break⟩ tokens are not allowed. But a word can be pronounced with an arbitrary
number of breaks. E.g. W1 can be pronounced also as “a b ⟨break⟩ a ⟨break⟩ a b ⟨break⟩ a b c”. Let L1
be an FST (obtained by modifying L from the previous part) that accepts all valid phone sequences
with breaks, and outputs a corresponding sequence of words. Draw the state diagram of L1.
Handling disfluencies in ASR
Recall that a pronunciation lexicon L maps a sequence of phones to a sequence of words. In this
problem, we shall modify L in order to handle some limited forms of interruptions in speech (a.k.a.
disfluencies). We will consider a dictionary of two words: W1 with the phone sequence “a b c” and
W2 with the phone sequence “x y z”.
c) Next, we want to modify L1 such that it can account for both “breaks” and “pauses.” A pause
corresponds to when the speaker briefly stops in the middle of a word and continues. For instance,
the word W1 may be pronounced as “a b ⟨pause⟩ c”, “a ⟨break⟩ a ⟨pause⟩ b ⟨break⟩ a b c,” etc.
where ⟨pause⟩ is another special token produced by the acoustic model. In a valid phone
sequence, these special tokens are allowed to appear only within a word, and two consecutive
special tokens are not allowed. Let L2 be an FST (obtained by modifying L1 from the previous
part) that accepts all valid phone sequences with breaks and pauses, and outputs a corresponding
sequence of words. Draw the state diagram of L2.
Mixed Bag
An HMM-based speech synthesis system can be described using the following steps:
1.Spectral feature and excitation features are extracted from a speech database
2.Context-dependent HMMs are trained on these features
3.These HMMs are clustered using a decision tree
4.Durations of the HMM models are explicitly modeled
At synthesis time, for a given text sequence, the decision tree yields the appropriate
HMM state sequence which in turn determines the output spectral and excitation
features (that are passed through a synthesis filter to produce speech). Say we
want to add expressivity to the synthesized speech: i.e. we want to make the voice
sound happy or sad, friendly or stern. Pick one of the above-mentioned steps from
(A)-(D) you would modify to add expressivity and briefly justify your choice.
Mixed Bag
d ) Find the probability, Pr(drank|Mohan), given the following bigram counts:
Mohan drank 10
drank co↵ee 1
Mohan co↵ee 10
drank Mohan 5
Mohan ate 10
drank water 20
Pr(drank|Mohan) = [1 pts]
Say you have an n-gram distribution which is smoothed using add-↵ smoothing for some ↵ > 0. The
entropy of the smoothed distribution is
(A) equal to (B) less than (C) greater than
the entropy of the original unsmoothed n-gram distribution. Pick one of (A), (B) or (C) and briefly justify
your choice. [2 pts]
1
Mixed Bag
she/1.2 [s]/1
Problem 5: Mixed bag (22 points)
[iy]/1.8 [s]/1.1
0 3 4 5
a) Recall neural network language models (NNLMs) as shown in the schematic diagram below. For a given
context of fixed length, each word in the context (drawn from a vocabulary of size N ) is projected onto a P
[s]/1.5
using a 2common N ⇥ P projection matrix, that is shared across the di↵erent
he/2.5
dimensional projection layer
word positions in the context. The value of the ith node in the output layer corresponds directly to the
probability of a word i given its context. sees/4.5
projection  output 
layer hidden  layer
wj-n+1
layer
softmax
over vocabulary
wj-n+2
of size N
shared 
⋮ projections
H
wj-1
P O
The complexity to calculate probabilities using this NNLM is quite high. Describe one main reason why
this evaluation is very costly in processing time. [3 pts]
of the special blank symbol hblanki. Here yj 2 V and ai 2 V [{hblanki}, where V is the output vocabulary.
Given an input sequence x of length T and an output character sequence y of length N , the CTC
objective function is given by:
CTC Alignments
X
PrCT C (y|x) = Pr(a|x).
Given an input sequence x of length T and an output character sequence   y
a:B(a)=y
<latexit sha1_base64="kI7RlH5Xx88yj+zAtipGByUNV1Y=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFNy4r2Ae2Q8mkmTY0kwxJRhiG/oUbF4q49W/c+Tdm2llo64HA4Zx7ybkniDnTxnW/ndLa+sbmVnm7srO7t39QPTzqaJkoQttEcql6AdaUM0HbhhlOe7GiOAo47QbT29zvPlGlmRQPJo2pH+GxYCEj2FjpcRBhMwnCLJ0NqzW37s6BVolXkBoUaA2rX4ORJElEhSEca9333Nj4GVaGEU5nlUGiaYzJFI9p31KBI6r9bJ54hs6sMkKhVPYJg+bq740MR1qnUWAn84R62cvF/7x+YsJrP2MiTgwVZPFRmHBkJMrPRyOmKDE8tQQTxWxWRCZYYWJsSRVbgrd88irpNOreRb1xf1lr3hR1lOEETuEcPLiCJtxBC9pAQMAzvMKbo50X5935WIyWnGLnGP7A+fwBABuRIQ==</latexit>
a) of length
CTC the CTC objective
N, independence
makes an assumptionfunction
about the is given
labels by: at each time step. Explicitly state
predicted
this assumption, and fill in the blank below by applying this independence condition. [1 pts]
X
Independence Assumption: PCTC (y|x) = P (a|x)
<latexit sha1_base64="I7j3BCkNpDLi6KyKroCef1SzueA=">AAACYHicbVFNSwMxEM1u/ai1atWbXoJF0EvZVUERhGIvHiu0WmhLmU2zGkx2lySrljV/0psHL/4S03Wl9WMg8Oa9mcnkJUg4U9rz3hy3tLC4tFxeqaxW19Y3aptbNypOJaFdEvNY9gJQlLOIdjXTnPYSSUEEnN4GD62pfvtIpWJx1NGThA4F3EUsZAS0pUa1p/YoGwjQ91JkrU7LmIM8C8JsYl6+4bM5xBd4oFJRFFsOzHkOCfDsctYF5vBiNsHg9pwyP29Uq3sNLw/8F/gFqKMi2qPa62Ack1TQSBMOSvV9L9HDDKRmhFNTGaSKJkAe4I72LYxAUDXMcoMM3rfMGIextCfSOGfnOzIQSk1EYCunK6rf2pT8T+unOjwbZixKUk0j8nVRmHKsYzx1G4+ZpETziQVAJLO7YnIPEoi2f1KxJvi/n/wX3Bw1/OPG0fVJvXlZ2FFGu2gPHSAfnaImukJt1EUEvTslp+qsOR9u2d1wN79KXafo2UY/wt35BFjyuPA=</latexit>
a:B(a)=y
where B maps a per-frame output sequence a = (a1 , . . . , aT ) to a final 

X
<latexit sha1_base64="rCx6YZKyFarkehd6/kmCVau7Ixc=">AAAB8nicbVDLSgMxFM3UV62vqks3wSK4KjNV0GWpG5cV7AOmQ8mkmTY0kwzJHaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmAhuwHW/ndLG5tb2Tnm3srd/cHhUPT7pGpVqyjpUCaX7ITFMcMk6wEGwfqIZiUPBeuH0Lvd7T0wbruQjzBIWxGQsecQpASv5g5jAhBKRtebDas2tuwvgdeIVpIYKtIfVr8FI0TRmEqggxviem0CQEQ2cCjavDFLDEkKnZMx8SyWJmQmyReQ5vrDKCEdK2ycBL9TfGxmJjZnFoZ3MI5pVLxf/8/wUotsg4zJJgUm6/ChKBQaF8/vxiGtGQcwsIVRzmxXTCdGEgm2pYkvwVk9eJ91G3buqNx6ua81WUUcZnaFzdIk8dIOa6B61UQdRpNAzekVvDjgvzrvzsRwtOcXOKfoD5/MHcyGRXA==</latexit>
<latexit sha1_base64="atOL6LBgyaupxbCMQFxBnljx98Q=">AAACBnicbVDLSgNBEJyNrxhfqx5FGAxChBB2o6AXIejFY4S8IBuW3slsMmT2wcysEJacvPgrXjwo4tVv8ObfOJvkoIkFDUVVN91dXsyZVJb1beRWVtfWN/Kbha3tnd09c/+gJaNEENokEY9ExwNJOQtpUzHFaScWFAKP07Y3us389gMVkkVhQ41j2gtgEDKfEVBacs1jJwA19PwUJvgal8C1yw7vR0qWwW2cuWbRqlhT4GViz0kRzVF3zS+nH5EkoKEiHKTs2laseikIxQink4KTSBoDGcGAdjUNIaCyl07fmOBTrfSxHwldocJT9fdECoGU48DTndnRctHLxP+8bqL8q17KwjhRNCSzRX7CsYpwlgnuM0GJ4mNNgAimb8VkCAKI0skVdAj24svLpFWt2OeV6v1FsXYzjyOPjtAJKiEbXaIaukN11EQEPaJn9IrejCfjxXg3PmatOWM+c4j+wPj8ASzOl6U=</latexit>
output sequencePryCT=C (y|x)

(y1 , . =
. . , yN ) <latexit sha1_base64="dYGfW2MYpOzGe1OtFxUccyal+R0=">AAACBnicbVDLSsNAFJ3UV62vqEsRBotQoZSkCroRim5cSQX7gCaEyWTSDp08mJkIIXTlxl9x40IRt36DO//GSZuFth64cDjnXu69x40ZFdIwvrXS0vLK6lp5vbKxubW9o+/udUWUcEw6OGIR77tIEEZD0pFUMtKPOUGBy0jPHV/nfu+BcEGj8F6mMbEDNAypTzGSSnL0QytAcuT6WTqBl7CWOmbdYl4kRT11bk8cvWo0jCngIjELUgUF2o7+ZXkRTgISSsyQEAPTiKWdIS4pZmRSsRJBYoTHaEgGioYoIMLOpm9M4LFSPOhHXFUo4VT9PZGhQIg0cFVnfrSY93LxP2+QSP/CzmgYJ5KEeLbITxiUEcwzgR7lBEuWKoIwp+pWiEeIIyxVchUVgjn/8iLpNhvmaaN5d1ZtXRVxlMEBOAI1YIJz0AI3oA06AINH8AxewZv2pL1o79rHrLWkFTP74A+0zx+USJfn</latexit>
a:B(a)=y
b) Consider a di↵erent definition of B which first removes all occurrences of the blank symbol, and then
compresses each run of an identical character to a run of length 1. Give an example of a sequence y such
that there is no a with B(a) = y, for this new B. Briefly justify your answer. [1 pts]
CTC Alignments
c) Now suppose we would like to avoid the use of the blank symbol altogether. Towards this, we define a new
B which works as follows. Given a = (a1 , . . . , aT ), B defines the sequence ((c1 , `1 ), (c2 , `2 ), . . . , (cM , `M ))
where ci 6= ci+1 and ì > 0 for all i, and a = (c1 , . . . , c1 , c2 , . . . , c2 , . . . , cM , . . . , cM ).
| {z } | {z } | {z }
`1 times `2 times `M times
1 PM
Then B calculates the average run length ` = M i=1 ì , and outputs
y = (c1 , . . . , c1 , c2 , . . . , c2 , . . . , cM , . . . , cM )
| {z } | {z } | {z }
k1 times k2 times kM times
where ki = max{1, bì /`c}. Here, ki is an estimate of how many times ci needs to be repeated, depending
on how ì compares with the average run length `.
For example, B(a, a, b, b, b, b, b, b, b, b, c, c) = (a, b, b, c) because `1 = 2, `2 = 8, `3 = 2 and therefore k1 =
1, k2 = 2, k3 = 1.
Give an example of a sequence y such that there is no a with B(a) = y, for this new B. Briefly justify your
answer. [2 pts]

Gans + Final Practice Questions: Instructor: Preethi Jyothi

Uploaded by

Copyright:

Available Formats

Gans + Final Practice Questions: Instructor: Preethi Jyothi

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gans + Final Practice Questions: Instructor: Preethi Jyothi

Uploaded by

Copyright:

Available Formats

GANs

• 4-5 page final report:

✓ Task definition, Methodology, Prior work, Implementation

• Short talk summarizing the project:

✓ Each team will get 8-10 minutes for their presentation

✓ Clearly demarcate which team member worked on what part

• 6 points for the report

• 4 points for the presentation

• 6 points for Q/A

• 4 points for overall evaluation of the project

• Presentations will be held on Nov 23rd and Nov 24th

• The final report in pdf format should be sent to

• The order of presentations will be decided on a lottery basis

• Training process is formulated as a

• Objective of the generator: Create

• Objective of the discriminator:

• The generator tries to fool the

max min L(G, D)

• Cost function of the generator is the opposite of the discriminator’s

• Minimax game: The generator and discriminator are playing a zero-sum

• Update the generator by descending its stochastic gradient:

Image from [Goodfellow16]: https://arxiv.org/pdf/1701.00160.pdf

•Problem of saturation: If the

This fixes the saturation problem.

Image from https://github.com/hindupuravinash/the-gan-zoo

input output input output

input output input output input output

Generator Network Discriminator Network

Image from Yang et al., “SPSS using GANs”, 2017 We treat

• Generator G will take both x̃ and z as

Image from https://arxiv.org/abs/1711.11293

a)What is P(O2 = clear)?

b)What is P(q2 = Clean ∣ O2 = clear)?

0.9 0.9 a) What is the most likely sequence of

a) Draw the state diagram of the finite-state machine L.

where B maps a per-frame output sequence a = (a1 , . . . , aT ) to a final

output sequencePryCT=C (y|x)

You might also like

✓ Each team will get 8-10 minutes for their presentation  

where B maps a per-frame output sequence a = (a1 , . . . , aT ) to a final