Gans + Final Practice Questions: Instructor: Preethi Jyothi
Gans + Final Practice Questions: Instructor: Preethi Jyothi
Gans + Final Practice Questions: Instructor: Preethi Jyothi
+
Final practice questions
Lecture 23
CS 753
Instructor: Preethi Jyothi
Final Exam Syllabus
1. WFST algorithms/WFSTs used in ASR
2. HMM algorithms/EM/Tied state Triphone models
3. DNN-based acoustic models
4. N-gram/Smoothing/RNN language models
5. End-to-end ASR (CTC, LAS, RNN-T)
6. MFCC feature extraction
7. Search & Decoding
8. HMM-based speech synthesis models
9. Multilingual ASR
10. Speaker Adaptation
11. Discriminative training of HMMs
Questions can be asked on any of the 11 topics listed above. You will be allowed a single A-4
cheat sheet of handwritten notes; content on both sides permitted.
Final Project
Deliverables
• Break-up of 20 points:
<latexit sha1_base64="VTwlHXywjOChz5LfJXs73m4kjK8=">AAACU3icbVFda9RAFJ2kVetq7aqPvlxchATtklTBvghFV+qDDxXctrAJy2T27u7YySTM3LS7DfmPIvjgH/HFB539QGzrgYHDOecyd85kpZKWouiH529s3rp9Z+tu69797Qc77YePjm1RGYF9UajCnGbcopIa+yRJ4WlpkOeZwpPs7N3CPzlHY2WhP9O8xDTnEy3HUnBy0rD9JSGcUX0xRYPQQJJzmgqu6o9NcPgCeiG8WWkmr983w3oGidTQa2Cwm6hiAr1gFqbw/Erm8q8bxLDrIofBZRiG6bDdibrREnCTxGvSYWscDdvfklEhqhw1CcWtHcRRSWnNDUmhsGkllcWSizM+wYGjmudo03rZSQPPnDKCcWHc0QRL9d+JmufWzvPMJRe72+veQvyfN6hovJ/WUpcVoRari8aVAipgUTCMpEFBau4IF0a6XUFMueGC3De0XAnx9SffJMd73fhld+/Tq87B23UdW+wJe8oCFrPX7IB9YEeszwT7yn6y3x7zvnu/fN/fXEV9bz3zmF2Bv/0H1zCvmQ==</latexit>
where L(G, D) = Ex2D [ log D(x)] + Ez [ log(1 D(G(z)))]
end for
• Sample minibatch of m noise samples {z , . . . , z } from noise prior pg (z).
(1) (m)
end for
The gradient-based updates can use any standard gradient-based learning rule. We used momen-
tum in our experiments.
= Ez [log(1
J• GOriginal cost D(G (z)))]
LGEN (G, D) = Ez [log(1 D(G(z)))]
Modified generator cost:
<latexit sha1_base64="LGWrBEXN/oGBn/UGMdYvD/mje+0=">AAACL3icbVDLSgMxFM34rPVVdekmWIQpaJmpgm6Eoq11IVLBPqAdSibNtKGZB0lGaIf5Izf+Sjciirj1L0wfgrYeCJyccy/33mMHjAppGK/awuLS8spqYi25vrG5tZ3a2a0KP+SYVLDPfF63kSCMeqQiqWSkHnCCXJuRmt27Gvm1R8IF9b0H2Q+I5aKORx2KkVRSK3XddJHsYsSi27gVjT/cjUrFuzjWS0ewkIEX8EctqopBDBtN5negbsJjWNBL+iCTyVitVNrIGmPAeWJOSRpMUW6lhs22j0OXeBIzJETDNAJpRYhLihmJk81QkADhHuqQhqIecomwovG9MTxUShs6PlfPk3Cs/u6IkCtE37VV5Wh1MeuNxP+8RiidcyuiXhBK4uHJICdkUPpwFB5sU06wZH1FEOZU7QpxF3GEpYo4qUIwZ0+eJ9Vc1jzJ5u5P0/nLaRwJsA8OgA5McAby4AaUQQVg8ASG4A28a8/ai/ahfU5KF7Rpzx74A+3rG3KWprc=</latexit>
• Modified cost
JG = Ez [ log D(G (z))]
LGEN (G, D) = Ez [ log D(G(z))] <latexit sha1_base64="axrtjiZYFFfZltVUN2PIU2Ef0eo=">AAACKnicbVDLSgMxFM3UV62vqks3wSK0oGWmCroRqrbUhUgF+4CZoWTStA3NPEgyQjvM97jxV9x0oRS3fojpQ9DqgcDJOfdy7z1OwKiQuj7WEkvLK6tryfXUxubW9k56d68u/JBjUsM+83nTQYIw6pGapJKRZsAJch1GGk7/ZuI3nggX1Pce5SAgtou6Hu1QjKSSWukry0WyhxGL7uJWNP1wN6qU7+M4WzmGpRy8hN9qWVUMY2ieWMzvwlK2kh3mcnYrndHz+hTwLzHmJAPmqLbSI6vt49AlnsQMCWEaeiDtCHFJMSNxygoFCRDuoy4xFfWQS4QdTU+N4ZFS2rDjc/U8Cafqz44IuUIMXEdVTrYWi95E/M8zQ9m5sCPqBaEkHp4N6oQMSh9OcoNtygmWbKAIwpyqXSHuIY6wVOmmVAjG4sl/Sb2QN07zhYezTPF6HkcSHIBDkAUGOAdFcAuqoAYweAav4A28ay/aSBtrH7PShDbv2Qe/oH1+ASE1pcM=</latexit>
xreal x = G(z)
• Generator and discriminator
receive some additional
conditioning information
C Z
Phillip Isola Jun-Yan Zhu Tinghui Zhou Alexei A. Efros
Image-to-image Translation using C-GANs
Berkeley AI Research (BAIR) Laboratory, UC Berkeley
{isola,junyanz,tinghuiz,efros}@eecs.berkeley.edu
Labels to Street Scene Labels to Facade BW to Color
4v3 [cs.CV] 26 Nov 2018
input output
Aerial to Map
Figure 1: Many problems in image processing, graphics, and vision involve translating an input image into a corresponding output image.
These problems are often treated with application-specific algorithms, even though the setting is always the same: map pixels to pixels.
Conditional adversarial nets are a general-purpose solution that appears to work well on a wide variety of these problems. Here we show
results of the method on several. In each case we use the same architectureImage
and from
objective,
Isola et and simply
al., CVPR train
2017, on different data.
https://arxiv.org/pdf/1611.07004.pdf
s, Saarbrücken, Germany (MPI - INF. MPG . DE)
Text-to-Image Synthesis
this small bird has a pink this magnificent fellow is
breast and crown, and black almost all black with a red
images from text primaries and secondaries. crest, and white cheek patch.
, but current AI
oal. However, in
ul recurrent neu-
been developed
ature representa-
tional generative
ve begun to gen- the flower has petals that this white and yellow flower
of specific cat- are bright pinkish purple have thin white petals and a
with white stigma round yellow stamen
overs, and room
lop a novel deep
on to effectively
nd image model-
from characters
capability of our
ges of birds and Figure 1. Examples of generated images from
Image from text
Reed et descriptions.
al., ICML 2016, https://arxiv.org/pdf/1605.05396.pdf
Text-to-Image Synthesis
Generative Adversarial Text to Image Synthesis
This flower has small, round violet This flower has small, round violet
petals with a dark purple center petals with a dark purple center
then concatenated to the noise vector z. Following this, in- Algorithm 1 GAN-CLS training algorithm with step size
ference proceeds as in a normal deconvolutional network: ↵, using minibatch SGD for simplicity.
we feed-forward it through the generator G; a synthetic im- 1: Input:
Image from Reed etminibatch
al., ICML 2016,images x, matching text t, mis-
https://arxiv.org/pdf/1605.05396.pdf
Three Speech Applications of GANs
GANs for speech synthesis
GAN is
• Generator Discriminator: sity function
produces
ative model
synthesised speech Binary loss functio
OR
which the
classifier solution suc
Discriminator Ez⇠pz (z) [
distinguishes from
real speech Linguistic features
Natural samples where Xrea
• During synthesis, a Generator:
erator G usi
MSE AND Combining
random noise +
framework i
linguistic features
Noise Predicted samples
generates speech lossmulti
features (con-
of model, we
like mean ab-
he raw speech Figure 1: GAN training process. First, D back-props a batch
s work under of real examples. Then, D back-props a batch of fake exam-
ion is shaped ples that come from G, and classifies them as fake. Finally, D’s
tions (like not parameters are frozen and G back-props to make D misclassify.
he predictions
. Our solution
tive adversar- coming from G, X̂ , have to be classified as fake. This leads
g information to G trying to fool D, and the way to do so is that G adapts
G can slightly its parameters such that D classifies G’s output as real. During
back-propagation, D gets better at finding realistic features in its
c distribution,
input and, in turn, G corrects its parameters to move towards the Figure 2: Encoder-decoder architecture for speech enhance-
ed to be fake. real data manifold described by the training data (Fig. 1). This ment (G network). The arrows between encoder and decoder
me sort of loss adversarial learning process is formulated as a minimax game blocks denote skip https://arxiv.org/pdf/1703.09452.pdf
Image from connections.
Voice Conversion Using Cycle-GANs
Recall that a pronunciation lexicon L maps a sequence of phones to a sequence of words. In this
problem, we shall modify L in order to handle some limited forms of interruptions in speech (a.k.a.
disfluencies). We will consider a dictionary of two words: W1 with the phone sequence “a b c” and W2
with the phone sequence “x y z”.
b) We want to modify L such that it accounts for “breaks” when the speaker stops in the middle of a
word and says the word all over again. For instance, the word W1 may be pronounced as “a b ⟨break⟩
a b c,” where ⟨break⟩ is a special token produced by the acoustic model. In a valid phone sequence,
breaks are allowed to appear only within a word, and not at the end or beginning of a word. Further,
two consecutive ⟨break⟩ tokens are not allowed. But a word can be pronounced with an arbitrary
number of breaks. E.g. W1 can be pronounced also as “a b ⟨break⟩ a ⟨break⟩ a b ⟨break⟩ a b c”. Let L1
be an FST (obtained by modifying L from the previous part) that accepts all valid phone sequences
with breaks, and outputs a corresponding sequence of words. Draw the state diagram of L1.
Handling disfluencies in ASR
Recall that a pronunciation lexicon L maps a sequence of phones to a sequence of words. In this
problem, we shall modify L in order to handle some limited forms of interruptions in speech (a.k.a.
disfluencies). We will consider a dictionary of two words: W1 with the phone sequence “a b c” and
W2 with the phone sequence “x y z”.
c) Next, we want to modify L1 such that it can account for both “breaks” and “pauses.” A pause
corresponds to when the speaker briefly stops in the middle of a word and continues. For instance,
the word W1 may be pronounced as “a b ⟨pause⟩ c”, “a ⟨break⟩ a ⟨pause⟩ b ⟨break⟩ a b c,” etc.
where ⟨pause⟩ is another special token produced by the acoustic model. In a valid phone
sequence, these special tokens are allowed to appear only within a word, and two consecutive
special tokens are not allowed. Let L2 be an FST (obtained by modifying L1 from the previous
part) that accepts all valid phone sequences with breaks and pauses, and outputs a corresponding
sequence of words. Draw the state diagram of L2.
Mixed Bag
An HMM-based speech synthesis system can be described using the following steps:
1.Spectral feature and excitation features are extracted from a speech database
2.Context-dependent HMMs are trained on these features
3.These HMMs are clustered using a decision tree
4.Durations of the HMM models are explicitly modeled
At synthesis time, for a given text sequence, the decision tree yields the appropriate
HMM state sequence which in turn determines the output spectral and excitation
features (that are passed through a synthesis filter to produce speech). Say we
want to add expressivity to the synthesized speech: i.e. we want to make the voice
sound happy or sad, friendly or stern. Pick one of the above-mentioned steps from
(A)-(D) you would modify to add expressivity and briefly justify your choice.
Mixed Bag
d ) Find the probability, Pr(drank|Mohan), given the following bigram counts:
Mohan drank 10
drank co↵ee 1
Mohan co↵ee 10
drank Mohan 5
Mohan ate 10
drank water 20
Pr(drank|Mohan) = [1 pts]
Say you have an n-gram distribution which is smoothed using add-↵ smoothing for some ↵ > 0. The
entropy of the smoothed distribution is
(A) equal to (B) less than (C) greater than
the entropy of the original unsmoothed n-gram distribution. Pick one of (A), (B) or (C) and briefly justify
your choice. [2 pts]
1
Mixed Bag
she/1.2 [s]/1
Problem 5: Mixed bag (22 points)
[iy]/1.8 [s]/1.1
0 3 4 5
a) Recall neural network language models (NNLMs) as shown in the schematic diagram below. For a given
context of fixed length, each word in the context (drawn from a vocabulary of size N ) is projected onto a P
[s]/1.5
using a 2common N ⇥ P projection matrix, that is shared across the di↵erent
he/2.5
dimensional projection layer
word positions in the context. The value of the ith node in the output layer corresponds directly to the
probability of a word i given its context. sees/4.5
projection
output
layer hidden
layer
wj-n+1
layer
softmax
over vocabulary
wj-n+2
of size N
shared
⋮ projections
H
wj-1
P O
The complexity to calculate probabilities using this NNLM is quite high. Describe one main reason why
this evaluation is very costly in processing time. [3 pts]
of the special blank symbol hblanki. Here yj 2 V and ai 2 V [{hblanki}, where V is the output vocabulary.
Given an input sequence x of length T and an output character sequence y of length N , the CTC
objective function is given by:
CTC Alignments
X
PrCT C (y|x) = Pr(a|x).
Given an input sequence x of length T and an output character sequence
y
a:B(a)=y
<latexit sha1_base64="kI7RlH5Xx88yj+zAtipGByUNV1Y=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFNy4r2Ae2Q8mkmTY0kwxJRhiG/oUbF4q49W/c+Tdm2llo64HA4Zx7ybkniDnTxnW/ndLa+sbmVnm7srO7t39QPTzqaJkoQttEcql6AdaUM0HbhhlOe7GiOAo47QbT29zvPlGlmRQPJo2pH+GxYCEj2FjpcRBhMwnCLJ0NqzW37s6BVolXkBoUaA2rX4ORJElEhSEca9333Nj4GVaGEU5nlUGiaYzJFI9p31KBI6r9bJ54hs6sMkKhVPYJg+bq740MR1qnUWAn84R62cvF/7x+YsJrP2MiTgwVZPFRmHBkJMrPRyOmKDE8tQQTxWxWRCZYYWJsSRVbgrd88irpNOreRb1xf1lr3hR1lOEETuEcPLiCJtxBC9pAQMAzvMKbo50X5935WIyWnGLnGP7A+fwBABuRIQ==</latexit>
a) of length
CTC the CTC objective
N, independence
makes an assumptionfunction
about the is given
labels by: at each time step. Explicitly state
predicted
this assumption, and fill in the blank below by applying this independence condition. [1 pts]
X
Independence Assumption: PCTC (y|x) = P (a|x)
<latexit sha1_base64="I7j3BCkNpDLi6KyKroCef1SzueA=">AAACYHicbVFNSwMxEM1u/ai1atWbXoJF0EvZVUERhGIvHiu0WmhLmU2zGkx2lySrljV/0psHL/4S03Wl9WMg8Oa9mcnkJUg4U9rz3hy3tLC4tFxeqaxW19Y3aptbNypOJaFdEvNY9gJQlLOIdjXTnPYSSUEEnN4GD62pfvtIpWJx1NGThA4F3EUsZAS0pUa1p/YoGwjQ91JkrU7LmIM8C8JsYl6+4bM5xBd4oFJRFFsOzHkOCfDsctYF5vBiNsHg9pwyP29Uq3sNLw/8F/gFqKMi2qPa62Ack1TQSBMOSvV9L9HDDKRmhFNTGaSKJkAe4I72LYxAUDXMcoMM3rfMGIextCfSOGfnOzIQSk1EYCunK6rf2pT8T+unOjwbZixKUk0j8nVRmHKsYzx1G4+ZpETziQVAJLO7YnIPEoi2f1KxJvi/n/wX3Bw1/OPG0fVJvXlZ2FFGu2gPHSAfnaImukJt1EUEvTslp+qsOR9u2d1wN79KXafo2UY/wt35BFjyuPA=</latexit>
a:B(a)=y
<latexit sha1_base64="atOL6LBgyaupxbCMQFxBnljx98Q=">AAACBnicbVDLSgNBEJyNrxhfqx5FGAxChBB2o6AXIejFY4S8IBuW3slsMmT2wcysEJacvPgrXjwo4tVv8ObfOJvkoIkFDUVVN91dXsyZVJb1beRWVtfWN/Kbha3tnd09c/+gJaNEENokEY9ExwNJOQtpUzHFaScWFAKP07Y3us389gMVkkVhQ41j2gtgEDKfEVBacs1jJwA19PwUJvgal8C1yw7vR0qWwW2cuWbRqlhT4GViz0kRzVF3zS+nH5EkoKEiHKTs2laseikIxQink4KTSBoDGcGAdjUNIaCyl07fmOBTrfSxHwldocJT9fdECoGU48DTndnRctHLxP+8bqL8q17KwjhRNCSzRX7CsYpwlgnuM0GJ4mNNgAimb8VkCAKI0skVdAj24svLpFWt2OeV6v1FsXYzjyOPjtAJKiEbXaIaukN11EQEPaJn9IrejCfjxXg3PmatOWM+c4j+wPj8ASzOl6U=</latexit>
a:B(a)=y
b) Consider a di↵erent definition of B which first removes all occurrences of the blank symbol, and then
compresses each run of an identical character to a run of length 1. Give an example of a sequence y such
that there is no a with B(a) = y, for this new B. Briefly justify your answer. [1 pts]
CTC Alignments
c) Now suppose we would like to avoid the use of the blank symbol altogether. Towards this, we define a new
B which works as follows. Given a = (a1 , . . . , aT ), B defines the sequence ((c1 , `1 ), (c2 , `2 ), . . . , (cM , `M ))
where ci 6= ci+1 and `i > 0 for all i, and a = (c1 , . . . , c1 , c2 , . . . , c2 , . . . , cM , . . . , cM ).
| {z } | {z } | {z }
`1 times `2 times `M times
1 PM
Then B calculates the average run length ` = M i=1 `i , and outputs
y = (c1 , . . . , c1 , c2 , . . . , c2 , . . . , cM , . . . , cM )
| {z } | {z } | {z }
k1 times k2 times kM times
where ki = max{1, b`i /`c}. Here, ki is an estimate of how many times ci needs to be repeated, depending
on how `i compares with the average run length `.
For example, B(a, a, b, b, b, b, b, b, b, b, c, c) = (a, b, b, c) because `1 = 2, `2 = 8, `3 = 2 and therefore k1 =
1, k2 = 2, k3 = 1.
Give an example of a sequence y such that there is no a with B(a) = y, for this new B. Briefly justify your
answer. [2 pts]