Audio Word2vec: Sequence-To-Sequence Autoencoding For Unsupervised Learning of Audio Segmentation and Representation
Audio Word2vec: Sequence-To-Sequence Autoencoding For Unsupervised Learning of Audio Segmentation and Representation
Audio Word2vec: Sequence-To-Sequence Autoencoding For Unsupervised Learning of Audio Segmentation and Representation
Abstract—In text, word2vec transforms each word into a fixed- In text, word2vec [6]–[8] transforms each word into a fixed-
size vector used as the basic component in applications of natural dimension vector used as the basic component of applications
language processing. Given a large collection of unannotated of natural language processing. Word2vec is useful because
audio, audio word2vec can also be trained in an unsupervised
way using a sequence-to-sequence autoencoder (SA). These vector it is learned from a large collection of documents without
representations are shown to effectively describe the sequential supervision. It is therefore interesting to ask: Given a large
phonetic structures of the audio segments. Here, we further collection of unannotated audio, can the machine automatically
extend this research in the following two directions. First, we represent an utterance as a sequence of vector representations,
disentangle phonetic information and speaker information from each of which corresponds to a word? The vector representa-
the SA vector representations. Second, we extend audio word2vec
from the word level to the utterance level by proposing a new tions should describe the sequential phonetic structures of the
segmental audio word2vec in which unsupervised spoken word audio signals. In some papers, such vector representations are
boundary segmentation and audio word2vec are jointly learned called “acoustic embedding”. Thus vector representations for
and mutually enhanced, and utterances are directly represented words that sound alike should be located in close proximity
as sequences of vectors carrying phonetic information. This is to each other in the vector space, regardless of the unique
achieved by means of a segmental sequence-to-sequence au-
toencoder (SSAE), in which a segmentation gate trained with characteristics of the speakers uttering the segments. If this
reinforcement learning is inserted in the encoder. can be achieved, this kind of representation would serve as
the basic component of downstream spoken language under-
Index Terms—Audio Word2Vec, Sequence-to-sequence Autoen-
coder standing applications, similar to Word2Vec in natural language
processing. For example, consider translating the speech of a
low-resource language into the text of another language [9].
I. I NTRODUCTION With training data in the form of audio paired with text
translations, representing the audio by audio Word2Vec would
Human infants acquire languages with little formal teaching;
make training more efficient than if using acoustic features
machines, however, must learn from a large amount of anno-
like MFCCs.
tated data, which makes the development of speech technology
In this paper, a sequence-to-sequence autoencoder (SA) is
for a new language challenging. For typical spoken language
used to represent variable-length audio segments using fixed-
understanding, one can simply convert spoken content into
length vectors [10], [11]. With SA, only audio segments
word sequences using an off-the-shelf speech recognizer.
without human annotation are needed, which suits it for low-
However, to train a high-quality speech recognition system,
resource applications. Although autoencoding is a successful
huge quantities of annotated audio data are needed. Therefore,
machine learning technique for extracting representations in
for low-resource languages with scarce annotated data, or
an unsupervised way [12], [13], it requires fixed-length in-
languages without written forms, sufficiently accurate speech
put vectors. This is a considerable limitation because audio
recognition is difficult to achieve. Some previous work [1], [2]
segments are intrinsically expressed as sequences of arbitrary
focus on speech recognition with mismatched crowdsourcing
length. SA, proposed to encode sequences based on sequence-
and probabilistic transcriptions.
to-sequence learning, has been applied in natural language pro-
Annotating audio data for speech recognition is expensive,
cessing [14], [15] and video processing [16]. SA consists of an
but unannotated audio data is relatively easy to collect. If the
RNN encoder and decoder. The RNN encoder reads an audio
machine can acquire the word patterns behind speech signals
segment represented as an acoustic feature sequence and maps
from a large collection of unannotated speech data without
it to a vector representation with a fixed length of z; the RNN
speech recognition, it would be able to learn a new language
decoder maps the vector z to another sequence. The RNN
in a novel linguistic environment with little supervision. Imag-
encoder and decoder are trained to minimize the reconstruction
ine a Hokkien-speaking family buying an intelligent device:
error of the input acoustic sequence. It has been shown that
although at first the machine does not understand Hokkien, by
the representation z contains phonetic information [10], [11].
hearing people speak it, it automatically learns the language.
For example, the vector for “night” subtracted by the vector
This paper is one step toward this dream [3]–[5].
for “fight” is approximately equal to the vector for “name”
This work was supported by the Ministry of Science and Technology of subtracted by the vector for “fame” [10].
Taiwan In this paper, we further improve on the SA vector rep-
2
Word segmentation of speech is critical but challenging Fig. 1: Audio word2vec framework
for zero-resource speech technology because word bound-
aries are usually not available for given speech utterances
or corpora [3], [18]–[20]. Although there are approaches to II. R ELATED W ORK
estimating word boundaries [21]–[25], we hypothesize that Audio segment representation is still an open problem. It is
the audio segmentation and SA can be integrated and jointly common to use i-vectors to represent utterances in speaker
learned, so that they can enhance each other. This means that identification [29]. However, i-vectors are not designed to
the machine learns to segment the utterances into a sequence precisely describe the sequential phonetic structure of audio
of spoken words while at the same time transforming these segments as desired for our task. Embedding audio word
spoken words into a sequence of vectors. We propose the segments into fixed-length vectors also has useful applications
segmental sequence-to-sequence autoencoder (SSAE) [26], a in spoken term detection (STD) [30]–[35], in which audio
new model to jointly train the segmenter while extracting the segments are usually represented as feature vectors to be
representation. The SSAE contains a segmentation gate jointly applied to a standard classifier which determines whether
learned with SA from an unlabeled corpus in a completely the input queries are included [32]–[34]. In previous work,
unsupervised way. During training, the SSAE learns to convert embedding approaches were developed primarily in heuristic
the utterances into sequences of embeddings, and then recon- ways, rather than learned from data. Graph-based embedding
structs the utterances with these embedding sequences. The approaches are also used to represent audio segments as fixed-
only thing needed during training is a guideline for the proper length vectors [30], [31]. Retrieval task efficiency is improved
number of vectors (or words) within an utterance of a given by searching audio content using fixed-length vectors instead
length, to ensure that the machine segments the utterances of using the original acoustic features [30], [31].
into word-level segments. Since the model is not completely Recently, deep learning has been used to encode acoustic
differentiable, standard backpropagation is not applicable [27] information as vectors [36]–[44]. This transformation success-
[28]; thus we use reinforcement learning to train SSAE. fully produces vector spaces in which word audio segments
with similar phonetic structures are located in close proximity.
In this paper, we employ query-by-example spoken term By training a recurrent neural network (RNN) with an audio
detection (QbE STD), a real-world application, to evaluate the segment as the input and the corresponding word as the target,
phonetic structure information of the original utterances con- the outputs of the hidden layer at the last few time steps
tained in these generated word vector sequences. When based can be taken as the representation of the input segment [37],
on audio word2vec, QbE STD is much more efficient than [45]. However, this approach is supervised and therefore
conventional dynamic time warping (DTW) based approaches, necessitates a large amount of labeled training data. In [38], the
because only the similarities between two single vectors are authors train a neural network with side information to obtain
needed; this is in addition to the significantly better retrieval embeddings that separate same-word and different-word pairs.
performance that it yields. Since human annotated data is still required, the scenario is
weakly supervised. For non-speech audio, some approaches
The audio word2vec framework is summarized in Fig. 1. obtain labeled paired data based on the nature of signals [46],
Given a large collection of annotated audio, it is first seg- but the approaches have not yet been applied to speech.
mented into word-level segments. Then the SA model gener- Feature disentanglement of audio features has been studied
ates an embedding for each audio segment, as described in with variational autoencoders [47]–[49]. In contrast to previous
Section III-A. In Section III-B, we describe how to disen- work, the feature disentanglement approach here uses a clas-
tangle speaker and speech content from the embedding. In sifier as discriminator. This approach is borrowed from work
Section IV, we describe the proposed SSAE model, which on domain adversarial training [17], [50]. Similar ideas have
jointly learns segmentation and embedding. In Section V the been applied to domain adaptation for speech recognition [51];
embedding is evaluated on QbE STD. while senone labels are needed in the previous work, here the
3
feature disentanglement is completely unsupervised. the input sequence x = (x1 , x2 , ..., xT1 ) sequentially, and the
In unsupervised pattern discovery, segmentation followed hidden state ht of the RNN is updated accordingly. After
by clustering is typical [52]–[63]1 . Probabilistic Bayesian the last symbol xT1 is processed, the hidden state hT1 is
models are developed to construct a model which learns interpreted as the learned representation of the whole input
segmentation and representation (or clustering) jointly [21], sequence. Then, taking hT1 as input, the RNN decoder gen-
[25], [64]. Although Bayesian models yield successful results, erates the output sequence y = (y1 , y2 , ..., yT2 ) sequentially,
they do not scale well to large speech corpora; as such the where T1 and T2 can be different, or the length of x and y can
embedded segmental K-means model has been proposed as be different. This RNN encoder-decoder framework is able to
an approximation of the Bayesian model [65] – this however handle variable-length input and output.
does not take advantage of deep learning. Another approach
to learn segmentation is using an autoencoder with a sample-
based algorithm [66]. First an LSTM is used to model a
proposal distribution over sequences of segment boundaries
for each utterance. Then m sequences of boundaries from the
distribution are sampled to split the utterance into words and
the reconstruction loss for each sequence is obtained with an
autoencoder. The losses and the distribution are used to com-
pute an importance weight for each sample and breakpoint.
A breakpoint is more likely if it appeared in samples with
low reconstruction loss. In comparison, SSAE as proposed in
this paper is a “whole network model” using reinforcement
learning, which scales well and can be trained in an end-to-
end fashion.
This journal paper is an extension of previous conference Fig. 2: Sequence-to-sequence autoencoder (SA), consisting an
papers. Both SA [10] and SSAE [26] have been proposed in RNN encoder (ER) and an RNN decoder (DR). The encoder
previous conference papers; SA’s language transfer ability has reads an audio segment represented as an acoustic feature
also been verified in a conference paper [11]. However, SA sequence x = (x1 , x2 , ..., xT ) and maps it to a fixed-length
and SSAE have not yet been used with feature disentanglement vector representation with dimension z; the decoder maps the
based on adversarial training. vector z to another sequence y = (y1 , y2 , ..., yT ). The RNN
encoder and decoder are jointly trained such that the output
III. AUDIO R EPRESENTATION sequence y is as close to the input sequence x as possible.
The goal for the audio word2vec model is to iden-
tify the phonetic patterns in sequences of acoustic features Fig. 2 depicts the structure of the sequence-to-sequence
such as MFCCs. Given a word-level audio segment x = autoencoder (SA), which integrates the RNN encoder-decoder
(x1 , x2 , ..., xT ) where xt is the acoustic feature at time t, and framework with an autoencoder for the unsupervised learning
T is the length, audio word2vec transforms the features into a of audio segment representations. The SA consists of an RNN
fixed-length vector z ∈ Rd with dimension d. In this section, encoder (the left part of Fig. 2) and decoder (the right part).
we assume the word boundaries are ready available. In the Given an audio segment represented as an acoustic feature
next section we describe how to jointly learn segmentation sequence x = (x1 , x2 , ..., xT ) of any length T , the RNN
and representation. encoder reads each acoustic feature xt sequentially and the
hidden state ht is updated accordingly. After the last acoustic
A. Sequence-to-sequence Autoencoder feature xT has been read and processed, the hidden state hT
of the encoder is taken as the learned representation z of the
Recurrent neural networks (RNNs) have shown great suc- input sequence (the vector in the middle of Fig. 2).
cess in many NLP tasks with their ability to capture sequential The RNN decoder takes hT as the initial state, and generates
information. The hidden neurons form a directed cycle and a sequence y. Based on autoencoder principles [12], [13],
perform the same task for every element in a sequence. Given the target of the output sequence y = (y1 , y2 , ..., yT ) is the
a sequence x = (x1 , x2 , ..., xT ), the RNN updates its hidden input sequence x = (x1 , x2 , ..., xT ). In other words, the RNN
state ht according to the current input xt and the previous encoder and decoder are jointly trained by minimizing the
ht−1 . The hidden state ht acts as an internal memory at reconstruction error Lmse ,
time t that enables the network to capture dynamic temporal
Tx
information, and also allows the network to process sequences XX
of variable length. Lmse = kxt − yt k2 , (1)
x t=1
The RNN encoder-decoder architecture [67], [68] consists
of an RNN encoder and an RNN decoder. The encoder reads where Lmse is the sum over all the audio segments x in the
1 In most approaches, models learn to cluster audio segments instead
data collection, and Tx represents the length of the segment
of producing distributed representations as in this paper; however we can x. Because the input sequence is taken as the learning target,
consider clustering as representing audio segments with a one-hot encoding. the training process requires no labeled data. The fixed-length
4
vector representation z is thus a meaningful representation for decoder is the concatenation of vectors z and e. The phonetic
the input audio segment x because the whole input sequence x encoder, speaker encoder, and RNN decoder are jointly learned
can be reconstructed from z with the RNN decoder. Although to minimize the reconstruction error Lmse in (1).
in Fig. 2 both the RNN encoder and decoder have only one Ensuring that z contains phonetic information and e con-
hidden layer, this does not preclude the use of multiple layers. tains speaker information is not as simple as merely minimiz-
In Fig. 2, after generating output y1 , instead of taking y1 ing Lmse ; rather, to achieve this goal, we require additional
as the input of the next time step, a zero vector is used as training criteria for speaker encoder, as shown in Fig. 3 (B).
input to generate y2 , and so on, in contrast to the typical The speaker encoder learns to minimize the distance between
encoder-decoder architecture. This use of historyless decoding the e of audio segments uttered by the same speaker [72],
is critical here. We found that if a typical decoder is used and enlarge the distance between the e of different speakers
(that is, RNN takes y1 as input to generate y2 , and so on), past a threshold. Given audio segments xi and xj , we obtain
despite the resultant low reconstruction error, the SA-learned two embeddings ei and ej from the speaker encoder2 . If xi
vector representations do not include useful information. This and xj are from the same speaker, the speaker encoder learns
is because a strong decoder focuses less on including more to minimize kei − ej k2 . If xi and xj are from the different
information in the vector representation. Historyless decoding speakers, however, the speaker encoder learns to minimize
yields a weakened decoder because the input of the decoder is max(λd − kei − ej k2 , 0), such that the distance between ei
removed, which forces the model to rely more on the vector and ej is larger than a predefined threshold λd . This assumes
representation. Historyless decoding is also used in some NLP that speaker labels are available. If speaker information is
applications [69]–[71]. not available, the speaker encoder can still be learned by
assuming that segments from the same utterance are produced
B. Feature Disentanglement by the same speaker. Although we only consider the speaker
information here, it is possible to use the same approach to
consider other information such as the channel.
Note that the above constraint on the speaker encoder is not
sufficient because it does not prevent the phonetic encoder
from putting the speaker information in z. To address this
problem, we borrow the speaker classifier technique from ad-
versarial training [17], as shown in Fig. 3 (C). The inputs of the
speaker classifier are the z i and z j from two audio segments xi
and xj ; the classifier produces a score representing whether z i
and z j are from the same speaker. The speaker classifier loss
is the same as that for the Wasserstein generative adversarial
network (WGAN) [73]. This loss is defined as the difference
between the summation of the scores of the pairs from the
same speaker minus that from different speakers. To minimize
this loss, the classifier increases scores for pairs from the
same speaker and decreases scores for those from different
speakers. Gradient penalty is used to learn the classifier [74].
Note that the learning targets of the speaker classifier and the
phonetic encoder are opposites: the phonetic encoder updates
Fig. 3: Feature disentanglement. (A) Adding speaker encoder its parameters to maximize the loss of the speaker classifier,
for reconstruction. (B) Additional training criteria for the while the speaker classifier does its best to distill the speaker
speaker encoder. (C) Speaker classifier learns to distinguish information from z in order to determine whether z i and z j
whether two z’s are from the same speaker or not; phonetic are from the same speaker. Thus, the phonetic encoder tries its
encoder attempts to confuse the classifier with z. utmost to generate z vectors that confuse the speaker classifier.
If it successfully achieves this after training, it produces a z
Because the representation z extracted by the RNN encoder that contains no speaker information, and an e that contains all
in Fig. 2 must reconstruct the input signals, it includes not the speaker information. The complete procedure for feature
only phonetic information but also speaker, environment, and disentanglement is shown in Algorithm 1.
channel information in various dimensions. Therefore, it is
necessary to disentangle the phonetic information from other IV. J OINTLY L EARNING S EGMENTATION AND
information. R EPRESENTATION
As shown in Fig. 3 (A), the basic idea of disentanglement is A. Segmental Sequence-to-Sequence Autoencoder (SSAE)
to add an additional speaker encoder. The original encoder in
Fig. 2 is the phonetic encoder, which takes the audio segment x The proposed structure for SSAE is depicted in Fig. 4,
as input and outputs embedding vector z. The speaker encoder in which the segmentation gate is inserted into the SA. The
has the same RNN architecture as the phonetic encoder. Its 2 We use superscripts to represent the indices of whole audio segments, and
output embedding is denoted as e. The input of the RNN subscripts for acoustic features within an audio segment.
5
zn for the n-th input segment in (8) above, the RNN decoder result in smaller reconstruction errors because the audio seg-
generates {yt0 , yt0 +1 , ..., yt } to reconstruct the input acoustic ments for words appear more frequently in the corpus and thus
features {xt0 , xt0 +1 , ..., xt }. When the RNN decoder begins the embeddings are trained better with lower reconstruction
decoding each audio segment, its state is also reset. errors. Therefore, one proper choice for the first term in the
In Fig. 4, each audio segment in the boxes with dotted reward function may be rmse , which is the negative reconstruc-
lines can be viewed as performing the sequence-to-sequence tion error, rmse = −Lmse .
training from Fig. 2 individually. Note that the sequence-to- At the same time, it is important to have a guideline
sequence training in Fig. 4 reconstructs the target sequence in for the proper number of segments N in an utterance of a
reverse order, in contrast to that shown in Fig. 2. In preliminary given length T 0 . Without this guideline, the segmentation gate
experiments, we found that the order of reconstruction does not generates as many segments as possible in order to minimize
significantly influence the performance of the representation. the reconstruction error. Therefore, we design the reward such
We use the reverse order here because it can be implemented that the smaller number of segments N normalized by the
more efficiently. utterance length T 0 , the higher the reward:
N
rnum = − , (10)
B. SSAE Training T0
Although all the parameters in the SSAE model can be where N and T 0 are respectively the numbers of segments and
trained simultaneously, we actually train our model using an frames for the utterance as in Fig. 4.
iterative process consisting of two phases. In the first phase, The total reward r is obtained by choosing the minimum
the RNN encoder and decoder parameters are updated, while between rmse and rnum :
in the second phase, only the segmentation gate parameters r = min(rmse , λrnum ) (11)
are updated. The two phases are performed iteratively.
1) First Phase - RNN Encoder and Decoder: In the first where λ is a hyperparameter to be tuned for a reasonable
phase, we train only the RNN encoder and decoder to min- guideline to estimate the proper number of segments for an
imize reconstruction error while fixing the parameters of the utterance of length T . λ is determined to make the values of
segmentation gate. Because the segments are already provided rmse roughly equivalent to λrnum . rmse and rnum are unknown
by the segmentation gate, the first phase training is parallel to before the model training, but it is possible to estimate their
training a typical SA as in Section III-A. That is, the encoder average values. Here we assume the average length of spoken
and decoder learn to minimize the reconstruction error Lmse words is known as the prior knowledge (this is the only
in (1). Each time in phase one, the encoder and decoder are language specific prior knowledge we used), so the average
learned from random initialized parameters, instead of starting value of rnum can be roughly estimated. rmse is estimated
off with the parameters learned in the previous iteration – this by randomly segmenting the utterances first and then training
was found to offer better training stability. a sequence-to-sequence auto-encoder. Interpolating rmse and
2) Second Phase – Segmentation Gate: In the second rnum as total reward r is also possible, but in our preliminary
phase, we update the parameters of the segmentation gate experiments, we found that the minimum function yielded
while fixing the parameters of the encoder and decoder. better results than interpolation.
Although the encoder and decoder are not updated in this For the reward baseline rb , we further use an utterance-
phase, they are involved in computing the reward for training wise reward baseline to remove the bias between utterances.
the segmentation gates by reinforcement learning. For each utterance, M different sets of segment boundaries
The segmentation gate is trained using reinforcement learn- are sampled by the segmentation gate, each of which is used
ing. After the gate performs the segmentation for each ut- to evaluate a reward rm with (11). The reward baseline rb for
terance, it receives a reward r and a reward baseline rb for the utterance is then their average:
updating the parameters. r and rb are defined later. We can M
1 X
express the expected reward for the gate under policy π as rb = rm . (12)
J(θ) = Eπ [r], where θ is the parameter set. To maximize the M m=1
expected reward J(θ), policy gradient [75] is used to update
the parameters of the segmentation gate using the parameter V. E XAMPLE A PPLICATION : U NSUPERVISED
update formulation below. Q UERY- BY-E XAMPLE S POKEN T ERM D ETECTION
T 0 Here we consider unsupervised query-by-example spoken
∇θ J(θ) = Ea∼π [∇θ (r − rb )
X (θ)
logπt (at )], (9) term detection (QbE STD) as an example application to eval-
t=1
uate the quality of the embeddings. The task of unsupervised
QbE STD here is to verify the existence of the input spoken
(θ)
where πt (at ) is the probability for the action at taken per query in an utterance or audio file without performing speech
(7). recognition [31]. With SSAE in Section IV, this is achieved
The reconstruction error is an effective indicator of whether as illustrated in Fig. 5. Given the acoustic feature sequences
the segmentation boundaries are good, since the embeddings of a spoken query and a spoken document, SSAE represents
are generated based on the segmentation. We hypothesize that these sequences as embeddings, q = { q1 , q2 , ..., qnq } for the
good boundaries, for example those close to word boundaries, query and d = { d1 , d2 , ..., dnd } for the document. Here q
7
Fig. 4: Segmental sequence-to-sequence autoencoder (SSAE). In addition to the RNN encoder (ER blocks) and RNN decoder
(DR blocks), a segmentation gate (S blocks) is included in the model to estimate word boundaries. During transitions across
segment boundaries, the RNN encoder and decoder are reset (illustrated as a slashed arrow) to prevent information flow across
segment boundaries. Each segment (the boxes with dotted lines) can be viewed as performing the sequence-to-sequence training
from Fig. 2 individually.
variance (the length of the black line on each bar) of the cosine
similarity for groups of pairs clustered by the PSED (PSED
= 0,1,2,3 and > 3) between the two words.
In Fig. 6, the cosine similarities of the RNN encoder and
phonetic encoder decrease as the edit distances increase; that
is, the vector representations for words with similar pronun-
ciations are in close proximity to each other. This means that
both the RNN encoder and the phonetic encoder indeed encode
the sequential phonetic structures into fixed-length vectors.
Clearly, the speaker encoder output includes little phonetic
information because speaker encoder similarities are almost
independent of the PSED. We also find from the means of
similarities that the RNN encoder clearly distinguishes word
segments with different phonemes even without disentangling Fig. 6: Average cosine similarity and variance (length of
features. For example, the similarity for word segments whose black line on each bar) between vector representations for
phonemes are exactly the same (PSED= 0) is 0.63, while all segment pairs in the evaluation set, clustered by phoneme
the similarity for word segments with one different phoneme sequence edit distance (PSED).
(PSED= 1) is only 0.40. However, their similarities have
very large variances. For example, the variances of the group
with one different phoneme is 0.24, which leads to ambiguity
between different groups. This is reasonable because it is well-
known that even with exactly identical phoneme sequences,
acoustic realizations can differ greatly for different speakers.
For the phonetic encoder, the mean similarities between dif-
ferent groups are not as remarkable as for the RNN encoder;
however, the variances in each group are much smaller (0.018–
0.029). This shows that disentangling features separate the
values of the similarities of different groups. Fig. 7: Output of phonetic encoder and speaker encoder for
six different words.
Loss Within Speaker Across Speaker
Same as Alg. 1 12.5 19.2
No Lc 14.0 19.8
No Ls 13.1 20.2 features.
No Lc & Ls 15.2 20.0 In the following experiments, we further compare the per-
formance of the RNN encoder without feature disentanglement
TABLE I: Within- and across-speaker ABX scores for the
and the phonetic encoder on QbE STD.
learned vector representations. Here an ablation study was
performed by removing some loss terms in Algorithm 1, or
some parts of Fig. 3. C. Visualization
to a representation of an audio segment; different words are For TIMIT the ground truth word boundaries were provided,
represented using different colors. The points in the left and while for GlobalPhone we used the forced-aligned word
right parts of Fig. 7 are the representations of the same sets boundaries. In this section, both the RNN encoder and decoder
of segments but with different encoders. The left part is the of the SSAE consist of one hidden layer with 100 LSTM units.
output of the phonetic encoder z, while the right part is the Feature disentanglement does not apply in the experiments of
output of the speaker encoder e. The representations were this section. That is, here we do not have a speaker encoder
reduced to two dimensions using PCA. It is clear that the and speaker classifier. The segmentation gate consists of two
phonetic representations distinguish different words, while the 256-node LSTM layers. All parameters were trained with
speaker representations from the six words are mixed together. Adam [78]. We set M in (12) for estimating the reward
We also find that the speaker representations are clustered baseline rb to be 5, and λ = 5 in (11). The word boundaries
into two groups corresponding to males and females. The were initialized randomly. The proximal policy optimization
setup of Fig. 8 is parallel to that of Fig. 7; here we show algorithm [79] was used in the policy gradient. The tolerance
the representations of the audio segments from two speakers. window for word segmentation evaluation was taken as 40
The segment representations of the two speakers correspond to ms. The acoustic features used were 39-dim MFCCs with
the red and blue points. The phonetic representations do not utterance-wise CMVN.
distinguish the audio segments of the two speakers because
their utterances show no remarkable differences. The audio B. Experimental Results
segments of the two speakers, however, show very different
Fig. 10 shows the SSAE learning curves on the TIMIT
speaker representations.
validation set. Fig. 11 is the results for the Czech validation
For another test, we selected four sets of words that differ
set; we do not show the French and German results because
only in the last few phonemes. We averaged the phonetic rep-
their trends mirror that of Fig. 11. We see that SSAE gradually
resentations z of the audio segments corresponding to the same
learns to segment utterances into spoken words because both
word, and reduced the dimensionality of the averaged presen-
the precision and recall (blue curves in Fig. 10 and Fig. 11
tations to 2 using PCA. The averaged representation of word w
respectively) increase during training. The reward rnum in (10)
is denoted as V (w). From the results, shown in Fig. 9, we see
(red curves) fluctuates initially during training and tends to
that the representations z constitute very good descriptions for
converge at the end.
the sequential phonemic structures of the acoustic segments.
Table III shows the spoken word segmentation performance
For example, in the leftmost figure of Fig. 9, we observe that
of the proposed SSAE in terms of precision, recall, and F1
V (SIT ) − V (SITTING) ≈ V (STAND) − V (STANDING).
score. We compared the SSAE results with three baselines:
Several similar examples are found in Fig. 9.
random segments, gate activation signals (GAS) [22], and
Loss +ing +ed +s +er hierarchical agglomerative clustering (HAC) [24], [80]. We
Same as Alg. 1 0.09 0.26 0.13 0.09 observe that SSAE significantly outperforms the other base-
No Lc & Ls 0.08 0.19 0.09 0.08 lines on all languages other than German GAS, to which it is
TABLE II: Mean reciprocal rank (MRR) scores for retrieval comparable. An example of segmentation by SSAE is shown
results of vector representations of words plus four kinds of in Fig. 12.
suffixes. TIMIT Czech French German
Method Precision Recall F1 F1
For quantitative analysis, we conducted an experiment to Random 24.60 41.08 30.77 22.56 32.66 25.41
evaluate whether the difference vector between the phonetic HAC 26.84 46.21 33.96 30.84 33.75 27.09
GAS 33.22 52.39 40.66 29.53 31.11 32.89
representation of a certain word w and that of the word plus a
SSAE 37.06 51.55 43.12 37.78 48.14 31.69
suffix, such as w − ing, would be consistent regardless of
what w was. More specifically, for a certain pair (V (w1 ), TABLE III: Spoken word segmentation performance, com-
V (w2 ), we calculated V (w2 ) + V (w1 − ing) − V (w1 ), and pared to different methods for various corpora.
used mean reciprocal rank (MRR) (harmonic mean of the
retrieval ranks) to serve as the retrieval evaluation measure
of V (w2 − ing). The retrieval results of four sets of words VIII. E XPERIMENT: Q BY E STD
plus suffixes (w − ing, w − ed, w − s and w − er) are In this section, we evaluate the performance of audio
listed in Table II. Here we also compared the performance word2vec on QbyE STD. We use mean average precision
of representations with or without disentanglement. It can be (MAP) as the evaluation measure.
observed that the difference vectors mentioned above were The first set of experiment is conducted on English (TIMIT),
consistent to an extent, and again disentanglement of features Czech, French and German. The testing set utterances were
improved the retrieval results. used as spoken documents [20]. We randomly selected as the
query words five words for each language containing a variety
VII. E XPERIMENT: S EGMENTATION
of phonemes; from the training set we used several occurrences
A. Experimental Setup of each of these phoneme-rich words as spoken queries. For
We conducted segmentation experiments on TIMIT and English, Czech, French, and German, we used 29, 21, 25, and
GlobalPhone [77], specifically Czech, French, and German. 23 spoken queries for evaluation, respectively. The number of
10
segmentation boundaries made the biggest impact on STD per- Audio embedding has many possible applications beyond
formance. We also note that although the GAS segmentation STD. For example, audio embedding can be considered as bet-
performance was slightly better than that of SSAE for German, ter audio representation for speech recognition. It is possible
SSAE clearly outperformed GAS on German QbyE STD. to use a large amount of unlabeled audio to learn the embed-
dings to improve low-resource speech recognition. The learned
Embeddings (different seg.) embeddings can also be used in the applications related to
Lang. Ran. DTW GAS HAC SSAE Oracle
TIMIT 0.74 12.02 8.29 0.91 23.27 30.28
spoken language understanding of low-resource language like
Czech 0.38 16.59 0.68 1.13 19.41 22.56 spoken question answering, spoken content summarization,
French 0.27 11.72 0.40 0.92 21.70 29.66 and speech translation. These spoken language understanding
German 0.18 6.07 0.27 0.26 13.82 21.52 systems can take the learned audio embedding as input, instead
TABLE IV: Spoken term detection performance in mean aver- of the transcriptions of spoken content.
age precision (MAP) for proposed SSAE as compared to audio
word2vec embeddings trained with spoken words segmented
R EFERENCES
with other methods for different languages. Random baseline
(Ran.) assigns a random score to each query-document pair. [1] M. A. Hasegawa-Johnson, P. Jyothi, D. McCloy, M. Mirbagheri, G. M. d.
Standard frame-based DTW is the primary baseline; oracle Liberto, A. Das, B. Ekin, C. Liu, V. Manohar, H. Tang et al., “Asr for
under-resourced languages from probabilistic transcription,” IEEE/ACM
segmentation is the upper bound. Transactions on Audio, Speech and Language Processing (TASLP),
vol. 25, no. 1, pp. 50–63, 2017.
Then we conducted experiments with more spoken queries [2] N. F. Chen, B. P. Lim, M. A. Hasegawa-Johnson et al., “Multitask
learning for phone recognition of underresourced languages using mis-
based on Librispeech. The audio word2vec models were matched transcription,” IEEE/ACM Transactions on Audio, Speech, and
trained on the 100 hour clean data set. The spoken archive Language Processing, vol. 26, no. 3, pp. 501–514, 2018.
to be retrieved is the clean testing data set. The chapters are [3] A. Jansen et al., “A summary of the 2012 JHU CLSP workshop on zero
resource speech technologies and models of early language acquisition,”
considered as the unit to be retrieved. We have 361 spoken in ICASSP, 2013.
queries also from Librispeech, but not included in the training [4] E. Dunbar, X. N. Cao, J. Benjumea, J. Karadayi, M. Bernard, L. Be-
set or retrieved utterances. All the spoken queries correspond sacier, X. Anguera, and E. Dupoux, “The zero resource speech challenge
2017,” in ASRU, 2017.
to a single word in the experiments. [5] M. Versteegh, R. Thiolliere, T. Schatz, X. N. Cao, X. Anguera,
The experimental results are shown in Table V. N oDis A. Jansen, and E. Dupoux, “The zero resource speech challenge 2015,”
and P E are the results without feature disentanglement and in INTERSPEECH, 2015.
[6] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
using the phonetic encoder respectively. Oracle means we “Distributed representations of words and phrases and their composi-
segmented the audio using the word boundaries obtained by tionality,” in NIPS, 2013.
forced alignment with the reference transcriptions. SSAE in [7] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of
word representations in vector space,” arXiv preprint arXiv:1301.3781,
Table V means the segments were obtained by SSAE. Different 2013.
k (k = 1 or 40) for the search algorithm in Section V and [8] Q. V. Le and T. Mikolov, “Distributed representations of sentences and
Fig. 5 are tested. Clearly, the output of the phonetic encoder documents,” arXiv preprint arXiv:1405.4053, 2014.
outperformed the features without disentanglement because it [9] S. Bansal, H. Kamper, A. Lopez, and S. Goldwater, “Towards speech-
to-text translation without speech recognition,” in EACL, 2017.
reduces the speaker dependence (P E v.s. N oDis) . [10] Y.-A. Chung, C.-C. Wu, C.-H. Shen, H.-Y. Lee, and L.-S. Lee, “Audio
word2vec: Unsupervised learning of audio segment representations using
Training Testing k=1 k = 40 sequence-to-sequence autoencoder,” in INTERSPEECH, 2016.
Set Set N oDis PE N oDis PE [11] C.-H. Shen, J. Y. Sung, and H.-Y. Lee, “Language transfer of audio
Oracle SSAE 16.41 17.24 17.27 21.79 word2vec: Learning audio segment representations without target lan-
SSAE SSAE 15.54 17.09 17.60 19.60 guage data,” in arXiv, 2017.
[12] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
TABLE V: MAP performance of query-by-example spoken data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,
2006.
term detection (QbE STD). N oDis and P E are the results [13] P. Baldi, “Autoencoders, unsupervised learning, and deep architectures,”
without feature disentanglement and using the phonetic en- Unsupervised and Transfer Learning Challenges in Machine Learning,
coder respectively. The segmentation of training and testing Volume 7, p. 43, 2012.
sets can be either oracle or by SSAE. Different k for the search [14] J. Li, M.-T. Luong, and D. Jurafsky, “A hierarchical neural autoencoder
for paragraphs and documents,” in arXiv, 2015.
algorithm in Section V and Fig. 5 are tested. [15] R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun,
and S. Fidler, “Skip-thought vectors,” in arXiv, 2015.
[16] N. Srivastava, E. Mansimov, and R. Salakhutdinov, “Unsupervised
learning of video representations using LSTMs,” in arXiv, 2015.
IX. C ONCLUDING R EMARKS [17] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Lavio-
lette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of
In this paper, we extend the research of audio word2vec. neural networks,” in JMLR, 2015.
We use domain adversarial training to automatically learn [18] H. Kamper, A. Jansen, and S. Goldwater, “A segmental framework for
fully-unsupervised large-vocabulary speech recognition,” in Computer
encoders that encode different information. The experimental Speech and Language, 2017.
results show that this thus disentangles the phonetic and [19] C.-T. Chung, C.-A. Chan, and L.-S. Lee, “Unsupervised spoken term
speaker information. We further propose an SSAE trained with detection with spoken queries by multi-level acoustic patterns with
varying model granularity,” in ICASSP, 2014.
reinforcement learning, in which word-level segmentation and [20] Y. Zhang and J. R. Glass, “Unsupervised spoken keyword spotting via
segment representation are jointly learned. segmental DTW on gaussian posteriorgrams,” in ASRU, 2009.
12
[21] H. Kamper, A. Jansen, and S. Goldwater, “A segmental framework [47] W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervised domain adaptation
for fully-unsupervised large-vocabulary speech recognition,” Computer for robust speech recognition via variational autoencoder-based data
Speech & Language, vol. 46, pp. 154–174, 2017. augmentation,” in ASRU, 2017.
[22] Y.-H. Wang, C.-T. Chung, and H.-Y. Lee, “Gate activation signal analysis [48] ——, “Learning latent representations for speech generation and trans-
for gated recurrent neural networks and its correlation with phoneme formation,” in INTERSPEECH, 2017.
boundaries,” INTERSPEECH, 2017. [49] ——, “Unsupervised learning of disentangled and interpretable repre-
[23] O. Räsänen, “Basic cuts revisited: Temporal segmentation of speech sentations from sequential data,” in NIPS, 2017.
into phone-like units with statistical learning at a pre-linguistic level.” [50] Y.-C. Chen, S.-F. Huang, C.-H. Shen, H.-y. Lee, and L.-s. Lee,
in CogSci, 2014. “Phonetic-and-semantic embedding of spoken words with applications
[24] Y. Qiao, N. Shimomura, and N. Minematsu, “Unsupervised optimal in spoken content retrieval,” in 2018 IEEE Spoken Language Technology
phoneme segmentation: Objectives, algorithm and comparisons,” in Workshop (SLT). IEEE, 2018, pp. 941–948.
ICASSP, 2008. [51] Z. Meng, Z. Chen, V. Mazalov, J. Li, and Y. Gong, “Unsupervised adap-
[25] C.-y. Lee and J. Glass, “A nonparametric bayesian approach to acoustic tation with domain separation networks for robust speech recognition,”
model discovery,” in Proceedings of the 50th Annual Meeting of the in ASRU, 2017.
Association for Computational Linguistics: Long Papers-Volume 1. [52] C. T. Chung and L. S. Lee, “Unsupervised discovery of structured
Association for Computational Linguistics, 2012, pp. 40–49. acoustic tokens with applications to spoken term detection,” IEEE/ACM
[26] Y.-H. Wang, H.-Y. Lee, and L.-S. Lee, “Segmental audio Word2Vec: Transactions on Audio, Speech, and Language Processing, vol. 26, no. 2,
Representing utterances as sequences of vectors with applications in pp. 394–405, Feb 2018.
spoken term detection,” in ICASSP, 2017. [53] A. Garcia and H. Gish, “Keyword spotting of arbitrary words using
[27] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating minimal speech resources,” in ICASSP, 2006.
gradients through stochastic neurons for conditional computation,” arXiv [54] A. Jansen and K. Church, “Towards unsupervised training of speaker
preprint arXiv:1308.3432, 2013. independent acoustic models,” in INTERSPEECH, 2011.
[28] J. Chung, S. Ahn, and Y. Bengio, “Hierarchical multiscale recurrent [55] A. Jansen, K. Church, and H. Hermansky, “Towards spoken term
neural networks,” International Conference on Learning Representations discovery at scale with zero resources,” in INTERSPEECH, 2010.
(ICLR), 2017. [56] A. Park and J. Glass, “Unsupervised pattern discovery in speech,” Audio,
[29] N. Dehak, R. Dehak, P. Kenny, N. Brummer, P. Ouellet, and P. Du- Speech, and Language Processing, IEEE Transactions on, vol. 16, no. 1,
mouchel, “Support vector machines versus fast scoring in the low- pp. 186–197, Jan 2008.
dimensional total variability space for speaker verification,” in INTER- [57] V. Stouten, K. Demuynck, and H. Van hamme, “Discovering phone
SPEECH, 2009. patterns in spoken utterances by non-negative matrix factorization,”
[30] K. Levin, K. Henry, A. Jansen, and K. Livescu, “Fixed-dimensional Signal Processing Letters, IEEE, vol. 15, pp. 131 –134, 2008.
acoustic embeddings of variable-length segments in low-resource set- [58] L. Wang, E. S. Chng, and H. Li, “An iterative approach to model merging
tings,” in ASRU, 2013. for speech pattern discovery,” in APSIPA, 2011.
[31] K. Levin, A. Jansen, and B. Van Durme, “Segmental acoustic indexing [59] N. Vanhainen and G. Salvi, “Word discovery with beta process factor
for zero resource keyword search,” in ICASSP, 2015. analysis,” in INTERSPEECH, 2012.
[32] H.-Y. Lee and L.-S. Lee, “Enhanced spoken term detection using support [60] J. Driesen and H. Van hamme, “Fast word acquisition in an NMF-based
vector machines and weighted pseudo examples,” Audio, Speech, and learning framework,” in ICASSP, 2012.
Language Processing, IEEE Transactions on, vol. 21, no. 6, pp. 1272– [61] Y. Zhang and J. Glass, “Towards multi-speaker unsupervised speech
1284, 2013. pattern discovery,” in ICASSP, 2010.
[33] I.-F. Chen and C.-H. Lee, “A hybrid HMM/DNN approach to keyword [62] C.-H. Lee, F. K. Soong, and B.-H. Juang, “A segment model based
spotting of short words,” in INTERSPEECH, 2013. approach to speech recoginition,” in ICASSP, 1988.
[34] A. Norouzian, A. Jansen, R. Rose, and S. Thomas, “Exploiting dis- [63] H. Wang, C.-C. Leung, T. Lee, B. Ma, and H. Li, “An acoustic segment
criminative point process models for spoken term detection,” in INTER- modeling approach to query-by-example spoken term detection,” in
SPEECH, 2012. ICASSP, 2012.
[35] K. Audhkhasi, A. Rosenberg, A. Sethy, B. Ramabhadran, and B. Kings- [64] H. Kamper, A. Jansen, and S. Goldwater, “Unsupervised word segmenta-
bury, “End-to-end asr-free keyword search from speech,” IEEE Journal tion and lexicon discovery using acoustic word embeddings,” IEEE/ACM
of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1351–1359, Trans. Audio, Speech and Lang. Proc., vol. 24, no. 4, pp. 669–679, Apr.
2017. 2016.
[36] S. Bengio and G. Heigold, “Word embeddings for speech recognition,” [65] H. Kamper, K. Livescu, and S. Goldwater, “An embedded segmental K-
in INTERSPEECH, 2014. means model for unsupervised segmentation and clustering of speech,”
[37] G. Chen, C. Parada, and T. N. Sainath, “Query-by-example keyword in ASRU, 2017.
spotting using long short-term memory networks,” in ICASSP, 2015. [66] M. Elsner and C. Shain, “Speech segmentation with a neural encoder
[38] H. Kamper, W. Wang, and K. Livescu, “Deep convolutional acoustic model of working memory,” in Proceedings of the 2017 Conference on
word embeddings using word-pair side information,” in ICASSP, 2016. Empirical Methods in Natural Language Processing, 2017, pp. 1070–
[39] W. He, W. Wang, and K. Livescu, “Multi-view recurrent neural acoustic 1080.
word embeddings,” arXiv preprint arXiv:1611.04496, 2016. [67] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares,
[40] S. Settle, K. Levin, H. Kamper, and K. Livescu, “Query-by-example H. Schwenk, and Y. Bengio, “Learning phrase representations using
search with discriminative neural acoustic word embeddings,” arXiv rnn encoder-decoder for statistical machine translation,” arXiv preprint
preprint arXiv:1706.03818, 2017. arXiv:1406.1078, 2014.
[41] A. L. Maas, S. D. Miller, T. M. O’neil, A. Y. Ng, and P. Nguyen, “Word- [68] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
level acoustic modeling with convolutional vector regression,” in ICML with neural networks,” in NIPS, 2014.
Workshop on Representation Learning, Edinburgh, Scotland, 2012. [69] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and
[42] Y.-A. Chung and J. Glass, “Speech2vec: A sequence-to-sequence frame- S. Bengio, “Generating sentences from a continuous space,” CoNLL
work for learning word embeddings from speech,” arXiv preprint 2016, p. 10, 2016.
arXiv:1803.08976, 2018. [70] S. Semeniuta, A. Severyn, and E. Barth, “A hybrid convolu-
[43] N. Holzenberger, M. Du, J. Karadayi, R. Riad, and E. Dupoux, “Learning tional variational autoencoder for text generation,” arXiv preprint
word embeddings: unsupervised methods for fixed-size representations arXiv:1702.02390, 2017.
of variable-length speech segments,” in Interspeech 2018. ISCA, 2018. [71] P. Nema, M. Khapra, A. Laha, and B. Ravindran, “Diversity driven atten-
[44] H. Kamper, “Truly unsupervised acoustic word embeddings using tion model for query-based abstractive summarization,” arXiv preprint
weak top-down constraints in encoder-decoder models,” arXiv preprint arXiv:1704.08300, 2017.
arXiv:1811.00403, 2018. [72] N. Zeghidour, G. Synnaeve, N. Usunier, and E. Dupoux, “Joint learning
[45] S. Settle and K. Livescu, “Discriminative acoustic word embeddings: of speaker and phonetic similarities with siamese networks,” in INTER-
Recurrent neural network-based approaches,” in SLT, 2016. SPEECH, 2016.
[46] A. Jansen, M. Plakal, R. Pandya, D. Ellis, S. Hershey, J. Liu, C. Moore, [73] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” in ICML,
and R. A. Saurous, “Towards learning semantic audio representations 2017.
from unlabeled data,” in NIPS Workshop on Machine Learning for Audio [74] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville,
Signal Processing (ML4Audio), 2017. “Improved training of wasserstein GANs,” in NIPS, 2017.
13
[75] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient Chia-Hao Shen received his M.S. degree in Elec-
methods for reinforcement learning with function approximation,” in trical Engineering from National Taiwan University
NIPS, 1999. (NTU), Taipei, Taiwan, in 2017. His research fo-
[76] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an cused on audio/speech representation. He is cur-
asr corpus based on public domain audio books,” in ICASSP, 2015. rently an NLP data scientist in CompStak, working
[77] T. Schultz, “Globalphone: a multilingual speech and text database on natural language understanding and reinforce-
developed at karlsruhe university.” in INTERSPEECH, 2002. ment learning.
[78] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
International Conference on Learning Representations (ICLR), 2015.
[79] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-
imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,
2017.
[80] C.-a. Chan, “Unsupervised spoken term detection with spoken queries,”
Ph.D. dissertation, National Taiwan University, 2012.