SP 14

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Leveraging Frame Semantics and Distributional Semantics for

Unsupervised Semantic Slot Induction in Spoken Dialogue Systems


(Extended Abstract)
Yun-Nung Chen, William Yang Wang, and Alexander I. Rudnicky
School of Computer Science, Carnegie Mellon University
5000 Forbes Ave., Pittsburgh, PA 15213-3891, USA
{yvchen, yww, air}@cs.cmu.edu

Abstract for designing spoken dialogue systems (SDS) in


an unsupervised fashion. Comparing to the tradi-
Although the spoken dialogue system tional approach where domain experts and devel-
community in speech and the semantic opers manually define the semantic ontology for
parsing community in natural language SDS, the unsupervised approach has the advan-
processing share many similar tasks and tages to reduce the costs and avoid human induced
approaches, they have progressed inde- bias.
pendently over years with few interac- On the other hand, the distributional view of se-
tions. This paper connects two worlds to mantics hypothesizes that words occurring in the
automatically induce the semantic slots for same contexts may have similar meanings (Har-
spoken dialogue systems using frame and ris, 1954). With the recent advance of deep learn-
distributional semantic theories. Given ing techniques, the continuous representation of
a collection of unlabeled audio, we ex- word embeddings has further boosted the state-
ploit continuous-valued word embeddings of-the-art results in many applications, such as
to augment a probabilistic frame-semantic frame identification (Hermann et al., 2014), sen-
parser that identifies key semantic slots timent analysis (Socher et al., 2013), language
in an unsupervised fashion. Our exper- modeling (Mikolov, 2012), and sentence comple-
iments on a real-world spoken dialogue tion (Mikolov et al., 2013a).
dataset show that distributional word rep- In this paper, given a collection of unlabeled
resentation significantly improves adapta- raw audio files, we investigate an unsupervised ap-
tion from FrameNet-style parses of rec- proach for semantic slot induction. To do this, we
ognized utterances to the target semantic use a state-of-the-art probabilistic frame-semantic
space, that comparing to a state-of-the-art parsing approach (Das et al., 2010; Das et al.,
baseline, a 12% relative mean average pre- 2014), and perform an adaptation process, map-
cision improvement is achieved, and that ping the generic FrameNet (Baker et al., 1998)
the proposed technology can be used to re- style semantic parses to the target semantic space
duce the costs for designing task-oriented that is suitable for the domain-specific conversa-
spoken dialogue systems. tion settings. We utilize continuous word em-
beddings trained on very large external corpora
1 Introduction (e.g. Google News and Freebase) for the adap-
Frame semantics is a linguistic theory that de- tation process. To evaluate the performance of our
fines meaning as a coherent structure of re- approach, we compare the automatically induced
lated concepts (Fillmore, 1982). Although there semantic slots with the reference slots created by
has been some successful applications in natural domain experts. Empirical experiments show that
language processing (Hedegaard and Simonsen, the slot creation results generated by our approach
2011; Coyne et al., 2011; Hasan and Ng, 2013), align well with those of domain experts.
this linguistically principled theory has not been
2 The Proposed Approach
explored in the speech community until recently:
Chen et al. (2013b) showed that it is possible to We build our approach on top of the recent suc-
use probabilistic frame-semantic parsing to auto- cess of an unsupervised frame-semantic parsing
matically induce and adapt the semantic ontology approach. Chen et al. (2013b) formulated the se-
can i have a cheap restaurant 2.2 Continuous Space Word Representations
To better adapt the FrameNet-style parses to the
Frame: expensiveness target task-oriented SDS domain, we make use of
FT LU: cheap
Frame: capability continuous word vectors derived from a recurrent
Frame: locale by use
FT LU: can FE Filler: i FT/FE LU: restaurant
neural network architecture (Mikolov et al., 2010).
The recurrent neural network language models use
Figure 1: An example of probabilistic frame- the context history to include long-distance in-
semantic parsing on ASR output. FT: frame target. formation. Interestingly, the vector-space word
FE: frame element. LU: lexical unit. representations learned from the language mod-
els were shown to capture syntactic and semantic
regularities (Mikolov et al., 2013c; Mikolov et al.,
mantic mapping and adaptation problem as a rank- 2013b). The word relationships are characterized
ing problem, and proposed the use of unsupervised by vector offsets, where in the embedded space,
clustering methods to differentiate the generic se- all pairs of words sharing a particular relation are
mantic concepts from target semantic space for related by the same constant offset. Considering
task-oriented dialogue systems. However, their that this distributional semantic theory may bene-
clustering approach only performs on the small fit our SLU task, we leverage word representations
in-domain training data, which may not be robust trained from large external data to differentiate se-
enough. Therefore, this paper proposes a radical mantic concepts.
extension of the previous approach: we aim at im-
proving the semantic adaptation process by lever- 2.3 Slot Ranking Model
aging distributed word representations.
Our model ranks the slot candidates by integrat-
ing two scores (Chen et al., 2013b): (1) the rela-
2.1 Probabilistic Semantic Parsing tive frequency of each candidate slot in the corpus,
FrameNet is a linguistically-principled semantic since slots with higher frequency may be more
resource (Baker et al., 1998), developed based on important. (2) the coherence of slot-fillers cor-
the frame semantics theory (Fillmore, 1976). In responding to the slot. Assuming that domain-
our approach, we parse all ASR-decoded utter- specific concepts focus on fewer topics and are
ances in our corpus using SEMAFOR, a state-of- similar to each other, the coherence of the corre-
the-art semantic parser for frame-semantic pars- sponding values can help measure the prominence
ing (Das et al., 2010; Das et al., 2014), and ex- of the slots.
tract all frames from semantic parsing results as
w(si ) = (1 ) log f (si ) + log h(si ), (1)
slot candidates, where the LUs that correspond to
the frames are extracted for slot-filling. For ex-
where w(si ) is the ranking weight for the slot can-
ample, Figure 1, shows an example of SEMAFOR
didate si , f (si ) is the frequency of si from seman-
parsing of an ASR-decoded text output.
tic parsing, h(si ) is the coherence measure of si ,
Since SEMAFOR was trained on FrameNet and is the weighting parameter within the inter-
annotation, which has a more generic frame- val [0, 1].
semantic context, not all the frames from the pars- For each slot si , we have the set of correspond-
ing results can be used as the actual slots in the ing slot-fillers, V (si ), constructed from the utter-
domain-specific dialogue systems. For instance, ance including the slot si in the parsing results.
in Figure 1, we see that the expensiveness The coherence measure h(si ) is computed as av-
and locale by use frames are essentially the erage pair-wise similarity of slot-fillers to evaluate
key slots for the purpose of understanding in the if slot si corresponds to centralized or scattered
restaurant query domain, whereas the capability topics.
frame does not convey particular valuable infor- P
mation for SLU. In order to fix this issue, we com- xa ,xb V (si ),xa 6=xb Sim(xa , xb )
pute the prominence of these slot candidates, use h(si ) = , (2)
|V (si )|2
a slot ranking model to rerank the most frequent
slots, and then generate a list of induced slots for where V (si ) is the set of slot-fillers correspond-
use in domain-specific dialogue systems. ing slot si , |V (si )| is the size of the set, and
Sim(xa , xb ) is the similarity between the pair of can be computed as the cosine similarity between
fillers xa and xb . The slot si with higher h(si ) rxa and rxb , called NeiSim(xa , xb ).
usually focuses on fewer topics, which is more The idea of using NeiSim(xa , xb ) is very sim-
specific and more likely for slots occurring in dia- ilar as using RepSim(xa , xb ), where we assume
logue systems. that words with similar concepts should have sim-
We involve distributional semantics of slot- ilar representations and share similar neighbors.
fillers xa and xb for deriving Sim(xa , xb ). Hence, NeiSim(xa , xb ) is larger when xa and xb
Here, we propose two similarity measures: have more overlapped neighbors in continuous
the representation-derived similarity and the space.
neighbor-derived similarity as Sim(xa , xb ) in (2).
3 Experiments
2.3.1 Representation-Derived Similarity
We examine the slot induction accuracy by com-
Given that distributional semantics can be cap-
paring the reranked list of frame-semantic pars-
tured by continuous space word representa-
ing induced slots with the reference slots created
tions (Mikolov et al., 2013c), we transform each
by system developers. Furthermore, using the
token x into its embedding vector x by pre-trained
reranked list of induced slots and their associated
distributed word representations, and then the sim-
slot fillers (value), we compare against the human
ilarity between a pair of slot-fillers xa and xb
annotation. For the slot-filling task, we evaluate
can be computed as their cosine similarity, called
both on ASR transcripts of the raw audio, and on
RepSim(xa , xb ).
the manual transcripts.
We assume that words occurring in similar
domains have similar word representations thus 3.1 Experimental Setup
RepSim(xa , xb ) will be larger when xa and xb are
In this experiment, we used the Cambridge Uni-
semantically related. The representation-derived
versity spoken language understanding corpus,
similarity relies on the performance of pre-trained
previously used on several other SLU tasks (Hen-
word representations, and higher dimensionality
derson et al., 2012; Chen et al., 2013a). The do-
of embedding words results in more accurate per-
main of the corpus is about restaurant recommen-
formance but greater complexity.
dation in Cambridge; subjects were asked to in-
2.3.2 Neighbor-Derived Similarity teract with multiple spoken dialogue systems, in
an in-car setting. The corpus contains a total num-
With embedding vector x corresponding token x
ber of 2,166 dialogues, and 15,453 utterances. The
in the continuous space, we build a vector rx =
ASR system that was used to transcribe the speech
[rx (1), ..., rx (t), ..., rx (T )] for each x, where T is
has a word error rate of 37%. There are 10 slots
the vocabulary size, and the t-th element of rx is
created by domain experts: addr, area, food,
defined as
name, phone, postcode, price range, signa-
xyt ture, task, and type. The parameter in (1) can
kxkkyt k , if yt is the word whose embedding

be empirically set; we use = 0.2, N = 100 for
vector has top N greatest similarity
rx (t) = all experiments.
to x.
To include distributional semantics information,


0 , otherwise.
(3) we use two lists of pre-trained distributed vectors
The t-th element of vector rx is the cosine simi- described as below1 .
larity between the embedding vector of slot-filler
Word/Phrase Vectors from Google News:
x and the t-th embedding vector yt of pre-trained
word vectors are trained on 109 words from
word representations (t-th token in the vocabulary
Google News, using the continuous bag of
of external larger dataset), and we only include
words architecture, which predicts the cur-
the elements with top N greatest values to form
rent word based on the context. The resulting
a sparse vector for space reduction (from T to N ).
vectors have dimensionality 300, vocabulary
rx can be viewed as a vector indicating the N near-
size is 3106 ; the entities contain both words
est neighbors of token x obtained from continuous
and automatically derived phrases.
word representations. Then the similarity between
1
a pair of slot-fillers xa and xb , Sim(xa , xb ) in (2), https://code.google.com/p/word2vec/
locale by use commerce scenario match if they share at least one overlapping word.
type expensiveness price range
building
range
We weight MAP scores with corresponding F-
food measure as MAP-F-H (hard) and MAP-F-S (soft)
food part orientational
origin to evaluate the performance of slot induction and
direction
area
speak on topic addr locale slot-filling tasks together (Chen et al., 2013b).
part inner outer
seeking
desiring task contacting phone 3.3 Evaluation Results
locating postcode
sending
Table 1 shows the results. Rows (a)-(c) are the
baselines without leveraging distributional word
Figure 2: The mappings from induced slots representations trained on external data, where
(within blocks) to reference slots (right sides of row (a) is the baseline only using frequency
arrows). for ranking, and rows (b) and (c) are the re-
sults of clustering-based ranking models in the
Entity Vectors with Freebase Naming: the prior work (Chen et al., 2013b). Rows (d)-(j)
entity vectors are trained on 109 words from show performance after leveraging distributional
Google News with naming from Freebase2 . semantics. Rows (d) and (e) are the results us-
The training was performed using the con- ing representation- and neighbor-derived similar-
tinuous skip gram architecture, which pre- ity from Google News data respectively, while row
dicts surrounding words given the current (f) and row (g) are the results from Freebase nam-
word. The resulting vectors have dimen- ing data. Rows (h)-(j) are performance of late fu-
sionality 1000, vocabulary size is 1.4 106 , sion, where we use voting to combine the results
and the entities contain the deprecated /en/ of two data considering coverage of Google data
naming from Freebase. and precision of Freebase data. We find almost all
results are improved by including distributed word
The first dataset provides a larger vocabulary and information.
better coverage; the second has more precise vec- For ASR results, the performance from Google
tors, using knowledge from Freebase. News and Freebase is similar. Rows (h) and (i)
fuse the two systems. For ASR, MAP scores
3.2 Evaluation Metrics
are slightly improved by integrating coverage of
To evaluate the induced slots, we measure their Google News and accuracy from Freebase (from
quality as the proximity between induced slots about 72% to 74%), but MAP-F scores do not in-
and reference slots. Figure 2 shows many-to- crease. This may be because some correct slots are
many mappings that indicate semantically related ranked higher after combining the two sources of
induced slots and reference slots (Chen et al., evidence, but their slot-fillers do not perform well
2013b). Since we define the adaptation task as enough to increase MAP-F scores.
a ranking problem, with a ranked list of induced
To compare the representation-derived
slots, we can use the standard mean average preci-
(RepSim) and neighbor-derived (NeiSim)
sion (MAP) as our metric, where the induced slot
similarities, for both ASR and manual transcripts,
is counted as correct when it has a mapping to a
neighbor-derived similarity performs better on
reference slot.
Google News data (row (d) v.s. row (e)). The
To evaluate slot fillers, for each matched map-
reason may be that neighbor-derived similarity
ping between the induced slot and the reference
considers more semantically-related words to
slot, we compute an F-measure by comparing the
measure similarity (instead of only two tokens),
lists of extracted slot fillers corresponding to the
while representation-derived similarity is directly
induced slots, and the slot fillers in the reference
based on trained word vectors, which may degrade
list. Since slot fillers may contain multiple words,
with recognition error. In terms of Freebase data,
we use hard and soft matching to define whether
rows (f) and (g) do not show significant improve-
two slot fillers match each other, where hard
ment, probably because entities in Freebase are
requires that the two slot fillers should be ex-
more precise and their word representations have
actly the same; soft means that two slot fillers
higher accuracy. Hence, considering additional
2
http://www.freebase.com/ neighbors in continuous vector space does not ob-
Table 1: The performance of induced slots and corresponding slot-fillers (%)
ASR Manual
Approach
MAP MAP-F-H MAP-F-S MAP MAP-F-H MAP-F-S
(a) Frequency (baseline) 67.31 26.96 27.29 59.41 27.29 28.68
Frame Sem (b) K-Means 67.38 27.38 27.99 59.48 27.67 28.83
(c) Spectral Clustering 68.06 30.52 28.40 59.77 30.85 29.22
(d) RepSim 72.71 31.14 31.44 66.42 32.10 33.06
Google News
(e) NeiSim 73.35 31.44 31.81 68.87 37.85 38.54
Frame Sem (f) RepSim 71.48 29.81 30.37 65.35 34.00 35.04
Freebase
+ (g) NeiSim 73.02 30.89 30.72 64.87 31.05 31.86
Dist Sem (h) (d) + (f) 74.60 29.82 30.31 66.91 34.84 35.90
(i) (e) + (g) 74.34 31.01 31.28 68.95 33.73 34.28
(j) (d) + (e) + (f) + (g) 76.22 30.17 30.53 66.78 32.85 33.44

tain improvement, and fusion of results from two act classification. In Proceedings of ICASSP, pages
sources (rows (h) and (i)) cannot perform better. 83178321.
However, note that neighbor-derived similarity
Yun-Nung Chen, William Yang Wang, and Alexander I
requires less space for computational procedure Rudnicky. 2013b. Unsupervised induction and fill-
and sometimes produces results the same or better ing of semantic slots for spoken dialogue systems
as the representation-derived similarity. using frame-semantic parsing. In Proceedings of
Overall, we see that all combinations that lever- ASRU, pages 120125.
age distributional semantics outperform only us- Bob Coyne, Daniel Bauer, and Owen Rambow. 2011.
ing frame semantics; this demonstrates the effec- Vignet: Grounding language in graphics using frame
tiveness of applying distributional information to semantics. In Proceedings of the ACL 2011 Work-
slot induction. The 76% MAP performance in- shop on Relational Models of Semantics, pages 28
36.
dicates that our proposed approach can generate
good coverage for domain-specific slots in a real- Dipanjan Das, Nathan Schneider, Desai Chen, and
world SDS. While we present results in the SLU Noah A Smith. 2010. Probabilistic frame-semantic
domain, it should be possible to apply our ap- parsing. In Proceedings of NAACL-HLT, pages 948
956.
proach to text-based NLU and slot filling tasks.
Dipanjan Das, Desai Chen, Andre F. T. Martins,
4 Conclusion Nathan Schneider, and Noah A. Smith. 2014.
Frame-semantic parsing. Computational Linguis-
We propose the first unsupervised approach unify- tics, 40(1):956.
ing frame and distributional semantics for the au-
tomatic induction and filling of slots. Our work Charles J Fillmore. 1976. Frame semantics and the
makes use of a state-of-the-art semantic parser, nature of language. Annals of the NYAS, 280(1):20
32.
and adapts the generic linguistically-principled
FrameNet representation to a semantic space char- Charles Fillmore. 1982. Frame semantics. Linguistics
acteristic of a domain-specific SDS. With the in- in the morning calm, pages 111137.
corporation of distributional word representations,
we show that our automatically induced semantic Zellig S Harris. 1954. Distributional structure. Word.
slots align well with reference slots. We show fea- Kazi Saidul Hasan and Vincent Ng. 2013. Frame se-
sibility of this approach, including slot induction mantics for stance classification. CoNLL-2013, page
and slot-filling for dialogue tasks. 124.

Steffen Hedegaard and Jakob Grue Simonsen. 2011.


Lost in translation: authorship attribution using
References frame semantics. In Proceedings of ACL-HLT,
Collin F Baker, Charles J Fillmore, and John B Lowe. pages 6570.
1998. The Berkeley FrameNet project. In Proceed-
ings of COLING, pages 8690. Matthew Henderson, Milica Gasic, Blaise Thomson,
Pirros Tsiakoulis, Kai Yu, and Steve Young. 2012.
Yun-Nung Chen, William Yang Wang, and Alexan- Discriminative spoken language understanding us-
der I. Rudnicky. 2013a. An empirical investigation ing word confusion networks. In Proceedings of
of sparse log-linear models for improved dialogue SLT, pages 176181.
Karl Moritz Hermann, Dipanjan Das, Jason Weston,
and Kuzman Ganchev. 2014. Semantic frame iden-
tification with distributed word representations. In
Proceedings of ACL.
Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan
Cernocky, and Sanjeev Khudanpur. 2010. Recur-
rent neural network based language model. In Pro-
ceedings of INTERSPEECH, pages 10451048.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. 2013a. Efficient estimation of word represen-
tations in vector space. In Proceedings of ICLR.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013b. Distributed representa-
tions of words and phrases and their compositional-
ity. In Proceedings of NIPS, pages 31113119.
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.
2013c. Linguistic regularities in continuous space
word representations. In Proceedings of NAACL-
HLT, pages 746751.
Tomas Mikolov. 2012. Statistical language models
based on neural networks. Ph.D. thesis, Brno Uni-
versity of Technology.
Richard Socher, Alex Perelygin, Jean Y Wu, Jason
Chuang, Christopher D Manning, Andrew Y Ng,
and Christopher Potts. 2013. Recursive deep mod-
els for semantic compositionality over a sentiment
treebank. In Proceedings of EMNLP, pages 1631
1642.

You might also like