Collecting Voices from the Cloud
Ian McGraw, Chia-ying Lee, Lee Hetherington, Stephanie Seneff, Jim Glass
MIT Computer Science and Artificial Intelligence Lab
32 Vassar Street Cambridge, MA, 02139, USA
imcgraw@mit.edu chiaying@mit.edu, ilh@mit.edu, seneff@mit.edu, glass@mit.edu
Abstract
The collection and transcription of speech data is typically an expensive and time-consuming task. Voice over IP and cloud computing
are poised to eliminate this impediment to research on spoken language interfaces in many domains. This paper documents our efforts
to deploy speech-enabled web interfaces to large audiences over the Internet via Amazon Mechanical Turk, an online marketplace for
work. Using the open source WAMI Toolkit, we collected corpora in two different domains which collectively constitute over 113 hours
of speech. The first corpus contains 100,000 utterances of read speech, and was collected by asking workers to record street addresses
in the United States. For the second task, we collected conversations with Flight Browser, a multimodal spoken dialogue system. The
Flight Browser corpus obtained contains 10,651 utterances composing 1,113 individual dialogue sessions from around 100 distinct users.
The aggregate time spent collecting the data for both corpora was just under two weeks. At times, our servers were logging audio from
workers at rates faster than real-time. We describe the process of collection and transcription of these corpora while providing an analysis
of the advantages and limitations to this data collection method.
1. Introduction
Acquiring in-domain training data for spoken language
systems is central to their development. Unfortunately, this
gives rise to a classic chicken-and-egg problem. A working
system is required to collect in-domain data; however, this
very data is needed to train the underlying models before
the system can be implemented. Over the years, we have
learned to bootstrap our acoustic and language models from
existing systems, supporting incremental parameter adaptation for particular domains. This typically slow process can
take years of data collection and iterative refinement.
Early successes in the wide deployment of sophisticated spoken language systems in the research community include MIT’s Jupiter (Zue et al., 2000) conversational
weather information system, and CMU’s Let’s Go (Raux et
al., 2006) transportation information system. Both Jupiter
and Let’s Go were telephone accessible and were publicized well enough to see a fair amount of use from outside
the laboratory. In general, however, researchers resort to
conducting user studies with high management overhead,
making the collection of large amounts of data for arbitrary
domains prohibitively expensive. This prevents data-driven
techniques from reaching their full potential.
The field of human computation is beginning to address some of these needs. Amazon Mechanical Turk
(AMT), for instance, provides a constant crowd of nonexpert workers who perform web-based tasks for micropayments of as little as $0.005. Researchers in the natural language processing community have begun to harness the potential of this cloud-based tool for both annotation (Snow et al., 2008) and data collection (Kaisser
and Lowe, 2008). The speech community, however, has
been somewhat slower to capitalize on this new paradigm.
In (McGraw et al., 2009), we employed AMT workers to
help transcribe spoken utterances. Other researchers have
created a number of transcribed corpora in a variety of domains (Novotne and Callison-Burch, 2010; Marge et al.,
2010). Despite this progress, speech researchers are only
beginning to tap into the power of the cloud.
Having explored the feasibility of the crowd transcription of speech data, it seems natural that we turn our attention to the collection of in-domain data for spoken language systems. From a technical standpoint, this endeavor
is somewhat more complex. Amazon Mechanical Turk
does not provide an interface for web-based audio collection; furthermore, while incorporating audio playback into
a website is relatively straightforward, few tools exist for
recording audio from web pages. For this work, we have
used the publicly available WAMI Toolkit, which provides
a Javascript API for speech-enabling a web-site (Gruenstein
et al., 2008). A similar technology from AT&T is also in
development (Giuseppe Di Fabbrizio, 2009). It is now feasible to integrate these speech recognition cloud services
with web pages deployed to Amazon Mechanical Turk.
We believe a web-based approach to data collection retains the advantages of telephone-based collection, while
opening up new interaction possibilities. Over the last few
years, the emergence of high-quality mobile device displays and constant network connectivity have popularized
a class of applications that make use of multiple input and
output modalities. Google and Vlingo, for example, are two
competitors in the increasingly active voice search market,
which combines spoken queries with result visualization on
a mobile display.
The Spoken Language Systems group has, for some
time, been interested in systems where points-of-interest
are spoken by the user in the form of an address (Gruenstein
et al., 2006). Over the years, we have collected data in this
domain by bringing people into the lab or through ad-hoc
web interfaces, but have never undertaken a marketing campaign to ensure heavy usage. Given the millions of possible
addresses that a user might speak, it is discouraging that we
have so little data. In a pilot experiment, we distributed
a speech-enabled web interface using Amazon Mechanical
Turk, and employed anonymous workers to read aloud addresses, eliciting a total of 103 hours of speech from 298
users. This simple task demonstrates the feasibility of large
scale speech data collection through AMT.
Since the very same API acts as the web front-end to
most of our spoken dialogue systems, we next discuss a
corpus of 1113 dialogue sessions collected for our flight
reservation system (Seneff, 2002). Here we explore a range
of price points for web tasks deployed to AMT that ask the
worker to book a flight according to a given scenario. The
scenarios themselves were also generated by AMT workers. Finally, once the data had been collected, we posted
it back on AMT for transcription using an extension of the
methods developed in (Gruenstein et al., 2009).
2. Related Work and Background
We consider two categories of related work: other activities which utilized Amazon Mechanical Turk for tasks
related to speech and language, and previous efforts for data
collection for spoken dialogue systems.
Although there have been several recent success stories on utilizing AMT for data annotation and languagerelated tasks, the idea of using Amazon Mechanic Turk to
provide raw speech data is much less common. (Snow
et al., 2008) demonstrated that many natural language processing tasks fall in the category of hard-for-computers but
easy-for-humans, and thus become perfect tasks for Turkers to undertake. (Kittur et al., 2008) investigated the utility of AMT as a tool for the task of assessing the quality
of Wikipedia documents. They found that the inclusion of
explicitly verifiable tasks was crucial to avoid a tendency to
game the system. (Kaisser and Lowe, 2008) were able to
successfully utilize AMT to generate a corpus of questionanswer pairs to be used in the evaluation of TREC’s QA
track. The authors claimed that AMT provided a “large,
inexpensive, motivated, and immediately available” pool
of subjects. (Novotne and Callison-Burch, 2010) were
among the first to exploit Turkers for speech data transcription. (Marge et al., 2010) demonstrated how ROVER can
be used as a technique to achieve gold-standard transcription from Turkers by intelligently but automatically combining redundant contributions. This was shown to still be
much more economic than hiring expert transcribers.
Data collection has always been a major issue for researchers developing conversational spoken dialogue systems. In the early days, very costly Wizard of Oz systems
involved scheduling subjects to come to the lab and participate with a simulated spoken dialogue system. These
efforts were obviously prohibitively expensive, and limited
collected data to at most dozens of dialogues.
In the early 1990’s, researchers from MIT, CMU,
BBN, SRI, and AT&T collaborated to collect approximately 25,000 utterances in the travel domain (Hirschman
et al., 1993). In 2000 and 2001, collaborative work was
revived with the additional participation of IBM, Lucent,
Bell Labs, and the University of Colorado at Boulder as the
Communicator project. This effort led to the collection of
just over 100,000 utterances spoken to automatic or semiautomatic flight-reservations systems (Walker et al., 2001).
The collaborative data collection effort was overseen by
NIST and involved considerable effort in planning the logistics. These multi-year data collection efforts yielded
seven corpora currently sold by the Linguistic Data Consortium for hundreds of dollars apiece.
The SLS group at MIT has been developing spoken
dialogue systems since the late 1980’s. We have devoted
considerable resources to data collection as a crucial part
of the development process. Within the last few years, we
have become aware of the potential for a paradigm shift towards a model for data collection that does not require subjects to come into the laboratory. We piloted this idea first
through soliciting subjects via email and rewarding them
substantially with Amazon gift certificates. A successful
Web-based data collection effort was conducted by having subjects solve scenarios within our CityBrowser multimodal spoken dialogue system (Gruenstein and Seneff,
2007).
In the next two sections, we will first describe our efforts to collect read speech for address information through
AMT, followed by a discussion of our experiments inviting
Turkers to solve travel scenarios within Flight Browser.
3. Read Speech
This section explores the use of Amazon Mechanical
Turk to collect read speech containing spoken addresses.
3.1.
Collection Procedure
Units of work on Amazon Mechanical Turk are called
Human Intelligence Task or HITs. Requesters build tasks
and deploy them to AMT using a web interface, commandline tools, or another of the many APIs made available to
the public. Each task is assigned a price, which an individual worker will receive as payment if the requester accepts
his or her work. Requesters reserve the right to deny payment for work that is unsatisfactory.
The addresses in our reading task are taken from the
TIGER 2000 database provided by the U.S. Census Bureau.
Each address is a triple: (road, city, state). There are over
six million such triples in the TIGER database. To ensure
coverage of the 273,305 unique words contained in these
addresses, we chose a single address to correspond to each
word. 100,000 such triples formed our pilot experiment;
AMT workers were paid one U.S. cent to read each prompt.
Figure 1 shows an example HIT. 1 After the worker has
recorded an address, they are required to listen to a playback of that utterance before moving on, to help mitigate
problems with microphones or the acoustic environment.
Still, since we are employing anonymous, non-expert
workers, there is little incentive to produce high quality utterances, and a worker may even try to game the system.
We propose two distinct ways to validate worker data. The
first is to have humans validate the data manually. Given
the success of previous AMT for transcription tasks, we can
simply pay workers to listen to the cloud-collected speech
and determine whether the expected words were indeed
spoken. The second approach, which we explore in this
section, is to integrate a speech recognizer into the datacollection process itself.
Since the VoIP interface we employ is used by our dialogue systems, we have the ability to incorporate the recognizer in real time. Thus, we can block workers who do not
1
Five utterances were bundled into each task for optimal pricing, given Amazon’s minimum commision of $0.005.
Recognizer−Estimated Quality
1
0.8
0.6
0.4
0.2
0
Native:
Non−native:
10
male
female
male
female
no speech
100
both
both
1000
Number of Utterances (log scale)
Figure 3: Individual workers plotted according to the recognizer estimated quality of their work, and the number of
utterances they contributed to our corpus.
Average Collection Rate
(# Utterances / Hour)
Figure 1: A sample Human Intelligence Task (HIT) for collecting spoken addresses.
2000
0.8
0.6
0.4
Data with Quality > q
0.2
0
0.5
Data with No Anomalies and Quality > q
0.6
0.7
0.8
0.9
1
Recognizer−Estimated Quality (q)
1000
0
0
Figure 4: By filtering out users whose estimated quality
does not meet a certain threshold, we can simulate the effect of using the recognizer to automatically block workers.
6
12
18
Worker’s Local Hour of Day
24
Figure 2: Data collection rate as a function of the worker’s
local time of day.
satisfy our expectations immediately. For the pilot experiment, however, we decided not to block any workers. Running the recognizer in a second pass allows us to examine
the raw data collected through AMT and experiment with
different methods of blocking unsuitable work, the best of
which would be deployed in future database collection efforts.
3.2.
Fraction of Initial Data Set
1
Corpus Overview
Our reading tasks were posted to Amazon Mechanical Turk on a Wednesday afternoon. Within 77 hours,
298 workers had collectively completed reading all 100,000
prompts, yielding a total of 103 hours of audio. Figure 2
depicts the average number of utterances collected per hour
plotted according to time of day. Workers tended to talk
with our system during their afternoon; however, the varying time zones smooth out the collection rate with respect
to the load on our servers.
The majority of our data, 68.6%, was collected from
workers within the United States. India, the second largest
contributor to our corpus, represented 19.6% of our data.
While some non-native speakers produced high quality utterances, others had nearly unintelligible accents. This, as
well as the fact that the acoustic environment varied greatly
from speaker to speaker, make the MIT Address corpus par-
ticularly challenging for speech recognition.
To determine the properties of our corpus without listening to all 103 hours of speech, two researchers independently sampled and annotated 10 utterances from each
worker. Speakers were marked as male or female and native or non-native. Anomalies in each utterance, such as unintelligible accents, mispronounced words, cut-off speech,
and background noise, were marked as present or absent.
We then extrapolate statistics for the overall corpus based
on the number of utterances contributed by a given worker.
From this, we have estimated that 74% of our data is cleanly
read speech.
This result raises the question of how to effectively
manage the quality of speech collected from the cloud.
Here we explore an automatic method which incorporates
our speech recognizer into the validation process. In particular, we run the recognizer that we have built for the address domain over each utterance collected. We then assign
a quality estimate, q, to each worker by computing the fraction of recognition hypotheses that contain the U.S. state
expected given the prompt. Figure 3 shows the recognizerestimated quality of each worker, plotted against the number of utterances that worker contributed. Notice that a single worker may actually be two or more different people
using the same Amazon account.
AMT provides requesters with the ability to block
workers who do not perform adequate work. Using our
automatic method of quality estimation, we simulate the
effects of blocking users according to a quality threshold
q. It is clear from Figure 4 that, while the data collection
No Quality Filter
4%
6%
9%
74%
Quality Threshold (q = .95)
3.5% 4.5%
1%
1%
7%
90%
No anomalies
Multiple anomalies, cut−off speech, silence
Very thick accent
Breathing into close−talking microphone
Background Noise
Figure 5: Breakdown of anomalies present in the corpus as a whole and the sub-corpus where workers have a
high quality estimate, q >= .95. The recognizer-filtered
sub-corpus still retains 65% of the original speech data.
Though not explicitly shown here, we found that non-native
speakers were still able to contribute to this sub-corpus:
5% of the filtered data with no anomalies came from nonnative speakers.
rate might have slowed, requiring a high q effectively filters out workers who contribute anomalous utterances. Figure 5 depicts how the corpus properties change when we set
q = .95. While not unexpected, it is nice to see that egregiously irregular utterances are effectively filtered out.
4. Multimodal Dialogue Interactions
Our experience with the address corpus inspired us to
deploy a fully functional multimodal spoken dialogue system to AMT. This section describes the process of designing tasks for AMT for a spoken dialogue system.
4.1.
System Design
Flight Browser was derived from the telephone-based
Mercury system, originally developed under the DARPA
Communicator project (Seneff, 2002). Mercury’s design
was based on a mixed-initiative model for dialogue interaction. The system describes verbally the set of database
tuples returned in a conversational mode. It prompts for
relevant missing information at each point in the dialogue,
but there are no constraints on what the user can say next.
Thus, the full space of the understanding system is available
at all times.
Using the WAMI Toolkit, we adapted Mercury to a
multimodal web-interface we call Flight Browser. The dialogue interaction was modified, mainly by reducing the
system’s verbosity, to reflect the newly available visual
itinerary and flight list display. A real database, provided
by ITA, is utilized. About 600 major cities are supported
worldwide, with a bias towards U.S. cities. The interface
was designed to fit the size constraints of a mobile phone,
and multimodal support, such as clicking to sort or book
flights, was added. Figure 6 shows Flight Browser in a
WAMI browser built specially for the iPhone.
Figure 6: Flight browser interface loaded in the iPhone’s
WAMI browser. The same interface can be accessed from
a desktop from any modern web browser.
4.2.
Scenario Creation
When designing a user study, many spoken dialogue
system researchers struggle with the question of how to
elicit interesting data from users without biasing the language that they use to produce it. Some have tried to present
scenarios in tabular form, while others prefer to introduce
extra language, hoping that the user will only pick up on
the important details of a scenario rather than the language
in which it is framed. Continuing the theme of crowdsourcing research tasks, we take an alternative approach.
To generate scenarios, we created an Amazon Mechanical Turk task which asked workers what they would expect
from a flight reservation system. They were explicitly told
that we were trying to build a conversational system that
could handle certain queries about flights, and we provided
them with a few example scenarios that our system can handle. Their job was then to construct a set of new scenarios,
each starting with the word ”You...” and continuing to describe ”your” desired itinerary. We paid $0.03 per scenario,
and within a few hours 72 distinct workers had given us 273
scenarios, examples of which are shown below.
1. You need to find the cheapest flight from Maryland to
Tampa, Florida. Find a cheap flight out of your choice
of Philadelphia, Dulles or Baltimore airports.
2. You won a prize to visit Disneyland but have to provide
your own airfare. You are going the week of alentine’s
Day and you need 2 tickets from Seattle. You only have
$500 to spend on tickets.
3. Destination: London, England
Departs: January 15th, 2010 anytime in the morning
Returns: March 1st, 2010 anytime after 3:00pm
Price: Anything under $1500 To make things
interesting I want as many layovers as possible!
4. You would like to take a vacation in Puerto Rico for two
weeks. The departure and arrival dates must be on a Saturday.
5. You are a cartoonish crime boss in New York City, but
Batman has caught on to you and you need to skip town
for a while. You decide to head for Memphis, a city not
normally known for costumed villainy. Set up a one-way
flight with no layovers first thing tomorrow morning; cost
is no object.
S:
U:
S:
U:
S:
S:
U:
S:
U:
S:
S:
U:
S:
(a)
The screenshot above depicts a sample HIT as shown to an AMT-worker.
To the right, a sample dialogue collected using this HIT is shown. Parts
of the system responses have been omitted for brevity. Also not shown
are the updates to the visual display which show the user query results.
U:
S:
U:
[...] How can I help you?
i need a flight from kansas city to providence rhode island rhode island
[...] What date will you be travelling?
monday morning
[...] One moment please.
[...] There are at least 27 flights departing between 5:30 a.m. and 12:39 a.m.
i will take the five thirty
[...] What date will you be returning on?
[noise] friday afternoon
[...] One moment please.
[...] There are at least 20 flights departing between 12:23 p.m. and 6:45 p.m.
i will take the twelve twenty three
Can you provide a departure or arrival
time? There are at least 6 Delta flights
from Providence to Kansas City on Friday February fifth.
arrive at five oh four
[...] How about a flight arriving at 5:04
p.m. connecting in Detroit? Would that
work? [...]
yes
(b)
Figure 7: Sample HIT and collected interaction.
Not all of these 273 scenarios were suitable for a user
study. As shown in the examples, some workers did not
fully follow the directions. Other crowd-sourced scenarios
had dates that were in the past by the time we deployed our
system. For the most part, however, the scenarios generated
were far more creative and varied than anything we could
have come up with ourselves in such a short amount of time.
Although it was clear that some tasks would cause our system trouble, we did not explicitly exclude such scenarios
from our study. For example, our system does not have
Disneyland in its vocabulary, let alone a mapping from the
landmark to the nearest airport. Ultimately, we chose 100
scenarios to form the basis of the data collection procedure
described in the next section.
4.3.
Data Collection
Although the interface shown in figure 6 is optimized
for a mobile device, the WAMI Toolkit allows us to access
it from modern desktop browsers as well. The following
paragraphs describe how we were able to collect over 1,000
dialogue sessions averaging less than $0.20 apiece in under
10 days of deployment on Amazon Mechanical Turk.
4.3.1. HIT Design
The design of a HIT is of paramount importance with
respect to the quality of the data we collected using Amazon Mechanical Turk. Novice workers, unused to interacting with a spoken language interface, present a challenge
to system development in general, and the AMT-workers
are no exception. Fortunately, AMT can be used as an opportunity to iteratively improve the interface, using worker
interactions to guide design decisions.
To optimize the design of our system and the HIT, we
deployed short-lived AMT tasks and followed them up with
improvements based on the interactions collected. Since
the entire interaction is logged on our servers, we also have
the ability to replay each session from start to finish, and
can watch and listen to the sequence of dialogue turns taking place in a browser. By replaying sessions from an early
version of our interface, we discovered that many workers
were not aware that they could click on a flight to view the
details. This inspired the addition of the arrows on the left
hand side, to indicate the potential for drop-down details.
Although initially we had hoped to minimize the instructions on screen, we found that, without guidance, a
number of AMT-workers just read the scenario aloud. Even
after providing them with a short example of something
they could say, a few workers were still confused, so we
added an explicit note instructing them to avoid repeating
the scenario verbatim. After a few iterations of redeploying
and retuning the dialogue and scenario user interfaces, we
eventually converged on the HIT design shown in figure 7.
In order to complete a HIT successfully, a worker was
required to book at least one flight (although we did not
check that it matched the scenario); otherwise they were
asked to “give up.” Whether the task was completed or not,
the worker had the option of providing written feedback
about their experience on each scenario before submitting.
4.3.2. Extended Deployment
With the design stage complete, we decided to leave
our HIT on AMT for an extended period of time to collect a large amount of data. Beginning on a Tuesday, we
deployed Flight Browser to AMT and paid workers $0.20
for each scenario. We restricted the deployment to workers who had Amazon accounts in the United States. Each
worker was limited to submitting sessions corresponding to
the 100 scenarios. By Friday at 7:30 am (i.e., after less than
three days of elapsed time) we had collected 876 dialogues
from 63 distinct users, totaling 9,372 audio files.
Curious about how price affected the rate of collection,
we deployed the same task for $0.10 a little over a month
later. This task was started on a Thursday and left running
for 6 days. Though clearly there was less interest in the
# of dialogue sessions
20−Cent Collection
# Sessions
# Distinct Workers
# Utterances
Avg. # Utts. / Session
% Sessions Gave Up
100
1
Gave Up
Finished
2
3
$0.20 HIT
876
63
8,232
9.5
14.7
$0.10 HIT
237
43
2,419
10.2
17.3
50
5
0
0
4
Figure 9: Corpora Statistics
20
40
AMT−Worker
60
# sessions
10−Cent Collection
40
5 4
20
0
0
10
20
30
AMT−Worker
13
2
40
Figure 8: A breakdown of the data collection efforts by
worker. For each price point, the workers are sorted in ascending order of the number of dialogues they contributed
to the corpus. Numbers 1-5 identify the five workers who
participated in both data collection efforts.
HIT, we were still able to collect 2,595 audio files over 237
dialogues from 43 distinct workers. It should be noted that
we made no special effort to exclude workers from the earlier task from participating in the $0.10-HIT a month later.
Figure 8 shows histograms for each price point of sessions collected from individual workers, as well as the number of tasks they marked “finished” and “give up”. As
shown in the plots, five workers participated in the $0.10
task despite being paid more previously. In fact, three of the
top four contributors to the second round of data collection
were repeat visitors. It’s interesting to note that they were
still willing to participate despite earning half as much.
In both deployments a non-trivial number of audio files
were recognized as noise or silence. This phenomenon has
been observed previously when utterances come from more
realistic sources (Ai et al., 2007) . Listening to these in context, it became apparent that some users required time to
familiarize themselves with the recording software. We decided to ignore the 1,316 files without meaningful recognition results, leaving 10,651 utterances for subsequent analysis. Figure 9 summarizes the corpus statistics from both
deployments.
4.4.
Data Transcription
To transcribe our newly collected data, we once again
turn to the Amazon Mechanical Turk cloud service. Previous work has explored the use of AMT for transcription to
generate high accuracy orthographies. We explore this area
further, and show how seeding the transcription interface
with recognizer hypotheses enables an automatic detection
Figure 10: Flowchart for the transcription task.
method for ”bad” transcripts.
Figure 10 depicts a flowchart of our transcription procedure. We deployed our entire corpus to AMT in a $0.05
HIT, which allows workers to listen to utterances and correct recognizer hypotheses. Each HIT contains a bundle
of 10 utterances for transcription. Since the average number of words per HIT is around 45, the likelihood that none
of them need to be corrected is quite low. This allows us
to detect lazy workers by comparing the submitted transcripts with the original hypotheses. We found that 76%
of our non-expert transcribers edited at least one word in
over 90% of their hits. We assumed that the remaining
workers were producing unreliable transcripts, and therefore discarded their transcripts from further consideration.
There are a number of different filters which could be
used to detect bad transcripts. In this work, we assume that
a transcript needs to be edited if more than two workers
have made changes, and then filter out transcripts which
match the hypothesis.
The question of how to obtain accurate transcripts from
non-expert workers has been addressed by (Marge et al.,
2010), who employ the ROVER voting scheme to combine transcripts. Indeed, a number of transcript combination techniques could be explored. In this work, we take a
simple majority vote, unless there is no agreement between
five unfiltered transcripts, at which point we begin to accept
a plurality. We found that 95.2% of our data only needed
3 good transcriptions to pass a simple majority vote. The
table below indicates the amount of data we were able to
transcribe for a given number of good transcripts.
# Good Transcripts Required (G)
% Corpus Transcribed (T)
G 2
3
5
6
7+
T 84.4 95.2 96.3 98.4 99.6
Fifty three audio files, did not have an accepted transcript even after collecting 15 transcripts. We listened to
Expert−edited Data
1% 6%
2%
3%
14%
6%
Consistent
90%
Inconsistent
Semi−consistent
78%
No agreement
Figure 11: These charts indicate whether the AMT transcripts were consistent, semi-consistent, or inconsistent
with the expert transcribers. The semi-consistent case
arises when the experts disagreed, and the AMT transcript
matched one their transcripts.
this audio and discovered anomalies such as foreign language speech, singing, or garbled noise that caused AMT
workers to start guessing at the transcription.
In order to assess the quality of our AMT-transcribed
utterances, we had two expert-transcribers perform the
same HIT for 1,000 utterances randomly selected from the
corpus. We compared the orthographies of our two experts
and found sentence-level exact agreement to be 93.1%. The
AMT-transcripts had 93.2% agreement with the first expert
and 93.1% agreement with the second, indicating that our
AMT-derived transcripts were of very high quality.
Figure 11 shows a detailed breakdown of agreement,
depicting the consistency of the AMT transcripts with those
of our experts. For example, of all the data edited by at least
one expert, only 6% of the AMT-transcripts were inconsistent with an expert-agreed-upon transcript. Where the
experts disagree, AMT-labels often match one of the two,
indicating that the inconsistencies in AMT transcripts are
often reasonable. For example, ”I want a flight to” and ”I
want to fly to” was a common disagreement.
Lastly, we also asked workers to annotate each utterance with the speaker’s gender. Again, taking a simple vote
allows us to determine that a majority of our corpus (69.6%)
consists of male speech.
4.5.
Data Analysis
Using the AMT-transcribed utterances, we can deduce
that the WER of our system was 18.1%. We note, however, that, due to the monetary incentives inherent in Amazon Mechanical Turk, this error rate may be artificially low,
since workers who found the task frustrating were free to
abandon the job. Figure 12 shows the WER for each worker
plotted against the number of sessions they contributed. It’s
clear that workers with high error rates rarely contributed
more than a few sessions. To provide a fairer estimate
of system performance across users, we average WER for
each speaker in our corpus and revise our estimate of WER
to 24.4%.
Upon replaying a number of sessions, we were quite
happy with the types of interactions collected. A number
of sessions exposed weaknesses in our system that we intend to correct in future development. The workers were
# Utterances Spoken
All Data
1000
10−cent worker
20−cent worker
500
0
0
0.5
Word Error Rate
1
Figure 12: The number of sessions contributed by each
worker is plotted against the WER experienced by that
worker.
given the opportunity to provide feedback, and many gave
us valuable comments, compliments and criticisms, a few
of which are shown below.
1. There was no real way to go back when it misunderstood
the day I wanted to return. It should have a go back function or command.
2. Fine with cities but really needs to get dates down better.
3. The system just cannot understand me saying “Tulsa”.
4. Was very happy to be able to say two weeks later and not
have to give a return date. System was not able to search
for lowest fare during a two week window.
5. I think the HIT would be better if we had a more specific
date to use instead of making them up. Thank you, your
HITs are very interesting.
To analyze the properties of the corpus objectively, we
decided to compare the recognition hypotheses contained in
the worker interactions with those of our internal database.
From March of 2009 to March of 2010, Flight Browser has
been under active development by 5 members of our lab.
Every utterance spoken to Flight Browser in this time has
been logged in a database. User studies have been conducted in the laboratory and demos have been presented to
interested parties. The largest segment of this audio, however, comes from developers, who speak to the system for
development and debugging. In total, 9,023 utterances were
recorded, and these comprise our internal database. We
summarize a number of statistics common to the internal
and AMT-collected corpora in the table below.
# Utts.
# Hyp Tokens
# Unique Word
# Unique Bigrams
# Unique Trigrams
Avg. Utt. Length
Internal
9,023
49,917
740
4,157
6,870
4.8
AMT-Collected
8,232
36,390
758
4,171
7,165
4.4
While the average recognizer hypothesis is longer and
the number of words is greater in our internal data, the overall language in the cloud-collected corpus appears to be
more complex, as illuminated by distinct n-gram counts.
This may in part be because system experts know how to
speak with the system, and can communicate in longer
phrases; however, they perhaps do not formulate new
queries as creatively, especially while debugging.
5. Discussion and Future Work
In this paper, we have demonstrated the utility of the
Amazon Mechanical Turk cloud service in a number of spoken dialogue system development tasks. We have explored
the practicality of deploying a simple read-aloud task to
AMT, and extended this approach to spontaneous speech
solicitation within a multimodal dialogue system. We have
shown that it is possible to collect large amounts of indomain speech data very quickly and relatively cheaply.
Central to this work has been designing tasks for nonexpert workers that are easily verifiable. We have shown
how the recognizer can be used as a tool to loosely constrain
both transcription and collection tasks, allowing us to filter
out low quality data. When taken to the limit, much of
the drudge work associated with spoken-dialogue system
research can be easily outsourced to the cloud.
While the collection, transcription, and scenario generation capabilities we have explored here are powerful, we
have not yet approached the question of how best to supplement dialogue system corpora with annotations, such
as user satisfaction and dialogue acts (Hastie et al., 2002).
Perhaps this too can be incorporated into a Human Intelligence Task. As researchers begin to create large-scale wellannotated corpora using cloud services such as Amazon
Mechanical Turk, our hope is that research on data-driven
approaches to dialogue system components, e.g. (Williams
and Young, 2007), will become feasible on realistic data
from arbitrary domains.
Another interesting line of research might be to devise
a framework for spoken dialogue system evaluation using
this service. A possibility is to construct a set of guidelines that multiple systems in a common domain would
be required to follow in order to put them through a rigorous evaluation, much like the former NIST evaluations.
If a cloud-based evaluation framework could be devised,
however, the management overhead of such an evaluation
would be greatly reduced, and a potentially unlimited number of institutions could participate.
6. Acknowledgments
This research is funded in part by the T-Party project, a joint
research program between MIT and Quanta Computer Inc., Taiwan.
7. References
Hua Ai, Antoine Raux, Dan Bohus, Maxine Eskenazi, and Diane Litman. 2007. Comparing spoken dialog corpora collected
with recruited subjects versus real users. In Proc. of SIGdial.
Thomas Okken Giuseppe Di Fabbrizio, Jay G. Wilpon. 2009. A
speech mashup framework for multimodal mobile services. In
The Eleventh International Conference on Multimodal Interfaces and Workshop on Machine Learning for Multi-modal Interaction (ICMI-MLMI 2009), November.
Alexander Gruenstein and Stephanie Seneff. 2007. Releasing a
multimodal dialogue system into the wild: User support mechanisms. In Proc. of the 8th SIGdial Workshop on Discourse and
Dialogue, pages 111–119.
Alexander Gruenstein, Stephanie Seneff, and Chao Wang. 2006.
Scalable and portable web-based multimodal dialogue interaction with geographical databases. In Proc. of INTERSPEECH,
September.
Alexander Gruenstein, Ian McGraw, and Ibrahim Badr. 2008.
The WAMI toolkit for developing, deploying, and evaluating
web-accessible multimodal interfaces. In Proc. of ICMI, October.
Alexander Gruenstein, Ian McGraw, and Andrew Sutherland.
2009. A self-transcribing speech corpus: Collecting continuous speech with an online educational game. In Proc. of the
Speech and Language Technology in Education (SLaTE) Workshop, September.
Helen Wright Hastie, Rashmi Prasad, and Marilyn Walker. 2002.
Automatic evaluation: Using a date dialogue act tagger for
user satisfaction and task completion prediction. In Proc. of
the Language Resources and Evaluation (LREC), May.
L. Hirschman, M. Bates, D. Dahl, W. Fisher, J. Garofolo, D. Pallett, K. Hunicke-Smith, P. Price, A. Rudnicky, and E. Tzoukermann. 1993. Multi-site data collection and evaluation in
spoken language understanding. In Proc. of the workshop on
Human Language Technology. Association for Computational
Linguistics.
Michael Kaisser and John Lowe. 2008. Creating a research collection of question answer sentence pairs with amazons mechanical turk. In Proc of the Language Resources and Evaluation (LREC), May.
Aniket Kittur, Ed H. Chi, and Bongwon Suh. 2008. Crowdsourcing user studies with mechanical turk. In CHI ’08: Proceeding of the twenty-sixth annual SIGCHI conference on Human
factors in computing systems, pages 453–456, New York, NY,
USA. ACM.
Matthew Marge, Satanjeev Banerjee, and Alexander Rudnicky.
2010. Using the amazon mechanical turk for transcription of
spoken language. In Proc. of ICASSP, March.
Ian McGraw, Alexander Gruenstein, and Andrew Sutherland.
2009. A self-labeling speech corpus: Collecting spoken words
with an online educational game. In Proc. of INTERSPEECH,
September.
Scott Novotne and Chris Callison-Burch. 2010. Cheap, fast and
good enough: Automatic speech recognition with non-expert
transcription. In Proceedings of ICASSP, forthcoming.
Antoine Raux, Dan Bohus, Brian Langner, Alan Black, and Maxine Eskenazi. 2006. Doing research on a deployed spoken dialogue system: One year of Let’s Go! experience. In Proceedings of INTERSPEECH-ICSLP, September.
Stephanie Seneff. 2002. Response planning and generation in the
MERCURY flight reservation system. Computer Speech and
Language, 16:283–312.
Rion Snow, Brendan O’Conner, Daniel Jurafsky, and Andrew Y.
Ng. 2008. Cheap and fast — but is it good? evaluating nonexpert annotations for natural language tasks. In Proceedings
of EMNLP, October.
M. Walker, J. Aberdeen, J. Boland, E. Bratt, J. Garofolo,
L. Hirschman, A. Le, S. Lee, S. Narayanan, K. Papineni,
B. Pellom, J. Polifroni, A. Potamianos, P. Prabhu, A. Rudnicky, G. Sanders, S. Seneff, D. Stallard, , and S. Whittaker.
2001. DARPA communicator dialog travel planning systems:
The june 2000 data collection. In Proc. of EUROSPEECH.
J.D. Williams and S. Young. 2007. Scaling POMDPs for spoken
dialog management. IEEE Transactions on Audio, Speech, and
Language Processing, 15(7):2116–2129, September.
Brandon Yoshimoto, Ian McGraw, and Stephanie Seneff. 2009.
Rainbow rummy: A web-based game for vocabulary acquisi-
tion using computer directed speech. In Proc. of the Speech
and Language Technology in Education (SLaTE) Workshop.
Victor Zue, Stephanie Seneff, James Glass, Joseph Polifroni,
Christine Pao, Timothy J. Hazen, and Lee Hetherington.
2000. JUPITER: A telephone-based conversational interface
for weather information. IEEE Transactions on Speech and Audio Processing, 8(1), January.