Academia.eduAcademia.edu

Collecting Voices from the Cloud

2010, Language Resources and Evaluation

Collecting Voices from the Cloud Ian McGraw, Chia-ying Lee, Lee Hetherington, Stephanie Seneff, Jim Glass MIT Computer Science and Artificial Intelligence Lab 32 Vassar Street Cambridge, MA, 02139, USA imcgraw@mit.edu chiaying@mit.edu, ilh@mit.edu, seneff@mit.edu, glass@mit.edu Abstract The collection and transcription of speech data is typically an expensive and time-consuming task. Voice over IP and cloud computing are poised to eliminate this impediment to research on spoken language interfaces in many domains. This paper documents our efforts to deploy speech-enabled web interfaces to large audiences over the Internet via Amazon Mechanical Turk, an online marketplace for work. Using the open source WAMI Toolkit, we collected corpora in two different domains which collectively constitute over 113 hours of speech. The first corpus contains 100,000 utterances of read speech, and was collected by asking workers to record street addresses in the United States. For the second task, we collected conversations with Flight Browser, a multimodal spoken dialogue system. The Flight Browser corpus obtained contains 10,651 utterances composing 1,113 individual dialogue sessions from around 100 distinct users. The aggregate time spent collecting the data for both corpora was just under two weeks. At times, our servers were logging audio from workers at rates faster than real-time. We describe the process of collection and transcription of these corpora while providing an analysis of the advantages and limitations to this data collection method. 1. Introduction Acquiring in-domain training data for spoken language systems is central to their development. Unfortunately, this gives rise to a classic chicken-and-egg problem. A working system is required to collect in-domain data; however, this very data is needed to train the underlying models before the system can be implemented. Over the years, we have learned to bootstrap our acoustic and language models from existing systems, supporting incremental parameter adaptation for particular domains. This typically slow process can take years of data collection and iterative refinement. Early successes in the wide deployment of sophisticated spoken language systems in the research community include MIT’s Jupiter (Zue et al., 2000) conversational weather information system, and CMU’s Let’s Go (Raux et al., 2006) transportation information system. Both Jupiter and Let’s Go were telephone accessible and were publicized well enough to see a fair amount of use from outside the laboratory. In general, however, researchers resort to conducting user studies with high management overhead, making the collection of large amounts of data for arbitrary domains prohibitively expensive. This prevents data-driven techniques from reaching their full potential. The field of human computation is beginning to address some of these needs. Amazon Mechanical Turk (AMT), for instance, provides a constant crowd of nonexpert workers who perform web-based tasks for micropayments of as little as $0.005. Researchers in the natural language processing community have begun to harness the potential of this cloud-based tool for both annotation (Snow et al., 2008) and data collection (Kaisser and Lowe, 2008). The speech community, however, has been somewhat slower to capitalize on this new paradigm. In (McGraw et al., 2009), we employed AMT workers to help transcribe spoken utterances. Other researchers have created a number of transcribed corpora in a variety of domains (Novotne and Callison-Burch, 2010; Marge et al., 2010). Despite this progress, speech researchers are only beginning to tap into the power of the cloud. Having explored the feasibility of the crowd transcription of speech data, it seems natural that we turn our attention to the collection of in-domain data for spoken language systems. From a technical standpoint, this endeavor is somewhat more complex. Amazon Mechanical Turk does not provide an interface for web-based audio collection; furthermore, while incorporating audio playback into a website is relatively straightforward, few tools exist for recording audio from web pages. For this work, we have used the publicly available WAMI Toolkit, which provides a Javascript API for speech-enabling a web-site (Gruenstein et al., 2008). A similar technology from AT&T is also in development (Giuseppe Di Fabbrizio, 2009). It is now feasible to integrate these speech recognition cloud services with web pages deployed to Amazon Mechanical Turk. We believe a web-based approach to data collection retains the advantages of telephone-based collection, while opening up new interaction possibilities. Over the last few years, the emergence of high-quality mobile device displays and constant network connectivity have popularized a class of applications that make use of multiple input and output modalities. Google and Vlingo, for example, are two competitors in the increasingly active voice search market, which combines spoken queries with result visualization on a mobile display. The Spoken Language Systems group has, for some time, been interested in systems where points-of-interest are spoken by the user in the form of an address (Gruenstein et al., 2006). Over the years, we have collected data in this domain by bringing people into the lab or through ad-hoc web interfaces, but have never undertaken a marketing campaign to ensure heavy usage. Given the millions of possible addresses that a user might speak, it is discouraging that we have so little data. In a pilot experiment, we distributed a speech-enabled web interface using Amazon Mechanical Turk, and employed anonymous workers to read aloud addresses, eliciting a total of 103 hours of speech from 298 users. This simple task demonstrates the feasibility of large scale speech data collection through AMT. Since the very same API acts as the web front-end to most of our spoken dialogue systems, we next discuss a corpus of 1113 dialogue sessions collected for our flight reservation system (Seneff, 2002). Here we explore a range of price points for web tasks deployed to AMT that ask the worker to book a flight according to a given scenario. The scenarios themselves were also generated by AMT workers. Finally, once the data had been collected, we posted it back on AMT for transcription using an extension of the methods developed in (Gruenstein et al., 2009). 2. Related Work and Background We consider two categories of related work: other activities which utilized Amazon Mechanical Turk for tasks related to speech and language, and previous efforts for data collection for spoken dialogue systems. Although there have been several recent success stories on utilizing AMT for data annotation and languagerelated tasks, the idea of using Amazon Mechanic Turk to provide raw speech data is much less common. (Snow et al., 2008) demonstrated that many natural language processing tasks fall in the category of hard-for-computers but easy-for-humans, and thus become perfect tasks for Turkers to undertake. (Kittur et al., 2008) investigated the utility of AMT as a tool for the task of assessing the quality of Wikipedia documents. They found that the inclusion of explicitly verifiable tasks was crucial to avoid a tendency to game the system. (Kaisser and Lowe, 2008) were able to successfully utilize AMT to generate a corpus of questionanswer pairs to be used in the evaluation of TREC’s QA track. The authors claimed that AMT provided a “large, inexpensive, motivated, and immediately available” pool of subjects. (Novotne and Callison-Burch, 2010) were among the first to exploit Turkers for speech data transcription. (Marge et al., 2010) demonstrated how ROVER can be used as a technique to achieve gold-standard transcription from Turkers by intelligently but automatically combining redundant contributions. This was shown to still be much more economic than hiring expert transcribers. Data collection has always been a major issue for researchers developing conversational spoken dialogue systems. In the early days, very costly Wizard of Oz systems involved scheduling subjects to come to the lab and participate with a simulated spoken dialogue system. These efforts were obviously prohibitively expensive, and limited collected data to at most dozens of dialogues. In the early 1990’s, researchers from MIT, CMU, BBN, SRI, and AT&T collaborated to collect approximately 25,000 utterances in the travel domain (Hirschman et al., 1993). In 2000 and 2001, collaborative work was revived with the additional participation of IBM, Lucent, Bell Labs, and the University of Colorado at Boulder as the Communicator project. This effort led to the collection of just over 100,000 utterances spoken to automatic or semiautomatic flight-reservations systems (Walker et al., 2001). The collaborative data collection effort was overseen by NIST and involved considerable effort in planning the logistics. These multi-year data collection efforts yielded seven corpora currently sold by the Linguistic Data Consortium for hundreds of dollars apiece. The SLS group at MIT has been developing spoken dialogue systems since the late 1980’s. We have devoted considerable resources to data collection as a crucial part of the development process. Within the last few years, we have become aware of the potential for a paradigm shift towards a model for data collection that does not require subjects to come into the laboratory. We piloted this idea first through soliciting subjects via email and rewarding them substantially with Amazon gift certificates. A successful Web-based data collection effort was conducted by having subjects solve scenarios within our CityBrowser multimodal spoken dialogue system (Gruenstein and Seneff, 2007). In the next two sections, we will first describe our efforts to collect read speech for address information through AMT, followed by a discussion of our experiments inviting Turkers to solve travel scenarios within Flight Browser. 3. Read Speech This section explores the use of Amazon Mechanical Turk to collect read speech containing spoken addresses. 3.1. Collection Procedure Units of work on Amazon Mechanical Turk are called Human Intelligence Task or HITs. Requesters build tasks and deploy them to AMT using a web interface, commandline tools, or another of the many APIs made available to the public. Each task is assigned a price, which an individual worker will receive as payment if the requester accepts his or her work. Requesters reserve the right to deny payment for work that is unsatisfactory. The addresses in our reading task are taken from the TIGER 2000 database provided by the U.S. Census Bureau. Each address is a triple: (road, city, state). There are over six million such triples in the TIGER database. To ensure coverage of the 273,305 unique words contained in these addresses, we chose a single address to correspond to each word. 100,000 such triples formed our pilot experiment; AMT workers were paid one U.S. cent to read each prompt. Figure 1 shows an example HIT. 1 After the worker has recorded an address, they are required to listen to a playback of that utterance before moving on, to help mitigate problems with microphones or the acoustic environment. Still, since we are employing anonymous, non-expert workers, there is little incentive to produce high quality utterances, and a worker may even try to game the system. We propose two distinct ways to validate worker data. The first is to have humans validate the data manually. Given the success of previous AMT for transcription tasks, we can simply pay workers to listen to the cloud-collected speech and determine whether the expected words were indeed spoken. The second approach, which we explore in this section, is to integrate a speech recognizer into the datacollection process itself. Since the VoIP interface we employ is used by our dialogue systems, we have the ability to incorporate the recognizer in real time. Thus, we can block workers who do not 1 Five utterances were bundled into each task for optimal pricing, given Amazon’s minimum commision of $0.005. Recognizer−Estimated Quality 1 0.8 0.6 0.4 0.2 0 Native: Non−native: 10 male female male female no speech 100 both both 1000 Number of Utterances (log scale) Figure 3: Individual workers plotted according to the recognizer estimated quality of their work, and the number of utterances they contributed to our corpus. Average Collection Rate (# Utterances / Hour) Figure 1: A sample Human Intelligence Task (HIT) for collecting spoken addresses. 2000 0.8 0.6 0.4 Data with Quality > q 0.2 0 0.5 Data with No Anomalies and Quality > q 0.6 0.7 0.8 0.9 1 Recognizer−Estimated Quality (q) 1000 0 0 Figure 4: By filtering out users whose estimated quality does not meet a certain threshold, we can simulate the effect of using the recognizer to automatically block workers. 6 12 18 Worker’s Local Hour of Day 24 Figure 2: Data collection rate as a function of the worker’s local time of day. satisfy our expectations immediately. For the pilot experiment, however, we decided not to block any workers. Running the recognizer in a second pass allows us to examine the raw data collected through AMT and experiment with different methods of blocking unsuitable work, the best of which would be deployed in future database collection efforts. 3.2. Fraction of Initial Data Set 1 Corpus Overview Our reading tasks were posted to Amazon Mechanical Turk on a Wednesday afternoon. Within 77 hours, 298 workers had collectively completed reading all 100,000 prompts, yielding a total of 103 hours of audio. Figure 2 depicts the average number of utterances collected per hour plotted according to time of day. Workers tended to talk with our system during their afternoon; however, the varying time zones smooth out the collection rate with respect to the load on our servers. The majority of our data, 68.6%, was collected from workers within the United States. India, the second largest contributor to our corpus, represented 19.6% of our data. While some non-native speakers produced high quality utterances, others had nearly unintelligible accents. This, as well as the fact that the acoustic environment varied greatly from speaker to speaker, make the MIT Address corpus par- ticularly challenging for speech recognition. To determine the properties of our corpus without listening to all 103 hours of speech, two researchers independently sampled and annotated 10 utterances from each worker. Speakers were marked as male or female and native or non-native. Anomalies in each utterance, such as unintelligible accents, mispronounced words, cut-off speech, and background noise, were marked as present or absent. We then extrapolate statistics for the overall corpus based on the number of utterances contributed by a given worker. From this, we have estimated that 74% of our data is cleanly read speech. This result raises the question of how to effectively manage the quality of speech collected from the cloud. Here we explore an automatic method which incorporates our speech recognizer into the validation process. In particular, we run the recognizer that we have built for the address domain over each utterance collected. We then assign a quality estimate, q, to each worker by computing the fraction of recognition hypotheses that contain the U.S. state expected given the prompt. Figure 3 shows the recognizerestimated quality of each worker, plotted against the number of utterances that worker contributed. Notice that a single worker may actually be two or more different people using the same Amazon account. AMT provides requesters with the ability to block workers who do not perform adequate work. Using our automatic method of quality estimation, we simulate the effects of blocking users according to a quality threshold q. It is clear from Figure 4 that, while the data collection No Quality Filter 4% 6% 9% 74% Quality Threshold (q = .95) 3.5% 4.5% 1% 1% 7% 90% No anomalies Multiple anomalies, cut−off speech, silence Very thick accent Breathing into close−talking microphone Background Noise Figure 5: Breakdown of anomalies present in the corpus as a whole and the sub-corpus where workers have a high quality estimate, q >= .95. The recognizer-filtered sub-corpus still retains 65% of the original speech data. Though not explicitly shown here, we found that non-native speakers were still able to contribute to this sub-corpus: 5% of the filtered data with no anomalies came from nonnative speakers. rate might have slowed, requiring a high q effectively filters out workers who contribute anomalous utterances. Figure 5 depicts how the corpus properties change when we set q = .95. While not unexpected, it is nice to see that egregiously irregular utterances are effectively filtered out. 4. Multimodal Dialogue Interactions Our experience with the address corpus inspired us to deploy a fully functional multimodal spoken dialogue system to AMT. This section describes the process of designing tasks for AMT for a spoken dialogue system. 4.1. System Design Flight Browser was derived from the telephone-based Mercury system, originally developed under the DARPA Communicator project (Seneff, 2002). Mercury’s design was based on a mixed-initiative model for dialogue interaction. The system describes verbally the set of database tuples returned in a conversational mode. It prompts for relevant missing information at each point in the dialogue, but there are no constraints on what the user can say next. Thus, the full space of the understanding system is available at all times. Using the WAMI Toolkit, we adapted Mercury to a multimodal web-interface we call Flight Browser. The dialogue interaction was modified, mainly by reducing the system’s verbosity, to reflect the newly available visual itinerary and flight list display. A real database, provided by ITA, is utilized. About 600 major cities are supported worldwide, with a bias towards U.S. cities. The interface was designed to fit the size constraints of a mobile phone, and multimodal support, such as clicking to sort or book flights, was added. Figure 6 shows Flight Browser in a WAMI browser built specially for the iPhone. Figure 6: Flight browser interface loaded in the iPhone’s WAMI browser. The same interface can be accessed from a desktop from any modern web browser. 4.2. Scenario Creation When designing a user study, many spoken dialogue system researchers struggle with the question of how to elicit interesting data from users without biasing the language that they use to produce it. Some have tried to present scenarios in tabular form, while others prefer to introduce extra language, hoping that the user will only pick up on the important details of a scenario rather than the language in which it is framed. Continuing the theme of crowdsourcing research tasks, we take an alternative approach. To generate scenarios, we created an Amazon Mechanical Turk task which asked workers what they would expect from a flight reservation system. They were explicitly told that we were trying to build a conversational system that could handle certain queries about flights, and we provided them with a few example scenarios that our system can handle. Their job was then to construct a set of new scenarios, each starting with the word ”You...” and continuing to describe ”your” desired itinerary. We paid $0.03 per scenario, and within a few hours 72 distinct workers had given us 273 scenarios, examples of which are shown below. 1. You need to find the cheapest flight from Maryland to Tampa, Florida. Find a cheap flight out of your choice of Philadelphia, Dulles or Baltimore airports. 2. You won a prize to visit Disneyland but have to provide your own airfare. You are going the week of alentine’s Day and you need 2 tickets from Seattle. You only have $500 to spend on tickets. 3. Destination: London, England Departs: January 15th, 2010 anytime in the morning Returns: March 1st, 2010 anytime after 3:00pm Price: Anything under $1500 To make things interesting I want as many layovers as possible! 4. You would like to take a vacation in Puerto Rico for two weeks. The departure and arrival dates must be on a Saturday. 5. You are a cartoonish crime boss in New York City, but Batman has caught on to you and you need to skip town for a while. You decide to head for Memphis, a city not normally known for costumed villainy. Set up a one-way flight with no layovers first thing tomorrow morning; cost is no object. S: U: S: U: S: S: U: S: U: S: S: U: S: (a) The screenshot above depicts a sample HIT as shown to an AMT-worker. To the right, a sample dialogue collected using this HIT is shown. Parts of the system responses have been omitted for brevity. Also not shown are the updates to the visual display which show the user query results. U: S: U: [...] How can I help you? i need a flight from kansas city to providence rhode island rhode island [...] What date will you be travelling? monday morning [...] One moment please. [...] There are at least 27 flights departing between 5:30 a.m. and 12:39 a.m. i will take the five thirty [...] What date will you be returning on? [noise] friday afternoon [...] One moment please. [...] There are at least 20 flights departing between 12:23 p.m. and 6:45 p.m. i will take the twelve twenty three Can you provide a departure or arrival time? There are at least 6 Delta flights from Providence to Kansas City on Friday February fifth. arrive at five oh four [...] How about a flight arriving at 5:04 p.m. connecting in Detroit? Would that work? [...] yes (b) Figure 7: Sample HIT and collected interaction. Not all of these 273 scenarios were suitable for a user study. As shown in the examples, some workers did not fully follow the directions. Other crowd-sourced scenarios had dates that were in the past by the time we deployed our system. For the most part, however, the scenarios generated were far more creative and varied than anything we could have come up with ourselves in such a short amount of time. Although it was clear that some tasks would cause our system trouble, we did not explicitly exclude such scenarios from our study. For example, our system does not have Disneyland in its vocabulary, let alone a mapping from the landmark to the nearest airport. Ultimately, we chose 100 scenarios to form the basis of the data collection procedure described in the next section. 4.3. Data Collection Although the interface shown in figure 6 is optimized for a mobile device, the WAMI Toolkit allows us to access it from modern desktop browsers as well. The following paragraphs describe how we were able to collect over 1,000 dialogue sessions averaging less than $0.20 apiece in under 10 days of deployment on Amazon Mechanical Turk. 4.3.1. HIT Design The design of a HIT is of paramount importance with respect to the quality of the data we collected using Amazon Mechanical Turk. Novice workers, unused to interacting with a spoken language interface, present a challenge to system development in general, and the AMT-workers are no exception. Fortunately, AMT can be used as an opportunity to iteratively improve the interface, using worker interactions to guide design decisions. To optimize the design of our system and the HIT, we deployed short-lived AMT tasks and followed them up with improvements based on the interactions collected. Since the entire interaction is logged on our servers, we also have the ability to replay each session from start to finish, and can watch and listen to the sequence of dialogue turns taking place in a browser. By replaying sessions from an early version of our interface, we discovered that many workers were not aware that they could click on a flight to view the details. This inspired the addition of the arrows on the left hand side, to indicate the potential for drop-down details. Although initially we had hoped to minimize the instructions on screen, we found that, without guidance, a number of AMT-workers just read the scenario aloud. Even after providing them with a short example of something they could say, a few workers were still confused, so we added an explicit note instructing them to avoid repeating the scenario verbatim. After a few iterations of redeploying and retuning the dialogue and scenario user interfaces, we eventually converged on the HIT design shown in figure 7. In order to complete a HIT successfully, a worker was required to book at least one flight (although we did not check that it matched the scenario); otherwise they were asked to “give up.” Whether the task was completed or not, the worker had the option of providing written feedback about their experience on each scenario before submitting. 4.3.2. Extended Deployment With the design stage complete, we decided to leave our HIT on AMT for an extended period of time to collect a large amount of data. Beginning on a Tuesday, we deployed Flight Browser to AMT and paid workers $0.20 for each scenario. We restricted the deployment to workers who had Amazon accounts in the United States. Each worker was limited to submitting sessions corresponding to the 100 scenarios. By Friday at 7:30 am (i.e., after less than three days of elapsed time) we had collected 876 dialogues from 63 distinct users, totaling 9,372 audio files. Curious about how price affected the rate of collection, we deployed the same task for $0.10 a little over a month later. This task was started on a Thursday and left running for 6 days. Though clearly there was less interest in the # of dialogue sessions 20−Cent Collection # Sessions # Distinct Workers # Utterances Avg. # Utts. / Session % Sessions Gave Up 100 1 Gave Up Finished 2 3 $0.20 HIT 876 63 8,232 9.5 14.7 $0.10 HIT 237 43 2,419 10.2 17.3 50 5 0 0 4 Figure 9: Corpora Statistics 20 40 AMT−Worker 60 # sessions 10−Cent Collection 40 5 4 20 0 0 10 20 30 AMT−Worker 13 2 40 Figure 8: A breakdown of the data collection efforts by worker. For each price point, the workers are sorted in ascending order of the number of dialogues they contributed to the corpus. Numbers 1-5 identify the five workers who participated in both data collection efforts. HIT, we were still able to collect 2,595 audio files over 237 dialogues from 43 distinct workers. It should be noted that we made no special effort to exclude workers from the earlier task from participating in the $0.10-HIT a month later. Figure 8 shows histograms for each price point of sessions collected from individual workers, as well as the number of tasks they marked “finished” and “give up”. As shown in the plots, five workers participated in the $0.10 task despite being paid more previously. In fact, three of the top four contributors to the second round of data collection were repeat visitors. It’s interesting to note that they were still willing to participate despite earning half as much. In both deployments a non-trivial number of audio files were recognized as noise or silence. This phenomenon has been observed previously when utterances come from more realistic sources (Ai et al., 2007) . Listening to these in context, it became apparent that some users required time to familiarize themselves with the recording software. We decided to ignore the 1,316 files without meaningful recognition results, leaving 10,651 utterances for subsequent analysis. Figure 9 summarizes the corpus statistics from both deployments. 4.4. Data Transcription To transcribe our newly collected data, we once again turn to the Amazon Mechanical Turk cloud service. Previous work has explored the use of AMT for transcription to generate high accuracy orthographies. We explore this area further, and show how seeding the transcription interface with recognizer hypotheses enables an automatic detection Figure 10: Flowchart for the transcription task. method for ”bad” transcripts. Figure 10 depicts a flowchart of our transcription procedure. We deployed our entire corpus to AMT in a $0.05 HIT, which allows workers to listen to utterances and correct recognizer hypotheses. Each HIT contains a bundle of 10 utterances for transcription. Since the average number of words per HIT is around 45, the likelihood that none of them need to be corrected is quite low. This allows us to detect lazy workers by comparing the submitted transcripts with the original hypotheses. We found that 76% of our non-expert transcribers edited at least one word in over 90% of their hits. We assumed that the remaining workers were producing unreliable transcripts, and therefore discarded their transcripts from further consideration. There are a number of different filters which could be used to detect bad transcripts. In this work, we assume that a transcript needs to be edited if more than two workers have made changes, and then filter out transcripts which match the hypothesis. The question of how to obtain accurate transcripts from non-expert workers has been addressed by (Marge et al., 2010), who employ the ROVER voting scheme to combine transcripts. Indeed, a number of transcript combination techniques could be explored. In this work, we take a simple majority vote, unless there is no agreement between five unfiltered transcripts, at which point we begin to accept a plurality. We found that 95.2% of our data only needed 3 good transcriptions to pass a simple majority vote. The table below indicates the amount of data we were able to transcribe for a given number of good transcripts. # Good Transcripts Required (G) % Corpus Transcribed (T) G 2 3 5 6 7+ T 84.4 95.2 96.3 98.4 99.6 Fifty three audio files, did not have an accepted transcript even after collecting 15 transcripts. We listened to Expert−edited Data 1% 6% 2% 3% 14% 6% Consistent 90% Inconsistent Semi−consistent 78% No agreement Figure 11: These charts indicate whether the AMT transcripts were consistent, semi-consistent, or inconsistent with the expert transcribers. The semi-consistent case arises when the experts disagreed, and the AMT transcript matched one their transcripts. this audio and discovered anomalies such as foreign language speech, singing, or garbled noise that caused AMT workers to start guessing at the transcription. In order to assess the quality of our AMT-transcribed utterances, we had two expert-transcribers perform the same HIT for 1,000 utterances randomly selected from the corpus. We compared the orthographies of our two experts and found sentence-level exact agreement to be 93.1%. The AMT-transcripts had 93.2% agreement with the first expert and 93.1% agreement with the second, indicating that our AMT-derived transcripts were of very high quality. Figure 11 shows a detailed breakdown of agreement, depicting the consistency of the AMT transcripts with those of our experts. For example, of all the data edited by at least one expert, only 6% of the AMT-transcripts were inconsistent with an expert-agreed-upon transcript. Where the experts disagree, AMT-labels often match one of the two, indicating that the inconsistencies in AMT transcripts are often reasonable. For example, ”I want a flight to” and ”I want to fly to” was a common disagreement. Lastly, we also asked workers to annotate each utterance with the speaker’s gender. Again, taking a simple vote allows us to determine that a majority of our corpus (69.6%) consists of male speech. 4.5. Data Analysis Using the AMT-transcribed utterances, we can deduce that the WER of our system was 18.1%. We note, however, that, due to the monetary incentives inherent in Amazon Mechanical Turk, this error rate may be artificially low, since workers who found the task frustrating were free to abandon the job. Figure 12 shows the WER for each worker plotted against the number of sessions they contributed. It’s clear that workers with high error rates rarely contributed more than a few sessions. To provide a fairer estimate of system performance across users, we average WER for each speaker in our corpus and revise our estimate of WER to 24.4%. Upon replaying a number of sessions, we were quite happy with the types of interactions collected. A number of sessions exposed weaknesses in our system that we intend to correct in future development. The workers were # Utterances Spoken All Data 1000 10−cent worker 20−cent worker 500 0 0 0.5 Word Error Rate 1 Figure 12: The number of sessions contributed by each worker is plotted against the WER experienced by that worker. given the opportunity to provide feedback, and many gave us valuable comments, compliments and criticisms, a few of which are shown below. 1. There was no real way to go back when it misunderstood the day I wanted to return. It should have a go back function or command. 2. Fine with cities but really needs to get dates down better. 3. The system just cannot understand me saying “Tulsa”. 4. Was very happy to be able to say two weeks later and not have to give a return date. System was not able to search for lowest fare during a two week window. 5. I think the HIT would be better if we had a more specific date to use instead of making them up. Thank you, your HITs are very interesting. To analyze the properties of the corpus objectively, we decided to compare the recognition hypotheses contained in the worker interactions with those of our internal database. From March of 2009 to March of 2010, Flight Browser has been under active development by 5 members of our lab. Every utterance spoken to Flight Browser in this time has been logged in a database. User studies have been conducted in the laboratory and demos have been presented to interested parties. The largest segment of this audio, however, comes from developers, who speak to the system for development and debugging. In total, 9,023 utterances were recorded, and these comprise our internal database. We summarize a number of statistics common to the internal and AMT-collected corpora in the table below. # Utts. # Hyp Tokens # Unique Word # Unique Bigrams # Unique Trigrams Avg. Utt. Length Internal 9,023 49,917 740 4,157 6,870 4.8 AMT-Collected 8,232 36,390 758 4,171 7,165 4.4 While the average recognizer hypothesis is longer and the number of words is greater in our internal data, the overall language in the cloud-collected corpus appears to be more complex, as illuminated by distinct n-gram counts. This may in part be because system experts know how to speak with the system, and can communicate in longer phrases; however, they perhaps do not formulate new queries as creatively, especially while debugging. 5. Discussion and Future Work In this paper, we have demonstrated the utility of the Amazon Mechanical Turk cloud service in a number of spoken dialogue system development tasks. We have explored the practicality of deploying a simple read-aloud task to AMT, and extended this approach to spontaneous speech solicitation within a multimodal dialogue system. We have shown that it is possible to collect large amounts of indomain speech data very quickly and relatively cheaply. Central to this work has been designing tasks for nonexpert workers that are easily verifiable. We have shown how the recognizer can be used as a tool to loosely constrain both transcription and collection tasks, allowing us to filter out low quality data. When taken to the limit, much of the drudge work associated with spoken-dialogue system research can be easily outsourced to the cloud. While the collection, transcription, and scenario generation capabilities we have explored here are powerful, we have not yet approached the question of how best to supplement dialogue system corpora with annotations, such as user satisfaction and dialogue acts (Hastie et al., 2002). Perhaps this too can be incorporated into a Human Intelligence Task. As researchers begin to create large-scale wellannotated corpora using cloud services such as Amazon Mechanical Turk, our hope is that research on data-driven approaches to dialogue system components, e.g. (Williams and Young, 2007), will become feasible on realistic data from arbitrary domains. Another interesting line of research might be to devise a framework for spoken dialogue system evaluation using this service. A possibility is to construct a set of guidelines that multiple systems in a common domain would be required to follow in order to put them through a rigorous evaluation, much like the former NIST evaluations. If a cloud-based evaluation framework could be devised, however, the management overhead of such an evaluation would be greatly reduced, and a potentially unlimited number of institutions could participate. 6. Acknowledgments This research is funded in part by the T-Party project, a joint research program between MIT and Quanta Computer Inc., Taiwan. 7. References Hua Ai, Antoine Raux, Dan Bohus, Maxine Eskenazi, and Diane Litman. 2007. Comparing spoken dialog corpora collected with recruited subjects versus real users. In Proc. of SIGdial. Thomas Okken Giuseppe Di Fabbrizio, Jay G. Wilpon. 2009. A speech mashup framework for multimodal mobile services. In The Eleventh International Conference on Multimodal Interfaces and Workshop on Machine Learning for Multi-modal Interaction (ICMI-MLMI 2009), November. Alexander Gruenstein and Stephanie Seneff. 2007. Releasing a multimodal dialogue system into the wild: User support mechanisms. In Proc. of the 8th SIGdial Workshop on Discourse and Dialogue, pages 111–119. Alexander Gruenstein, Stephanie Seneff, and Chao Wang. 2006. Scalable and portable web-based multimodal dialogue interaction with geographical databases. In Proc. of INTERSPEECH, September. Alexander Gruenstein, Ian McGraw, and Ibrahim Badr. 2008. The WAMI toolkit for developing, deploying, and evaluating web-accessible multimodal interfaces. In Proc. of ICMI, October. Alexander Gruenstein, Ian McGraw, and Andrew Sutherland. 2009. A self-transcribing speech corpus: Collecting continuous speech with an online educational game. In Proc. of the Speech and Language Technology in Education (SLaTE) Workshop, September. Helen Wright Hastie, Rashmi Prasad, and Marilyn Walker. 2002. Automatic evaluation: Using a date dialogue act tagger for user satisfaction and task completion prediction. In Proc. of the Language Resources and Evaluation (LREC), May. L. Hirschman, M. Bates, D. Dahl, W. Fisher, J. Garofolo, D. Pallett, K. Hunicke-Smith, P. Price, A. Rudnicky, and E. Tzoukermann. 1993. Multi-site data collection and evaluation in spoken language understanding. In Proc. of the workshop on Human Language Technology. Association for Computational Linguistics. Michael Kaisser and John Lowe. 2008. Creating a research collection of question answer sentence pairs with amazons mechanical turk. In Proc of the Language Resources and Evaluation (LREC), May. Aniket Kittur, Ed H. Chi, and Bongwon Suh. 2008. Crowdsourcing user studies with mechanical turk. In CHI ’08: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, pages 453–456, New York, NY, USA. ACM. Matthew Marge, Satanjeev Banerjee, and Alexander Rudnicky. 2010. Using the amazon mechanical turk for transcription of spoken language. In Proc. of ICASSP, March. Ian McGraw, Alexander Gruenstein, and Andrew Sutherland. 2009. A self-labeling speech corpus: Collecting spoken words with an online educational game. In Proc. of INTERSPEECH, September. Scott Novotne and Chris Callison-Burch. 2010. Cheap, fast and good enough: Automatic speech recognition with non-expert transcription. In Proceedings of ICASSP, forthcoming. Antoine Raux, Dan Bohus, Brian Langner, Alan Black, and Maxine Eskenazi. 2006. Doing research on a deployed spoken dialogue system: One year of Let’s Go! experience. In Proceedings of INTERSPEECH-ICSLP, September. Stephanie Seneff. 2002. Response planning and generation in the MERCURY flight reservation system. Computer Speech and Language, 16:283–312. Rion Snow, Brendan O’Conner, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast — but is it good? evaluating nonexpert annotations for natural language tasks. In Proceedings of EMNLP, October. M. Walker, J. Aberdeen, J. Boland, E. Bratt, J. Garofolo, L. Hirschman, A. Le, S. Lee, S. Narayanan, K. Papineni, B. Pellom, J. Polifroni, A. Potamianos, P. Prabhu, A. Rudnicky, G. Sanders, S. Seneff, D. Stallard, , and S. Whittaker. 2001. DARPA communicator dialog travel planning systems: The june 2000 data collection. In Proc. of EUROSPEECH. J.D. Williams and S. Young. 2007. Scaling POMDPs for spoken dialog management. IEEE Transactions on Audio, Speech, and Language Processing, 15(7):2116–2129, September. Brandon Yoshimoto, Ian McGraw, and Stephanie Seneff. 2009. Rainbow rummy: A web-based game for vocabulary acquisi- tion using computer directed speech. In Proc. of the Speech and Language Technology in Education (SLaTE) Workshop. Victor Zue, Stephanie Seneff, James Glass, Joseph Polifroni, Christine Pao, Timothy J. Hazen, and Lee Hetherington. 2000. JUPITER: A telephone-based conversational interface for weather information. IEEE Transactions on Speech and Audio Processing, 8(1), January.