The document discusses applications of computational linguistics including machine translation, document retrieval and clustering, knowledge extraction and summarization, and sentiment analysis. Computational linguistics aims to develop computer models for natural language generation and understanding.
The document discusses applications of computational linguistics including machine translation, document retrieval and clustering, knowledge extraction and summarization, and sentiment analysis. Computational linguistics aims to develop computer models for natural language generation and understanding.
The document discusses applications of computational linguistics including machine translation, document retrieval and clustering, knowledge extraction and summarization, and sentiment analysis. Computational linguistics aims to develop computer models for natural language generation and understanding.
The document discusses applications of computational linguistics including machine translation, document retrieval and clustering, knowledge extraction and summarization, and sentiment analysis. Computational linguistics aims to develop computer models for natural language generation and understanding.
Download as DOCX, PDF, TXT or read online from Scribd
Download as docx, pdf, or txt
You are on page 1of 19
omputational linguistics is the
scientific study of human
language from a computational point of view. Computational linguists provide computer models of different types of linguistic phenomena. Computer oriented studies have evolved into a hybrid type called computational linguistics. As an interdisciplinary field, computational linguistics has a history of nearly half a century. The ultimate goal of computational linguistics is to explain the basic techniques used to create computer models for the generation and understanding of natural langua omputational linguistics is the scientific study of human language from a computational point of view. Computational linguists provide computer models of different types of linguistic phenomena. Computer oriented studies have evolved into a hybrid type called computational linguistics. As an interdisciplinary field, computational linguistics has a history of nearly half a century. The ultimate goal of computational linguistics is to explain the basic techniques used to create computer models for the generation and understanding of natural langua omputational linguistics is the scientific study of human language from a computational point of view. Computational linguists provide computer models of different types of linguistic phenomena. Computer oriented studies have evolved into a hybrid type called computational linguistics. As an interdisciplinary field, computational linguistics has a history of nearly half a century. The ultimate goal of computational linguistics is to explain the basic techniques used to create computer models for the generation and understanding of natural langua
Analysis of application of computational
linguistics 10. Applications As indicated at the outset, applications of computational linguistics techniques range from those minimally dependent on linguistic structure and meaning, such as document retrieval and clustering, to those that attain some level of competence in comprehending and using language, such as dialogue agents that provide help and information in limited domains like personal scheduling, flight booking, or help desks, and intelligent tutoring systems. In the following we enumerate some of these applications. In several cases (especially machine translation) we have already provided considerable detail, but the intent here is to provide a bird's eye view of the state of the art, rather than technical elucidations. With the advent of ubiquitous computing, it has become increasingly difficult to provide a systematic categorization of NLP applications: Keyword-based retrieval of documents (or snippets) and database access are integrated into some dialogue agents and many voice-based services; animated dialogue agents interact with users both in tutoring systems and games; chatbot techniques are incorporated into various useful or entertaining agents as a backends; and language-enabled robots, though distinctive in combining vision and action with language, are gradually being equipped with web access, QA abilities, tutorial functions, and no doubt eventually with collaborative problem solving abilities. Thus the application categories in the subsections that follow, rather than being mutually exclusive, are ever more interwined in practice.
10.1 Machine translation (again)
One of the oldest MT systems is SYSTRAN, which was developed as a rule-based system beginning in the 1960s, and has been extensively used by US and European government agencies, and also in Yahoo! Babel Fish and (until 2007) in Google Translate. In 2010, it was hybridized with statistical MT techniques. As mentioned, Google Translate currently uses phrase-based MT, with English serving as an interlingua for the majority of language pairs. Microsoft's Bing Translator employs dependency structure analysis together with statistical MT. Other very comprehensive translation systems include Asia Online and WorldLingo. Many systems for small language groups exist as well, for instance for translating between Punjabi and Hindi (the Direct MT system), or between a few European languages (e.g., OpenLogos, IdiomaX, and GramTrans). Translations remain error-prone, but their quality is usually sufficient for readers to grasp the general drift of the source contents. No more than that may be required in many cases, such as international web browsing (an application scarcely anticipated in decades of MT research). Also, MT applications on hand-held devices, designed to aid international travellers, can be sufficiently accurate for limited purposes such as asking directions or emergency help, interacting with transportation personnel, or making purchases or reservations, When high-quality translations are required, automatic methods can be used as an aid to human translators, but subtle issues may still absorb a large portion of a translator's time.
10.2 Document retrieval and clustering applications
Information retrieval has long been a central theme of information science, covering retrieval of both structured data such as are found in relational databases as well as unstructured text documents (e.g., Salton 1989). Retrieval criteria for the two types of data are not unrelated, since both structured and unstructured data often require content-directed retrieval. For example, while users of an employee database may wish at times to retrieve employee records by the unique name or ID of employees, at other times they may wish to retrieve all employees in a certain employment category, perhaps with further restrictions such as falling into a certain salary bracket. This is accomplished with the use of “inverted files” that essentially index entities under their attributes and values rather than their identifiers. In the same way, text documents might be retrieved via some unique label, or they might instead be retrieved in accord with their relevance to a certain query or topic header. The simplest notion of relevance is that the documents should contain the terms (words or short phrases) of the query. However, terms that are distinctive for a document should be given more weight. Therefore a standard measure of relevance, given a particular query term, is the tf– idf (term frequency–inverse document frequency) for the term, which increases (e.g., logarithmically) with the frequency of occurrences of the term in the document but is discounted to the extent that it occurs frequently in the set of documents as a whole. Summing the tf-idf's of the query terms yields a simple measure of document relevance. Shortcomings of this method are first, that it underrates term co-occurrences if each term occurs commonly in the document collection (for instance, for the query “rods and cones of the eye”, co-occurrences of rods, cones, and eye may well characterize relevant documents, even though all three terms occur quite commonly in non-physiological contexts), and second, that relevant documents might have few occurrences of the query terms, while containing many semantically related terms. Some of the vector methods mentioned in connection with document clustering can be used to alleviate these shortcomings. We may reduce the dimensionality of the term-based vector space using LSA, obtaining a much smaller “concept space” in which many terms that tend to co-occur in documents will have been merged into the same dimensions (concept). Thus sharing of concepts, rather than sharing of specific terms, becomes the basis for measuring relevance. Document clustering is useful when large numbers of documents need to be organized for easy access to topically related items, for instance in collections of patent descriptions, medical histories or abstracts, legal precedents, or captioned images, often in hierarchical fashion. Clustering is also useful in exploratory data analysis (e.g., in exploring token occurrences in an unknown language), and indirectly supports various NLP applications because of its utility in improving language models, for instance in providing word clusters to be used for backing off from specific words in cases of data sparsity. Clustering is widely used in other areas, such as biological and medical research and epidemiology, market research and grouping and recommendation of shopping items, educational research, social network analysis, geological analysis, and many others. Document retrieval and clustering often serve as preliminary steps in information extraction (IE) or text mining, two overlapping areas concerned with extracting useful knowledge from documents, such as the main features of named entities (category, roles in relation to other entities, location, dates, etc.) or of particular types of events, or inferring rule-like correlations between relational terms (e.g., that purchasing of one type of product correlates with purchasing another). We will not attempt to survey IE/text mining applications comprehensively, but the next two subsections, on summarization and sentiment analysis, are subareas of particular interest here because of their emphasis on the semantic content of texts. 10.3 Knowledge extraction and summarization Extracting knowledge or producing summaries from unstructured text are ever more important applications, in view of the deluge of documents issuing forth from news media, organizations of every sort, and individuals. This unceasing stream of information makes it difficult to gain an overview of the items relevant to some particular purpose, such as basic data about individuals, organizations and consumer products, or the particulars of accidents, earthquakes, crimes, company take-overs, product maintenance and repair activities, medical research results, and so on. One commonly used method in both knowledge extraction and certain types of “rote” summarization relies on the use of extraction patterns; these are designed to match the kinds of conventional linguistic patterns typically used by authors to express the information of interest. For example, text corpora or newswire might be mined for information about companies, by keying in on known company names and terms such as “Corp.”, “.com”, “headquartered at”, and “annual revenue of”, as well as parts of speech and dependency relations, and matching regular-expression patterns against local text segments containing key phrases or positioned close to them. As another example, summarization of earthquake reports might extract expected information such as the epicenter of the quake, its magnitude on the Richter scale, the time and duration of the event, affected population centers, extent of death tolls, injuries, and property damage, consequences such as fires and tsunamis, etc. Extraction patterns can usually be thought of as targeting particular attributes in predetermined attribute-value frames (e.g., a frame for company information or a frame for facts about an earthquake), and the filled-in frames may themselves be regarded as summaries, or may be used to generate natural-language summaries. Early systems of this type were FRUMP (DeJong 1982) and JASPER (Andersen et al. 1992). Among the hundreds of more modern extraction systems, a particularly successful one in competitions has been SRI's “Fastus” (Hobbs et al. 1997). Note that whether a pattern-based system is viewed as a knowledge extraction system or summarization system depends on the text it is applied to. If all the information of interest is bundled together in a single, extended text segment (as in the case of earthquake reports), then the knowledge extracted can be viewed as a summary of the segment. If instead the information is selectively extracted from miscellaneous sentences scattered through large text collections, with most of the material being ignored as irrelevant to the purposes of extraction, then we would view the activity of the system as information extraction rather than summarization. When a document to be summarized cannot be assumed to fall into some predictable category, with the content structured and expressed in a stereotyped way, summarization is usually performed by selecting and combining “central sentences” from the document. A sentence is central to the extent that many other sentences in the document are similar to it, in terms of shared word content or some more sophisticated similarity measure such as one based on the tf-idf metric for terms, or a cosine metric in a dimensionality-reduced vector space (thus it is as if we were treating individual sentences as documents, and finding a few sentences whose “relevance” to the remaining sentences is maximal). However, simply returning a sequence of central sentences will not in general yield an adequate summary. For example, such sentences may contain unresolved pronouns or other referring expressions, whose referents may need to be sought in non-central sentences. Also, central “sentences” may actually be clauses embedded in lengthier sentences that contain unimportant supplementary information. Heuristic techniques need to be applied to identify and excise the extra material, and extracted clauses need to be fluently and coherently combined. In other cases, complex descriptions should be more simply and abstractly paraphrased. For example, an appropriate condensation of a sentence such as “The tornado carried off the roof of a local farmhouse, and reduced its walls and contents to rubble” might be “The tornado destroyed a local farmhouse.” But while some of these issues are partially addressed in current systems, human-like summarization will require much deeper understanding than is currently attainable. Another difficulty in this area (even more so than in machine translation) is the evaluation of summaries. Even human judgments differ greatly, depending, for instance, on the sensitivity of the evaluator to grammatical flaws, versus inadequacies in content.
10.4 Sentiment analysis
Sentiment analysis refers to the detection of positive or negative attitudes (or more specific attitudes such as belief or contempt) on the part of authors of articles or blogs towards commercial products, films, organizations, persons, ideologies, etc. This has become a very active area of applied computational linguistics, because of its potential importance for product marketing and ranking, social network analysis, political and intelligence analysis, classification of personality types or disorders based on writing samples, and other areas. The techniques used are typically based on sentiment lexicons that classify the affective polarity of vocabulary items, and on supervised machine learning applied to texts from which word and phrasal features have been extracted and that have been hand-labeled as expressing positive or negative attitudes towards some theme. Instead of manual labeling, existing data can sometimes be used to provide a priori classification information. For example, average numerical ratings of consumer products or movies produced by bloggers may be used to learn to classify unrated materials belonging to the same or similar genres. If fact, affective lexical categories and contrast relations may be learnable from such data; for example, frequent occurrences of phrases such as great movie or pretty good movie or terrible movie in blogs concerning movies with high, medium, and low average ratings may well suggest that great, pretty good, and terrible belong to a contrast spectrum ranging from a very positive to a very negative polarity. Such terminological knowledge can in turn boost the coverage of generic sentiment lexicons. However, sentiment analysis based on lexical and phrasal features has obvious limitations, such as obliviousness to sarcasm and irony ( “This is the most subtle and sensitive movie since The Texas Chainsaw Massacre”), quotation of opinions contrasting with the author's (“According to the ads, Siri is the greatest app since iTunes, but in fact …”), and lack of understanding of entailments (“You'll be much better off buying a pair of woolen undies for the winter than purchasing this item”). Thus researchers are attempting to integrate knowledge-based and semantic analysis with superficial word- and phrase-based sentiment analysis.
10.5 Chatbots and companionable dialogue agents
Current chatbots are the descendants of Weizenbaum's ELIZA (see section 1.2), and are typically used (often with an animated “talking head” character) for entertainment, or to engage the interest of visitors to the websites of certain “dotcoms”. They may be equipped with large hand-crafted scripts (keyword-indexed input-response schemas) that enable them to answer simple inquiries about the company and their products, with some ability to respond to miscellaneous topics and to exchange greetings and pleasantries. A less benign application is the use of chatbots posing as visitors to social network sites, or interactive game sites, with the aim of soliciting private information from unwitting human participants, or recommending websites or products to them. As a result, many social networking sites have joined other bot-targeted sites in using CAPTCHAS to foil bot entry. Companionable dialogue agents (also called relational agents) have so far relied rather heavily on chatbot techniques, i.e., authored input patterns and corresponding outputs. But the goal is to transcend these techniques, creating agents (often with talking heads or other animated characters) with personality traits and capable of showing emotion and empathy; they should have semantic and episodic memory, learning about the user over the long term and providing services to the user. Those services might include, besides companionship and support: advice in some areas of life, health and fitness, schedule maintenance, reminders, question answering, tutoring (e.g., in languages), game playing, and internet services. Yorick Wilks has suggested that ideally such characters would resemble “Victorian companions”, with such characteristics as politeness, discretion, modesty, cheerfulness, and well- informedness (Wilks 2010). However, such goals are far from being achieved, as speech recognition, language understanding, reasoning and learning are not nearly far enough advanced. As a noteworthy example of the state of the art, we might mention the HWYD (“How Was Your Day”) system of Pulman et al. (2010), which won a best demonstration prize at an autonomous agents conference. The natural language processing in this system is relatively sophisticated. Shallow syntactic and semantic processing is used to find instantiations of some 30 “event templates”, such as ones for “argument at work between X and Y,” or “meeting with X about Y”. The interpretation process includes reference and ellipsis resolution, relying on an information state representation maintained by the dialogue manager. Goals generated by the dialogue manager lead to responses via planning, which involves instantiation and sequencing of response paradigms. The authors report the system's ability to maintain consistent dialogues extending over 20 minutes. Systems of a rather different sort, aimed at clinically well-founded health counseling, have been under development as well. For example, the systems described in (Bickmore et al. 2011) rely on an extensive, carefully engineered formalization of clinically proven counseling strategies and knowledge, expressed within a description logic (OWL) and a goal-directed task description language. Such systems have proved to perform in a way comparable to human counselors. However, though dialogues are plan-driven, they ultimately consist of scripted system utterances paired with multiple-choice lists of responses offered to the client. Thus companionable systems remain very constrained in the dialogue themes they can handle, their understanding of language, and their ability to bring extensive general knowledge to a conversation, let alone to use such knowledge inferentially.
10.6 Virtual worlds, games, and interactive fiction
Text-based adventure (quest) games, such as Dungeons and Dragons, Hunt the Wumpus (in its original version), and Advent began to be developed in the early and middle 1970s, and typically featured textual descriptions of the setting and challenges confronting the player, and allowed for simple command-line input from the player to select available actions (such as “open box”, “take sword” or “read note”). While the descriptions of the settings (often accompanied by pictures) could be quite elaborate, much as in adventure fiction, the input options available to the player were, and have largely remained, restricted to simple utterances of the sort that can be anticipated or collected in pre-release testing by the game programmers, and for which responses can be manually prepared. Certainly more flexible use of NL ( “fend off the gremlin with the sword!”, “If I give you the gold, will you open the gate for me?”) would enliven the interaction between player and the game world and the characters in it. In the 1980s and 90s text-based games declined in favor of games based primarily on graphics and animation, though an online interactive fiction community grew over the years that drove the evolution of effective interactive fiction development software. A highly touted program (in the year 2000) was Emily Short's ‘Galatea’, which enabled dialogue with an animated sculpture. However, this is still an elaborately scripted program, allowing only for inputs that can be heuristically mapped to one of various preprogrammed responses. Many games in this genre also make use of chatbot-like input-output response patterns in order to gain a measure of robustness for unanticipated user inputs. The most popular PC video games in the 1990s and beyond were Robyn and Rand Miller's Myst, a first-person adventure game, and Maxis Software's The Sims, a life-simulation game. Myst, though relying on messages in books and journals, was largely nonverbal, and The Sims' chief developer, Will Wright, finessed the problem of natural language dialogue by having the inhabitants of SimCity babble in Simlish, a nonsense language incorporating elements of Ukrainian, French and Tagalog. Commercial adventure games and visual novels continue to rely on scripted dialogue trees— essentially branching alternative directions in which the dialogue can be expected to turn, with ELIZA-like technology supporting the alternatives. More sophisticated approaches to interaction between users and virtual characters are under development in various research laboratories, for example at the Center for Human Modeling and Simulation at the University of Pennsylvania, and the USC-affiliated Institute for Creative Technologies. While the dialogues in these scenarios are still based on carefully designed scripts, the interpretation of the user's spoken utterances exploits an array of well-founded techniques in speech recognition, dialogue management, and reasoning. Ongoing research can be tracked at venues such as IVA (Intelligent Virtual Agents), AIIDE (AI and Interactive Digital Entertainment), and AAMAS (Autonomous Agents and Multiagent Systems).
10.7 Natural language user interfaces
The topic of NL user interfaces subsumes a considerable variety of NL applications, ranging from text-based systems minimally dependent on understanding to systems with significant comprehension and inference capabilities in text- or speech-based interactions. The following subsections briefly survey a range of traditional and current applications areas. Text-based question answering Text-based QA is practical to the extent that the types of questions being asked can be expected to have ready-made answers tucked away somewhere in the text corpora being accessed by the QA system. This has become much more feasible in this age of burgeoning internet content than a few decades ago, though questions still need to be straightforward, factual ones (e.g., “Who killed President Lincoln?”) rather than ones requiring inference (e.g., “In what century did Catherine the Great live?”, let alone “Approximately how many 8- foot 2-by-4s do I need to build a 4-foot high, 15-foot long picket fence?”). Text-based QA begins with question classification (e.g., yes-no questions, who-questions, what-questions, when-questions, etc.), followed by information retrieval for the identified type of question, followed by narrowing of the search to paragraphs and finally sentences that may contain the answer to the question. The successive narrowing typically employs word and other feature matching, and ultimately dependency and role matching, and perhaps limited textual inference to verify answer candidates. Textual inference may, for instance, use WordNet hypernym knowledge to try to establish that a given candidate answer sentence supports the truth of the declarative version of the question. Since the chosen sentence(s) may contain irrelevant material and anaphors, it remains to extract the relevant material (which may also include supporting context) and generate a well-formed, appropriate answer. Many early text-based QA systems up to 1976 are discussed in Bourne & Hahn 2003. Later surveys (e.g., Maybury 2004) have tended to include the full spectrum of QA methods, but TREC conference proceedings (https://trec.nist.gov/) feature numerous papers on implemented systems for text-based QA. In open-domain QA, many questions are concerned with properties of named entities, such as birth date, birth place, occupation, and other personal attributes of well-known present and historical individuals, locations, ownership, and products of various companies, facts about consumer products, geographical facts, and so on. For answering such questions, it makes sense to pre-assemble the relevant factoids into a large knowledge base, using knowledge acquisition methods like those in section 8. Examples of systems containing an abundance of factoids about named entities are several developed at the University of Washington, storing factoids as text fragments, and various systems that map harvested factoids into RDF (Resource Description Framework) triples (see references in Other Internet Resources). Some of these systems obtain their knowledge not only from open information extraction and targeted relation extraction, but also from such sources as Wikipedia “infoboxes” and (controlled) crowdsourcing. Here we are also stretching the notion of question answering, since several of the mentioned systems require the use of key words or query patterns for retrieval of factoids. From a general user perspective, it is unclear how much added benefit can be derived from such constructed KBs, given the remarkable ability of Google and other search engines to provide rapid answers even to such questions as “Which European countries are landlocked?” (typed without quotes—with quotes, Google finds the top answer using True Knowledge), or “How many Supreme Court justices did Kennedy appoint?” Nonetheless, both Google and Microsoft have recently launched vast “knowledge graphs” featuring thousands of relations among hundreds of millions of entities. The purpose is to provide direct answers (rather then merely retrieved web page snippets) to query terms and natural language questions, and to make inferences about the likely intent of users, such as purchasing some type of item or service. Database front-ends Natural-language front ends for databases have long been considered an attractive application of NLP technology, beginning with such systems as LUNAR (Woods et al. 1972) and REL (Thompson et al. 1969; Thompson & Thompson 1975). The attractiveness lies in the fact that retrieval and manipulation of information from a relational (or other uniformly structured) database can be assumed to be handled by an existing db query language and process. This feature sharply limits the kinds of natural language questions to be expected from a user, such as questions aimed at retrieving objects or tuples of objects satisfying given relational constraints, or providing summary or extremal properties (longest rivers, lowest costs, and the like) about them. It also greatly simplifies the interpretive process and question-answering, since the target logical forms—formal db queries—have a known, precise syntax and are executed automatically by the db management system, leaving only the work of displaying the computed results in some appropriate linguistic, tabular or graphical form. Numerous systems have been built since then, aimed at applications such as navy data on ships and their deployment (LADDER: Hendrix et al. 1978), land-use planning (Damerau 1981), geographic QA (CHAT-80: Pereira & Warren 1982), retrieval of company records and product records for insurance companies, oil companies, manufacturers, retailers, banks, etc. (INTELLECT: Harris 1984), compilation of statistical data concerning customers, services, assets, etc., of a company (Cercone et al. 1993), and many more (e.g., see Androutsopoulos & Ritchie 2000). However, the commercial impact of such systems has remained scant, because they have generally lacked the reliability and some of the functionalities of traditional db access. Inferential (knowledge-based) question answering We have noted certain limited inferential capabilities in text-based QA systems and NL front ends for databases, such as the ability to confirm entailment relations between candidate answers and questions, using simple sorts of semantic relations among the terms involved, and the ability to sort or categorize data sets from databases and compute averages or even create statistical charts. However, such limited, specialized inference methods fall far short of the kind of general reasoning based on symbolic knowledge that has long been the goal in AI question answering. One of the earliest efforts to create a truly inferential QA system was the ENGLAW project of L. Stephen Coles (Coles 1972). ENGLAW was intended as a prototype of a kind of system that might be used by scientists and engineers to obtain information about physical laws. It featured a KB of axioms (in first-order logic) for 128 important physical laws, manually coded with the aid of a reference text. Questions (such as “In the Peltier Effect, does the heat developed depend on the direction of the electric current?”) were rendered into logic via a transformational grammar parser, and productions (aided by various Lisp functions) that map phrase patterns to logical expressions. The system was not developed to the point of practical usefulness, but its integration of reasoning and NLP technologies and its methods of selectively retrieving axioms for inferential QA were noteworthy contributions. An example of a later larger-scale system aimed at practical goals was BBN's JANUS system (Ayuso et al. 1990). This was intended for naval battle management applications, and could answer questions about the locations, readiness, speed and other attributes of ships, allowing for change with the passage of time. It mapped English queries to a very expressive initial representation language with an “intension” operator to relate formulas to times and possible worlds, and this was in turn mapped into the NIKL description logic, which proved adequate for the majority of inferences needed for the targeted kinds of QA. Jumping forward in time, we take note of the web-based Wolfram|Alpha (or WolframAlpha) answer engine, developed by Wolfram Research and consisting of 15 million lines of Mathematica code grounded in curated data bases, models, and algorithms for thousands of different domains. (Mathematica is a mathematically oriented high-level programming language developed by the British scientist Stephen Wolfram.) The system is tilted primarily towards quantitative questions (e.g., “What is the GDP of France?”, or “What is the surface area of the Moon?”) and often provides charts and graphics along with more direct answers. The interpretation of English queries into functions applied to various known objects is accomplished with the pattern matching and symbol manipulation capabilities of Mathematica. However, the comprehension of English is not particularly robust at the time of writing. For example “How old was Lincoln when he died?”, “At what age did Lincoln die?” and other variants were not understood, though in many cases of misunderstanding, Wolfram|Alpha displays enough retrieved information to allow inference of an answer. A related shortcoming is that Wolfram|Alpha's quantitative skills are not supplemented with significant qualitative reasoning skills. For example, “Was Socrates a man?” (again, at the time of writing) prompts display of summary information about Socrates, including an image, but no direct answer to the question. Still, Wolfram|Alpha's quantitative abilities are not only interesting in stand-alone mode, but also useful as augmentations of search engines (such as Microsoft Bing) and of voice-based personal assistants such as Apple's Siri (see below). Another QA system enjoying wide recognition because of its televised victory in the Jeopardy! quiz show is IBM's “Watson” (Ferrucci 2012; Ferrucci et al. 2010; Baker 2011). Like Wolfram|Alpha, this is in a sense a brute force program, consisting of about a million lines of code in Java, C++, Prolog and other languages, created by a core team of 20 researchers and software engineers over the course of three years. The program runs 3000 processes in parallel on ninety IBM Power 750 servers, and has access to 200 million pages of content from sources such as Wordnet, Wikipedia (and its structured derivatives YAGO and DBpedia), thesauri, newswire articles, and literary texts, amounting to several terabytes of human knowledge. (This translates into roughly 1010 clausal chunks—a number likely to be around 2 orders of magnitude greater than the number of basic facts over which any one human being disposes.) Rather than relying on any single method of linguistic or semantic analysis, or method of judging relevance of retrieved passages and textual “nuggets” therein, Watson applies multiple methods to the questions and candidate answers, including methods of question classification, focal entity detection, parsing, chunking, lexical analysis, logical form computation, referent determination, relation detection, temporal analysis, and special methods for question-answer pairs involving puns, anagrams, and other twists common in Jeopardy!. Different question analyses are used separately to retrieve relevant documents, and to derive, analyze and score potential answers from passages and sentences in those documents. In general, numerous candidate answers to a question are produced, and their analyses provide hundreds of features whose weights for obtaining ranked answers with corresponding confidence levels are learned by ML methods applied to a corpus of past Jeopardy! questions and answers (or officially, answers and questions, according to the peculiar conceit of the Jeopardy! protocol). Watson's wagers are based on the confidence levels of its potential answers and a complex regression model. How well does Watson fit under our heading of inferential, knowledge-based QA? Does it actually understand the questions and the answers it produces? Despite its impressive performance against Jeopardy! champions, Watson reasons, and understands English in only very restricted senses. The program exploits the fact that the target of a Jeopardy! question is usually a named entity, such as Jimmy Carter, Islamabad, or Black Hole of Calcutta, though other types of phrases are occasionally targeted. Watson is likely to find multiple sentences that mention a particular entity of the desired type, and whose syntactic and semantic features are close to the features of the question, thereby making the named entity a plausible answer without real understanding of the question. For example, a “recent history” question asking for the president under whom the US gave full recognition to Communist China (Ferrucci 2012) might well zero in on such sentences as Although he was the president who restored full diplomatic relations with China in 1978, Jimmy Carter has never visited that country … (New York Times, June 27, 1981) or Exchanges between the two countries' nuclear scientists had begun soon after President Jimmy Carter officially recognized China in 1978. (New York Times, Feb. 2, 2001) While the links between such sentences and the correct answer are indirect (e.g., dependent on resolving he and who to Jimmy Carter, and associating restored diplomatic relations with recognized, and Communist China with China), correct analysis of those links is not a requirement for success—it is sufficient for the cluster of sentences favoring the answer Jimmy Carter (in virtue of their word and phrasal content and numerous other features) to provide a larger net weight to that answer than any competing clusters. This type of statistical evidence combination based on stored texts seems unlikely to provide a path to the kind of understanding that even first-graders betray in answering simple commonsense questions, such as “How do people keep from getting wet when it rains?”, or “If you eat a cookie, what happens to the cookie?” At the same time, vast data banks utilized in the manner of Watson can make up for inferential weakness in various applications, and IBM is actively redeveloping Watson as a resource for physicians, one that should be able to provide diagnostic and treatment possibilities that even specialists may not have at their fingertips. In sum, however, the goal of open-domain QA based on genuine understanding and knowledge- based reasoning remains largely unrealized. Voice-based web services and assistants Voice-based services, especially on mobile devices, are a rapidly expanding applications area. Services range from organizers (for grocery lists, meeting schedules, reminders, contact lists, etc.), to in-car “infotainment” (routing, traffic conditions, hazard warnings, iTunes selection, finding nearby restaurants and other venues, etc.), to enabling use of other miscellaneous apps such as email dictation, dialing contacts, financial transactions, reservations and placement of orders, Wikipedia access, help-desk services, health advising, and general question answering. Some of these services (such as dialing and iTunes selection) fall into the category of hands-free controls, and such controls are becoming increasingly important in transport (including driverless or pilotless vehicles), logistics (deployment of resources), and manufacturing. Also chatbot technology and companionable dialogue agents (as discussed in section 10.5) are serving as general backends to more specific voice-based services. The key technology in these services is of course speech recognition, whose accuracy and adaptability has been gradually increasing. The least expensive, narrowly targeted systems (e.g., simple organizers) exploit strong expectations about user inputs to recognize, interpret and respond to those inputs; as such they resemble menu-driven systems. More versatile systems, such as car talkers that can handle routing, musical requests, searches for venues, etc., rely on more advanced dialogue management capabilities. These allow for topic switches and potentially for the attentional state of the user (e.g., delaying answering a driver's question if the driver needs to attend to a turn). The greatest current “buzz” surrounds advanced voice-based assistants, notably iPhone's Siri (followed by Android's Iris, True Knowledge's Evi, Google Now, and others). While previous voice control and dictation systems, like Android's Vlingo, featured many of the same functionalities, Siri adds personality and improved dialogue handling and service integration—users feel that they are interacting with a lively synthetic character rather than an app. Besides Nuance SR technology, Siri incorporates complex techniques that were to some extent pushed forward by the CALO (Cognitive Assistant that Learns and Organizes) project carried out by SRI International and multiple universities from 2003–2008 (Ambite et al. 2006; CALO [see Other Internet Resources]). These techniques include aspects of NLU, ML, goal- directed and uncertain inference, ontologies, planning, and service delegation. But while delegation to web services, including Wolfram|Alpha QA, or chatbot technology provides considerable robustness, and there is significant reasoning about schedules, purchasing and other targeted services, general understanding is still very shallow, as users soon discover. Anecdotal examples of serious misunderstandings are “Call me an ambulance” eliciting the response “From now on I will call you ‘an ambulance’”. However, the strong interest and demand in the user community generated by these early (somewhat) intelligent, quite versatile assistants is likely to intensify and accelerate research towards ever more life-like virtual agents, with ever more understanding and common sense.
10.8 Collaborative problem solvers and intelligent tutors
We discuss collaborative problem solving systems (also referred to as “mixed-initiative” or “task-oriented” dialogue systems) and tutorial dialogue systems (i.e., tutorial systems in which dialogue plays a pivotal role) under a common heading because both depend on rather deep representations or models of the domains they are aimed at as well as the mental state of the users they interact with. However, we should immediately note that collaborative problem solving systems typically deal with much less predictable domain situations and user inputs than tutorial systems, and accordingly the former place much greater emphasis on flexible dialogue handling than the latter. For example, collaborators in emergency evacuation (Ferguson and Allen 1998, 2007) need to deal with a dynamically changing domain, at the same time handling the many dialogue states that may occur, depending on the participants' shared and private beliefs, goals, plans and intentions at any given point. By contrast, in a domain such as physics tutoring (e.g., Jordan et al. 2006; Litman and Silliman 2004), the learner can be guided through a network of learning goals with authored instructions, and corresponding to those goals, finite-state dialogue models can be designed that classify student inputs at each point in a dialogue and generate a prepared response likely to be appropriate for that input. It is therefore not surprising that tutorial dialogue systems are closer to commercial practicality, with demonstrated learning benefits relative to conventional instruction in various evaluations, than collaborative problem solving systems for realistic applications. Tutorial dialogue systems have been built for numerous domains and potential clienteles, ranging from K-12 subjects to computer literacy and novice programming, qualitative and quantitative physics, circuit analysis, operation of machinery, cardiovascular physiology, fire damage control on ships, negotiation skills, and more (e.g., see Boyer et al. 2009; Pon-Barry et al. 2006). Among the most successful tutorial systems are reading tutors (e.g., Mostow and Beck 2007; Cole et al. 2007), since the materials presented to the learner (in a “scaffolded” manner) are relatively straightforward to design in this case, and the responses of the learner, especially when they consist primarily of reading presented text aloud, are relatively easy to evaluate. For the more ambitious goal of fostering reading comprehension, the central problem is to design dialogues so as to make the learner's contributions predictable, while also making the interaction educationally effective (e.g., Aist and Mostow 2009). Some tutoring systems, especially ones aimed at children, use animated characters to heighten the learner's sense of engagement. Such enhancements are in fact essential for systems aimed at learners with disabilities like deafness (where mouth and tongue movements of the virtual agent observed by the learner can help with articulation), autism, or aphasia (Massaro et al. 2012; Cole et al. 2007). As well, if tutoring is aimed specifically at training interpersonal skills, implementation of life-like characters (virtual humans) becomes an indispensable part of system development (e.g., Core et al. 2006; Campbell et al. 2011). Modeling the user's state of mind in tutoring systems is primarily a matter of determining which of the targeted concepts and skills have, or have not yet, been acquired by the user, and diagnosing misunderstandings that are likely to have occurred, given the session transcript so far. Some recent experimental systems can also adapt their strategies to the user's apparent mood, such as frustration or boredom, as might be revealed by the user's inputs, tone of voice, or even facial expressions or gestures analyzed via computer vision. Other prototype systems can be viewed as striving towards more general mental modeling, by incorporating ideas and techniques from task-oriented dialogue systems concerning dialogue states, dialogue acts, and deeper language understanding (e.g., Callaway et al. 2007). In task-oriented dialogue systems, as already noted, dialogue modeling is much more challenging, since such systems are expected not only to contribute to solving the domain problem at hand, but to understand the user's utterances, beliefs, and intentions, and to hold their own in a human-like, mixed-initiative dialogue. This requires domain models, general incremental collaborative planning methods, dialogue management that models rational communicative interaction, and thorough language understanding (especially intention recognition) in the chosen domain. Prototype systems have been successfully built for domains such as route planning, air travel planning, driver and pedestrian guidance, control and operation of external devices, emergency evacuation, and medication advising (e.g., Allen et al. 2006; Rich and Sidner 1998; Bühler and Minker 2011; Ferguson and Allen 1998, 2007), and these hold very significant practical promise. However, systems that can deal with a variety of reasonably complex problems, especially ones requiring broad commonsense knowledge about human cognition and behavior, still seem out of reach at this time.
10.9 Language-enabled robots
As noted at the beginning of section 10, robots are beginning to be equipped with web services, question answering abilities, chatbot techniques (for fall-back and entertainment), tutoring functions, and so on. The transfer of such technologies to robots has been slow, primarily because of the very difficult challenges involved in just equipping a robot with the hardware and software needed for basic visual perception, speech recognition, exploratory and goal-directed navigation (in the case of mobile robots), and object manipulation. However, the keen public interest in intelligent robots and their enormous economic potential (for household help, eldercare, medicine, education, entertainment, agriculture, industry, search and rescue, military missions, space exploration, and so on) will surely continue to energize the drive towards greater robotic intelligence and linguistic competence. A good sense of the state of the art and difficulties in human-robot dialogue can be gained from (Scheutz et al. 2011). Some of the dialogue examples presented there, concerning boxes and blocks, are reminiscent of Winograd's SHRDLU, but they also exhibit the challenges involved in real interaction, such as the changing scenery as the robot moves, speech recognition errors, disfluent and complex multi-clause utterances, perspective-dependent utterances ( “Is the red box to the left of the blue box?”), and deixis (“Go down there”). In addition, all of this must be integrated with physical action planned so as to fulfill the instructions as understood by the robot. While the ability of recent robots to handle these difficulties to some degree is encouraging, many open problems remain, such as the problems of speech recognition in the presence of noise, better, broader linguistic coverage, parsing, and dialogue handling, adaptation to novel problems, mental modeling of the interlocutor and other humans in the environment, and greater general knowledge about the world and the ability to use it for inference and planning (both at the domain level and the dialogue level). While task-oriented robot dialogues involve all these challenges, we should note that some potentially useful interactions with “talking” robots require little in the way of linguistic skills. For example, the RUBI robot described in (Movellan et al. 2009), displayed objects on its screen-equipped “chest” to toddlers, asking them to touch and name the objects. This resulted in improved word learning by the toddlers, despite the simplicity of the interaction. Another example of a very successful talking robot with no real linguistic skills was the “museum tour guide” RHINO (Burgard et al. 1999). Unlike RUBI it was able to navigate among unpredictably moving humans, and kept its audience engaged with its prerecorded messages and with a display of its current goals on a screen. In the same way, numerous humanoid robots (for example, Honda's Asimo) under past and present development across the world still understand very little language and rely mostly on scripted output. No doubt their utility and appeal will continue to grow, thanks to technologies like those mentioned above—games, companionable agent systems, voice-based apps, tutors, and so on; and these developments will also fuel progress on the deeper aspects of perception, motion, manipulation, and meaningful dialogue.