NLP LectureNotes 2022 Edited
NLP LectureNotes 2022 Edited
LECTURE NOTES
Prepared By
J.SOWJANYA
Asst. Prof., ITD
Department of Information Technology
M2. To provide adequate fundamental knowledge of sciences and Information Technology with
positive attitude.
M3. To create an environment that enhances skills and technologies required for industry.
M4. To encourage creativity and innovation for solving real world problems.
M5. To cultivate professional ethics in students and inculcate a sense of responsibility towards
society
PEO2: Apply knowledge of mathematics and Information Technology to analyze, design and
implement solutions for real world problems in core or in multidisciplinary areas.
PEO3: Communicate effectively, work in a team, practice professional ethics and apply
knowledge of computing technologies for societal development.
PSO 2 - Software design: An ability to analyze a problem, design algorithm, identify and define
the computing requirements appropriate to its solution and implement the same.
PROGRAM OUTCOMES(POs)
TIME TABLE
Section-B
1. Wednesday : 10:30am to 11:30am
Section- C
1. Monday : 10:30am to 11:30am
2. Tuesday : 12:40 pm to 1: 40 pm
2. Wednesday : 12:40 pm to 1: 40 pm
UNIT – II
Introduction to Semantics and Knowledge Representation: some applications like
Machine translation, database interface Semantic Interpretation, word senses and
ambiguity, Basic logical form language, Encoding ambiguity in logical from, Thematic roles,
Linking syntax and semantics, Recent trends in NLP.
UNIT – III
Grammars and Parsing: Grammars and sentence Structure, Top-Down and Bottom-Up
Parsers, Transition Network Grammars, Top-Down Chart Parsing. Feature Systems and
Augmented Grammars: Basic Feature system for English, Morphological Analysis and the
Lexicon, Parsing with Features, Augmented Transition Networks.
UNIT – IV
Semantic Interpretation: word senses and ambiguity, Basic logical form language,
Encoding ambiguity in logical from, Thematic roles, Linking syntax and semantics, Recent
trends in NLP.
UNIT – V
Ambiguity Resolution: Statistical Methods, Probabilistic Language Processing, Estimating
Probabilities, Part- of- Speech tagging, Obtaining Lexical Probabilities, Probabilistic
Context- Free Grammars, Best First Parsing. Semantics and Logical Form, Word senses and
Ambiguity, Encoding Ambiguity in Logical Form.
Suggested Readings:
1. James Allen, ―Natural Language Understanding‖, Pearson Education
2. Christopher D Manning and Hinrich Schutze, ―Foundations of Statistical Natural
Language Processing‖ MIT Press, 1999.
3. Akshar Bharti, Vineet Chaitanya and Rajeev Sangal, ―NLP: A Paninian Perspective‖,
Prentice Hall, New Delhi
4. D. Jurafsky, J. H. Martin, ―Speech and Language Processing‖, Pearson
Objectives and Outcomes
Code: PC 503 IT
Objectives:
1. To represent and analyse natural language both spoken and written, using statistical
and finite state methods for modelling and classification. To use grammar for
natural language processing.
2. To study knowledge representation from its semantics view point with emphasis on
applications. To study basic logical form language to encode ambiguity.
3. To study augmented grammars and parsers for feature systems.
4. To resolve and encode ambiguity using statistical methods to estimate lexical
probabilities along with critical study of probabilistic context free grammars and
parsing.
5. To interpret semantics covering ambiguity and link syntax to semantics
Outcomes:
1. Use statistical and finite state methods for modelling and classification for
representation and analysis of natural languages, and use grammars for natural
language processing.
2. Apply knowledge representation and semantics to machine translation and
database semantic interpretation.
3. Perform top-down and bottom-up parsing, and parsing with features.
4. Estimate lexical probabilities, resolve ambiguity, and use probabilistic context-free
grammar.
5. Able to encode ambiguity in logical form language and deal with word-sense and
ambiguity and to link syntax to semantics.
LECTURE PLAN
Course: Natural Language Processing Faculty: J. Sowjanya.
Code: PE 633 IT Designation: Asst Prof.
Delivery
Lecture
Periods
Lecture
Lecture
No. Broad area
No. of
Date
Date
Plan
No
Delivery
Lecture
Periods
Lecture
Lecture
No. Broad area
No. of
Date
Date
Plan
No
Applications of NLP
There are the following applications of NLP -
1. Speech Recognition - Speech recognition is used for converting spoken words into text. It is used in
applications, such as mobile, home automation, video recovery, dictating to Microsoft Word, voice
biometrics, voice user interface, and so on.
2. Spam Detection -Spam detection is used to detect unwanted e-mails getting to a user's inbox.
3. Sentiment Analysis - Sentiment Analysis is also known as opinion mining. It is used on the web to
analyse the attitude, behaviour, and emotional state of the sender. This application is implemented through
a combination of NLP (Natural Language Processing) and statistics by assigning the values to the text
(positive, negative, or natural), identify the mood of the context (happy, sad, angry, etc.)
4. Machine Translation - Machine translation is used to translate text or speech from one natural language
to another natural language. Example: Google Translator
5. Spelling correction (spell check) -Microsoft Corporation provides word processor software like MS-
word, PowerPoint for the spelling correction.
6. Chatbot (Virtual Assistance) - used by many companies to provide the customer's chat services.
7. Information extraction - used for extracting structured information from unstructured or semi-
structured machine-readable documents.
8. Question Answering- focuses on building systems that automatically answer the questions asked by
humans in a natural language.
9. Auto correction
10. Recommendations.
Components of NLP
NLP pipeline
There are the following steps to build an NLP pipeline -
Step1: Sentence Segmentation - Sentence Segment is the first step for building the NLP pipeline. It breaks
the paragraph into separate sentences.
Example: Consider the following paragraph -
Independence Day is one of the important festivals for every Indian citizen. It is celebrated on the
15th of August each year ever since India got independence from the British rule. The day celebrates
independence in the true sense.
Sentence Segment produces the following result:
1. "Independence Day is one of the important festivals for every Indian citizen."
2. "It is celebrated on the 15th of August each year ever since India got independence from the British
rule."
3. "This day celebrates independence in the true sense."
Step3: Stemming
Stemming is used to normalize words into its base form or root form. For example, celebrates, celebrated
and celebrating, all these words are originated with a single root word "celebrate." The big problem with
stemming is that sometimes it produces the root word which may not have any meaning.
For Example, intelligence, intelligent, and intelligently, all these words are originated with a single root
word "intelligen." In English, the word "intelligen" do not have any meaning.
Step 4: Lemmatization
Lemmatization is quite similar to the Stamming. It is used to group different inflected forms of the word,
called Lemma. The main difference between Stemming and lemmatization is that it produces the root word,
which has a meaning.
For example: In lemmatization, the words intelligence, intelligent, and intelligently has a root word
intelligent, which has a meaning.
Phases of NLP
There are the following five phases of NLP:
3. Semantic Analysis
Semantic analysis is concerned with the meaning representation. It mainly focuses on the literal meaning of
words, phrases, and sentences.
4. Discourse Integration
Discourse Integration depends upon the sentences that proceeds it and also invokes the meaning of the
sentences that follow it.
5. Pragmatic Analysis
Pragmatic is the fifth and last phase of NLP. It helps you to discover the intended effect by applying a set of
rules that characterize cooperative dialogues.
For Example: "Open the door" is interpreted as a request instead of an order.
Language is one of the fundamental aspects of human behavior and is a crucial component
of our lives. In written form it serves as a long-term record of knowledge from one generation to
the next. In spoken form it serves as our primary means of coordinating our day-to-day behavior
with others. This book describes research about how language comprehension and production
work. The goal of this research is to create computational models of language in enough detail that
you could write computer programs to perform various tasks involving natural language. The
ultimate goal is to be able to specify models that approach human performance in the linguistic
tasks of reading, writing, hearing, and speaking.
Language is studied in several different academic disciplines. Each discipline defines its own
set of problems and has its own methods for addressing them.
1. The linguist - studies the structure of language itself, considering questions such as why certain
combinations of words form sentences but others do not, and why a sentence can have some
meanings but not others.
2. The psycholinguist - studies the processes of human language production and comprehension,
considering questions such as how people identify the appropriate structure of a sentence and
when they decide on the appropriate meaning for words.
3. The philosopher- considers how words can mean anything at all and how they identify objects
in the world. Philosophers also consider what it means to have beliefs, goals, and intentions,
and how these cognitive capabilities relate to language.
4. Computational linguist - goal is to develop a computational theory of language, using the
notions of algorithms and data structures from computer science. Of course, to build a
computational model, you must take advantage of what is known from all the other disciplines.
Figure 1.1 summarizes these different approaches to studying language.
Computational models are useful both for scientific purposes — for exploring the nature of
linguistic communication — and for practical purposes — for enabling effective human-machine
communication.
The scientific motivation is to obtain a better understanding of how language works. It
recognizes that any one of the other traditional disciplines does not have the tools to completely
address the problem of how language comprehension and production work. Even if you combine
all the approaches, a comprehensive theory would be too complex to be studied using traditional
methods. But we may be able to realize such complex theories as computer programs and then test
them by observing how well they perform. By seeing where they fail, we can incrementally improve
them. Computational models may provide very specific predictions about human behavior that can
then be explored by the psycholinguist. By continuing in this process, we may eventually acquire a
deep understanding of how human language processing occurs. To realize such a dream will take
the combined efforts of linguists, psycholinguists, philosophers, and computer scientists. This
common goal has motivated a new area of interdisciplinary research often called cognitive science.
This book takes a middle ground between the scientific and technological goals. On the one
hand, this reflects a belief that natural language is so complex that an ad hoc approach without a
well-specified underlying theory will not be successful. Thus the technological goal cannot be
realized without using sophisticated underlying theories on the level of those being developed by
linguists, psycholinguists, and philosophers. On the other hand, the present state of knowledge
about natural language processing is so preliminary that attempting to build a cognitively correct
model is not feasible. Rather, we are still attempting to construct any model that appears to work.
A good way to define natural language research is to consider the different applications that
researchers work on. As you consider these/ examples. It will also be a good opportunity to
consider what it would mean to say that a computer system understands natural language. The
applications can be divided into two major classes: text-based applications and dialogue-based
applications.
Text-based applications involve the processing of written text, such as books, newspapers,
reports, manuals, e-mail messages, and so on. These are all reading-based tasks. Text-based natural
language research is ongoing in applications such as
1. finding appropriate documents on certain topics from a data-base of texts (for example,
finding relevant books in a library)
2. extracting information from messages or articles on certain topics (for example, building
a database of all stock transactions described in the news on a given day)
3. translating documents from one language to another (for example, producing automobile
repair manuals in many different languages)
4. summarizing texts for certain purposes (for example, producing a 3-page summary of a
1000-page government report)
Not all systems that perform such tasks must be using natural language understanding
techniques in the way we mean in this book. For example, consider the task of finding newspaper
articles on a certain topic in a large database. Many, techniques have been developed that classify
documents by the presence of certain keywords in the text. You can then retrieve articles on a
certain topic by looking for articles that contain the keywords associated with that topic. Articles
on law, for instance, might contain the words "lawyer", "court", "sue", "affidavit", and so on, while
articles on stock transactions might contain words such as "stocks", "takeover", "leveraged buyout",
"options", and so on. Such a system could retrieve articles on any topic that has been predefined by
a set of keywords. Clearly, we would not say that this system is understanding the text; rather, it is
using a simple matching technique. While such techniques may produce useful applications, they
are inherently limited. It is very unlikely, for example, that they could be extended to handle
complex retrieval tasks that are easily expressed in natural language, such as the query "Find me
all articles on leveraged buyouts involving more than 100 million dollars that were attempted but
failed during 1986 and 1990". To handle such queries, the system would have to be able to extract
enough information from each article in the database to determine whether the article meets the
criteria defined by the query; that is, it would have to build a representation of the information in
the articles and then use the representation to do the retrievals. This identifies a crucial
characteristic of an understanding system: it must compute some representation of the information
that can be used for later inference.
Consider another example. Some machine translation systems have been built that are
based on pattern matching; that is, a sequence of words in one language is associated with a
sequence of words in another language. The translation is accomplished by finding the best set of
patterns that match the input and producing the associated output in the other language. This
technique can produce reasonable results in some cases but sometimes produces completely
wrong translations because of its inability to use an understanding of content to disambiguate
word senses and sentence meanings appropriately. In contrast, other machine translation systems
operate by producing a representation of the meaning of each sentence in one language, and then
producing a sentence in the other language that realizes the same meaning. This latter approach,
because it involves the computation of a representation of meaning, is using natural language
understanding techniques.
One very attractive domain for text-based research is story understanding. In this task the
system processes a story and then must answer questions about it. This is similar to the type of
reading comprehension tests used in schools and provides a very rich method for evaluating the
depth of understanding the system is able to achieve.
As you can see, what counts as understanding might vary from application to application. If
this is so, how can you tell if a system works? One obvious way to evaluate a system is to run the
program and see how well it performs the task it was designed to do. If the program is meant to
answer questions about a database of facts, you might ask it questions to see how good it is at
producing the correct answers. If the system is designed to participate in simple conversations on a
certain topic, you might try conversing with it. This is called black box evaluation because it
evaluates system performance without looking inside to see how it works. While ultimately this
method of evaluation may be the best test of a system’s capabilities, it is problematic in the early
stages of research because early evaluation results can be misleading. Sometimes the techniques
that produce the best results in the short term will not lead to the best results in the long term. For
instance, if the overall performance of all known systems in a given application is uniformly low,
few conclusions can be drawn. The fact that one system was correct 50 percent of the time while
another was correct only 40 percent of the time says nothing about the long-term viability of either
approach. Only when the success rates become high, making a practical application feasible, can
much significance be given to overall system performance measures.
Despite the difficulties, evaluations are of crucial importance to the field, as very little can be
concluded from examining a few examples that a particular system can handle. If you cannot
interact freely with the program yourself but must depend on transcripts of others’ interactions,
you can easily be mislead into overestimating a program’s abilities. It is very hard for people not to
attribute much more intelligence to the program than is actually present.
Consider a concrete example — the ELIZA program that was developed in the mid-1960s at
MIT. This program must be one of the most popular AI programs at all time, and versions of it exist
for most machines, including most personal computers. ELIZA was never claimed to embody a
theory of language comprehension and production, but it serves as an excellent example as its
behavior initially seems impressive. The system plays the role of a therapist and, to obtain the best
results, the user should correspondingly play the role of a patient. Figure 1.2 presents a transcript
of this system in operation. Given this transcript, or even playing with the system yourself for a few
minutes, ELIZA's performance certainly seems impressive.
USER ELIZA
Men are all alike. In what way?
They’re always bugging us about something or
other. Can you think of a specific example?
Well, my boyfriend made me come here. Your boyfriend made you come here.
He says I’m depressed much of the time. I am sorry to hear you are depressed.
It’s true, I am unhappy. Do you think coming here will help you not to be
unhappy?
I need some help; that much seems certain. What would it mean to you if you got some help?
Perhaps I could learn to get along with my
mother. Tell me more about your family.
My mother takes care of me. Who else in your family takes care of you?
My father. Your father.
You are like my father in some ways. What resemblance do you see?
Figure 1.2 A dialogue with ELIZA
Here is a simple description of how ELIZA works. There is a database of particular words
that are called keywords. For each keyword, the system stores an integer, a pattern to match
against the input, and a specification of the output. The algorithm is as follows: Given a sentence S.
find a keyword in S whose pattern matches S. If there is more than one keyword, pick the one with
the highest integer value. Use the output specification that is associated with this keyword to
generate the next sentence. If there are no keywords, generate an innocuous continuation
statement, such as "Tell me more" or "Go on".
Word Rank Pattern Outputs
alike 10 ?X In what way?
What resemblance do you see?
are 3 ?X are you ?Y Would you prefer it if I weren't ?Y ?
3 ?X are ?Y What if they were not ?Y?
always 5 ?X Can you think of a specific example?
When?
Really, always?
what 2 ?X Why do you ask?
Does that interest you?
Figure 1.3 Sample data from ELIZA
Figure 1.3 shows a fragment of a database of keywords. In this database a pattern consists of
words and variables. The prefix ? before a letter indicates a variable, which can match any sequence
of words. For example, the pattern
?X are you ?Y
would match the sentence "Why are you looking at me?", where the variable ?X matches "Why" and
"?Y" matches "looking at me". The output specification may also use the same variables. In this case,
ELIZA inserts the words that match the variables in the input into the output after making some
minor changes in the pronouns (for example, replacing "me" with "you"). Thus, for the pattern
above, if the output specification is
the rule would generate a response "Would you prefer it if I weren’t looking at you?" When the
database lists multiple output specifications for a given pattern, ELIZA selects a different one each
time a keyword rule is used, thereby preventing unnatural repetition in the conversation. Using
these rules, you can see how ELIZA produced the first two exchanges in the conversation in Figure
1.2. ELIZA generated the first response from the first output of the keyword "alike" and the second
response from the first output of the keyword "always". This description covers all of the essential
points of the program. You will probably agree that the program does not understand the
conversation it is participating in. Rather, it is a collection of tricks. Why then does ELIZA appear to
function so well? There are several reasons. Perhaps the most important reason is that, when
people hear or read a sequence of words that they understand as a sentence, they attribute
meaning to the sentence and assume that the person (or machine) that produced the sentence
actually intended that meaning. People are extremely good at distinguishing word meanings and
interpreting sentences to fit the context. Thus ELIZA appears to be intelligent because you use your
own intelligence to make sense of what it says.
Other crucial characteristics of the conversational setting also aid in sustaining the illusion
of intelligence. For instance, the system does not need any world knowledge because it never has to
make a claim, support an argument, or answer a question. Rather, it simply asks a series of
questions. Except in a patient-therapist situation, this would be unacceptable. ELIZA evades all
direct questions by responding with another question, such as "Why do you ask?" There is no way to
force the program to say something concrete about any topic.
Even in such a restricted situation, however, it is relatively easy to demonstrate that the
program does not understand. It sometimes produces completely off-the-wall responses. For
instance, if you say "Necessity is the mother of invention", it might respond with "Tell me more about
your family", based on its pattern for the word "mother". In addition, since ELIZA has no knowledge
about the structure of language, it accepts gibberish just as readily as valid sentences. If you enter
"Green the adzabak are the a ran four", ELIZA will respond with something like "What if they were
not the a ran four?" Also, as a conversation progresses, it becomes obvious that the program does
not retain any of the content in the conversation. It begins to ask questions that are inappropriate
in light of earlier exchanges, and its responses in general begin to show a lack of focus. Of course, if
you are not able to play with the program and must depend only on transcripts of conversations by
others, you would have no way of detecting these flaws, unless they are explicitly mentioned.
Suppose you need to build a natural language program for a certain application in only six
months. If you start to construct a general model of language understanding, it will not be
completed in that time frame and so will perform miserably on the tests. An ELIZA-like system,
however, could easily produce behavior like that previously discussed with less than a few months
of programming and will appear to far outperform the other system in testing. The differences will
be especially marked if the test data only includes typical domain interactions that are not designed
to test the limits of the system. Thus, if we take short-term performance as our only criteria of
progress, everyone will build and fine-tune ELIZA-style systems, and the field will not progress past
the limitations of the simple approach.
To avoid this problem, either we have to accept certain theoretical assumptions about the
architecture of natural language systems and develop specific evaluation measures for different
components, or we have to discount overall evaluation results until some reasonably high level of
performance is obtained. Only then will cross-system comparisons begin to reflect the potential for
long-term success in the field.
A natural language system must use considerable knowledge about the structure of the language
itself, including what the words are, how words combine to form sentences, what the words mean,
how word meanings contribute to sentence meanings, and so on. However, we cannot completely
account for linguistic behavior without also taking into account another aspect of what makes
humans intelligent — their general world knowledge and their reasoning abilities. For example, to
answer questions or to participate in a conversation, a person not only must know a lot about the
structure of the language being used, but also must know about the world in general and the
conversational setting in particular.
The following are some of the different forms of knowledge relevant for natural language
understanding:
1. Phonetic and phonological knowledge - concerns how words are related to the sounds
that realize them. Such knowledge is crucial for speech-based systems.
2. Morphological knowledge - concerns how words are constructed from more basic
meaning units called morphemes. A morpheme is the primitive unit of meaning in a
language (for example, the meaning of the word "friendly" is derivable from the meaning of
the noun "friend" and the suffix "-ly", which transforms a noun into an adjective).
3. Syntactic knowledge - concerns how words can be put together to form correct sentences
and determines what structural role each word plays in the sentence and what phrases are
subparts of what other phrases.
4. Semantic knowledge - concerns what words mean and how these meanings -combine in
sentences to form sentence meanings. This is the study of context-independent meaning -
the meaning a sentence has regardless of the context in which it is used.
5. Pragmatic knowledge - concerns how sentences are used in different situations and how
use affects the interpretation of the sentence.
6. Discourse knowledge-concerns how the immediately preceding sentences affect the
interpretation of the next sentence. This information is especially important for interpreting
pronouns and for interpreting the temporal aspects of the information conveyed.
7. World knowledge - includes the general knowledge about the structure of the world that
language users must have in order to, for example, maintain a conversation. It includes what
each language user must know about the other user’s beliefs and goals.
BOX 1.2 Syntax, Semantics, and Pragmatics
The following examples may help you understand the distinction between syntax, semantics, and
pragmatics. Consider each example as a candidate for the initial sentence of this book, which you
know discusses natural language processing:
1. Language is one of the fundamental aspects of human behavior and is a crucial component
of our lives.
2. Green frogs have large noses.
3. Green ideas have large noses.
4. Large have green ideas nose.
Sentence 1 appears to be a reasonable start (I hope!). It agrees with all that is known about
syntax, semantics, and pragmatics. Each of the other sentences violates one or more of these
levels. Sentence 2 is well-formed syntactically and semantically, but not pragmatically. It fares
poorly as the first sentence of the book because the reader would find no reason for using it. But
however bad sentence 2 would be as a start, sentence 3 is much worse. Not only is it obviously
pragmatically ill-formed, it is also semantically ill-formed. To see this, consider that you and I
could argue about whether sentence 2 is true or not, but we cannot do so with sentence 3. I
cannot affirm or deny sentence 3 in coherent conversation. However, the sentence does have
some structure, for we can discuss what is wrong with it: Ideas cannot be green and, even if they
could, they certainly cannot have large noses. Sentence 4 is even worse. In fact, it is unintelligible,
even though it contains the same words as sentence 3. It does not even have enough structure to
allow you to say what is wrong with it. Thus it is syntactically ill-formed. Incidentally, there are
cases in which a sentence may be pragmatically well-formed but not syntactically well-formed.
For example, if I ask you where you are going and you reply "I go store", the response would be
understandable even though it is syntactically ill-formed. Thus it is at least pragmatically well-
formed and may even be semantically well-formed.
Several different representations will be used that correspond to some of the levels of analysis
discussed in the last section. In particular, we will develop formal languages for expressing
syntactic structure, for context-independent word and sentence meanings, and for expressing
general world knowledge.
The syntactic structure of a sentence indicates the way that words in the sentence are related to
each other. This structure indicates how the words are grouped together into phrases, what words
modify what other words, and what words are of central importance in the sentence. In addition,
this structure may identify the types of relationships that exist between phrases and can store
other information about the particular sentence structure that may be needed for later processing.
For example, consider the following sentences:
These sentences share certain structural properties. In each, the noun phrases are "John", "Mary",
and "the book", and the act described is some selling action. In other respects, these sentences are
significantly different. For instance, even though both sentences are always either true or false in
the exact same situations, you could only give sentence 1 as an answer to the question "What did
John do for Mary?"
Sentence 2 is a much better continuation of a sentence beginning with the phrase "After it fell in
the river", as sentences 3 and 4 show. Following the standard convention in linguistics, this book
will use an asterisk (*) before any example of an ill-formed or questionable sentence.
Many other structural properties can be revealed by considering sentences that are not well-
formed.
Sentence 5 is ill-formed because the subject and the verb do not agree in number (the subject is
singular and the verb is plural), while 6 is ill-formed because the verb put requires some modifier
that describes where John put the object.
The structure of a sentence doesn’t reflect its meaning, however. For example, the NP "the
catch" can have different meanings depending on whether the speaker is talking about a baseball
game or a fishing expedition. Both these interpretations have the same syntactic structure, and the
different meanings arise from an ambiguity concerning the sense of the word "catch". Once the
correct sense is identified, say the fishing sense, there still is a problem in determining what fish
are being referred to. The intended meaning of a sentence depends on the situation in which the
sentence is produced. Rather than combining all these problems, this book will consider each one
separately. The division is between context-independent meaning and context-dependent meaning.
The fact that "catch" may refer to a baseball move or the results of a fishing expedition is
knowledge about English and is independent of the situation in which the word is used. On the
other hand, the fact that a particular noun phrase "the catch" refers to what Jack caught when
fishing yesterday is contextually dependent. The representation of the context-independent
meaning of a sentence is called its logical form.
The logical form encodes possible word senses and identifies the semantic relationships
between the words and phrases. Many of these relationships are often captured using an abstract
set of semantic relationships between the verb and its NPs. In particular, in both sentences 1 and 2
previously given, the action described is a selling event, where "John" is the seller, "the book" is the
object being sold, and "Mary" is the buyer. These roles are instances of the abstract semantic roles
AGENT, THEME, and TO-POSS (for final possessor), respectively.
Once the semantic relationships are determined, some word senses may be impossible and thus
eliminated from consideration. Consider the sentence
The word "ball", which by itself is ambiguous between the plaything that bounces and the formal
dance event, can only take the latter sense in sentence 9, because the verb "invite" only makes
sense with this interpretation. One of the key tasks in semantic interpretation is to consider what
combinations of the individual word meanings can combine to create coherent sentence meanings.
Exploiting such interconnections between word meanings can greatly reduce the number of
possible word senses for each word in a given sentence.
The final representation needed is a general knowledge representation (KR), which the
system uses to represent and reason about its application domain. This is the language in which all
the specific knowledge based on the application is represented. The goal of contextual
interpretation is to take a representation of the structure of a sentence and its logical form, and to
map this into some expression in the KR that allows the system to perform the appropriate task in
the domain. In a question-answering application, a question might map to a database query, in a
story-understanding application, a sentence might map into a set of expressions that represent the
situation that the sentence describes.
For the most part, we will assume that the first-order predicate calculus (FOPC) is the final
representation language because it is relatively well known, well studied, and is precisely defined.
This book is organized around the three levels of representation just discussed: syntactic structure,
logical form, and the final meaning representation. Separating the problems in this way will allow
you to study each problem in depth without worrying about other complications. Actual systems
are usually organized slightly differently, however. In particular, Figure 1.5 shows the organization
that this book assumes.
Figure 1.5 The flow of information
As you can see, there are interpretation processes that map from one representation to the
other. For instance, the process that maps a sentence to its syntactic structure and logical form is
called the parser. It uses knowledge about word and word meanings (the lexicon) and a set of rules
defining the legal structures (the grammar) in order to assign a syntactic structure and a logical
form to an input sentence. An alternative organization could perform syntactic processing first and
then perform semantic interpretation on the resulting structures. Combining the two, however, has
considerable advantages because it leads to a reduction in the number of possible interpretations,
since every proposed interpretation must simultaneously be syntactically and semantically well
formed. For example, consider the following two sentences:
10. Visiting relatives can be trying.
11. Visiting museums can be trying.
These two sentences have identical syntactic structure, so both are syntactically ambiguous. In
sentence 10, the subject might be relatives who are visiting you or the event of you visiting
relatives. Both of these alternatives are semantically valid, and you would need to determine the
appropriate sense by using the contextual mechanism. However, sentence 11 has only one possible
semantic interpretation, since museums are not objects that can visit other people; rather they
must be visited. In a system with separate syntactic and semantic processing, there would be two
syntactic interpretations of sentence 11, one of which the semantic interpreter would eliminate
later. If syntactic and semantic processing are combined, however, the system will be able to detect
the semantic anomaly as soon as it interprets the phrase "visiting museums", and thus will never
build the incorrect syntactic structure in the first place. While the savings here seem small, in a
realistic application a reasonable sentence may have hundreds of possible syntactic structures,
many of which are semantically anomalous.
Continuing through Figure 1.5, the process that transforms the syntactic structure and
logical form into a final meaning representation is called contextual processing. This process
includes issues such as identifying the objects referred to by noun phrases such as definite
descriptions (for example, "the man") and pronouns, the analysis of the temporal aspects of the
new information conveyed by the sentence, the identification of the speaker’s intention (for
example, whether "Can you lift that rock" is a yes/no question or a request), as well as all the
inferential processing required to interpret the sentence appropriately within the application
domain. It uses knowledge of the discourse context (determined by the sentences that preceded the
current one) and knowledge of the application to produce a final representation.
The system would then perform whatever reasoning tasks are appropriate for the application.
When this requires a response to the user, the meaning that must be expressed is passed to the
generation component of the system. It uses knowledge of the discourse context, plus information
on the grammar and lexicon, to plan the form of an utterance, which then is mapped into words by
a realization process. Of course, if this were a spoken language application, the words would not be
the final input and output, but rather would be the output of a speech recognizer and the input to a
speech synthesizer, as appropriate. While this text focuses primarily on language understanding,
notice that the same levels of knowledge are also used for the generation task as well. For instance,
knowledge of syntactic structure is encoded in the grammar. This grammar can be used either to
identify the structure of a given sentence or to realize a structure as a sequence of words. A
grammar that supports both processes is called a bidirectional grammar. While most researchers
agree that bidirectional grammars are the preferred model, in actual practice grammars are often
tailored for the understanding task or the generation task. This occurs because different issues are
important for each task, and generally any given researcher focuses just on the problems related to
their specific task. But even when the actual grammars differ between understanding and
generation, the grammatical formalisms used remain the same.
2.1 Words
At first glance the most basic unit of linguistic structure appears to be the word. The word,
though, is far from the fundamental element of study in linguistics; it is already the result of a
complex set of more primitive parts. The study of morphology concerns the construction of words
from more basic components corresponding roughly to meaning units. There are two basic ways
that new words are formed, traditionally classified as inflectional forms and derivational forms.
Inflectional forms use a root form of a word and typically add a suffix so that the word
appears in the appropriate form given the sentence. Verbs are the best examples of this in English.
Each verb has a basic form that then is typically changed depending on the subject and the tense of
the sentence. For example, the verb sigh will take suffixes such as -s, -ing, and -ed to create the verb
forms sighs, sighing, and sighed, respectively. These new words are all verbs and share the same
basic meaning.
Derivational morphology involves the derivation of new words from other forms. The new
words may be in completely different categories from their subparts. For example, the noun friend
is made into the adjective friendly by adding the suffix - ly. A more complex derivation would allow
you to derive the noun friendliness from the adjective form. There are many interesting issues
concerned with how words are derived and how the choice of word form is affected by the
syntactic structure of the sentence that constrains it.
Traditionally, linguists classify words into different categories based on their uses. Two
related areas of evidence are used to divide words into categories. The first area concerns the
word’s contribution to the meaning of the phrase that contains it, and the second area concerns the
actual syntactic structures in which the word may play a role. For example, you might posit the
class noun as those words that can he used to identify the basic type (If object, concept, or place
being discussed, and adjective as those words that further qualify the object, concept, or place. Thus
green would be an adjective and book a noun, as shown in the phrases the green book and green
books. But things are not so simple: green might play the role of a noun, as in That green is lighter
than the other, and book might play the role of a modifier, as in the book worm. In fact, most nouns
seem to be able to be used as a modifier in some situations. Perhaps the classes should be
combined, since they overlap a great deal. But other forms of evidence exist. Consider what words
could complete the sentence it’s so... You might say it’s so green, it’s so hot, it’s so true, and so on.
Note that although book can be a modifier in the book worm, you cannot say *it’s so book about
anything. Thus there are two classes of modifiers: adjective modifiers and noun modifiers.
Consider again the case where adjectives can be used as nouns, as in the green. Not all
adjectives can be used in such a way. For example, the noun phrase the hot can be used, given a
context where there are hot and cold plates, in a sentence such as The hot are on the table. But this
refers to the hot plates; it cannot refer to hotness in the way the phrase the green refers to green.
With this evidence you could subdivide adjectives into two subclasses—those that can also be used
to describe a concept or quality directly, and those that cannot. Alternatively, however, you can
simply say that green is ambiguous between being an adjective or a noun and, therefore, falls in
both classes. Since green can behave like any other noun, the second solution seems the most
direct.
Using similar arguments, we can identify four main classes of words in English that
contribute to the meaning of sentences. These classes are nouns, adjectives, verbs, and adverbs.
Sentences are built out of phrases centered on these four word classes. Of course, there are many
other classes of words that are necessary to form sentences, such as articles, pronouns,
prepositions, particles, quantifiers, conjunctions, and so on. But these classes are fixed in the sense
that new words in these classes are rarely introduced into the language. New nouns, verbs,
adjectives and adverbs, on the other hand, are regularly introduced into the language as it evolves.
As a result, these classes are called the open class words, and the others are called the closed class
words.
A word in any of the four open classes may be used to form the basis of a phrase. This word
is called the head of the phrase and indicates the type of thing, activity, or quality that the phrase
describes. For example, with noun phrases, the head word indicates the general classes of objects
being described. The phrases
the dog
the mangy dog
the mangy dog at the pound
are all noun phrases that describe an object in the class of dogs. The first describes a member from
the class of all dogs, the second an object from the class of mangy dogs, and the third an object from
the class of mangy dogs that are at the pound. The word dog is the head of each of these phrases.
In some cases a phrase may consist of a single head. For example, the word sand can be a
noun phrase, hungry can be an adjective phrase, and walked can be a verb phrase. In many other
cases the head requires additional phrases following it to express the desired meaning. For
example, the verb put cannot form a verb phrase in isolation; thus the following words do not form
a meaningful sentence:
*Jack put.
To be meaningful, the verb put must be followed by a noun phrase and a phrase describing a
location, as in the verb phrase put the dog in the house. The phrase or set of phrases needed to
complete the meaning of such a head is called the complement of the head. In the preceding phrase
put is the head and the dog in the house is the complement. Heads of all the major classes may
require complements. Figure 2.1 gives some examples of phrases, with the head indicated by
boldface and the complements by italics.
Count nouns acquired their name because they can be counted. There may be one dog or
many dogs, one book or several books, one crowd or several crowds. If a single count noun is used to
describe a whole class of objects, it must be in its plural form. Thus you can say Dogs are friendly
but not *Dog is friendly.
Mass nouns cannot be counted. There may be some water, some wheat, or some sand. If you
try to count with a mass noun, you change the meaning. For example, some wheat refers to a
portion of some quantity of wheat, whereas one wheat is a single type of wheat rather than a single
grain of wheat. A mass noun can be used to describe a whole class of material without using a
plural form. Thus you say Water is necessary for life, not * Waters are necessary for life.
In addition to a head, a noun phrase may contain specifiers and qualifiers preceding the
head. The qualifiers further describe the general class of objects identified by the head, while the
specifiers indicate how many such objects are being described, as well as how the objects being
described relate to the speaker and hearer. Specifiers are constructed out of ordinals (such as first
and second), cardinals (such as one and two), and determiners. Determiners can be sub-divided
into the following general classes:
articles — the words the, a, and an.
demonstratives — words such as this, that, these, and those.
possessives — noun phrases followed by the suffix ‘s, such as
John’s and the fat man’s, as well as possessive pronouns, such as her, my, and whose.
wh-determiners — words used in questions, such as which and what.
quantifying determiners — words such as some, every, most, no, any, both, and half
A simple noun phrase may have at most one determiner, one ordinal, and one cardinal. It is
possible to have all three, as in the first three contestants. An exception to this rule exists with a few
quantifying determiners such as many, few, several, and little. These words can be preceded by an
article, yielding noun phrases such as the few songs we knew. Using this evidence, you could
subcategorize the quantifying determiners into those that allow this and those that don’t, but the
present coarse categorization is fine for our purposes at this time.
The qualifiers in a noun phrase occur after the specifiers (if any) and before the head. They
consist of adjectives and nouns being used as modifiers. The following are more precise definitions:
adjectives - words that attribute qualities to objects yet do not refer to the qualities
themselves (for example, angry is an adjective that attributes the quality of anger to
something).
noun modifiers - mass or count nouns used to modify another noun,
as in the cook book or the ceiling paint can.
Before moving on to other structures, consider the different inflectional forms that nouns
take and how they are realized in English. Two forms of nouns - the singular and plural forms -
have already been mentioned. Pronouns take forms based on person (first, second, and third) and
gender (masculine, feminine, and neuter). Each of these distinctions reflects a systematic analysis
that is almost wholly explicit in some languages, such as Latin, while implicit in others. In French,
for example, nouns are classified by their gender. In English many of these distinctions are not
explicitly marked except in a few cases. The pronouns provide the best example of this. They
distinguish number, person, gender, and case (that is, whether they are used as possessive, subject,
or object), as shown in Figures 2.2 through 2.4.
Number First Person Second Person Third Person
Singular me you her / him / it
Plural us you them
Figure 2.4 Pronoun system (as object)
While an NP is used to refer to things, a sentence (S) is used to assert, query, or command.
You may assert that some sentence is true, ask whether a sentence is true, or command someone to
do something described in the sentence. The way a sentence is used is called its mood. Figure 2.5
shows four basic sentence moods.
Mood Example
declarative (or assertion) The cat is sleeping.
yes/no question Is the cat sleeping?
wh-question What is sleeping? or Which cat is sleeping?
imperative (or command) Shoot the cat
Figure 2.5 Basic moods of sentences
A simple declarative sentence consists of an NP, the subject, followed by a verb phrase (VP),
the predicate. A simple VP may consist of some adverbial modifiers followed by the head verb and
its complements. Every verb must appear in one of the five possible forms shown in Figure 2.6.
Form Examples Example Uses
Base hit, cry, go, be Hit the ball!
I want to go.
simple present hit, cries, go, am The dog cries every day.
I am thirsty.
simple past hit, cried, went, was I was thirsty.
I went to the store.
present participle hitting, crying, I'm going to the store.
going, being Being the last in line aggravates me.
past participle hit, cried, gone, I’ve been there before.
been The cake was gone.
Figure 2.6 The five verb forms
The last verb in a verb sequence is called the main verb, and is drawn from the open class of verbs.
Depending on the verb, a wide variety of complement structures are allowed. For example, certain
verbs may stand alone with no complement. These are called intransitive verbs and include
examples such as laugh (for example, Jack laughed) and run (for example, He will have been
running). Another common complement form requires a noun phrase to follow the verb. These are
called transitive verbs and include verbs such as find (for example, Jack found a key). Notice that
find cannot be intransitive (for example, *Jack found is not a reasonable sentence), whereas laugh
cannot be transitive (for example, *Jack laughed a key is not a reasonable sentence). A verb like run,
on the other hand, can be transitive or intransitive, but the meaning of the verb is different in each
case (for example. Jack ran vs. Jack ran the machine).
Transitive verbs allow another form of verb group called the passive form, which is
constructed using a be auxiliary followed by the past participle. In the passive form the noun
phrase that would usually be in the object position is used in the subject position, as can be seen by
the examples in Figure 2.10. Note that tense is still carried by the initial verb in the verb group.
Also, even though the first noun phrase semantically seems to be the object of the verb in passive
sentences, it is syntactically the subject. This can be seen by checking the pronoun forms. For
example, I was hit is correct, not *Me was hit. Furthermore, the tense and number agreement is
between the verb and the syntactic subject. Thus you say I was hit by them, not *I were hit by them.
Some verbs allow two noun phrases to follow them in a sentence; for example, Jack gave Sue a book
or Jack found me a key. In such sentences the second NP corresponds to the object NP outlined
earlier and is sometimes called the direct object. The other NP is called the indirect object.
Generally, such sentences have an equivalent sentence where the indirect object appears with a
preposition, as in Jack gave a book to Sue or Jack found a key for me.
Particles
Some verb forms are constructed from a verb and an additional word called a particle.
Particles generally overlap with the class of prepositions considered in the next section. Some
examples are up, out, over, and in. With verbs such as look, take, or put, you can construct many
different verbs by combining the verb with a particle (for example, look up, look out, look over, and
so on). In some sentences the difference between a particle and a preposition results in two
different readings for the same sentence. For example, look over the paper would mean reading the
paper, if you consider over a particle (the verb is look over). In contrast, the same sentence would
mean looking at something else behind or above the paper, if you consider over a preposition (the
verb is look).
You can make a sharp distinction between particles and prepositions when the object of the
verb is a pronoun. With a verb-particle sentence, the pronoun must precede the particle, as in I
looked it up. With the prepositional reading, the pronoun follows the preposition, as in I looked up
it. Particles also may follow the object NP. Thus you can say I gave up the game to Mary or I gave the
game up to Mary. This is not allowed with prepositions; for example, you cannot say *I climbed the
ladder up.
Clausal Complements
Many verbs allow clauses as complements. Clauses share most of the same properties of
sentences and may have a subject, indicate tense, and occur in passivized forms. One common
clause form consists of a sentence form preceded by the complementizer that, as in that Jack ate the
pizza. This clause will be identified by the expression S[that], indicating a special subclass of S
structures. This clause may appear as the complement of the verb know, as in Sam knows that Jack
ate the pizza. The passive is possible, as in Sam knows that the pizza was eaten by Jack.
Another clause type involves the infinitive form of the verb. The VP[inf] clause is simply a VP
starting in the infinitive form, as in the complement of the verb wish in Jack wishes to eat the pizza.
An infinitive sentence S[inf] form is also possible where the subject is indicated by a for phrase, as
in Jack wishes for Sam to eat the pizza.
Another important class of clauses are sentences with complementizers that are wh-words,
such as who, what, where, why, whether, and how many. These question clauses, S[WH], can be used
as a complement of verbs such as know, as in Sam knows whether we went to the party and The
police know who committed the crime.
Many verbs require complements that involve a specific prepositional phrase (PP). The verb give
takes a complement consisting of an NP and a PP with the preposition to, as in Jack gave the book to
the library. No other preposition can be used. Consider
*Jack gave the book from the library. (OK only if from the library modifies book.)
In contrast, a verb like put can take any PP that describes a location, as in
Jack put the book in the box.
Jack put the book inside the box.
Jack put the book by the door.
To account for this, we allow complement specifications that indicate prepositional phrases with
particular prepositions. Thus the verb give would have a complement of the form NP+PP[to].
Similarly the verb decide would have a complement form NP+PP[about], and the verb blame would
have a complement form NP+PP[on], as in Jack blamed the accident on the police.
Verbs such as put, which take any phrase that can describe a location (complement
NP+Location), are also common in English. While locations are typically prepositional phrases, they
also cart be noun phrases, such as home, or particles, such as back or here. A distinction can be
made between phrases that describe locations and phrases that describe a path of motion, although
many location phrases can be interpreted either way. The distinction can be made in some cases,
though. For instance, prepositional phrases beginning with "to" generally indicate a path of motion.
Thus they cannot be used with a verb such as "put" that requires a location (for example, *I put the
ball to the box). This distinction will be explored further in Chapter 4.
Figure 2.11 summarizes many of the verb complement structures found in English. A full list would
contain over 40 different forms. Note that while the examples typically use a different verb for each
form, most verbs will allow several different complement structures.
Section 2.2 introduced simple noun phrases. This section considers more complex forms in
which NPs contain sentences or verb phrases as subcomponents.
All the examples in Section 2.2 had heads that took the null complement. Many nouns,
however, may take complements. Many of these fall into the class of complements that require a
specific prepositional phrase. For example, the noun love has a complement form PP[of], as in their
love of France, the noun reliance has the complement form PP[on], as in his reliance on handouts,
and the noun familiarity has the complement form PP[with], as in a familiarity with computers.
Many nouns, such as desire, reluctance, and research, take an infinitive VP form as a
complement, as in the noun phrases his desire to release the guinea pig, a reluctance to open the case
again, and the doctor’s research to find a cure for cancer. These nouns, in fact, can also take the S[inf]
form, as in my hope for John to open the case again.
Noun phrases can also be built out of clauses, which were introduced in the last section as
the complements for verbs. For example, a that clause (S[that]) can be used as the subject of a
sentence, as in the sentence That George had the ring was surprising. Infinitive forms of verb
phrases (VP[inf]) and sentences (S[inf]) can also function as noun phrases, as in the sentences "To
own a car would be delightful" and "For us to complete a project on time would be unprecedented". In
addition, the gerundive forms (VP[ing] and S[ing]) can also function as noun phrases, as in the
sentences Giving up the game was unfortunate and John's giving up the game caused a riot.
Relative clauses involve sentence forms used as modifiers in noun phrases. These clauses
are often introduced by relative pronouns such as who, which, that, and so on, as in
The man who gave Bill the money...
The rug that George gave to Ernest...
The man whom George gave the money to ...
In each of these relative clauses, the embedded sentence is the same structure as a regular
sentence except that one noun phrase is missing. If this missing NP is filled in with the NP that the
sentence modifies, the result is a complete, sentence that captures the same meaning as what was
conveyed by the relative clause. The missing NPs in the preceding three sentences occur in the
subject position, in the object position, and as object to a preposition, respectively. Deleting the
relative pronoun and filling in the missing NP in each produces the following:
The man gave Bill the money.
George gave the rug to Ernest.
George gave the money to the man.
As was true earlier, relative clauses can be modified in the same ways as regular sentences.
In particular, passive forms of the preceding sentences would be as follows:
Bill was given the money by the man.
The rug was given to Ernest by George.
The money was given to the man by George.
Correspondingly, these sentences could have relative clauses in the passive form as follows:
The man Bill was given the money by...
The rug that was given to Ernest by George...
The man whom the money was given to by George...
Notice that some relative clauses need not be introduced by a relative pronoun. Often the relative
pronoun can be omitted, producing what is called a base relative clause, as in the NP the man
George gave the money to. Yet another form deletes the relative pronoun and an auxiliary be form,
creating a reduced relative clause, as in the NP the man given the money, which means the same as
the NP the man who was given the money.
These more complex adjective phrases are most commonly found as the complements of
verbs such as be or seem, or following the head in a noun phrase. They generally cannot be used as
modifiers preceding the heads of noun phrases (for example, consider *the angry at the committee
man vs. the angry man vs. the man angry at the committee).
Adjective phrases may also take a degree modifier preceding the head, as in the adjective
phrase very angry or somewhat fond of Mary. More complex degree modifications are possible, as in
far too heavy and much more desperate. Finally, certain constructs have degree modifiers that
involve their own complement forms, as in too stupid to come in out of the rain, so boring that
everyone fell asleep, and as slow as a dead horse.
You have already seen adverbs in use in several constructs, such as indicators of degree (for
example, very, rather, too), and in location phrases (for example, here, everywhere). Other forms of
adverbs indicate the manner in which something is done (for example, slowly, hesitantly), the time
of something (for example, now, yesterday), or the frequency of something (for example, frequently,
rarely, never).
Adverbs may occur in several different positions in sentences: in the sentence initial
position (for example, Then, Jack will open the drawer), in the verb sequence (for. example, Jack
then will open the drawer, Jack will then open the drawer), and in the sentence final position (for
example, Jack opened the drawer then).The exact restrictions on what adverb can go where,
however, is quite idiosyncratic to the particular adverb.
In addition to these adverbs, adverbial modifiers can be constructed out of a wide range of
constructs, such as prepositional phrases indicating, among other things, location (for example, in
the box) or manner (for example, in great haste); noun phrases indicating, among other things,
frequency (for example, every day): or clauses indicating, among other things, the time (for
example, when the bomb exploded). Such adverbial phrases, however, usually cannot occur except
in the sentence initial or sentence final position. For example, we can say Every day John opens his
drawer or John opens his drawer every day, but not *John every day opens his drawer.
Because of the wide range of forms, it generally is more useful to consider adverbial phrases
(ADVPs) by function rather than syntactic form. Thus we can consider manner, temporal, duration,
location, degree, and frequency adverbial phrases each as its own form. We considered the location
and degree forms earlier, so here we will consider some of the others.
Temporal adverbials occur in a wide range of forms: adverbial particles (for example, now),
noun phrases (for example, today, yesterday), prepositional phrases (for example, at noon, during
the fight), and clauses (for example, when the clock struck noon, before the fight started).
Frequency adverbials also can occur in a wide range of forms: particles (for example, often),
noun phrases (for example, every day), prepositional phrases (for example, at every party), and
clauses (for example, every time that John comes for a visit).
Duration adverbials appear most commonly as prepositional phrases (for example, for three
hours, about 20 feet) and clauses (for example, until the moon turns blue).
Manner adverbials occur in a wide range of forms, including particles (for example, slowly),
noun phrases (for example, this way), prepositional phrases (for example, in great haste), and
clauses (for example, by holding the embers at the end of a stick).
In the analyses that follow, adverbials will most commonly occur as modifiers of the action
or state described in a sentence. As such, an issue arises as to how to distinguish verb complements
from adverbials. One distinction is that adverbial phrases are always optional. Thus you should be
able to delete the adverbial and still have a sentence with approximately the same meaning
(missing, obviously, the contribution of the adverbial). Consider the sentences
Jack put the box by the door.
Jack ate the pizza by the door.
In the first sentence the prepositional phrase is clearly a complement, since deleting it to produce
*Jack put the box results in a nonsensical utterance. On the other hand, deleting the phrase from the
second sentence has only a minor effect:
Jack ate the pizza is just a less general assertion about the same situation described by Jack ate the
pizza by the door.
Spoken language interfaces to computers is a topic that has lured and fascinated engineers
and speech scientists alike for over five decades. For many, the ability to converse freely with a
machine represents the ultimate challenge to our understanding of the production and perception
processes involved in human speech communication. In the near future, interactive networks will
provide easy access to a wealth of information and services that will fundamentally affect how
people work, play and conduct their daily affairs. Advances in human language technology are
needed to enable the average citizen to communicate with networks using natural communication
skills and everyday devices, such as telephones and televisions. A speech interface, in a user’s own
language, is ideal because it is the most natural, flexible, efficient, and economical form of human
communication. Spoken input to computers embodies many different technologies and
applications, as illustrated in Figure 1.1. In some cases, as shown at the bottom of the figure, one is
interested not in the underlying linguistic content but in the identity of the speaker or the language
being spoken. Speaker recognition can involve identifying a specific speaker out of a known
population, which has forensic implications, or verifying the claimed identity of a user, thus
enabling controlled access to locales (e.g., a computer room) and services (e.g., voice banking).
Language identification also has important applications, and techniques applied to this area.
Figure 1.1: Technologies for spoken language interfaces.
When one thinks about speaking to computers, the first image is usually speech recognition,
the conversion of an acoustic signal to a stream of words. After many years of research, speech
recognition technology is beginning to pass the threshold of practicality. The last decade has
witnessed dramatic improvement in speech recognition technology, to the extent that high
performance algorithms and systems are becoming available. Speech input capabilities are
emerging that can provide functions like voice dialing (e.g., Call home), call routing (e.g., I would like
to make a collect call), simple data entry (e.g., entering a credit card number), and preparation of
structured documents (e.g., a radiology report). As these authors point out, speech recognition
involves several component technologies. First, the digitized signal must be transformed into a set
of measurements i.e signal representation. Next, the various speech sounds must be modeled
appropriately. The most widespread technique for acoustic modeling is called hidden Markov
modeling (HMM).
Speech recognition is a very challenging problem in its own right, with a well defined set of
applications. However, many tasks that lend themselves to spoken input— making travel
arrangements or selecting a movie—are in fact exercises in interactive problem solving. The
solution is often built up incrementally, with both the user and the computer playing active roles in
the “conversation.” Therefore, several language based input and output technologies must be
developed and integrated to reach this goal.
Figure 1.1 shows the major components of a typical conversational system. The spoken
input is first processed through the speech recognition component. The natural language
component, working in concert with the recognizer, produces a meaning representation. The
spoken language understanding technology discusses the integration of speech recognition and
natural language processing techniques. For information retrieval applications illustrated in this
figure, the meaning representation can be used to retrieve the appropriate information in the form
of text, tables and graphics. If the information in the utterance is insufficient or ambiguous, the
system may choose to query the user for clarification. Natural language generation and speech
synthesis can be used to produce spoken responses that may serve to clarify the tabular
information. Throughout the process, discourse information is maintained and fed back to the
speech recognition and language understanding components, so that sentences can be properly
understood in context.
Parameters Range
Speaking Mode Isolated words to continuous speech
Speaking Style Read speech to spontaneous speech
Enrollment Speaker-dependent to Speaker-independent
Vocabulary Small ( 20 words) to large (20,000 words)
Language Model Finite-state to context-sensitive
Perplexity Small ( 10) to large (100)
SNR (signal-to-noise ratio) High ( 30 dB) to low (10 dB)
Transducer Voice-cancelling microphone to telephone
Table 1.1: Typical parameters used to characterize the capability of speech recognition systems
Figure 1.2 shows the major components of a typical speech recognition system. The
digitized speech signal is first transformed into a set of useful measurements or features at a fixed
rate, typically once every 10–20 msec. These measurements are then used to search for the most
likely word candidate, making use of constraints imposed by the acoustic, lexical, and language
models. Throughout this process, training data are used to determine the values of the model
parameters.
3.1.3 Signal Representation
In statistically based automatic speech recognition, the speech waveform is sampled at a rate
between 6.6 kHz and 20 kHz and processed to produce a new representation as a sequence of
vectors containing values that are generally called parameters. The vectors typically comprise
between 10 and 20 parameters, and are usually computed every 10 or 20 msec. These parameter
values are then used in succeeding stages in the estimation of the probability that the portion of
waveform just analyzed corresponds to a particular phonetic event in the phone-sized or whole-
word reference unit being hypothesized.
.
Representations used in current speech recognizers (see Figure 1.3), concentrate primarily on
properties of the speech signal attributable to the shape of the vocal tract rather than to the
excitation, whether generated by a vocal-tract constriction or by the larynx. Representations are
sensitive to whether the vocal folds are vibrating or not (the voiced/unvoiced distinction), but try
to ignore effects due to variations in their frequency of vibration.
Figure 1.3: Examples of representations used in current speech recognizers: (a) Time varying
waveform of the word speech, showing changes in amplitude (y axis) over time (x axis); (b) Speech
spectrogram of (a), in terms of frequency (y axis), time (x axis) and amplitude (darkness of the
pattern); (c) Expanded waveform of the vowel ee (underlined in b); (d) Spectrum of the vowel ee, in
terms of amplitude (y axis) and frequency (x axis); (e) Mel-scale spectrogram.
Robustness in speech recognition refers to the need to maintain good recognition accuracy
even when the quality of the input speech is degraded, or when the acoustical, articulatory, or
phonetic characteristics of speech in the training and testing environments differ. Obstacles to
robust recognition include acoustical degradations produced by additive noise, the effects of linear
filtering, nonlinearities in transduction or transmission, as well as impulsive interfering sources,
and diminished accuracy caused by changes in articulation produced by the presence of high-
intensity noise sources. Some of these sources of variability are illustrated in Figure 1.4. Speaker-
to-speaker differences impose a different type of variability, producing variations in speech rate,
co-articulation, context, and dialect. Even systems that are designed to be speaker-independent
exhibit dramatic degradations in recognition accuracy when training and testing conditions differ .
Speech recognition systems have become much more robust in recent years with respect to
both speaker variability and acoustical variability. In addition to achieving speaker-independence,
many current systems can also automatically compensate for modest amounts of acoustical
degradation caused by the effects of unknown noise and unknown linear filtering.
Figure 1.4: Schematic representation of some of the sources of variability that can degrade speech
recognition accuracy, along with compensation procedures that improve environmental
robustness.
Modern architectures for Automatic Speech Recognition (ASR) are mostly software architectures
which generate a sequence of word hypotheses from an acoustic signal. The most popular
algorithms implemented in these architectures are based on statistical methods. A vector yt of
acoustic features is computed every 10 to 30 msec.
Note: the notation y1T stands for the sequence {y1,y2, … yT}
Argmax is an operation that finds the argument that gives the maximum value from a target
function.
For large vocabularies, search is performed in two steps. The first generates a word lattice of the n-
best word sequences with simple models to compute approximate likelihoods in real-time. In the
second step, more accurate likelihoods are compared with a limited number of hypotheses. Some
systems generate a single word sequence hypothesis with a single step. The search produces an
hypothesized word sequence if the task is dictation. If the task is understanding, then a conceptual
structure is obtained with a process that may involve more than two steps.
Eg: dollfins swim fast.
A speech recognizer converts the observed acoustic signal into the corresponding orthographic
representation of the spoken sentence. The recognizer chooses its guess from a finite vocabulary of
words that can be recognized. For simplicity, we assume that a word is uniquely identified by its
spelling. Dramatic progress has been demonstrated in solving the speech recognition problem via
the use of a statistical model of the joint distribution of the sequence of spoken words and the
corresponding observed sequence of acoustic information. This approach, pioneered by the IBM
Continuous Speech Recognition group, is called the source-channel model. In this approach, the
speech recognizer determines an estimate of the identity of the spoken word sequence from the
observed acoustic evidence by using the a posteriori distribution.
Speaker recognition, which can be classified into identification and verification, is the process of
automatically recognizing who is speaking on the basis of individual information included in
speech waves. This technique makes it possible to use the speaker’s voice to verify their identity
and control access to services such as voice dialing, banking by telephone, telephone shopping,
database access services, information services, voice mail, security control for confidential
information areas, and remote access to computers. AT&T and TI (with Sprint) have started field
tests and actual application of speaker recognition technology; Sprint’s Voice Phone Card is already
being used by many customers. In this way, speaker recognition technology is expected to create
new services that will make our daily lives more convenient. Another important application of
speaker recognition technology is for forensic purposes.
Figure 1.8 shows the basic structures of speaker identification and verification systems.
Speaker identification is the process of determining which registered speaker provides a given
utterance. Speaker verification, on the other hand, is the process of accepting or rejecting the
identity claim of a speaker. Most applications in which a voice is used as the key to confirm the
identity of a speaker are classified as speaker verification. There is also the case called open set
identification, in which a reference model for an unknown speaker may not exist. This is usually the
case in forensic applications. In this situation, an additional decision alternative, the unknown does
not match any of the models, is required. In both verification and identification processes, an
additional threshold test can be used to determine if the match is close enough to accept the
decision or if more speech data needed.
Speaker recognition methods can also be divided into text-dependent and text independent
methods. The former require the speaker to say key words or sentences having the same text for
both training and recognition trials, whereas the latter do not rely on a specific text being spoken.
Both text-dependent and independent methods share a problem however. These systems can be
easily deceived because someone who plays back the recorded voice of a registered speaker saying
the key words or sentences can be accepted as the registered speaker. To cope with this problem,
there are methods in which a small set of words, such as digits, are used as key words and each
user is prompted to utter a given sequence of key words that is randomly chosen every time the
system is used. Yet even this method is not completely reliable, since it can be deceived with
advanced electronic recording equipment that can reproduce key words in a requested order.
Therefore, a text-prompted (machine-driven-text-dependent) speaker recognition method has
recently been proposed.
Figure 1.8: Basic structures of speaker recognition systems.
Spoken language understanding involves two primary component technologies (each covered
elsewhere in this volume): speech recognition (SR), and natural language (NL) understanding. The
integration of speech and natural language has great advantages: To NL, SR can bring prosodic
information (information important for syntax and semantics but not well represented in text); NL
can bring to SR additional knowledge sources (e.g., syntax and semantics). For both, integration
affords the possibility of many more applications than could otherwise be envisioned, and the
acquisition of new techniques and knowledge bases not previously represented. The integration of
these technologies presents technical challenges, and challenges related to the quite different
cultures, techniques and beliefs of the people representing the component technologies.
In large part, NL research has grown from symbolic systems approaches in computer science and
linguistics departments. The desire to model language understanding is often motivated by a desire
to understand cognitive processes and, therefore, the underlying theories tend to be from
linguistics and psychology. Practical applications have been less important than increasing
intuitions about human processes. Therefore, coverage of phenomena of theoretical interest—
usually the more rare phenomena—has traditionally been more important than broad coverage.
Speech recognition research, on the other hand, has largely been practiced in engineering
departments. The desire to model speech is often motivated by a desire to produce practical
applications. Techniques motivated by knowledge of human processes have therefore been less
important than techniques that can be automatically developed or tuned, and broad coverage of a
representative sample is more important than coverage of any particular phenomenon.
There are certainly technical challenges to the integration of SR and NL. However, progress toward
meeting these challenges has been slowed by the differences outlined above. Collaboration can be
inhibited by differences in motivation, interests, theoretical underpinnings, techniques, tools, and
criteria for success. However, both groups have much to gain from collaboration. For the SR
engineers, human language understanding provides an existence proof, and needs to be taken into
account, since most applications involve interaction with at least one human. For the AI NL
researchers, statistical and other engineering techniques can be important tools for their inquiries.
Speech synthesis research predates other forms of speech technology by many years. In the early
days of synthesis, research efforts were devoted mainly to simulating human speech production
mechanisms, using basic articulatory models based on electro acoustic theories. Though this
modeling is still one of the ultimate goals of synthesis research, advances in computer science have
widened the research field to include Text to- Speech (TtS) processing in which not only human
speech generation but also text processing is modeled. As this modeling is generally done by a set
of rules derived, e.g., from phonetic theories and acoustic analyses, the technology is typically
referred to as speech synthesis by rule.
Figure 5.1 shows the configuration of a standard TtS system. In such systems, rule-based synthesis
has attained highly intelligible speech quality and can already serve in many practical uses.
Ceaseless efforts have improved the quality of rule-based synthetic speech, step by step, by
alternating speech characteristics analysis with the development of control rules. However, most of
this progress has been system dependent, and remains deeply embedded within system
architectures in impenetrable meshes of detailed rules and finely tuned control parameters. As a
consequence, the expert knowledge that has been incorporated is not available for sharing
commonly and can be very hard to replicate in equivalent systems by other researchers.
In contrast to this traditional rule-based approach, a corpus-based approach has also been pursued.
In the corpus-based work, well-defined speech data sets have been annotated at various levels with
information, such as acoustic-phonetic labels and syntactic bracketing, serving as the foundation
for statistical modeling. Spectral and prosodic feature parameters of the speech data are analyzed
in relation to the labeled information, and their control characteristics are quantitatively described.
Based on the results of these analyses, a computational model is created and trained using the
corpus. By subsequently applying the resulting model to unseen test data, its validity and any
defects can be quantitatively shown. By feeding back results from such tests into the original model
with extended training, further improvements can be attained in a cyclical process.
Corpus-based methods provide for a specification of the speech segments required for
concatenative synthesis in three factors:
1. the procedures of the unit selection algorithm;
2. the objective measures used in the selection criteria; and
3. the design of the speech corpus from which the units are extracted.
This modularization of system building is useful not only in reducing construction effort, but
also in allowing precise mathematical specification of the problems and in defining ways to cope
with them systematically, by improving the selection algorithms, criteria and data.
Speech generation is the process which allows the transformation of a string of phonetic and
prosodic symbols into a synthetic speech signal. The quality of the result is a function of the quality
of the string, as well as of the quality of the generation process itself. Today from a text-to-speech
(TtS) system two quality criteria are proposed. The first one is intelligibility, which can be
measured by taking into account several kinds of units (phonemes, syllables, words, phrases). The
second one, more difficult to define, is often labeled as pleasantness or naturalness. Actually the
concept of naturalness may be related to the concept of realism in the field of image synthesis: the
goal is not to restitute the reality but to suggest it. Thus, listening to a synthetic voice must allow
the listener to attribute this voice to some pseudo-speaker and to perceive some kind of
expressivity as well as providing some indices characterizing the speaking style and the particular
situation of elocution.
Most of the present TtS systems produce an acceptable level of intelligibility, but the
naturalness dimension, the ability to control expressivity, speech style and pseudo speaker identity
are still poorly mastered. Let us mention, however, that users demands vary to a large extent
according to the field of application: general public applications such as telephonic information
retrieval need maximal realism and naturalness, whereas some applications involving
professionals (process or vehicle control) or highly motivated persons (visually impaired,
applications in hostile environments) demand intelligibility with the highest priority.
One of the first things that an English TtS system would need to do is tokenize the input into
words: for English this is not generally difficult though it is more complicated for some other
languages. A pronunciation then needs to be computed for each word; in English, given the
irregularity of the orthography, this process involves a fair amount of lexical lookup though other
processes are involved too. Some of the words in the sentence should be accented; in this particular
case, a reasonable accentuation would involve accenting content words like give, ticket, Dallas, back
and money, and leaving the other words unaccented. Then we might consider breaking the input
into prosodic phrases: in this case, it would be reasonable to intone the sentence as if there were a
comma between Dallas and or. Thus, various kinds of linguistic information need to be extracted
from the text, but only in the case of word boundaries can this linguistic information be said to be
represented directly in the orthography.
Interactive natural language capabilities are needed for a wide range of today’s intelligent
systems: expert systems must explain their results and reasoning, intelligent assistants must
collaborate with users to perform tasks, tutoring systems must teach domain concepts and critique
students’ problem-solving strategies, and information delivery systems must help users find and
make sense of the information they need. These applications require that a system be capable of
generating coherent multi-sentential responses, and interpreting and responding to users’
subsequent utterances in the context of the ongoing interaction.
Spoken language generation allows for provision of responses as part of an interactive
human-machine dialogue, where speech is one medium for the response. This research topic draws
from the fields of both natural language generation and speech synthesis. It differs from synthesis
in that speech is generated from an abstract representation of concepts rather than from text.
While a relatively under-emphasized research problem, the ability to generate spoken responses is
clearly crucial for interactive situations, in particular when:
1. the user’s hands and/or eyes are busy;
2. screen real estate is at a premium;
3. time is critical; or
4. system and user are communicating via a primarily audio channel such as the telephone.
Like written language generation, spoken language generation requires determining what concepts
to include and how to realize them in words, but critically also requires determining intonational
form. Several problems are particularly pertinent to the spoken context: 1) The need to model and
use knowledge about hearer goals, hearer background, and past discourse in determining content
and form of a response. While the written context can include a general audience (e.g., for report
generation), responses in an interactive dialogue are intended for a particular person and, to be
useful, must take that person into account. 2) What kind of language should be generated given a
spoken context? Given the lack of visual memory that a text provides, the form required for speech
is likely to be quite different from that found in text.
The written form of language is contained in printed documents, such as newspapers, magazines
and books, and in handwritten matter, such as found in notebooks and personal letters. Given the
importance of written language in human transactions, its automatic recognition has practical
significance.
3.3.1.1 Written Language
Fundamental characteristics of writing are:
1. it consists of artificial graphical marks on a surface;
2. its purpose is to communicate something;
3. this purpose is achieved by virtue of the mark’s conventional relation to language
Although speech is a sign system that is more natural than writing to humans, writing is considered
to have made possible much of culture and civilization. Different writing systems, or scripts,
represent linguistic units, words, syllables and phonemes, at different structural levels. In
alphabetic writing systems, principal examples of which are the Latin, Greek and Russian scripts,
alphabets are the primitive elements, or characters, which are used to represent words. Several
languages such as English, Dutch, French, etc, share the Latin script. The Devanagari script, which
represents syllables as well as alphabets, is used by several Indian languages, including Hindi. The
Chinese script, which consists of ideograms, is an alternative to alphabets. The Japanese script
consists of the Chinese ideograms (Kanji) and syllables (Kana).
There are roughly two dozen different scripts in use today (ignoring minor differences in
orthography, as between English and French). Each script has its own set of icons, known as
characters or letters, that have certain basic shapes. Each script has its rules for combining the
letters to represent the shapes of higher level linguistic units. For example, there are rules for
combining the shapes of individual letters so as to form cursively written words in the Latin
alphabet.
In addition to linguistic symbols, each script has a representation for numerals, such as the Arabic-
Indic digits used in conjunction with the Latin alphabet. In addition, there are icons for special
symbols found on keyboards.
3.3.1.2 Transducers
Most of archived written language has been in the form of printed paper documents. In such
documents, text is presented as a visual image on a high contrast background, where the shapes of
characters belong to families of type fonts. Paper documents, which are an inherently analog
medium, can be converted into digital form by a process of scanning and digitization.
More recently, it has become possible to store and view electronically prepared documents
as formatted pages on a computer graphics screen, where the scanning and recognition process is
eliminated.
Written language is also encountered in the form of handwriting inscribed on paper or
registered on an electronically sensitive surface. Handwriting data is converted to digital form
either by scanning the writing on paper or by writing with a special pen on an electronic surface
such as a Liquid Crystal Display (LCD). The two approaches are distinguished as off-line and on-line
handwriting.
3.3.1.3 Recognition
Written language recognition is the task of transforming language represented in its spatial form of
graphical marks into its symbolic representation. For English orthography, this symbolic
representation is typically the ASCII representation of text. The characters of most written
languages of the world are representable today in the form of the Unicode. The central tasks are
character recognition and word recognition. A necessary preprocessing step for recognizing
written language is the spatial issue of locating and registering the appropriate text when there are
complex two-dimensional spatial layouts employed. The latter task is referred to as document
image analysis.
Character Recognition
The basic problem is to assign the digitized character into its symbolic class. In the case of a
print image, this is referred to as Optical Character Recognition (OCR) . In the case of handprint, it
is referred to as Intelligent Character Recognition (ICR).
The typical classes are the upper- and lower-case characters, the ten digits, and special
symbols such as the period, exclamation mark, brackets, dollar and pound signs, etc. A pattern
recognition algorithm is used to extract shape features and assign the observed character into the
appropriate class. Artificial neural networks have emerged as fast methods for implementing
classifiers for OCR. Algorithms based on nearest neighbor methods have higher accuracy, but are
slower.
Recognition of characters from a single font family on a well-printed paper document can be
done very accurately. Difficulties arise when there are decorative fonts, many fonts to be handled, ,
or when the document is of poor quality. Some examples of poor quality machine-printed and
handwritten characters are shown in Figure 2.1. In the difficult cases, it becomes necessary to use
models to constrain the choices at the character and word levels. Such models are essential in
handwriting recognition due to the wide variability of hand printing and cursive script. A word
recognition algorithm attempts to associate the word image to choices in a lexicon. Typically, a
ranking is produced. This is done either by the analytical approach of recognizing the individual
characters or by the holistic approach of dealing with the entire word image. The latter approach is
useful in the case of touching printed characters and handwriting. A higher level of performance is
observed by combining the results of both approaches. In the off-line unconstrained handwritten
word recognition problem, recognition rates of 95%, 85% and 78% have been reported for the top
choice for lexicon sizes of 10, 100 and 1,000 respectively difficulties (a) and handwritten
characters (b). In the on-line case, larger lexicons are possible for the same accuracy; a top choice
recognition rate of 80% with pure cursive words and a 21,000 word lexicon has been reported.
Language Models
Language models are useful in recovering strings of words after they have been passed
through a noisy channel, such as handwriting or print degradation. The most important model for
written language recognition is the lexicon of words. The lexicon, in turn, is determined by
linguistic constraints, e.g., in recognizing running text, the lexicon for each word is constrained by
the syntax, semantics and pragmatics of the sentence. The performance of a recognition system can
be improved by incorporating statistical information at the word sequence level. The performance
improvement derives from selection of lower-rank words from the word recognition output when
the surrounding context indicates such selection makes the entire sentence more probable. Lexical
techniques such as collocational analysis can be used to modify word neighborhoods generated by
a word recognizer. Modification includes re-ranking, deleting or proposing new word candidates.
Collocations are word patterns that occur frequently in language; intuitively, if word A is present,
there is a high probability that word B is also present.
Methods to apply linguistic knowledge include: n-gram word models, n-gram class (e.g.,
part-of-speech) models, context-free grammars, and stochastic context-free grammars. An example
of a handwritten sentence together with recognition choices produced by a word recognizer and
grammatically determined correct paths are shown in Figure 2.2. An increase in top choice word
recognition rate from 80% to 95% is possible with the use of language models.
1. determine spatial extent of document segments and to associate appropriate labels with them,
e.g., half-tone photographs, text, graphics, separating lines, etc.,
2. group image parts into meaningful units, e.g., figure and caption, heading, subheading, etc.,
3. determine reading order of blocks of text.
Document image analysis involves traditional image processing operations to printed text,
such as enhancement, gray-scale image binarization, texture analysis, segmentation, etc. Additional
difficult problems in the case of handwriting are: separation of lines of text, separation of words
within a line and the separation of touching characters.
Figure 2.2: Handwritten Sentence Recognition. The path through top word choices is determined
using part-of-speech tags.
Pen computers offer an interesting alternative to paper. One can write directly on a Liquid
Crystal Display (LCD) screen with a stylus or pen. The screen has an invisible sensitive matrix
which records the position of the pen on the surface. The trajectory of the pen appears almost
instantaneously on the screen giving the illusion of ink (electronic ink). Handwriting recognition
allows text and computer commands to be entered.
While nothing opposes the idea of a computer that would use multiple input modalities,
including speech, keyboard and pen, some applications call for a pen-only computer interface: in a
social environment, speech does not provide enough privacy; for small hand-held devices and for
large alphabet (e.g., Chinese), the keyboard is cumbersome. Applications are numerous: personal
organizer, personal communicator, notebook, data acquisition device for order entries, inspections,
inventories, surveys, etc.
The dream is to have a computer that looks like paper, feels like paper but is better than
paper. Currently, paper is the most popular medium for sketching, note taking and form filling,
because it offers a unique combination of features: light, cheap, reliable, available almost
everywhere any time, easy to use, flexible, foldable, pleasing to the eye and to the touch, silent. But
paper also has its drawbacks: in large quantities it is no longer light and cheap, it is hard to reuse
and recycle, difficult to edit, expensive to copy and to mail, and inefficient to transform into
computer files. With rapid technology progress, electronic ink could become cheaper and more
convenient than paper, if only handwriting recognition worked.
Human subjects make 4–8% error for isolated letters read in the absence of context and
error with the context of the neighboring letters. Therefore, the task of designing usable
handwriting recognizers for pen computing applications is tremendously hard. Human recognition
rates must be reached and even outperformed.
On-line Handwriting Recognition Techniques
Considerably more effort has been put into developing algorithms for Optical Character
Recognition (OCR) and speech recognition than for on-line handwriting recognition. Consequently,
on-line handwriting recognition, which bears similarity to both, has been borrowing a lot of
techniques from these.
There is a natural temptation to convert pen trajectory data to pixel images and process them
with an OCR recognizer. But, the on-line handwriting recognition problem has a number of
distinguishing features which must be exploited to get best results:
1. Preprocessing operations such as smoothing, deslanting, deskewing, and dehooking and
2. feature extraction operations such as the detection of line orientations, corners, loops and
cusps are easier and faster with the pen trajectory data than on pixel images.
3. Discrimination between optically ambiguous characters (for example, “j” and “;”) may be
facilitated with the pen trajectory information.
4. Segmentation operations are facilitated by using the pen-lift information, particularly for
handprinted characters.
5. Immediate feed-back is given by the writer, whose corrections can be used to further train
the recognizer.
Another temptation is to use the pen trajectory as a temporal signal and process it with a speech
recognizer. Other problems arise:
1. Stroke reordering is usually necessary, to get rid of stroke order variability and of the
problem of delayed strokes.
2. Data unfolding in a purely one-dimensional representation may result in losing direct
reference to the two-dimensional structure of the data.
As in many well-mastered tasks, human subjects generally work at the highest and most
efficient level of abstraction possible when reading a handwritten document. When difficulties are
encountered in decyphering a part of the message using one level of interpretation, they often
switch to a lower level of representation to resolve ambiguities.
In this perspective, the lower levels of knowledge, although generally used in the
background, constitute a cornerstone on which a large part of the higher and more abstract process
levels relies. For example, according to motor theories of perception, it is assumed that motor
processes enter into genesis of percepts and that handwriting generation and perception tasks
interact and share sensorimotor information. Cursive script recognition or signature verification
tasks therefore require, directly or indirectly, an understanding of the handwriting generation
processes.
3.4 Mathematical Methods
3.4.1 Overview
Processing of written and spoken human language is computation. However, language
processing is not a single type of computation. The different levels of information encoding in our
language as well as the spectrum of applications for language technology pose a range of very
distinct computational tasks. Therefore a wide variety of mathematical methods have been
employed in human language technology. Some of them have led to specialized tools for a very
restricted task, while others are part of the mathematical foundations of the technology. In this
overview a very general map is drawn that groups the different approaches. Some particularly
relevant classes of methods are highlighted in the remaining sections of the chapter.
In the early days of language processing, most if not all researchers underestimated the
complexity of the problem. Many of them tried to bypass a mathematical characterization of their
tasks and solve the problem simply by looking at the envisaged inputs and outputs of their systems.
Purely procedural early approaches to machine translation fall in this category. These attempts
failed very badly. However, there is one major difference between language processing and most
other areas with highly complex calculation tasks, e.g., computational meteorology. One system
exists that can handle human language quite decently, i.e., the human cognitive system. Moreover,
there is a scientific discipline that strives for a formal description of the human language faculty.
Very soon the great majority of researchers became convinced that one needed to utilize
insights from linguistics—including phonetics and psycholinguistics—in order to make progress in
modeling the human language user. Modern linguistics tries to characterize the mapping between a
spoken or written utterance and its meaning. Linguists do this in roughly the following way. They
break up the complex mapping into suitable levels of representation, specify representations in
dedicated formalisms, and they employ the same formalisms for specifying the implicit linguistic
knowledge of a human language user. Traditionally their main data are collected or invented
example sentences, judged and interpreted by introspection.
Almost exclusively, discrete symbolic methods have been employed for representing
different types of information in linguistics. (The only exception was phonetic signal processing,
where Fourier transformations converted the two-dimensional acoustic signal into a three-
dimensional spectrogram.)
3.4.1.1 High-level Linguistic Methods
The mathematical methods of syntax, morphology and phonology are suited for describing
sets of strings, especially hierarchically structured strings. Most of these methods came from formal
language theory. Most notable is the formal theory of languages and grammars emerging from the
Chomsky hierarchy, an inclusion hierarchy of classes of languages. Each class of languages
corresponds to a certain level of computational complexity. For these classes of languages, classes
of grammars were found that generate the languages. Each class of languages also corresponds to a
type of automaton that accepts strings of a language. Typical for this research was the close
interaction between theoretical computer science and formal syntax with strong influences in both
directions. Much investigation has gone into the question of the proper characterization
of human language with respect to the Chomsky hierarchy.
The grammars and automata of formal language theory can rarely be applied to natural
language processing without certain modifications. The grammar models developed in linguistics
do not directly correspond to the ones from formal language theory. A variety of grammar models
have been designed in linguistics and language technology.
The grammars of formal language theory are rewrite systems with atomic non-terminal
symbols that stand for lexical and syntactic categories. However, in human language such
categories have complex properties that influence their syntactic distribution. Therefore
mathematical tools were developed for expressing linguistic entities as sets of complex features. A
new class of logic, so-called feature logic evolved. This branch of research, often subsumed under
the term unification grammar, had close links with similar developments in knowledge
representation and programming languages.
The specialized processing methods that were developed for unification grammars had
strong connections with constraint logic programming. The situation is somewhat different in
semantics. Here representation languages are needed in which we can represent the meaning—or
better informational content—of an utterance. In order to provide unambiguous representations of
meaning that can serve as the basis for inferences, logic is employed. Many varieties of higher order
predicate logic have been developed for this purpose. Special representation languages such as
frame and script languages came from artificial intelligence (AI). In the last few years, many of
them have received a logical foundation. General purpose and specialized inference techniques
have been employed for interpreting the meaning representation in connection with knowledge
about the linguistic context, situational context, and the world. Logical deduction is the inference
technique mainly used, but there are also approaches that utilize abduction methods.
The last two decades witnessed a convergence of theoretical linguistics and language
processing with respect to their mathematical methods. On the one hand, this movement proved
very fruitful in many areas of language processing. On the other hand, it also lead to some
disillusionment concerning the potentials of formal linguistic tools among practitioners of language
technology. Although the specification of linguistic knowledge improved quite a bit through the use
of advanced representation techniques, the resulting systems still lacked coverage, robustness, and
efficiency, the properties required for realistic applications. It even seemed that every increase in
linguistic coverage was accompanied by a loss of efficiency since efficient processing methods for
linguistic representation formalisms are still missing.
3.4.2.2 Quantization
Since we assume that speech is literate, the signal could also be represented as a sequence of
symbols where each symbol corresponds to a phonetic unit of varying duration. Each phonetic unit
corresponds to a region of the twelve-dimensional acoustic feature space. The regions are defined
statistically by estimating the probability of each class conditioned on the vectors belonging to that
class and then computing the pairwise decision boundaries between the classes as the locus of
points in the feature space where both classes are equally probable.
If the decision boundaries are explicitly computed as described above, then an arbitrary feature
vector can be classified as resulting from the utterance of one phonetic class simply by finding
which region of the space it lies in. As a matter of practice, however, this computation is not
performed explicitly. Rather, a statistical decision rule is used. The rule simply states that a feature
vector belongs to that class whose probability is largest conditioned on the vector. The effect of this
decision rule is to quantize the entire twelve-dimensional feature space into a small number of
regions corresponding to the phonetic classes.
3.4.2.7 Semantics
The ultimate goal of speech recognition is to enable computers to understand the meaning of
ordinary spoken discourse. Semantics is that aspect of linguistic structure relating words to
meaning. Thus, the ultimate speech recognition machine will necessarily include a semantic
analyzer. At present, there exists no general theory of the semantics of natural language. There are
many proposed theories some of which are abstract and others of which are worked out for specific
limited domains of discourse. All such theories rest on the idea that formal logical operations acting
on lexical tokens and syntactic structures yield a formal symbolic representation of meaning. These
theories have not yet been made statistical in any coherent way, although a new approach based
on the HMM seems promising. There is, however, a statistical methodology which captures useful
semantic information. It is called word sense disambiguation. A simple example is found in the
word bank which has two meanings or senses. One sense is that of a financial institution and
another refers to the shores of a river. Clearly, the words commonly associated with the two senses
are quite different. If we know the words that are appropriate to each word sense, then we can use
search techniques to maximize the joint probability of word sequences and word sense. This will
result in higher lexical transcription accuracy. The key to this line of reasoning is the precise
measurement of the closeness of word associations. Church and Hanks (1990) proposed using the
information theoretic measure of mutual information and has analyzed large text corpora to show
that words clustered by large mutual information contents are indicative of a single word sense. It
is thus possible to compile word sense statistics for large lexicons and apply them in statistical
parsing techniques as described earlier.
3.4.2.8 Performance
It would be impossible in a short paper such as this is to completely and quantitatively
characterize the performance of statistical speech recognizers. Instead, I will briefly mention three
benchmarks established by systems based on the methodologies described above. They are
moderate vocabulary speech understanding, large vocabulary speech recognition and small
vocabulary recognition of telephony. Detailed summaries may be found in ARPA (1994); Wilpon
(1994). The most ambitious speech understanding experiment is presently ongoing under the
sponsorship of ARPA. Several laboratories have built (ATIS) systems that provide airline travel
information from spoken input. With a nominal vocabulary of 2000 words and spontaneous
discourse from undesignated but cooperative speakers, approximately 95% of all queries are
correctly answered.
Another highly ambitious project in speech recognition is also sponsored by ARPA. In this
large vocabulary recognition task, the goal is lexical transcription only; so unlike the ATIS task, no
semantic processing is used. The material is text read from North American business publications
by undesignated speakers. The nominal vocabulary is 60,000 words. For this task, several
laboratories have achieved word error rates of 11% or less. Unfortunately, such results are
obtained by computer programs requiring hundreds of times real time.
Finally, the largest commercial use of speech recognition is in the AT&T telephone network
for the placement of calls. In this case, customers are allowed to ask for one of five categories of
service using any words they like so long as their utterance contains one of five key words. This
system is currently processing about 1 billion calls per year. Calls are correctly processed more
than 95% of the time without operator intervention.
Because of their mathematical and computational simplicity, regular languages and finite-
state machines have been applied in many information processing tasks. Regular expressions are
often used to specify global search patterns in word-processors and in operating-system utilities
such as Unix’ Grep. The lexical analysis component of most modern programming language
compilers is defined as a finite-state machine that recognizes identifier classes, punctuation,
numbers, etc. (Aho & Ullman, 1978). But until relatively recently, finite-state techniques have not
been widely used in natural language processing. This is in large measure the result of Chomsky’s
argument (Chomsky, 1957) that natural language sentences are so rich and complicated that they
cannot be accurately characterized by means of regular or finite-state descriptions. One
consequence of this was that a generation of linguists and computational linguists came to believe
that finite-state techniques were of little or no interest.
It may well be true that complete and accurate natural language syntactic and semantic
dependencies lie beyond the power of finite-state description, but work in the last twenty years
(and particularly in the last five years) has identified a number of important problems for which
very efficient and effective finite-state solutions can be found.
One set of solutions relies on the observation that finite-state descriptions can provide an
approximation to the proper grammatical description of a language, an approximation that is often
good enough for practical purposes. The information extraction problem, for example, requires that
documents and passages be identified that are likely to contain information relevant to a given
user’s needs. Full-blown syntactic and semantic analyses of documents would certainly help to
solve this problem. But such analyses may provide much more information than this task actually
requires. Indeed, Appelt, Hobbs, et al. (1993) have constructed a finite-state solution that is
extremely efficient compared to more powerful but more complex extraction systems and also has
very favorable recall and precision scores. Their finite-state pattern specifications can only
approximate a complete analysis of a text, but the approximation seems close enough for this
particular purpose.
There has also been growing interest in using finite-state machines for storing and accessing
natural language dictionaries. Appel and Jacobson (1988) observed that the words in a large
lexicon can be very compactly represented in a state-transition graph. This is because the graph can
be transformed using determinization and minimization techniques that are well-known from
finite-state theory, with the result that prefixes and suffixes common to many words are collapsed
into a single set of transitions. Lucchesi and Kowaltowski (1993) discuss access methods for finite-
state dictionary representations that permit efficient retrieval of translation and synonymy
information associated with particular words.
A third set of problems require that strings be systematically transformed into other strings.
For example, the negative prefix in in the abstract phonological representation such as in+practical
must be realized with an assimilated nasal as in impractical, or the inflected form stopping must be
mapped either to its stem stop (for use in information retrieval indexing) or to its morphological
analysis stop+PresentParticiple as an initial step in further processing. One formalism for
describing these kinds of string transformations are context-sensitive rewriting rules as discussed
for example by Chomsky and Halle (1968). Johnson (1972) and later Kaplan and Kay (1981);
Kaplan and Kay (1994) showed that for each rule of this type there is a finite-state
transducer that maps input strings to exactly the same outputs that the rule prescribes.
A transducer is a simple extension of a basic finite-state machine: as it reads its input tape and
makes a transition from one state to another, it can also write a corresponding symbol on a second
tape. When it reaches a final state and the end of the input, the string produced on the second tape
is taken to be the output that the input string is mapped to.
Mathematically, the collection of input-output string-pairs for a given transducer
(corresponding perhaps to a particular rule) constitutes a regular relation. Regular relations share
many (but not all) of the formal properties of the regular languages (for example, closure under
union and concatenation) and also enjoy certain other properties. In particular, the regular
relations are closed under composition, and Kaplan and Kay use this fact to show that the effect of
an entire collection of rewriting rules making up a phonological or morphological grammar can be
implemented as a single finite-state transducer. This device is much more efficient than any scheme
using the rules directly for performing the string transformations that the grammar describes.
Koskenniemi (1983) proposed a different rule notation, called two-level rules, for characterizing
phonological and morphological variations. These rules also denote only regular relations and can
be transformed to equivalent transducers.
UNIT – II
Introduction to Semantics and Knowledge Representation: some applications like Machine
translation, database interface Semantic Interpretation, word senses and ambiguity, Basic logical
form language, Encoding ambiguity in logical from, Thematic roles, Linking syntax and semantics,
Recent trends in NLP.
S
)?ZZ/68'*)?+-@=6+-%('U7=)OZ#43>O@E+-%(6"H1457"W]XG936K>?+1%3214(+1216 `(+1<=<E)?6"<L7QZ/6">O@i+;FG+;A1WGXG936K<E)?RL93f
%3687=7SH-BtH1@=936<Y>?)}@E6"<L+a@E43<=687YHJ[56%(7S43[@EHb6":J6"<EA1HJ%361WiXG936KF H1<E>?')?7S)O%\@E6">?>O68Rk@=45+->?>OA
+-%5'~R"43>}@E43<L+->?>OA4(%3)}@E6's)?%\@=HH1%36JW_XG93)?7^)?7i@E936'*<E6+-Z H-BI[e6"HJ[3>?6_FIHJ<=P)?%32)?%s+
B+17ERr)?%(+-@=)?%32_+-<E6+#H-B<=687N68+-<LRL9bR"+1>O>?6' X+1RL9()O%36 Xm<E+1%(7=>?+-@=)?H1%s XX]rW
>O@=9(H1432J9#@E936Y+1`5Ha:J6 2JHJ+1>3)?7B+-<BD<EH1Z <=68+->?) 8+a@E)OHJ%na@=9(6QT(<L7@t7=)O2J%(7H1Be>?)OZ/)O@=68'
7=4(R"R"67E7K+-<E6b+1[3[(+-<E6"%\@8W$&% B+1Rr@n@E936T(<L7N@^B+->O@=6<=)?%32~7N@=6[(7#)O% XXFI6<=6_@L+-J6"%
)?%s@=936 v ª [ 7C+-%5' v ª 7W W 4*@#+-@^@=95+a@^@E)OZ/61n BDHJ<=Z_+->I>O)?214()?7N@=),R"7^+1%(' +-<=@=)OT5R")?+1>
)?%J@E6">?>O)?216%(Rr6i9(+J'_`(+-<E6">?A/`566"%`5HJ<=%n*+1%('bR"H1Z/[34*@E6"<G7=R")O6%(Rr6iF +J7t)?%b)O@E7 )?%*B+-%5RrA1W
7Y+#<E67=43>O@nP@E936 6rH1<=@E7 NB+-)?>O68' #@=H_+1RL93)?6":J6 7N4(RRr687=7W
@=gyF 95H1+a9(<L@+-'3@YF 7 +JZ_)O%7m+-<E+/1668c\7N7G643%J@=)?<E@E9(66"'6 'C%(XR"BD61HJXWP< {3X@EHJ+JX<Q7N6rF V37N+J+-H7 Z/+Y'3[3)}`3T5>?61)?R">OnP43)?%3@E>OH@21s45@=+-<L$l+->*@Y%('*FG7=),Rk>?+1+-@=7])?@=H16S`5%5@=6+-93>O<E)?6C6"AC:JBD+16HJ%('U>O>?'^Ha+aFQ<E@]43)?HJ%3>?6%327m6 7=BDH16"@E%\)O< Z/@E<E6"6"6 %(H1@ER"<L9(6Y'*+a6BD@S<E<=H1)?+1%3Z >O2 >
q])?%('*)@=Hop%(21>?)?7=9u
;{>{P { } }7;{ >,;{>{
}77|
!{P z> ;} 7;} ,}z
-
@=9(6qY)O%('3)FIHJ<E'37S+-<E6K<E6"[3>,+1R"6'U`PAUop%(21>?)?7=96c\43)?:a+->?6"%\@E7i+-%5'<E6"H1<L'*6<=68'd@=HBDHJ>}f
>?HaF@=9(6]7N68cJ4(6"%(R"6Q%3H1%*fl:16<=`3f %3HJ43%K)?%(7N@=68+1'H-Be%3H143%3f %3HJ43%*fl:16<=`W ]%*BDHJ<N@E43%(+-@=6">?A1n
v 3v
v \z $&% % P$]%
-
H1<
;{>{P { } ! !{7{ >,;{7{
!{P !{, } > } 7} !, }
-
XG9P4(T7 ) (H-@L+1E+ +"R"+1% `56p<E6"[(>?+JRr6'i`PVA ) [3>?H14(219 +?[n ) [3<E6"[(+1<=6 +rH13< ) 9(+1<=%(67E]7 +"'36"[e6"%('*)?%32]HJ%
@=9(6S7=6"%(7=6Q)OZ/[3>?)?6'_)?%/@=936i7N6%\@=6"%5Rr61WXG936YR"H1<E<=68Rk@t7=6"%(7=6]Z#4(7@tT(<E7N@t`e6Y),'*6"%\@E)}T(68'
BDH1<QXG68+193RL6p9d7=H-6"Bm%\@=@=693%(6 RrF6p7NH1@=<L<E'34(7GRr@=`e436r<EBD6pH1<EZK6 457N76@0>O68+1Rk>?@E7=)OHQ%32`56t@E93)O%\6^@=6+-<=[3[([(<=<=6"HJ@=[36'C<=),+aR"H1@E6S<E<=<E686"Rk[(@=>?>?AQ+JRrBD6"HJZ/<@=6<L%J+-@8%(W 7=>,+a@=)?H1%0W
{3HJ<Q6rV3+-Z/[3>?61n*)?%
{ !{P } 7} 7z>z 7 ,;}}z,} 7 }
q]6%(Rr6Jn)}@CF H143>,'`e6#)?Z/[5HJ<N@L+-%\@i@=Hb),'*6%J@E)}BDA@=936<=6>?+-@=)?H1%(7=93)?[H-Bt@=936#@=6>O687=R"H1[e6
RrHJ<=<E6Rr@=>?AbR+->?>O68'@=936S7N6%J@E6"%(R"6]7@E<=45Rk@=4(<=6;kWpP4(RL9/),'*6"%\@=)OT5R+a@=)?H1%57pZ/)O2J9\@p<=68cJ4()O<E6
BDH1>?>?Ha{3FQ<E6)O%3c\2_436+%\@=[(>?+1A1<En-+1)O@t21<L),7+-[3)OZ/9b[e+1H1%(<='@E+1Z/%J@+1)O@E%\HC@L+-T()?%(%3'#)O%(@=2/9(6]+/<=R"6"H1BD%\6"<E@=6"6"%\V\@@8W H-Be@=936Y[3<=HJ%3H14(%W IHJ%(7N),'*6<
@=9(6CBDH1>?>OHaFQ)?%32_7N6%\@=6"%5Rr67 BDHJ<Q6rV3+-Z/[3>?61u
; { {7 7;}>{
{,> } |;{,, > }7};
-
ZK4(>?+-@=)?%32b%3H1@YH1%(>OAU+_2J<E+1ZZ_+1<QBDH1<S@=936#>?+1%32145+-216 `(4*@C+1>?7=H4(7=)?%32`(+JRL\2J<=HJ43%('
P%3HaFQ>?6'*2J6_)O%(R">O45'*)O%(2~RrHJZ/ZHJ%7N6%(7=6P%3HaFQ>O68'*216JW >?>t@E93),7KZ#4(7@`564(7=6' )?%
RrHJZ/[3>O6"V+-%5'd+J7GA16"@G4(%3P%3HaFQ%FG+;A*7I@=H_[3<EH*Rr67E7G+21)?:16"%@=6"VP@W
<@?3< R R R D
XG936s'*)O_R"43>O@AwH1B^@=9(6@L+17=Z_+-J67)}@URr>?6+1<@E9(+a@>O)O@=6<E+-@=43<E6+1%(' [eHP6r@=<EAy+-<E6
`e6"A1HJ%(' XXh)O%C@E936BDHJ<=687N66+-`(>O6BD4*@E43<E61Wm.06"2\+->a@E6rVP@E70FQ93)?RL9 +1<=6pR+-<E6rBD43>?>OAi'*<L+aB @=68'
@=H~+;:1H1),'s+-Z#`3)O2J43)O@A~+-%5's43%PF +1%J@E6'h)?Z[(>O),R"+-@=)?H1%(7KZ)?219\@#+1>?7=HU`e6b`e6"A1HJ%('h@=9(6
+-Z#`3)O@H-B XXCW JH11687/+-%('H-@E936"<Z_+-@=6"<E),+->QFQ93),RL9<E6">?A HJ%wR"H1%P:16AP)O%32s'*HJ43`3>?6
Z/6+1%3)O%(2/+1<=6 +1>?7=H#%(H-@]21HPH*'dR"+-%5'*)?'(+a@=687 BDH1< XX+a@Q[3<E67=6"%\@8W
<E6^@E936"<E6/+-%PAd@E+17=U'*H1Z_+1)O%(7SFQ936<=6 XX ),7S+1[3[3>?)?R+-`3>?6 ygy93)O>?6/'*6rT(%()}@E)O:J6
7N@E+a@E6"Z/6"%\@YR+-%3%3H1@Q`56 Z_+1'361n*)O@S7N66"Z_7 @=95+a@Q@=9(6C@E+17=b'*HJZ_+-)?%(7Q+-<E6C<E+-@=936< %5+-<=f
43%('36"<=@E+-J6"%+-%U+-Z#`3)O@=)?H14(7Q[(<=[H (6Rk@]R+->?>O68'dot43<EH-@=<L+RrHa:16<=)?%32/+->?>@=9(6^>?+1%32145+-21687
H-B @=9(6 IH1Z/ZK43%()}@AJWg~HJ<=9(+17^+1>?7=H`566"%s43%('36"<=@E+-J6"%h`PA21<EH14([(7i)O%{(<E+1%(Rr6Jn
i6"<EZ_+-%PA1n(*FQ)}@ "6"<E>,+-%('nP@E936 Sd+1%('b$&%5'*)?+(W
<@?3<
R R D L L B 3G EN #R
[ n!#P v
216%36"<L+a@EH1<C@L+-1687C@=9(6)O%\@E6"<EZ68'*),+a@=6b<=6[3<=687N6%\@E+a@E)OHJ%h+1%('s216%36"<L+a@E67^+U7=6"%\@E6"%(R"6
)?%@=9(6^@E+1<=2J6r@i>?+1%3214(+1216JWSXG936#Z/+ (H1<i+1'*:a+-%\@L+-216KH-B@=9()?7C+-[([3<=H\+1RL9U),7]@E9(+a@i@=9(6
+-%5+->?A 6"<]H1Bp[(+-<L7=6"<QBDHJ<]@=9(6#7=H143<LRr6 >,+-%32J4(+-2J6 )?7Y)O%5'*6"[e6"%('36"%\@iH1B@=936#216"%(6"<L+a@=HJ<
BDH1<C@E936/@E+1<=2J6r@C>?+1%32145+-216JW 7^+d<=687N43>O@n0)}BGH1%36/FG+-%\@E7S@EHU`343)?>?'s+d7NA*7N@=6"Z FQ)}@E9
@=<L+-%57N>,+a@E)OHJ%dR+-[(+1`3)O>?)O@A+-Z/HJ%3216Jn37=+;AJn v >,+-%32J4(+-2J67n*H1%3>?A v [5+-<L7N6<E7Q+1%('d216"%3f
6"<L+a@EH1<L70%(6"6'^@=HC`e6GR"H1%(7N@=<E4(Rr@=6'W IH1%\@E<E+J7@)}@FQ)O@=9 z*v v V v ©P7=A*7@E6"Z_7m@=9(+-@
%366'@=H/`e6^RrH1%57@E<=4(Rr@=68'b)?%d@=936CT(<L7N@Q+1[3[3<EHJ+JRL9W
XG936'*)}R"43>}@AFQ)O@=9@=936/7N68RrHJ%('U+-[3[(<=H\+1RL9n593HaFI6:16<n5)?7Y@=9(+-@i)O@S),7i'*)}R"43>}@
@=H '*6"T(%36~@=936h)O%\@E6"<E>O)?%32145+3W >,7NH5nY)O@d),7d%3H-@d[5H\7=7=)?`3>O6~@=H@E+116h+1'*:a+-%\@L+-216H-B
7=)OZ/)?>?+1<=)O@=)?67#`56"@FI66"%>?+1%3214(+121687"W 7+<E67=43>O@nZ_+-%PA <E67=6+1<ERL9s21<EH14([(7K457N6d+
@=9()O<L'~+1[3[3<EHJ+JRL9R"+->?>?6'U@E936@=<L+-%(7NBD6"< +1[3[3<EHJ+1RL90Wi$&%~@=9()?7C+-[([3<=H\+1RL9ne@E936/[(+-<L7N6<
[3<EH*'*4(Rr687@E936Y<E6"[3<E67=6"%\@L+a@=)?H1%BDHJ< +^7=H14(<ER"6]>?+1%32145+-216Jn-FQ93),RL9),7p@E936"%_@E<E+1%(7NBD6"<E6'
@=Hh@E936U<=6[3<E67=6"%\@E+-@=)?H1%BDHJ<@=936@L+-<E216r@/>,+-%32J4(+-2J61W XG936216"%(6"<L+a@=HJ<@E+-J67Ha:16<
BD<EH1Z|@E936"<E61WXG93),7K+1[3[3<EHJ+1RL9~)?7K)O%\@E6"<EZ68'*),+a@=6_`e6r@F 6"6%s@=936T(<E7N@ @F H(W{(H1< v
>,+-%32J4(+-2J67nQ)O% @=93),7+1[3[3<EHJ+JRL9yH1%3>?A v [(+1<E7=6"<L7b+1%('y2J6"%36<E+-@=H1<L7+1<=6~%36"68'*6'n
93HaF 6":J6"<8nJ@E936 %P43ZK`e6"<QH1B0@E<E+1%(7BD6<]RrHJZ[eH1%(6"%\@E7G%366'*68'+-<E6 z3v W
$&%d@=936 $&%5'*)?+1%RrHJ%J@E6rVP@QFQ936"<E6C@=9(6"<E6^+-<E6C+/>,+-<E216C%P43Z#`56<]H-B>?+1%32145+-21687CDH-B f
T5R")?+1>O>?A v Z_:+ (HJ<]H1%(67LQFQ93),RL9+-<E6K+1>?7=H:16<=ARr>?HJ7=61n5@=9(6K)?%J@E6"<E>O)?%32J4(++-[([3<=H\+1RL9
F H143>,'7=6"6Zj[3<E6rBD6<E+1`3>?61nP`34*@Q@E936 %(+a@E43<E6 H-Bm)?%\@=6<=>?)O%(214(+),7GH1[e6"%W
<@?3< M JB,B Q QGPQJR
7=\<E)O@S:PAJ+1;+1<E+1%(+JI@EHbZ/H*'*6"<E%$&%('3)?+1%>?+1%32145+-21687"W]%3)?:16<E7=)}@AdH-BtqYA*'*6"<L+-`(+J')?7
+-%d)?Z[eH1<=@E+1%\@YRr6"%\@E<=6 7N@E+-<=@=)?%32FIHJ<=_HJ%URrH1Z/[343@E+a@E)OHJ%(+->>?)?%32143),7N@=),R"7Q+-%5' XXCW
XG93O6 XX 7NA*7N@=6Z/7bFQ)O>?>i9(+;:16U@EHT3@d)O% +-% HJ<=2\+-%3) +-@=)?H1%W $&%(7N@=68+1'yH1B^<=6"f
[3>,+1R")O%32[56H1[3>?61n@E936"A FQ)?>O>GRrHJZ/[3>O6Z/6"%\@K@E936"ZW~XG9(6"<E6R"+1% `56dH-@E936"<KB+->?>OHJ4*@E7
BD<EH1Z @E936 XX 6"eHJ<N@8W @_+1%)?%\@=6>O>?6Rr@=4(+1>]>O6:16>nI)}@bR+-%w21)?:16+~`()O2s`eH\H\7@/@EH
@=9(6@=<L+1'3)}@E)OHJ%(+->I3+-%(7=P<=)O@^:PAJ+-a+1<E+1%h7N@=4('3)O687"n>?)O%32J43),7@E)?R7K+-%5' +-<=@=)OT5Rr),+->p)?%\@=6>O>?)}f
216%(Rr6'*),7=R")O[3>?)?%367WSXG93687N6#FQ)O>?>21<EHaF @EHb%36F 936")?219\@L7Y`eH1<E<=HaFQ)?%32/BD<EH1Z H1%(6#+-%3f
H-@E936"<8W @I@E936i[3<E+JRk@E)?R+->e>O6:16"> nPH1%36CR"+1%)?Z/+121)?%36C+^%P43ZK`e6"<GH1BmRrHJZ[(4*@=6< `5+17=6'
[3<EH*'*4(Rk@L7"W{3H1< 6"V3+-Z/[3>?61n0>,+-%(214(+1216_+17E7N),7N@E+-%\@_+U7=A*7@E6"Z|@E9(+a@KR"H1<E<=68Rk@L7C21<L+-Zf
Z_+a@E)?R+->0Z/),7@L+-J67Lkn(R"H1Z/[34*@E6"<S+J7=7=),7@E6'd>?+1%32145+-216C)?%(7N@=<E4(Rk@E)OHJ%DFQ93),RL9U)O%5Rr>?4('*67
:a+-<E)OHJ4(7]>?67E7=H1%(7Q)?%(R">O4('3)O%32dRrHJZ[(<=6936"%(7=)?H1%@=687@L7]BDHJ<Y>,+-%32J4(+-2J6C@=68+1RL93)?%32\rn5%5+a@Nf
43<L+->m>?+1%3214(+1216^)?%\@=6<NB+JRr67]@=HdRrH1Z/[343@=6"<L7K7NHb@=9(+-@iHJ%36KR+-%R"H1Z/ZK43%()?R+a@=6#FQ)}@E9
RrHJZ/[34*@=6<E7#4(7=)O%(2>,+-%32J4(+-2J68C6"@ER-WsXG936dZ_+-<E<=),+-2J6_H-BY7=[566RL9 [(<=H*Rr687=7=)?%32UFIHJ<=
FQ)O@=9U%(+-@=43<L+->>,+-%32J4(+-2J6C[3<=H*R"67E7N)?%32/9(+17Q@=936 [eH-@E6"%\@=),+->mH-B[3<=H*'*45Rr)?%32_RrH1Z/[343@=6"<
7=AP7N@=6Z_7G@=9(+-@YR"+1%d>?),7@E6"%+-%5'b@L+->?Wp$lB@=936^HJ[*@=),R"+1>RL9(+-<L+1Rr@=6<Q<=68RrH12J%3)O@=)?H1%@E6RL9*f
%3HJ>OHJ21AK),7 +->,7NHKR"H1Z#`3)O%(6'n\)}@GR"+1%>O68+1'@EH#7=AP7N@=6Z_7@=95+a@ R+-%<E6+J'/H14*@ `5HPHJP7S @EH
@=9(6 `3>O)?%('n3BDH1<Q6rV3+1Z[(>O6;kW
ghH1<E/BDH1<G$&%('*),+-%d>?+1%32145+-21687 FQ)O>?>9(+;:16S@=H`e6^'*H1%36CZ_+-)?%3>?A`PA_$&%('*),+-%(7Wgh6
+-<E6^+->,7=H#@E936^`e67N@Y6c\43)?[3[e6'd@EH'*H_)}@8W XG9(6^[3<EH1`3>?6"Z_7]+1<=6 RL9(+1>O>?6"%32J)O%(2/+1%('d@=9(6
[eH-@=6%\@=),+->)OZ/Z/6"%57N6JW gy95+a@Q)?7 %366'*68'),7G+^%(+-@=)?H1%(+1>5FQ)?>O>+1%('+#%(+a@E)OHJ%(+->e6rH1<=@W
)?%@E936C@E+1<=2J6r@Q>,+-%32J4(+-2J6]@=HJ216"@=936<QFQ)}@E97=43)O@E+1`3>O6^+J'3'*)O@=)?H1%(+1>%3H-@L+a@E)OHJ%W
$l@t),7t+1<=2J436'#)O%@=93),7t7=6Rr@=)?H1%@=9(+-@t@E936"<E6]+-<E6QH1%3>?A#@=93<E6"6QZ_:+ (HJ<t'*)Oe6<=6%(Rr687p)?%
@=9(6K7NHJ4*@=9$&%('3)?+1%d>,+-%(214(+12167G+1%('q])?%('*) W >O>0@=93687N6^R+-%`56 `(<=),'*2168'd`\Ad7N)?Z/[3>?6
+1'('*)}@E)OHJ%(+->t%3H-@L+a@=)?H1%s)?%q])?%('*) WUXG936<E67=43>}@E)O%(2U>?+1%3214(+1216_R"+1%s`56:P)?6"F 6's+17K+
7=H14*@E936"<E%d'3)?+1>O68Rk@]H-Bq])?%('*) W %\457=+1<E+1;+#4(7=67G@E93),7]'*),+->?6Rk@G@EH_Z/+116 7=H143<LRr6i@=6"VP@
)?%7=H143@=936<=%d>?+1%32145+-21687G+1R"R"67E7N)?`3>?6Y@EHq])?%('*)<E6+J'*6"<L7W
e k8 ;; ¥ k L£L
) 1 *
,
%*! %' !P K = v
>O@=9(H1432J9U@=936/RrHJ%(Rr6[*@SH1Bt+-%P4(7E+-<L+-a+/),7Y2J6"%36<E+1>neFI6#FQ)O>?>'*)?7ERr457=7S)}@C)?%@=9(6
RrHJ%(7=)?'*6<E7+^FIHJ<E'+a@I+i@E)OZ/6JnP+1%('#BDH1<p68+1RL9/F H1<L'#)O@IRL9368RL*7FQ936r@E936"<p@=936SFIHJ<E'#)?7
qY$ Co YU tq +1.<E+J6"'*V )O2JZ_7
ZS6rB+143>}@ I93H1),Rr687
Z " X
vª
W )O>?)O%(214(+1> ZY),Rk@E)OHJ%(+-<EA
X tt$ 1W . " Y)?`3X 9( +1X J@E) ZYZS),Rk)?Rr@E@=)OHJ)?H1%(%(+-+1<E<=A A
Co XUHJ<=[39UPAP%\@=93687N)"6<L
6"<E`+-n(<Lt+1'*<=HJ)?21%3Z_H1437i%0 Yn**H19(43+1%7=n 9\@=)
ZS6rB+143>}@ I93H1),Rr687
S XQ S X
)?%_@=936S'*)?Rr@=)?H1%(+1<=AH-B)?%('*68Rr>?)O%(+1`3>?6]FIHJ<E'(7"W $lBBDH143%('n\)O@I<E6r@E43<E%(7)O@E7I21<L+-Z/Z_+a@=),R"+1>
BD6+-@=43<E67W$l@K+1>?7=H4(7N687C@=9(6_FIHJ<E'~[(+-<L+1'3)O2JZ/7S@EH7N66_FQ936r@E936"< @E936_)?%3[34*@KF H1<L'
R"+1%#`e6]'*6<=)?:168' BD<EH1Z +i<EHPH-@+-%('#)}@L7[(+1<E+J'*)O2JZW0$lBe@=936Q'*6<=)?:a+a@=)?H1%),7[5H\7=7=)O`(>O6Jn1)O@
<E6r@=4(<=%(7 @E936^2J<E+1ZZ_+-@=),R"+->eBD68+a@=4(<=687Q+17E7NH*Rr),+a@E6'FQ)}@E9d@=936^F H1<L'_BDHJ<=Z DHJ`*@E+1)O%368'
BD<EH1Z @=936I<=HPH-@ +-%('i@E936t[(+1<E+J'*)?21ZkW$&%^R+17=61n8@=9(6p)O%([34*@FIHJ<E'CR"+1%3%3H1@0`56 '*6"<E)?:16'n
)O@])?7G[eHJ7E7N)?`3>?Ab+_RrHJZ/[5HJ43%('FIHJ<E'+-%('d),7G21)?:16"%@=H@E936^7=+1%('*93)[5+1RLa+-2J6Y@EH7N[3>?)O@
)O@])O%\@=H@F H/H1<QZ/H1<E6CFIHJ<E'37nPFQ93),RL9+1<=6S@=9(6"%U+-2\+-)?%d+1%(+->?A "6'b`PAbZ/H1<E[39W
XG936/H143@=[34*@^H1BIZ/H1<E[39h)?7C2J)O:J6"%s+17i)?%3[34*@ @EH@=936/>?HPR+->F H1<L'2J<=HJ43[56<W$l@E7
Z_+-)?%/@E+J7N#)?7p@=H 21<EH14([#BD4(%(Rk@E)OHJ%_FIHJ<E'37pFQ)}@E9/@=936YR"H1%\@=6%\@pFIHJ<E'37p`(+17=6'H1%_>?H*R"+1>
)?%*BDH1<EZ_+a@=)?H1%~7N4(RL9h+17S[5H\7@E[5H\7N)O@=)?H1%~Z_+-<E16"<L7Q@E9(+a@CBDHJ>O>?HaF +%3HJ43%nHJ<C+14*V*)O>?),+-<EA
:16<=`57BDH1>?>OHaFQ)?%32+sZ_+-)?% :16"<E`WXG93)?7b21<EH143[()O%32yH1<bR+17=6d6%('*)?%32J7b)O%yR+17=6UH-B
)?% \(6Rr@=)?H1%(+1>P>?+1%3214(+121687Ern8),'*6%J@E)}T567 :P)O`395+-\@=)*H1B(%3H14(%(7+-%('K:16"<E`(7WmXG936 :P)O`(9(+-\@=)
H-B:16<=`57I),7]+1>?7=HR+->?>O68'X X D@=6"%57N6"fl+J7N[e6Rr@NflZH*'3+1>O)O@A3I>?+1`56>W
B @E6"<@E936t+1`5Ha:J67N@E+1216Jn7=6"%\@=6%J@E)?+1>1+-%5+->?AP7=),7R"+-%C`e6t'*HJ%361W I43<=<E6"%\@+-%P4(7E+-<L+-a+
'*HP67K%3H-@'*H@=93),7K+1%(+->?A*7N),7 `e6R+-4(7=6)O@K<=68c\43)O<E67K+>?+1<=2J6+-Z/H143%\@^H1BG>?)?%32143),7N@=),R
'3+-@E+/@EH`56#[3<E6"[(+1<=68'W >?7=H(n7=)O%(R"6 @=9(6K$&%('*),+-%U>,+-%(214(+12167Q+1<=6#Rr>?HJ7=61n3@E936 f z
<E43>O6+-[3[3>?)?67 @EH:P)O`(9(+-\@=) W Y7=6H1BG:P)?`39(+1J@E)t[3<EH*'*4(R"67 =R"H1<E<=68Rk@ dHJ4*@=[343@
#
>?6rV*)?R+->'3+-@E+1`(+17=67G+-<E6C<=68+1'*AJW
XG936#%36rVP@ 7@L+-216#H-Bp[3<EH*Rr687=7=)O%(2b),7]@E9(+a@CH-Bp@E936Z/+1[3[3)?%32b`(>OH*RLWiXG9()?7C7N@E+-2J6
4(7=67]+/%(H143%:P)O`(9(+-\@=)m'*)?Rr@=)?H1%(+1<=AJn3+_X X '*),Rk@=)?H1%5+-<EA1n(+1%('+#`()O>?)O%(214(+1>0'*),Rk@=)?H-f
%(+1<=AJWC{(H1<C6+JRL9F H1<L'U21<EH143[0n(@E936_7NA*7N@=6"Z T5%('37C+b7=43)O@E+-`(>O6<=HPH1@i+-%5':P)O`(9(+-\@=)
)?%b@E936C@E+1<=2J6r@G>?+1%3214(+1216JW XG9P4(7n*)}@Q2J6"%36<E+-@=67G+#>OH*R+->F H1<L'_21<EH143[)O%d@E936i@E+-<E216"@
>,+-%32J4(+-2J61W
XG936>?HPR+->GF H1<L'21<EH14([(7)O%@=936@L+-<E216"@>,+-%32J4(+-2J6d+1<=6d[(+J7=7=6' H1%@=H +h>OH1f
R"+1>IF H1<L'7=[3>?)}@=@=6"<U.1g 3KBDHJ>O>?HaFI68' `PA+Z/H1<E[393HJ>OHJ21),R"+1>I7=A\%\@E9367=) 6"<d Co ikW
.Jg 7N[3>?)O@E7G@E936^>?HPR+->F H1<L'd21<EH143[57G)O%\@=H_6>O6Z/6"%\@E7]R"H1%(7=),7@E)O%32_H1B <EHPH-@Y+1%('dBD6+-f
@=4(<=687"WU{ )O%5+->?>OAJn Co 2J6"%36<E+-@=67i@E936bF H1<L'37CBD<EH1Z +<EH\H1@#+-%5'h@=936b+J7=7=H*Rr),+a@E6'
21<L+-Z/Z_+a@E)?R+->eBD6+-@=43<E67W
<>= < R B Q B G
%P4(7E+-<L+-a+ HJ4*@=[(4*@I),7t457N4(+1>O>?A/%3H-@I@=936S@E+1<=2J6r@t>,+-%(214(+12161nJ`34*@QRr>?HJ7=6]@EHK)O@WXG9P4(7"n
@=9(6 +-%(%(+1'3+-flqY)O%5'*)p+-%P4(7E+-<L+-a+d[3<EHP'34(Rr687K+U'*),+->?6Rr@KH-B]q])?%('*) n@=95+a@#'*HP687^%3H1@
9(+;:J6#+-2J<=66"Z/6"%\@Y6"@ER1W $l@CR+-%`e6R+->?>O68'+7NHJ<N@iH-B Zi+-*7=93)O%()G7NHJ4*@=9(6"<E%5]qY)O%('3)W
"
PHJZ6C+1'('*)}@E)OHJ%(+->%3H1@E+-@=)?H1%bZ_+;A_+->,7=H#`e6S457N68'b)?%@E936CH14*@E[34*@W I6<N@L+-)?%+1ZHJ43%\@
H-BP@E<E+1)O%3)?%32]),7%(6"6'36'iBDH1< +Q4(7=6"<0@=H]2J6r@04(7=6'i@EHQ@=9(6I+-%P4(7E+-<L+-a+ H143@=[34*@ >?+1%32145+-216JW
XG936 <EH1>?6 H-B @=936#+-%P4(7E+-<L+-a+K)?%\@=6<NB+JRr6DHJ<]H1%3f >?)O%(6 936">?[U)O%{ )?2143<E6 PW z ),7G@EH
B+1R")O>?)O@E+a@E6Y@E936i<=68+1'*)?%32KH1BH14*@E[34*@Q`PA@=936i<=68+1'*6<W $l@Q7N93HJ43>,'166"[_@E<E+JRL#H1B0FQ9(+-@
IH1<E6 %P4(7=+1<E+1a+
X+ X+ 9 \ (
4 /
Z 1
+ % X+ X+
Y7=6"<E$&%\@=6<NB+JRr6 $&%\@E6">?>O)?216%J@ S
7N6<Q$&%J@E6"<=B+1R"6 -+ ),'
q XX *4 @=HJZ_+a@=),R IH1<E<=68Rk@EH1<
<ED@=6H_+J'*9P6"43<LZ_7E +-%
X+ X
X E+
X +-<E216"X @Q.m+-%325W X +-<E216rX @].m+-%325W
q XX XX
p+ -<E)?H14(7G.6:16>?7GH1B %\457=+1<E+1;+
4*@E[34*@
{ )?2143<E6 PW z uGu3ZY)O6"<E6"%\@]$&%\@=6"<=B+1R"67 BDHJ< %P4(7E+-<L+-a+
ZS6"[e6"%('*)?%32hH1% @E936d%(+a@E43<=6H-B])?%\@=6"<=B+1R"61nHJ%36R+-%9(+;:J6+1%H1<L'*)?%(+-<EA~457N6<
)?%J@E6"<=B+1R"61n\)O%\@=6>O>?)?216"%\@t457N6<I)?%\@=6"<=B+1R"61n\212\q XX9P43Z_+-%b+1)?'*68'_Z_+1RL93)?%36Q@E<E+1%(7Nf
>,+a@=)?H1%ekn(+1%('bZ_+JRL93)O%(6S@E<E+1%(7=>?+-@=)?H1%s7=6"6 { )O2J43<=6 *W z kW
+-%P4(7E+-<L+-a+(W#W 4*@QT(<L7@]>?6r@]457G>OHPH1+a@]+-2J<=66"Z/6"%\@W
<>= < D D JL 3B QD J R GB,G#G JQLJ Q
R"+1%('*),'3+a@E6#a+1<N@L+17DH1<Ca+-<EZ_+17LQ),7C'*)}6"<E6"%\@8n+-%~+1ZK`3)?214()}@AFQ)?>O>p+1[3[568+-<i)O%~@=9(6
21%([H1Bm@E936 :16<=`W
P)?Z/)O>,+-<E>OAJn([eHJ7E7N687=7=)?:16 Z/H*'*)}T56"<L7]H-B%3HJ43%(7Y%36"68'U)O%3BDH1<EZ/+-@=)?H1%+1`5HJ4*@Y21%([UH-B
@=9(6d<E6">,+a@E6' %3H143%57"W XG9(6"<E6d+1<=6+%P43Z#`56<_H-BSR+17=67FQ936<=6d)}@_),7%3H-@_6+J7NAh@EH
RrHJZ/[34*@=6JWpPH1Z/6r@E)OZ/687"n\@=9(6S<E6">?6":a+-%\@ )O%*BDHJ<=Z_+-@=)?H1%_Z_+;A/%3H-@ 6":16%`e6i+;:a+-)?>,+-`3>?61W @
+-%(%(+1'3+9(+17]@=93<E6"6216%('*6"<L7KZ_+17ERr43>?)O%(61neBD6"Z/)?%3)O%(6/+1%('U%364*@=6<LrnFQ936"<E6+J7
q])?%('*)59(+17HJ%3>?A^@F H/Z_+17ERr43>?)O%(6]+-%('#BD6"Z/)?%3)?%368rW IHJ%(7=6c\436"%\@E>OAJnJ7=H1Z/6G)O%*BDHJ<=Z_+-f
"
;} ;>, } }
-
;;+E 0 k ¥ ¥ k ; ¥k× E¢; kL £r ×Q¥N ¥ E k ¥ Y; k ?¥ ¥p¤×Q¤ ¥ ¥ Y ¢ CNk ;; k ×] £k ¥ ,;
1 '& ( ) ( )
& ! *
0k ¥ ×>¥×] ¥p 0; a¥ ¥ × ¥ ¢ ¥¤¥ ¤ ; ¢ ¤k¥ ¥= 0 L ¤8 ¢ 1¥ × 8 l; L ¤¥ J k ¤ ¥¤^0; ¤ ¤ ¢1¢ £k¥¥¥ ¥ ¥ ¥ ¢- Lk¤¤p £ 8kkS8¥N¥kL×Q ¥; ¥
( ( . 1 . -
# & ( ). & ) 1 )
& ( ). 1 & -
F ¥k× × k ¤¥0k S £ ; ¤ k ¥¥ ¥ 8 ¥ ¤;k 8 ¥ L ;¥ k× £ ¤;¥& t¥ L ¥ ¥¥ ; ¥ ¥k × ¤ E k¥ ¢ ¥
( (
1 * ( & # & ' ( ( &
( 1 ( ( & 1 *
× Y; ; ; ; ¥ ] ¥ k ¥ ¥ ¥ k k×] ¥ £ "k £ "k G E ;
!
) .
(
( 1
'* !
(
1
, %' !P K =
%*! v1v7y
.H\7=7iH1B +-2J<=66"Z/6"%\@C)?7C%(H-@ 6rV*[e6Rk@E6'@=H`e6/@=HPH"(N+-<E<=)?%32@=HUq])?%('*)t7=[e6+-J6"<L7
)?%y:\)?6"F H1Bi@E936")?<6"VP[eHJ7=43<E6@=H:a+1<=)?H14(7:;+1<=)?6r@E)O687_H-B %3HJ%*fl%(+a@E)O:J6qY)O%5'*)S9(6+-<L'
6":J6"<EA*'3+;A_+17G[3<EH1[(+12J+-@=6'b`PA_@=6>O6:\),7=)OHJ%n3<L+1'*)?H+1%('T(>?Z/7W
<>= < 'GJ G BED R
[5+-<=@QBD<=HJZ +-2J<=66"Z/6"%\@G@=9(6"<E6^+-<E6iHJ%3>?A@=9(<=66 Z/+(H1<]7=AP%J@L+1Rr@=),Ri'3)}6"<E6"%(R"67Q`e6rf
@F 6"6"%sq])?%('*)I+-%(' +-%3%5+1'3+(W/P4(<=[3<E),7N)?%321>?A+->?>H-B @=9(67=6/R+-%h`56/@L+-J6"%sR"+-<E6H-B
`PAb6%3<=),RL93)?%32q])?%('*)mFQ)}@E9+BD6"F +1'('*)}@E)OHJ%(+->BD43%(Rk@E)OHJ%(+->m[(+-<=@=),Rr>?67]H1<Y7=4*/V*67]+J7
"
6r@LR-Wku
`eH-]B 6H1R"@=N+16JP4(n3) K7=@=6S95F +aH16 @QB0R"@=@=9(+19(%d6i6KHJ68+-<E+1`e'37=Ha)O6":1>?<GA6 H-+;B:1+-@EHJ93%()?'6^%(+1@=Rr93'3HJ),+/%(7Q7N7=@=[36")O<E@=%\H143@=`(66>O%J%(6@LRrZd7"6^Wn37=qY+JH17G4(HaFI%()O>?'3>?64(7Q:176"H*@E< <E'3+-`P'd@=A_68@E'4(H7=`e+)O%36"2>?Ha+-F =%36"u ),%(7=+J+J'3+ +[5)?%(67N<E@=7=6H1+J%0' n
" "
$l@'*HP687_%3H-@dRrH1%\@L+-)?%w)?%*BDHJ<=Z_+a@E)OHJ%y+-`eH14*@_@E936<E6">,+a@=)?H1% 68+a@b9(+J7/@=H7=[eH\HJ%
H1< Z_+116/9(+17i@EHU7N[eHPH1%W_$l@^),7i@=936b`(+1RLP21<EH14(%('UP%3HaFQ>?6'*2J6#@E9(+a@ [3<EHa:P)?'367i@=9(6
<E6">,+a@=)?H1%w`e6r@F 6"6"% @E936"ZW {3HJ<6rV3+-Z/[3>?61nI@=9(6216%36"<L+->QF H1<E>?'wP%3HaFQ>?6'*2J6Z/+;A
`e6b4(7=6'h@EH~7E+;A@=95+a@7=[eH\HJ% )?7^@=936)O%57@E<=43Z/6%J@#H-B]6+-@+1%(' ),7^%3H1@K6+-@=6%5 )?%
*W y kWC$&%@E936%36rVP@ 7N6%J@E6"%(R"6 PW ©\S7N[eHPH1%R"H143>,'`e6K@E936)O%(7N@=<E43Z/6"%\@CH1<S@=936Z/6
H-BZ_q]+-)?J%(61'*W ) n1HJ%K@E936GH-@E936"<9(+1%('n-95+17H1%(>OA @F Hi[(+-<=@=),Rr)?[3),+->*[393<L+17=67 :P) 1W AJ+1+ 9P4(+J+
$
XG9P4(7n\+1%\457=+1<E+1;+CF H14(>?'_`e6C+-`3>?6Y@EHK4(7=6]@E9367=6CRrH1%57@E<=4(Rr@=)?H1%(7t)O%dq])?%('*)H1%(>OA
FQ936%s@=936b@=6"%57N6)?%*BDHJ<=Z_+a@E)OHJ% )O% +1%3%(+J'3+)?7#+-[3[(<=HJ[3<=),+a@E619W W 4*@#FQ9(+-@#+-`eH143@
@=9(6 H-@=9(6"<G@=6%(7=61n(+J7N[e6Rr@Q+1%('dZ/HP'(+->?)}@A XG936<=6C),7]+/7NAP%\@E+JRk@=),Ri93H1>?6C)O%UqY)O%('3) ?W
"
k¥ LCa¤P ¤ ;¥ ¥;OE k ¥ ¥ ¥N¥ L£k¥ ¥ ¥k ; ¥ ¥ ; ¥ ¥¤¥N ^ L ;; ¥ t ¥ ¥ 8Ok ×];E ¥ Ck iNk × ¥ ¦
) ) ( . ( *
& # )# 1 ) 1 1
£L lCki ¥ k¥ L¥ 3 ¤ " k ¥ k ¥ ¢ ¥ £E ×] i ¥ L L£E ×]L£E k ¥ ¥ ¢ ;
) '*
#1 1 1 # 1
1 1'*
n! v1v
;{7{P{ {7}7 } >>7A A7 > 7>A 7;{P} !{,{ } }{,} 7;{, !{
| ;{7{P { } !A{>{,{>{7!{>{ { ,{P>{,{ } } 7! ;{>{>z>
XG) 93+?W 6Q) (:PH )?`39(+*+-Rr\H1@E4()*>?Z/'b+16"<=:1J66"%b<L7I`e6C)W 61<=6W?n-[3@E>?93+J6GRr68BD'43%(`PRr9A @=)?) H1(HJ%(43+1%(>3E+F +JH1@E<LH#'37[(<=7NH*6Jn-'*14(HCR"6i6r@L+#R-WP^)O%5+1'b<=6GH1<EBm6"Rr[3HJ>,+1>O>?R"H*6cJ'^4(`P)?+1A >
q])?%('*)0)O%U7=H1Z/6C<=621)?H1%W
]%(>O)?16d@=9(6T(<E7N@R"+J7N6Jnp@=9()?7/),'*6+~@L+-1687_7NHJZ/6b@E)OZ/6+-%5' 6"eHJ<N@/BDH1</@E936Uq])?%('*)
<E6+1'36"< @=H_2J6r@Q4(7=6'@=H5W
/ !
gh69(+;:J6+-<E214368' @E9(+a@)}@d)?7b[5H\7=7=)O`(>O6U@EHHa:16<ER"H1Z/6d@=936~>?+1%32145+-216`(+1<=<E)?6"<)?%
$&%('*),+U4(7=)?%32+-%P4(7E+-<L+-a+3W %P4(7E+-<L+-a+d@=<E)O687C@=H@E+-J6+1'*:a+1%J@L+-2J6H1BG@=936<E6">,+a@E)O:J6
7N@=<E6"%321@=9(7 H-B5@=936]R"H1Z/[34*@E6"<+-%5'^@E936G9\4(Z/+1%#<E6+J'*6"<8naFQ936<=6I@=936QR"H1Z/[34*@E6"<@E+11687
@=9(6i>,+-%(214(+1216Y>?HJ+J'b+1%('b>?6+;:J67t@=936CF H1<E>?'\%(HaFQ>O68'*216i>OH\+1'bH1%@E936C<=68+1'*6<W $l@Q)?7
[(+1<N@E)?R"43>?+1<=>?AY6"e68Rk@E)O:J6pFQ936"%^@=936I>?+1%3214(+121687+-<E6pRr>?HJ7=61n;+170)?70@=936IR"+17=6pFQ)O@=9^$&%5'*)?+1%
( ;E " M ¥ "×] ¢ L¥ L; * ;¥ ¡ 0¥ £rkl¥ ×>E ×] ; ¥E " E" ¥¥ "¥ ¥ k 8 £ E " ¥ k" ×> E×] " "; J"kk ¥ 0k ¤¤ ¢ ¥ E £r;l L¤
)
'
'
(
1
&
(
(
1
'
( ) (
-
'
*
(
&
11
(
UNIT – II
Introduction to Semantics and Knowledge Representation: some applications like Machine
translation, database interface Semantic Interpretation, word senses and ambiguity, Basic logical
form language, Encoding ambiguity in logical from, Thematic roles, Linking syntax and semantics,
Recent trends in NLP.
Natural Language Processing (NLP) has many applications that require either
Natural Language understanding or Natural Language generation or even both - for
example Machine Translation, which focuses on translating text from one human
language to another automatically. Another good example of NLP is Information
Extraction which is concerned with “factual information in free text ”. Natural
Language Interface to Database (NLIDB) is one of the traditional applications of the
NLP domain. It is a powerful example of Question answering systems (QAS) for
querying structured and unstructured databases. NLIDB has been an interactive area
of research that aims to provide accurate answers to user questions expressed in
Natural Language and generalize access to databases for different types of users
regardless of their technical skills
Above figure shows the components of Natural Language Interface to Database. The
idea of NLIDB systems came from the method of questioning the database that uses
Natural Language queries like English instead of database language to query the
database.
Graphical NLIDBs: Presents users with an interface where they can make
proposed selections for query formulation. NL-Menu system as an example.
Textual NLIDBs: Allows users to express their queries in NL by directly
writing it easily, LUNAR and PRECISE are systems that use Textual NLIDB.
NLIDBs dependent on the database domain: it is necessary to know
beforehand all the particularities of the field of application like LUNAR, ASK,
and GILNDB systems.
NLIDBs independent of the database domain: PRECISE is an example of this
type of NLIDB system where knowledge of the field of application is not
necessary.
Classification of NLIDB by language: Most of the NLIDBs that exist answer
English requests since this language is the main language of several countries.
However, this does not prevent having other NLIDBs that allow access to the
information stored in a database through the formulation of user queries in
other languages: Arabic NLIDBs , Indien NLIDBs , French NLIDBs, Chinese
NLIDBs, Bengali Language Query Processing System, and multilingual NLIDBs
(the EDITE system which supports French, English, and Spanish languages).
Despite the development of many NLIDB systems, their use is not widespread due to
a set of constraints, some of which are listed below:
1.Lack of obvious language coverage: Some NLIDB systems cannot answer all
questions expressed in Natural Language.
2.Absence of explanations in case of failures: In the case of failures, some NLIDBs
do not provide any explanation at all.
3. Disappointment with user expectations: Individuals can be misled by the ability
of an NLIDB system to process all their queries in Natural Language.
4.Suitability of NLIDB for any user.
UNIT – III
Grammars and Parsing: Grammars and sentence Structure, Top-Down and Bottom-Up Parsers,
Transition Network Grammars, Top-Down Chart Parsing. Feature Systems and Augmented
Grammars: Basic Feature system for English, Morphological Analysis and the Lexicon, Parsing with
Features, Augmented Transition Networks.
To examine how the syntactic structure of a sentence can be computed, you must consider
two things: the grammar, which is a formal specification of the structures allowable in the
language, and the parsing technique, which is the method of analyzing a sentence to determine its
structure according to the grammar. This chapter examines different ways to specify simple
grammars and considers some fundamental parsing techniques. Next chapter then describes the
methods for constructing syntactic representations that are useful for later semantic
interpretation.
This section considers methods of describing the structure of sentences and explores ways of
characterizing all the legal structures in a language. The most common way of representing how a
sentence is broken into its major subparts. and how those subparts are broken up in turn, is as a
tree. The tree representation for the sentence John ate the cat is shown in Figure 3.1.
This illustration can be read as follows: The sentence (S) consists of an initial noun phrase (NP) and
a verb phrase (VP). The initial noun phrase is made of the simple NAME John. The verb phrase is
composed of a verb (V) ate and an NP, which consists of an article (ART) the and a common noun
(N) cat. In list notation this same structure could be represented as
Rule 1 says that an-S may consist of an NP followed by a VP. Rule 2 says that a VP may consist of a V
followed by an NP. Rules 3 and 4 say that an NP may consist of a NAME or may consist of an ART
followed by an N. Rules 5 - 8 define possible words for the categories. Grammars consisting entirely
of rules with a single symbol on the left-hand side, called the mother, are called context-free
grammars (CFGs). CFGs are a very important class of grammars for two reasons: The formalism is
powerful enough to describe most of the structure in natural languages, yet it is restricted enough
so that efficient parsers can be built to analyze sentences. Symbols that cannot be further
decomposed in a grammar, namely the words in the preceding example, are called terminal
symbols. The other symbols, such as NP, VP, and 5, are called non-terminal symbols. The
grammatical symbols such as N and V that describe word categories are called lexical symbols. Of
course, many words will be listed under multiple categories. For example, can would be listed
under V and N.
Grammars have a special symbol called the start symbol. In this book, the start symbol will always
be S. A grammar is said to derive a sentence if there is a sequence of rules that allow you to rewrite
the start symbol into the sentence. For instance, Grammar 3.2 derives the sentence John ate the cat.
This can be seen by showing the sequence of rewrites starting from the S symbol, as follows:
S => NP VP (rewriting S)
=> NAME VP (rewriting NP)
=> John VP (rewriting NAME)
=> John V NP (rewriting VP)
=> John ate NP (rewriting V)
=> John ate ART N (rewriting NP)
=> John ate the N (rewriting ART)
=> John ate the cat (rewriting N)
Two important processes are based on derivations. The first is sentence generation, which uses
derivations to construct legal sentences. A simple generator could be implemented by randomly
choosing rewrite rules, starting from the S symbol, until you have a sequence of words. The
preceding example shows that the sentence John ate the cat can be generated from the grammar.
The second process based on derivations is parsing, which identifies the structure of sentences
given a grammar. There are two basic methods of searching. A top-down strategy starts with the S
symbol and then searches through different ways to rewrite the symbols until the input sentence is
generated, or until all possibilities have been explored. The preceding example demonstrates that
John ate the cat is a legal sentence by showing the derivation that could be found by this process.
In a bottom-up strategy, you start with the words in the sentence and use the rewrite rules
backward to reduce the sequence of symbols until it consists solely of S. The left-hand side of each
rule is used to rewrite the symbol on the right-hand side. A possible bottom-up parse of the
sentence is
John ate the cat
=> NAME ate the cat (rewriting John)
=> NAME V the cat (rewriting ate)
=> NAME V ART cat (rewriting the)
=> NAME V ART N (rewriting cat)
=> NP V ART N (rewriting NAME)
=> NP V NP (rewriting ART N)
=> NP VP (rewriting V NP)
=> S (rewriting NP VP)
A tree representation, such as Figure 3.1, can be viewed as a record of the CFG rules that account
for the structure of the sentence. In other words, if you kept a record of the parsing process,
working either top-down or bottom-up, it would be something similar to the parse tree
representation.
In constructing a grammar for a language, you are interested in generality, the range of sentences
the grammar analyzes correctly; selectivity, the range of non-sentences it identifies as problematic;
and understandability, the simplicity of the grammar itself.
In small grammars, such as those that describe only a few types of sentences, one structural
analysis of a sentence may appear as understandable as another, and little can be said as to why
one is superior to the other. As you attempt to extend a grammar to cover a wide range of
sentences, however, you often find that one analysis is easily extendable while the other requires
complex modification. The analysis that retains its simplicity and generality as it is extended is
more desirable.
Unfortunately, here you will be working mostly with small grammars and so will have only a few
opportunities to evaluate an analysis as it is extended. You can attempt to make your solutions
generalizable, however, by keeping in mind certain properties that any solution should have. In
particular, pay close attention to the way the sentence is divided into its subparts, called
constituents. Besides using your intuition, you can apply a few specific tests, discussed here.
Anytime you decide that a group of words forms a particular constituent, try to construct a new
sentence that involves that group of words in a conjunction with another group of words classified
as the same type of constituent. This is a good test because for the most part only constituents of
the same type can be conjoined. The sentences in Figure 3.3, for example, are acceptable, but the
following sentences are not:
Another test involves inserting the proposed constituent into other sen -tences that take the same
category of constituent. For example, if you say that John’s hitting of Mary is an NP in John’s hitting
of Mary alarmed Sue, then it should be usable as an NP in other sentences as well. In fact this is
true—the NP can be the object of a verb, as in I cannot explain John’s hitting of Mary as well as in the
passive form of the initial sentence Sue was alarmed by John’s hitting of Mary. Given this evidence,
you can conclude that the proposed constituent appears
to behave just like other NPs.
As another example of applying these principles, consider the two sentences I looked up John ‘s
phone number and I looked up John ‘s chimney. Should these sentences have the identical structure?
If so, you would presumably analyze both as subject-verb-complement sentences with the
complement in both cases being a PP. That is, up John’s phone number would be a PP.
When you try the conjunction test, you should become suspicious of this analysis. Conjoining up
John’s phone number with another PP, as in *I looked up John’s phone number and in his cupboards, is
certainly bizarre. Note that I looked up John’s chimney and in his cupboards is perfectly acceptable.
Thus apparently the analysis of up John ‘s phone number as a PP is incorrect.
Further evidence against the PP analysis is that up John ‘s phone number does not seem usable as a
PP in any sentences other than ones involving a few verbs such as look or thought. Even with the
verb look, an alternative sentence such as *Up John’s phone number, I looked is quite implausible
compared to Up John’s chimney, I looked.
This type of test can be taken further by considering changing the PP in a manner that usually is
allowed. In particular, you should be able to replace the NP John’s phone number by the pronoun it.
But the resulting sentence, I looked up it, could not be used with the same meaning as I looked up
John ‘s phone number. In fact, the only way to use a pronoun and retain the original meaning is to
use I looked it up, corresponding to the form I looked John’s phone number up.
Thus a different analysis is needed for each of the two sentences. If up John’s phone number is not a
PP, then two remaining analyses may be possible. The VP could be the complex verb looked up
followed by an NP, or it could consist of three components: the V looked, a particle up, and an NP.
Either of these is a better solution. What types of tests might you do to decide between them?
As you develop a grammar, each constituent is used in more and more different ways. As a result,
you have a growing number of tests that can be per-formed to see if a new analysis is reasonable or
not. Sometimes the analysis of a new form might force you to back up and modify the existing
grammar. This backward step is unavoidable given the current state of linguistic knowledge. The
important point to remember, though, is that when a new rule is proposed for a grammar, you must
carefully consider its interaction with existing rules.
A parsing algorithm can be described as a procedure that searches through various ways of
combining grammatical rules to find a combination that generates a tree that could be the structure
of the input sentence. To keep this initial formulation simple, we will not explicitly construct the
tree. Rather, the algorithm will simply return a yes or no answer as to whether such a tree could be
built. In other words, the algorithm will say whether a certain sentence is accepted by the grammar
or not. This section considers a simple top-down parsing method in some detail and then relates
this to work in artificial intelligence (Al) on search procedures.
A top-down parser starts with the S symbol and attempts to rewrite it into a sequence of
terminal symbols that matches the classes of the words in the input sentence. The state of the parse
at any given time can be represented as a list of symbols that are the result of operations applied so
far, called the symbol list. For example, the parser starts in the state (S) and after applying the rule
S -> NP VP the symbol list will be (NP VP). If it then applies the rule NP ->ART N, the symbol list will
be (ART N VP), and so on.
The parser could continue in this fashion until the state consisted entirely of terminal
symbols, and then it could check the input sentence to see if it matched. But this would be quite
wasteful, for a mistake made early on (say, in choosing the rule that rewrites 5) is not discovered
until much later. A better algorithm checks the input as soon as it can. In addition, rather than
having a separate rule to indicate the possible syntactic categories for each word, a structure called
the lexicon is used to efficiently store the possible categories for each word. For now the lexicon
will be very simple. A very small lexicon for use in the examples is
cried: V
dogs: N, V
the: ART
With a lexicon specified, a grammar, such as that shown as Grammar 3.4, need not contain any
lexical rules.
1. S -> NP VP
2. NP -> ART N
3. NP -> ART ADJ N
4. VP -> V
5. VP -> V NP
Grammar 3.4
Given these changes, a state of the parse is now defined by a pair: a symbol list similar to before
and a number indicating the current position in the sentence. Positions fall between the words,
with 1 being the position before the first word. For example, here is a sentence with its positions
indicated:
1 The 2 dogs 3 cried 4
A typical parse state would be
((N VP) 2)
indicating that the parser needs to find an N followed by a VP, starting at position two. New states
are generated from old states depending on whether the first symbol is a lexical symbol or not. If it
is a lexical symbol, like N in the preceding example, and if the next word can belong to that lexical
category, then you can update the state by removing the first symbol and updating the position
counter. In this case, since the word dogs is listed as an N in the lexicon, the next parser state would
be ((VP) 3)
which means it needs to find a VP starting at position 3. If the first symbol is a non-terminal, like
VP, then it is rewritten using a rule from the grammar. For example, using rule 4 in Grammar 3.4,
the new state would be
((V) 3)
which means it needs to find a V starting at position 3. On the other hand, using rule 5, the new
state would be
((V NP) 3)
A parsing algorithm that is guaranteed to find a parse if there is one must systematically explore
every possible new state. One simple technique for this is called backtracking. Using this approach,
rather than generating a single new state from the state ((VP) 3), you generate all possible new
states. One of these is picked to be the next state and the rest are saved as backup states. If you ever
reach a situation where the current state cannot lead to a solution, you simply pick a new current
state from the list of backup states. Here is the algorithm in a little more detail.
The algorithm starts with the initial state ((S) 1) and no backup states.
1. Select the current state: Take the first state off the possibilities list and call it C. If the
possibilities list is empty, then the algorithm fails (that is, no successful parse is possible).
2. If C consists of an empty symbol list and the word position is at the end of the sentence, then
the algorithm succeeds.
3. Otherwise, generate the next possible states.
3.1. If the first symbol on the symbol list of C is a lexical symbol, and the next word in the
sentence can be in that class, then create a new state by removing the first symbol from
the symbol list and updating the word position, and add it to the possibilities list.
3.2. Otherwise, if the first symbol on the symbol list of C is a non-terminal, generate a new
state for each rule in the grammar that can rewrite that nonterminal symbol and add
them all to the possibilities list.
Consider an example. Using Grammar 3.4, Figure 3.5 shows a trace of the algorithm on the sentence
The dogs cried. First, the initial S symbol is rewritten using rule 1 to produce a new current state of
((NP VP) 1) in step 2. The NP is then rewritten in turn, but since there are two possible rules for NP
in the grammar, two possible states are generated: The new current state involves (ART N VP) at
position 1, whereas the backup state involves (ART ADJ N VP) at position 1. In step 4 a word in
category ART is found at position I of the sentence, and the new current state becomes (N VP). The
backup state generated in step 3 remains untouched. The parse continues in this fashion to step 5,
where two different rules can rewrite VP. The first rule generates the new current state, while the
other rule is pushed onto the stack of backup states. The parse completes successfully in step 7,
since the current state is empty and all the words in the input sentence have been accounted for.
Consider the same algorithm and grammar operating on the sentence
1 The 2 old 3 man 4 cried 5
In this case assume that the word old is ambiguous between an ADJ and an N and that the word
man is ambiguous between an N and a V (as in the sentence The sailors man the boats). Specifically,
the lexicon is
the: ART
old: ADJ, N
man: N, V
cried: V
The parse proceeds as follows (see Figure 3.6). The initial S symbol is rewritten by rule 1 to
produce the new current state of ((NP VP) 1). The NP is rewritten in turn, giving the new state of
((ART N VP) 1) with a backup state of ((ART ADJ N VP) 1). The parse continues, finding the as an
ART to produce the state ((N VP) 2) and then old as an N to obtain the state ((VP) 3). There are now
two ways to rewrite the VP, giving us a current state of ((V) 3) and the backup states of ((V NP) 3)
and ((ART ADJ N) 1) from before. The word man can be parsed as a V. giving the state (04).
Unfortunately, while the symbol list is empty, the word position is not at the end of the sentence, so
no new state can be generated and a backup state must be used. In the next cycle, step 8, ((V NP) 3)
is attempted. Again man is taken as a V and the new state ((NP) 4) generated. None of the rewrites
of NP yield a successful parse. Finally, in step 12, the last backup state, ((ART ADJ N VP) 1), is tried
and leads to a successful parse.
Figure 3.6 A top-down parse of 1 The 2 old 3 man 4 cried 5
For a depth-first strategy, the possibilities list is a stack. In other words, step 1 always takes the
first element off the list, and step 3 always puts the new states on the front of the list, yielding a
last-in first-out (LIFO) strategy.
In contrast, in a breadth-first strategy the possibilities list is manipulated as a queue. Step 3 adds
the new positions onto the end of the list, rather than the beginning, yielding a first-in first-out
(FIR)) strategy.
Figure 3.7 Search tree for two parse strategies (depth-first strategy on left; breadth-first on right)
We can compare these search strategies using a tree format, as in Figure 3.7, which shows
the entire space of parser states for the last example. Each node in the tree represents a parser
state, and the sons of a node are the possible moves from that state. The number beside each node
records when the node was selected to be processed by the algorithm. On the left side is the order
produced by the depth-first strategy, and on the right side is the order produced by the breadth-
first strategy. Remember, the sentence being parsed is 1 The 2 old 3 man 4 cried 5
The main difference between depth-first and breadth-first searches in this simple example is
the order in which the two possible interpretations of the first NP are examined. With the depth-
first strategy, one interpretation is considered and expanded until it fails; only then is the second
one considered. With the breadth-first strategy, both interpretations are considered alternately,
each being expanded one step at a time. In this example, both depth-first and breadth-first searches
found the solution but searched the space in a different order. A depth-first search often moves
quickly to a solution but in other cases may spend considerable time pursuing futile paths. The
breadth-first strategy explores each possible solution to a certain depth before moving on. In this
particular example -the depth-first strategy found the solution in one less step than the breadth-
first. (The state in the bottom right hand side of Figure 3.7 was not explored by the depth-first
parse.)
In certain cases it is possible to put these simple search strategies into an infinite loop. For
example, consider a left-recursive rule that could be a first account of the possessive in English (as
in the NP the man ‘s coat):
NP -> NP 's N
With a naive depth-first strategy, a state starting with the non-terminal NP would be rewritten to a
new state beginning with NP s N. But this state also begins with an NP that could be rewritten in the
same way. Unless an explicit check were incorporated into the parser, it would rewrite NPs forever!
The breadth-first strategy does better with left-recursive rules, as it tries all other ways to rewrite
the original NP before coming to the newly generated state with the new NP. But with an
ungrammatical sentence it would not terminate because it would rewrite the NP forever while
searching for a solution. For this reason, many systems prohibit left-recursive rules from the
grammar.
Many parsers built today use the depth-first strategy because it tends to minimize the
number of backup states needed and thus uses less memory and requires less bookkeeping.
The main difference between top-down and bottom-up parsers is the way the grammar rules are
used. For example, consider the rule
NP -> ART ADJ N
In a top-down system you use the rule to find an NP by looking for the sequence ART ADJ N. In a
bottom-up parser you use the rule to take a sequence ART ADJ N that yOu have found and identify
it as an NP. The basic operation in bottom-up parsing then is to take a sequence of symbols and
match it to the right-hand side of the rules. You could build a bottom-up parser simply by
formulating this matching process as a search process. The state would simply consist of a symbol
list, starting with the words in the sentence. Successor states could be generated by exploring all
possible ways to
rewrite a word by its possible lexical categories
replace a sequence of symbols that matches the right-hand side of a grammar rule by its left-
hand side symbol
1. S -> N PVP
2. NP -> ART ADJ N
3. NP -> ART N
4. NP -> ADJ N
5. VP -> AUX VP
6. VP -> V NP
Grammar 3.8 A simple context-free grammar
Assume you are parsing a sentence that starts with an ART. With this ART as the key, rules 2 and 3
are matched because they start with ART. To record this for analyzing the next key, you need to
record that rules 2 and 3 could be continued at the point after the ART. You denote this fact by
writing the rule with a dot (●), indicating what has been seen so far. Thus you record
2'. NP -> ART ●ADJ N
3'. NP -> ART ● N
If the next input key is an ADJ, then rule 4 may be started, and the modified rule 2 may be extended
to give
2''. NP -> ART ADJ ● N
The chart maintains the record of all the constituents derived from the sentence so far in the
parse. It also maintains the record of rules that have matched partially but are not complete. These
are called the active arcs. For example, after seeing an initial ART followed by an ADJ in the
preceding example, you would have the chart shown in Figure 3.9. You should interpret this figure
as follows. There are two completed constituents on the chart: ART1 from position 1 to 2 and ADJ1
from position 2 to 3. There are four active arcs indicating possible constituents. These are indicated
by the arrows and are interpreted as follows (from top to bottom). There is a potential NP starting
at position 1, which needs an ADJ starting at position 2. There is another potential NP starting at
position 1, which needs an N starting at position 2. There is a potential NP starting at position 2
with an ADJ, which needs an N starting at position 3. Finally, there is a potential NP starting at
position 1 with an ART and then an ADJ, which needs an N starting at position 3.
The basic operation of a chart-based parser involves combining an active arc with a
completed constituent. The result is either a new completed constituent or a new active arc that is
an extension of the original active arc. New completed constituents are maintained on a list called
the agenda until they themselves are added to the chart. This process is defined more precisely by
the arc extension algorithm shown in Figure 3.10.
As with the top-down parsers, you may use a depth-first or breadth-first search strategy,
depending on whether the agenda is implemented as a stack or a queue. Also, for a full breadth-first
strategy, you would need to read in the entire input and add the interpretations of the words onto
the agenda before starting the algorithm. Let us assume a depth-first search strategy for the
following example.
Consider using the algorithm on the sentence The large can can hold the water using
Grammar 3.8 with the following lexicon:
the: ART
large: ADS
can: N,AUX,V
hold: N,V
water: N, V
To best understand the example, draw the chart as it is extended at each step of the algorithm. The
agenda is initially empty, so the word the is read and a constituent ARTL placed on the agenda.
Entering ART1: (the from 1 to 2)
Adds active arc NP -> ART o ADJ N from 1 to 2
Adds active arc NP -> ART o N from 1 to 2
Both these active arcs were added by step 3 of the parsing algorithm and were derived from rules 2
and 3 in the grammar, respectively. Next the word large is read and a constituent ADJ1 is created.
Entering ADJ1: (large from 2 to 3)
Adds arc NP -> ADJ o N from 2 to 3
Adds arc NP -> ART ADJ o N from 1 to 3
The first arc was added in step 3 of the algorithm. The second arc added here is an extension
of the first active arc that was added when ART1 was added to the chart using the arc extension
algorithm (step 4).
The chart at this point has already been shown in Figure 3.9. Notice that active arcs are
never removed from the chart. For example, when the arc NP ART o ADJ N from 1 to 2 was
extended, producing the arc from 1 to 3, both arcs remained on the chart. This is necessary because
the arcs could be used again in a different way by another interpretation.
For the next word, can, three constituents, N1, AUX1, and V1 are created for its three
interpretations.
Entering N1 : (can from 3 to 4)
No active arcs are added in step 2, but two are completed in step 4 by the arc extension algorithm,
producing two NPs that are added to the agenda: The first, an NP from 1 to 4, is constructed from
rule 2, while the second, an NP from 2 to 4, is constructed from rule 4. These NPs are now at the top
of the agenda.
Entering NP1 :an NP (the large can from 1 to 4)
Adding active arc S -> NP o VP from 1 to 4
Entering NP2: an NP (large can from 2 to 4)
Adding arc S -> NP o VP from 2 to 4
Entering AUX1: (can from 3 to 4)
Adding arc VP -> AUX o VP from 3 to 4
Entering V1: (can from 3 to 4)
Adding arc VP —> V o NP from 3 to 4
The chart is shown in Figure 3.12, which illustrates all the completed constituents (NP2, NP1,
ART1, ADJ1, N1, AUX1, V1) and all the uncompleted active arcs entered so far. The next word is can
again. and N2, AUX, and V2 are created.
Entering N2: (can from 4 to 5, the second can)
Adds no active arcs
Entering AUX2: (can from 4 to 5)
Adds arc VP -> AUX o VP from 4 to 5
Entering V2: (can from 4 to 5)
Adds arc VP -> V o NP from 4 to 5
The next word is hold, and N3 and V3 are created.
Entering N3: (hold from 5 to 6)
Adds no active arcs
Entering V3: (hold from 5 to 6)
Adds arc VP -> V o NP from 5 to 6
The chart in Figure 3.13 shows all the completed constituents built so far, together with all the
active arcs, except for those used in the first NP.
Entering ART2: (the from 6 to 7)
Adding arc NP -> ART o ADJ N from 6 to 7
Adding arc NP -> ART o N from 6 to 7
Entering N4: (water from 7 to 8)
No active arcs added in step 3
An NP, NP3, from 6 to 8 is pushed onto the agenda, by completing arc
NP ART o N from 6 to 7
Entering NP3: (the water from 6 to 8)
A VP, VP1, from 5 to 8 is pushed onto the agenda, by completing
VP V o NP from 5 to 6
Adds arc S NP o VP from 6 to 8
The chart at this stage is shown in Figure 3.14, but only the active arcs to be used in the remainder
of the parse are shown.
Entering VP1: (hold the water from 5 to 8)
A VP, VP2, from 4 to 8 is pushed onto the agenda, by completing
VP AUX o VP from 4 to S
Entering VP2: (can hold the water from 4 to 8)
An S, S1, is added from 1 to 8, by completing arc
S NP o VP from 1 to 4
A VP, VP3, is added from 3 to 8, by completing arc
VP AUX o VP from 3 to 4
An S, S2, is added from 2 to 8, by completing arc
S NP o VP from 2 to 4
Since you have derived an S covering the entire sentence, you can stop successfully. If you wanted
to find all possible interpretations for the sentence, you would continue parsing until the agenda
became empty. The chart would then contain as many S structures covering the entire set of
positions as there were different structural interpretations. In addition, this representation of the
entire set of structures would be more efficient than a list of interpretations, because the different S
structures might share common subparts represented in the chart only once. Figure 3.15 shows the
final chart.
Efficiency Considerations
Chart-based parsers can be considerably more efficient than parsers that rely only on a search
because the same constituent is never constructed more than once. For instance, a pure top-down
or bottom-up search strategy could require up to Cn operations to parse a sentence of length n,
where C is a constant that depends on the specific algorithm you use. Even if C is very small, this
exponential complexity rapidly makes the algorithm unusable. A chart-based parser, on the other
hand, in the worst case would build every possible constituent between every possible pair of
positions. This allows us to show that it has a worst-case complexity of K*n3, where n is the length
of the sentence and K is a constant depending on the algorithm. Of course, a chart parser involves
more work in each step, so K will be larger than C. To contrast the two approaches, assume that C is
10 and that K is a hundred times worse, 1000. Given a sentence of 12 words, the brute force search
might take 1012 operations (that is, 1,000,000,000,000), whereas the chart parser would take 1000
* 123 (that is, 1,728,000). Under these assumptions, the chart parser would be up to 500,000 times
faster than the brute force search on some examples!
So far we have examined only one formalism for representing grammars, namely context-free
rewrite rules. Here we consider another formalism that is useful in a wide range of applications. It
is based on the notion of a transition network consisting of nodes and labeled arcs. One of the
nodes is specified as the initial state, or start state. Consider the network named NP in Grammar
3.16, with the initial state labeled NP and each arc labeled with a word category. Starting at the
initial state, you can traverse an arc if the current word in the sentence is in the category on the arc.
If the arc is followed, the current word is updated to the next word. A phrase is a legal NP if there is
a path from the node NP to a pop arc (an arc labeled pop) that accounts for every word in the
phrase. This network recognizes the same set of sentences as the following context-free grammar:
NP ART NP1
NP1 ADJ NP1
NP1 N
Consider parsing the NP a purple cow with this network. Starting at the node NP, you can follow the
arc labeled art, since the current word is an article— namely, a. From node NP1 you can follow the
arc labeled adj using the adjective purple, and finally, again from NP1, you can follow the arc labeled
noun using the noun cow. Since you have reached a pop arc, a purple cow is a legal NP.
Simple transition networks are often called finite state machines (FSMs). Finite state
machines are equivalent in expressive power to regular grammars, and thus are not powerful
enough to describe all languages that can be described by a CFG. To get the descriptive power of
CFGs, you need a notion of recursion in the network grammar. A recursive transition network
(RTN) is like a simple transition network, except that it allows arc labels to refer to other networks
as well as word categories. Thus, given the NP network in Grammar 3.16, a network for simple
English sentences can be expressed as shown in Grammar 3.17. Uppercase labels refer to networks.
The arc from S to Si can be followed only if the NP network can be successfully traversed to a pop
arc. Although not shown in this example, RTNs allow true recursion—that is, a network might have
an arc labeled with its own name.
Consider finding a path through the S network for the sentence The purple cow ate the grass.
Starting at node 5, to follow the arc labeled NP, you need to traverse the NP network. Starting at
node NP, traverse the network as before for the input the purple cow. Following the pop arc in the
NP network, return to the S network and traverse the arc to node S1. From node S1 you follow the
arc labeled verb using the word ate. Finally, the arc labeled NP can be followed if you can traverse
the NP network again. This time the remaining input consists of the words the grass. You follow the
arc labeled art and then the arc labeled noun in the NP network; then take the pop arc from node
NP2 and then another pop from node S3. Since you have traversed the network and used all the
words in the sentence, The purple cow ate the grass is accepted as a legal sentence.
In practice, RTN systems incorporate some additional arc types that are useful but not formally
necessary; Figure 3.18 summarizes the arc types, together with the notation that will be used in
this book to indicate these arc types. According to this terminology, arcs that are labeled with
networks are called push arcs, and arcs labeled with word categories are called cat arcs. In
addition, an arc that can always be followed is called a jump arc.
An algorithm for parsing with RTNs can be developed along the same lines as the algorithms for
parsing CFGs. The state of the parse at any moment can be represented by the following:
current position - a pointer to the next word to be parsed.
current node - the node at which you are located in the network.
return points - a stack of nodes in other networks where you will continue if you pop from
the
current network.
Grammar 3.19
First, consider an algorithm for searching an RTN that assumes that if you can follow an arc, it will
be the
correct one in the final parse. Say you are in the middle of a parse and know the three pieces of
information just cited. You can leave the current node and traverse an arc in the following cases:
Case 1: If arc names word category and next word in sentence is in that category,
Then (1) update current position to start at the next word;
(2) update current node to the destination of the arc.
Case 2: If arc is a push arc to a network N,
Then (1) add the destination of the arc onto return points;
(2) update current node to the starting node in network N.
Case 3: If arc is a pop arc and return points list is not empty,
Then (1) remove first return point and make it current node.
Case 4: If arc is a pop arc, return points list is empty and there are no words left,
Then (1) parse completes successfully.
Grammar 3.19 shows a network grammar. The numbers on the arcs simply indicate the order in
which arcs will be tried when more than one arc leaves a node.
Figure 3.20 demonstrates that the grammar accepts the sentence
1 The 2 old 3 man 4 cried 5
by showing the sequence of parse states that can be generated by the algorithm. In the trace, each
arc is identified by the name of the node that it leaves plus the number identifier. Thus arc S/1 is
the arc labeled 1 leaving the S node. If you start at node 5, the only possible arc to follow is the push
arc NP. As specified in case 2 of the algorithm, the new parse state is computed by setting the
current node to NP and putting node S1 on the return points list. From node NP, arc NP/1 is
followed and, as specified in case 1 of the algorithm, the input is checked for a word in category art.
Since this check succeeds, the arc is followed and the current position is updated (step 3). The
parse continues in this manner to step 5, when a pop arc is followed, causing the current node to be
reset to S1 (that is, the NP arc succeeded). The parse succeeds after finding a verb in step 6 and
following the pop arc from the S network in step 7.
In this example the parse succeeded because the first arc that succeeded was ultimately the
correct one in every case. However, with a sentence like The green faded, where green can be an
adjective or a noun, this algorithm would fail because it would initially classify green as an adjective
and then not find a noun following. To be able to recover from such failures, we save all possible
backup states as we go along, just as we did with the CFG top-down parsing algorithm.
Consider this technique in operation on the following sentence:
1 One 2 saw 3 the 4 man 5
The parser initially attempts to parse the sentence as beginning with the NP one saw, but after
failing to find a verb, it backtracks and finds a successful parse starting with the NP one. The trace
of the parse is shown in Figure 3.21, where at each stage the current parse state is shown in the
form of a triple (current node, current position, return points), together with possible states for
backtracking. The figure also shows the arcs used to generate the new state and backup states.
This trace behaves identically to the previous example except in two places. In step 2, two arcs
leaving node NP could accept the word one. Arc NP/2 classifies one as a number and produces the
next current state. Arc NP/3 classifies it as a pronoun and produces a backup state. This backup
state is actually used later in step 6 when it is found that none of the arcs leaving node S 1 can
accept the input word the.
Of course, in general, many more backup states are generated than in this simple example.
In these cases there will be a list of possible backup states. Depending on how this list is organized,
you can produce different orderings on when the states are examined.
Step Current State Arc to be Followed Backup States
1. (S, 1, NIL) S/1 NIL
2. (NP, 1, (S1)) NP/2 (& NP/3 for backup) NIL
3. (NP1, 2, (S1)) NPl/2 (NP2, 2, (S1))
4. (NP2, 3, (S1)) NP2/l (NP2, 2, (S1))
5. (S1, 3, NIL) no arc can be followed (NP2, 2, (S1))
6. (NP2, 2, (S1)) NP2/l NIL
7. (S1, 2, NIL) S1/l NIL
8. (S2, 3, NIL) S2/2 NIL
9. (NP, 3, (S2)) NP/1 NIL
10. (N7PI, 4, (S2)) NP1/2 NIL
11. (NP2, 5, (S2)) NP2/1 NIL
12. (S2, 5, NIL) S2/1 NIL
13. Parse succeeds NIL
Figure 3.21 A top-down RTN parse with backtracking
An RTN parser can be constructed to use a chart-like structure to gain the advantages of chart
parsing. In RTN systems, the chart is often called the well-formed substring table (WFST). Each
time a pop is followed, the constituent is placed on the WFST, and every time a push is found, the
WFST is checked before the sub-network is invoked. If the chart contains constituent(s) of the type
being pushed for, these are used and the sub-network is not reinvoked. An RTN using a WFST has
the same complexity as the chart parser described in the last section: K*n3, where n is the length of
the sentence.
So far, you have seen a simple top-down method and a bottom-up chart-based method for parsing
context-free grammars. Each of the approaches has its advantages and disadvantages. In this
section a new parsing method is presented that actually captures the advantages of both. But first,
consider the pluses and minuses of the approaches.
Top-down methods have the advantage of being highly predictive. A word might be
ambiguous in isolation, but if some of those possible categories cannot be used in a legal sentence,
then these categories may never even be considered. For example, consider Grammar 3.8 in a top-
down parse of the sentence The can holds the water, where can may be an AUX, V. or N, as before.
The top-down parser would rewrite (5) to (NP VP) and then rewrite the NP to produce three
possibilities, (ART ADJ N VP), (ART N VP), and (ADJ N VP). Taking the first, the parser checks if the
first word, the, can be an ART, and then if the next word, can, can be an ADJ, which fails. Trying the
next possibility, the parser checks the again, and then checks if can can be an N, which succeeds.
The interpretations of can as an auxiliary and a main verb are never considered because no
syntactic tree generated by the grammar would ever predict an AUX or V in this position. In
contrast, the bottom-up parser would have considered all three interpretations of can from the
start—that is, all three would be added to the chart and would combine with active arcs. Given this
argument, the top-down approach seems more efficient.
On the other hand, consider the top-down parser in the example above needed to check that
the word the was an ART twice, once for each rule. This reduplication of effort is very common in
pure top-down approaches and becomes a serious problem, and large constituents may be rebuilt
again and again as they are used in different rules. In contrast, the bottom-up parser only checks
the input once, and only builds each constituent exactly once. So by this argument, the bottom-up
approach appears more efficient.
You can gain the advantages of both by combining the methods. A small variation in the
bottom-up chart algorithm yields a technique that is predictive like the top-down approaches yet
avoids any reduplication of work as in the bottom-up approaches.
As before, the algorithm is driven by an agenda of completed constituents and the arc
extension algorithm, which combines active arcs with constituents when they are added to the
chart. While both use the technique of extending arcs with constituents, the difference is in how
new arcs are generated from the grammar. In the bottom-up approach, new active arcs are
generated whenever a completed constituent is added that could be the first constituent of the
right-hand side of a rule. With the top-down approach, new active arcs are generated whenever a
new active arc is added to the chart, as described in the top-down arc introduction algorithm
shown in Figure 3.22. The parsing algorithm is then easily stated, as is also shown in Figure 3.22
Consider this new algorithm operating with the same grammar on the same sentence as in Section
3.4, namely The large can can hold the water. In the initialization stage, an arc labeled S o NP VP
is added. Then, active arcs for each rule that can derive an NP are added: NP o ART ADJ N,
NP o ART N, and NP o ADJ N are all added from position 1 to 1. Thus the initialized chart is as
shown in Figure 3.23. The trace of the parse is as follows:
Entering ART1 (the) from 1 to 2
Two arcs can be extended by the arc extension algorithm
NP ART o N from 1 to 2
NP ART o ADJ N from 1 to 2
Entering ADJ1 (large) from 2 to 3
One arc can be extended
NP ART ADJ o N from 1 to 3
Entering AUX1 (can) from 3 to4
No activity, constituent is ignored
Entering V1 (can) from 3 to4
No activity, constituent is ignored
Entering N1 (can) from 3 to 4
One arc extended and completed yielding
NP1 from 1 to 4(the large can)
Entering NP1 from 1 to 4
One arc can be extended
S NP o VP from 1 to 4
Using the top-down rule (step 4), new active arcs are added for VP
VP o AUX VP from 4 to 4
VP o V NP from 4 to 4
At this stage, the chart is as shown in Figure 3.24. Compare this with Figure 3.10. It contains
fewer completed constituents since only those that are allowed by the top-down filtering have been
constructed.
The algorithm continues, adding the three interpretations of can as an AUX, V, and N. The
AUX reading extends the VP o AUX VP arc at position 4 and adds active arcs for a new VP starting
at position 5. The V reading extends the VP 0 V NP arc and adds active arcs for an NP starting at
position 5. The N reading does not extend any arc and so is ignored. After the two readings of hold
(as an N and V) are added, the chart is as shown in Figure 3.25. Again, compare with the
corresponding chart for the bottom-up parser in Figure 3.13. The rest of the sentence is parsed
similarly, and the final chart is shown in Figure 3.26. In comparing this to the final chart produced
by the bottom-up parser (Figure 3.15), you see that the number of constituents generated has
dropped from 21 to 13. While it is not a big difference here with such a simple grammar, the
difference can be dramatic with a sizable
grammar.
It turns out in the worst-case analysis that the top-down chart parser is not more efficient that the
pure bottom-up chart parser. Both have a worst-case complexity of K*n3 for a sentence of length n.
In practice, however, the top-down method is considerably more efficient for any reasonable
grammar.
Context-free grammars provide the basis for most of the computational parsing mechanisms
developed to date, but as they have been described so far, they would be very inconvenient for
capturing natural languages. This chapter describes an extension to the basic context-free
mechanism that defines constituents by a set of features. This extension allows aspects of natural
language such as agreement and sub-categorization to be handled in an intuitive and concise way.
In natural languages there are often agreement restrictions between words and phrases. For
example, the NP "a men" is not correct English because the article a indicates a single object while
the noun "men" indicates a plural object; the noun phrase does not satisfy the number agreement
restriction of English. There are many other forms of agreement, including subject-verb agreement,
gender agreement for pronouns, restrictions between the head of a phrase and the form of its
complement, and so on. To handle such phenomena conveniently, the grammatical formalism is
extended to allow constituents to have features. For example, we might define a feature NUMBER
that may take a value of either s (for singular) or p (for plural), and we then might write an
augmented CFG rule such as
NP ART N only when NUMBER1 agrees with NUMBER2
This rule says that a legal noun phrase consists of an article followed by a noun, but only when the
number feature of the first word agrees with the number feature of the second. This one rule is
equivalent to two CFG rules that would use different terminal symbols for encoding singular and
plural forms of all noun phrases, such as
NP-SING ART-SING N-SING
NP-PLURAL ART-PLURAL N-PLURAL
While the two approaches seem similar in ease-of-use in this one example, consider that all rules in
the grammar that use an NP on the right-hand side would now need to be duplicated to include a
rule for NP-SING and a rule for NP-PLURAL, effectively doubling the size of the grammar. And
handling additional features, such as person agreement, would double the size of the grammar
again and again. Using features, the size of the augmented grammar remains the same as the
original one yet accounts for agreement constraints.
To accomplish this, a constituent is defined as a feature structure - a mapping from features
to values that defines the relevant properties of the constituent. In the examples in this book,
feature names in formulas will be written in boldface. For example, a feature structure for a
constituent ART1 that represents a particular use of the word a might be written as follows:
ART1: (CAT ART
ROOT a
NUMBER s)
This says it is a constituent in the category ART that has as its root the word a and is singular.
Usually an abbreviation is used that gives the CAT value more prominence and provides an
intuitive tie back to simple context-free grammars. In this abbreviated form, constituent ART1
would be written as
ART1: (ART ROOT a NUMBER s)
Feature structures can be used to represent larger constituents as well. To do this, feature
structures themselves can occur as values. Special features based on the integers - 1, 2, 3, and so on
- will stand for the first sub-constituent, second sub-constituent, and so on, as needed. With this, the
representation of the NP constituent for the phrase "a fish" could be
NP1: (NP NUMBERs
1 (ART ROOT a
NUMBER s)
2 (N ROOT fish
NUMBER s))
Note that this can also be viewed as a representation of a parse tree shown in Figure 3.27, where
the subconstituent features 1 and 2 correspond to the subconstituent links in the tree.
The rules in an augmented grammar are stated in terms of feature structures rather than
simple categories. Variables are allowed as feature values so that a rule can apply to a wide range of
situations. For example, a rule for simple noun phrases would be as follows:
(NP NUMBER ?n) (ART NUMBER ?n) (N NUMBER ?n)
This says that an NP constituent can consist of two subconstituents, the first being an ART and the
second being an N, in which the NUMBER feature in all three constituents is identical. According to
this rule, constituent NP1 given previously is a legal constituent. On the other hand, the constituent
Since number and person features always co-occur, it is convenient to combine the two into a
single feature, AGR, that has six possible values: first person singular (is), second person singular
(2s), third person singular (3s), and first, second and third person plural (ip, 2p, and 3p,
respectively). For example, an instance of the word is can agree only with a third person singular
subject, so its AGR feature would be 3s. An instance of the word are, however, may agree with
second person singular or any of the plural forms, so its AGR feature would be a variable ranging
over the values {2s 1p 2p 3p}.
To handle the interactions between words and their complements, an additional feature, SUBCAT,
is used. Chapter 2 described some common verb sub-categorization possibilities. Each one will
correspond to a different value of the SUBCAT feature. Figure 3.28 shows some SUBCAT values for
complements consisting of combinations of NPs and VPs. To help you remember the meaning of the
feature values, they are formed by listing the main category of each part of the complement. If the
category is restricted by a feature value, then the feature value follows the constituent separated by
a colon. Thus the value _np_vp:inf will be used to indicate a complement that consists of an NP
followed by a VP with VFORM value inf. Of course, this naming is just a convention to help the
reader; you could give these values any arbitrary name, since their significance is determined solely
by the grammar rules that involve the feature. For instance, the rule for verbs with a SUBCAT value
of _np_vp:inf would be
(VP) -> (V SUBCAT _np_vp:inf)
(NP)
(VP VFORM inf)
This says that a VP can consist of a V with SUBCAT value _np_vp:inf, followed by an NP, followed by
a VP with VFORM value inf. Clearly, this rule could be rewritten using any other unique symbol
instead of _np_vp:inf, as long as the lexicon is changed to use this new value.
Many verbs have complement structures that require a prepositional phrase with a
particular preposition, or one that plays a particular role. For example, the verb give allows a
complement consisting of an NP followed by a PP using the preposition to, as in "Jack gave the
money to the bank". Other verbs, such as "put", require a prepositional phrase that describes a
location, using prepositions such as "in", "inside", "on", and "by". To express this within the feature
system, we introduce a feature PFORM on prepositional phrases. A prepositional phrase with a
PFORM value such as TO must have the preposition to as its head, and so on. A prepositional phrase
with a PFORM value LOC must describe a location. Another useful PFORM value is MOT, used with
verbs such as walk, which may take a prepositional phrase that describes some aspect of a path, as
in We walked to the store. Prepositions that can create such phrases include to, from, and along.
The LOC and MOT values might seem hard to distinguish, as certain prepositions might describe
either a location or a path, but they are distinct. For example, while Jack put the box (in on by] the
corner is fine, *Jack put the box (to from along] the corner is ill-formed. Figure.3.29 summarizes the
PFORM feature.
Figure 3.29 Some values of the PFORM feature for prepositional phrases
This feature can be used to restrict the complement forms for various verbs. Using the
naming convention discussed previously, the SUBCAT value of a verb such as put would be jip
pp:loc, and the appropriate rule in the grammar would be
(VP) -> (V SUBCAT _np_pp:loc)
(NP)
(PP PFORM LOC)
For embedded sentences, a complementizer is often needed and must be subcategorized for. Thus a
COMP feature with possible values for, that, and no-comp will be useful. For example, the verb tell
can subcategorize for an S that has the complementizer that. Thus one SUBCAT value of tell will be
_s:that. Similarly, the verb wish subcategorizes for an S with the complementizer for, as in We
wished for the rain to stop. Thus one value of the SUBCAT feature for wish is _s:for. Figure 3.30 lists
some of these additional SUBCAT values and examples for a variety of verbs. In this section, all the
examples with the SUBCAT feature have involved verbs, but nouns, prepositions, and adjectives
may also use the SUBCAT feature and subcategorize for their complements in the same way.
Binary Features
Certain features are binary in that a constituent either has or doesn’t have the feature. In our
formalization a binary feature is simply a feature whose value is restricted to be either + or -. For
example, the INV feature is a binary feature that indicates whether or not an S structure has an
inverted subject (as in a yes/no question). The S structure for the sentence Jack laughed will have
an INV value —, whereas the S structure for the sentence Did Jack laugh? will have the INV value +.
Often, the value is used as a prefix, and we would say that a structure has the feature +INV or -INV.
Other binary features will be introduced as necessary throughout the development of the
grammars.
Before you can specify a grammar, you must define the lexicon. This section explores some
issues in lexicon design and the need for a morphological analysis component .
The lexicon must contain information about all the different words that can be used,
including all the relevant feature value restrictions. When a word is ambiguous, it may be described
by multiple entries in the lexicon, one for each different use.
Because words tend to follow regular morphological patterns, however, many forms of
words need not be explicitly included in the lexicon. Most English verbs, for example, use the same
set of suffixes to indicate different forms: -s is added for third person singular present tense, -ed for
past tense, -ing for the present participle, and so on. Without any morphological analysis, the
lexicon would have to contain every one of these forms. For the verb want this would require six
entries, for want (both in base and present form), wants, wanting, and wanted (both in past and
past participle form).
In contrast, by using the methods described in previous Section to strip suffixes there needs
to be only one entry for want. The idea is to store the base form of the verb in the lexicon and use
context-free rules to combine verbs with suffixes to derive the other entries. Consider the following
rule for present tense verbs:
(V ROOT ?r SUBCAT ?s VFORM pres AGR 3s) ->
(V ROOT ?r SUBCAT ?s VFORM base) (+S)
where +S is a new lexical category that contains only the suffix morpheme -s. This rule, coupled
with the
lexicon entry
want:(V ROOT want
SUBCAT {_np_vp:inf _np_vp:inf}
VFORM base)
would produce the following constituent given the input string want -s
want:(V ROOT want
SUBCAT {_np_vp:inf _np_vp:inf}
VFORM pres
AGR 3s)
Another rule would generate the constituents for the present tense form not in third person
singular, which for most verbs is identical to the root form:
(V ROOT ?r SUBCAT ?s VFORM pres AGR {ls 2s lp 2p 3p})
—> (V ROOT ?r SUBCAT ?s VFORM base)
But this rule needs to be modified in order to avoid generating erroneous interpretations.
Currently, it can transform any base form verb into a present tense form, which is clearly wrong for
some irregular verbs. For instance, the base form be cannot be used as a present form (for example,
*We be at the store). To cover these cases, a feature is introduced to identify irregular forms.
Specifically, verbs with the binary feature +IRREG-PRES have irregular present tense forms. Now
the rule above can be stated correctly:
(V ROOT ?r SUBCAT ?s VFORM pres AGR (ls 2s lp 2p 3p))
—> (V ROOT ?r SUBCAT ?s VFORM base IRREG-PRES -)
Because of the default mechanism, the IRREG-PRES feature need only be specified on the
irregular verbs. The regular verbs default to -, as desired. Similar binary features would be needed
to flag irregular past forms (IRREG-PAST, such as saw), and to distinguish -en past participles from
-ed past participles (EN--PASTPRT). These features restrict the application of the standard lexical
rules, and the irregular forms are added explicitly to the lexicon. Grammar 4.5 gives a set of rules
for deriving different verb and noun forms using these features.
Grammar 4.5 Some lexical rules for common suffixes on verbs and nouns
Given a large set of features, the task of writing lexical entries appears very difficult. Most
frameworks allow some mechanisms that help alleviate these problems. The first technique -
allowing default values for features - has already been mentioned. With this capability, if an entry
takes a default value for a given feature, then it need not be explicitly stated. Another commonly
used technique is to allow the lexicon writing to define clusters of features, and then indicate a
cluster with a single symbol rather than listing them all. Later, additional techniques will be
discussed that allow the inheritance of features in a feature hierarchy.
Figure 3.31 contains a small lexicon. It contains many of the words to be used in the
examples that follow. It contains three entries for the word "saw "- as a noun, as a regular verb, and
as the irregular past tense form of the verb "see" - as illustrated in the sentences
The saw was broken.
Jack wanted me to saw the board in half.
I saw Jack eat the pizza.
With an algorithm for stripping the suffixes and regularizing the spelling, as described in previous
Section, the derived entries can be generated using any of the basic parsing algorithms on Grammar
4.5. With the lexicon in Figure 3.31 and Grammar 4.5, correct constituents for the following words
can be derived: been, being, cries, cried, crying, dogs, saws (two interpretations), sawed, sawing,
seen, seeing, seeds, wants, wanting, and wanted. For example, the word cries would be transformed
into the sequence cry +s, and then rule 1 would produce the present tense entry from the base form
in the lexicon.
Figure 3.31 A lexicon
Often a word will have multiple interpretations that use different entries and different
lexical rules. The word saws, for instance, transformed into the sequence saw +s, can be a plural
noun (via rule 7 and the first entry for saw), or the third person present form of the verb saw (via
rule 1 and the second entry for saw). Note that rule I cannot apply to the third entry, as its VFORM
is not base.
The success of this approach depends on being able to prohibit erroneous derivations, such
as analyzing seed as the past tense of the verb "see". This analysis will never be considered if the
FST that strips suffixes is correctly designed. Specifically, the word see will not allow a transition to
the states that allow the -ed suffix. But even if this were produced for some reason, the IRREG-PAST
value + in the entry for see would prohibit rule 3 from applying.
3.10 Parsing with Features
The parsing algorithms developed in Chapter 3 for context-free grammars can be extended to
handle augmented context-free grammars. This involves generalizing the algorithm for matching
rules to constituents. For instance, the chart-parsing algorithms developed in Chapter 3 all used an
operation for extending active arcs with a new constituent. A constituent X could extend an arc of
the form
C C1 ... Ci o X ... Cn
to produce a new arc of the form
C C1 ... Ci X o ... Cn
A similar operation can be used for grammars with features, but the parser may have to instantiate
variables in the original arc before it can be extended by X. The key to defining this matching
operation precisely is to remember the definition of grammar rules with features. A rule such as
1. (NP AGR ?a) -> o (ART AGR ?a) (N AGR ?a)
says that an NP can be constructed out of an ART and an N if all three agree on the AGR feature. It
does not place any restrictions on any other features that the NP, ART, or N may have. Thus, when
matching constituents against this rule, the only thing that matters is the AGR feature. All other
features in the constituent can be ignored. For instance, consider extending arc 1 with the
constituent
2. (ART ROOT A AGR 3s)
To make arc 1 applicable, the variable ?a must be instantiated to 3s, producing
3. (NP AGR 3s) -> o (ART AGR 3s) (N AGR 3s)
This arc can now be extended because every feature in the rule is in constituent 2:
4. (NP AGR 3s) -> (ART AGR 3s) o (N AGR 3s)
Now, consider extending this arc with the constituent for the word dog:
5. (N ROOT DOG1 AGR 3s)
This can be done because the AGR features agree. This completes the arc
6. (NP AGR 3s) —> (ART AGR 3s) (N AGR 3s)
This means the parser has found a constituent of the form (NP AGR 3s).
This algorithm can be specified more precisely as follows: Given an arc A, where the
constituent following the dot is called NEXT, and a new constituent X, which is being used to extend
the arc,
a. Find an instantiation of the variables such that all the features specified in NEXT are found
in X.
b. Create a new arc A', which is a copy of A except for the instantiations of the variables
determined in step (a).
c. Update A' as usual in a chart parser.
For instance, let A be arc 1, and X be the ART constituent 2. Then NEXT will be (ART AGR ?a). In
step a, NEXT is matched against X, and you find that ?a must be instantiated to 3s. In step b, a new
copy of A is made, which is shown as arc 3. In step c, the arc is updated to produce the new arc
shown as arc 4.
When constrained variables, such as ?a{3s 3p}, are involved, the matching proceeds in the
same manner, but the variable binding must be one of the listed values. If a variable is used in a
constituent, then one of its possible values must match the requirement in the rule. If both the rule
and the constituent contain variables, the result is a variable ranging over the intersection of their
allowed values.
For instance, consider extending arc 1 with the constituent (ART ROOT the AGR ?v{3s 3p}),
that is, the word "the". To apply, the variable ?a would have to be instantiated to ?v{ 3s 3p},
producing the rule
(NP AGR ?v{3s 3p}) —> (ART AGR ?v{3s 3p}) o (N AGR ?v{3s 3p})
This arc could be extended by (N ROOT dog AGR 3s), because ?v{3s 3p} could be instantiated by the
value 3s. The resulting arc would be identical to arc 6. The entry in the chart for the is not changed
by this operation. It still has the value ?v{3s 3p}. The AGR feature is restricted to 3s only in the arc.
Another extension is useful for recording the structure of the parse. Subconstituent features (1, 2,
and so on, depending on which subconstituent is being added) are automatically inserted by the
parser each time an arc is extended. The values of these features name subconstituents already in
the chart.
With this treatment, and assuming that the chart already contains two constituents, ARTL
and Ni, for the words the and dog, the constituent added to the chart for the phrase the dog would
be
(NP AGR 3s
1 ART1
2 N1)
where ART1 (ART ROOT the AGR {3s 3p}) and N1 = (N ROOT dog AGR {3s}). Note that the AGR
feature of ART1 was not changed. Thus it could be used with other interpretations that require the
value 3p if they are possible. Any of the chart-parsing algorithms described in Chapter 3 can now be
used with an augmented grammar by using these extensions to extend arcs and build constituents.
Consider an example. Figure 3.32 contains the final chart produced from parsing the sentence He
wants to cry using Grammar 4.8. The rest of this section considers how some of the non-terminal
symbols were constructed for the chart.
Grammar for parsing with features
Figure 3.32 The chart for "He wants to cry".
Features can also be added to a recursive transition network to produce an augmented transition
network (ATN). Features in an ATN are traditionally called registers. Constituent structures are
created by allowing each network to have a set of registers. Each time a new network is pushed, a
new set of registers is created. As the network is traversed, these registers are set to values by
actions associated with each arc. When the network is popped, the registers are assembled to form
a constituent structure, with the CAT slot being the network name.
Grammar 4.11 is a simple NP network. The actions are listed in the table below the network.
ATNs use a special mechanism to extract the result of following an arc. When a lexical arc, such as
arc 1, is followed, the constituent built from the word in the input is put into a special variable
named "*". The action
DET := *
then assigns this constituent to the DET register. The second action on this arc,
AGR := AGR*
assigns the AGR register of the network to the value of the AGR register of the new word (the
constituent in "*").
Agreement checks are specified in the tests. A test is an expression that succeeds if it returns a
nonempty value and fails if it returns the empty set or nil.
Figure 3.33 Trace tests and actions used with "1 The 2 dog 3 saw 4 Jack 5"
With the lexicon in Section 4.3, the ATN accepts the following sentences:
The dog cried.
The dogs saw Jack.
Jack saw the dogs.
Consider an example. A trace of a parse of the sentence "The dog saw Jack" is shown in Figure 3.33
It indicates the current node in the network, the current word position, the arc that is followed
from the node, and the register manipulations that are performed for the successful parse. It starts
in the S network but moves immediately to the NP network from the call on arc 4. The NP network
checks for number agreement as it accepts the word sequence The dog. It constructs a noun phrase
with the AGR feature plural. When the pop arc is followed, it completes arc 4 in the S network. The
NP is assigned to the SUBJ register and then checked for agreement with the verb when arc 3 is
followed. The NP
"Jack" is accepted in another call to the NP network.
Here is a more comprehensive example of the use of an ATN to describe some declarative
sentences. The allowed sentence structure is an initial NP followed by a main verb, which may then
be followed by a maximum of two NPs and many PPs, depending on the Verb. Using the feature
system extensively, you can create a grammar that accepts any of the preceding complement forms,
leaving the actual verb-complement agreement to the feature restrictions. Grammar 4.14 shows the
S network. Arcs are numbered using the conventions previously. For instance, the arc S3/1 is the
arc labeled 1 leaving node 83. The NP network in Grammar 4.15 allows simple names, bare plural
nouns, pronouns, and a simple sequence of a determiner followed by an adjective and a head noun.
Allowable noun complements include an optional number of prepositional phrases. The
prepositional phrase network in Grammar 4.16 is straightforward. Examples of parsing sentences
with this grammar are left for the exercises.
Presetting Registers
One further extension to the feature-manipulation facilities in ATNs involves the ability to
preset registers in a network as that network is being called, much like parameter passing in a
programming language. This facility, called the SENDR action in the original ATN systems, is useful
to pass information to the network that aids in analyzing the new constituent.
Consider the class of verbs, including want and pray, that accept complements using the
infinitive forms of verbs, which are introduced by the word to. According to the classification in
Section 4.2, this includes the following:
Grammar 4.16 The PP network
In the context-free grammar developed earlier, such complements were treated as VPs with the
VFORM value inf. To capture this same analysis in an ATN, you would need to be able to call a
network corresponding to VPs but preset the VFORM register in that network to inf. Another
common analysis of these constructs is to view the complements as a special form of sentence with
an understood subject. In the first case it is Mary who would be the understood subject (that is, the
host), while in the other case it is John. To capture this analysis, many ATN grammars preset the
SUBJ register in the new S network when it is called.
UNIT – IV
Semantic Interpretation: word senses and ambiguity, Basic logical form language, Encoding
ambiguity in logical from, Thematic roles, Linking syntax and semantics, Recent trends in NLP.
This chapter introduces the basic ideas underlying theories of meaning, or semantics. It introduces
a level of context-independent meaning called the logical form, which can be produced directly
from the syntactic structure of a sentence. Because it must be context independent, the logical form
does not contain the results of any analysis that requires interpretation of the sentence in context.
Can we define a notion of sentence meaning that is independent of context? In other words,
is there a level at which the sentence "Do you know what gate you are going to?" has a single
meaning, but may be used for different purposes? This is a complex issue, but there are many
advantages to trying to make such an approach work. The primary argument is modularity. If such
a division can be made, then we can study sentence meaning in detail without all the complications
of sentence usage. in particular, if sentences have no context-independent meaning, then we may
not be able to separate the study of language from the study of general human reasoning and
context. As you will see in the next few chapters, there are many examples of constraints based on
the meaning of words that appear to be independent of context. So from now on, we will use the
term "meaning" in this context-independent sense, and we will use the term "usage" for the
context-dependent aspects. The representation of context-independent meaning is called the
logical form. The process of mapping a sentence to its logical form is called semantic interpretation,
and the process of mapping the logical form to the final knowledge representation (KR) language is
called contextual interpretation. Figure 4.1 shows a simple version of the stages of interpretation.
The exact meaning of the notations used will be defined later.
For the moment let us assume the knowledge representation language is the first-order
predicate calculus (FOPC). Given that assumption, what is the status of the logical form? In some
approaches the logical form is defined as the literal meaning of the utterance, and the logical form
language is the same as the final knowledge representation language. If this is to be a viable
approach in the long run, however, it would mean that the knowledge representation must be
considerably more complex than representations in present use in Al systems. For instance, the
logical form language must allow indexical terms, that is, terms that are defined by context. The
pronouns "I" and "you" are indexical because their interpretation depends on the context of who is
speaking and listening. In fact most definite descriptions (such as "the red ball") are indexical, as
the object referred to can only be identified with respect to a context. Many other aspects of
language, including the interpretation of tense and determining the scope of quantifiers, depend on
context as well and thus cannot be uniquely determined at the logical form level. Of course, all of
this could be treated as ambiguity at the logical form level, but this would be impractical, as every
sentence would have large numbers of possible logical forms (as in the sentence "The red ball
dropped", which would have a different logical form for every possible object that could be
described as a ball that is red).
But if the logical form language is not part of the knowledge representation language, what
is its formal status? A promising approach has been developed in linguistics over the last decade
that suggests an answer that uses the notion of a situation, which is a particular set of
circumstances in the world. This corresponds reasonably well to the intuitive notion of the
meaning of "situation" in English. For instance, when attending a class, you are in a situation where
there are fellow students and an instructor, where certain utterances are made by the lecturer,
questions asked, and so on. Also, there will be objects in the lecture hail, say a blackboard and
chairs, and so on. More formally, you might think of a situation as a set of objects and relations
between those objects. A very simple situation might consist of two objects, a ball B0005 and a
person P86, and include the relationship that the person owns the ball. Let us encode this situation
as the set {(BALL B0005), (PERSON P86), (OWNS P86 B0005)}.
Language creates special types of situations based on what information is conveyed. These
issues will be explored in detail later, but for now consider the following to help your intuition. In
any conversation or text, assume there is a discourse situation that records the information
conveyed so far. A new sentence is interpreted with respect to this situation and produces a new
situation that includes the information conveyed by the new sentence. Given this view, the logical
form is a function that maps the discourse situation in which the utterance was made to a new
discourse situation that results from the occurrence of the utterance. For example, assume that the
situation we just encoded has been created by some preceding sentences describing the ball and
who owns it. The utterance "The ball is red" might produce a new situation that consists of the old
situation plus the new fact that B0005 has the property RED: ((BALL B0005), (PERSON P86),
(OWNS P86 B0005), (RED B0005)}. Figure 8.2 shows this view of the interpretation process,
treating the logical form as a function between situations. The two organizations presented in
Figures 4.1 and 4.2 differ in that the latter might not include a single identifiable expression in the
knowledge representation that fully captures the ”meaning” of the sentence. Rather, the logical
form might make a variety of changes to produce the updated situation. This allows other
implications to be derived from an utterance that are not directly captured in the semantic content
of the sentence.
Such issues will become important later when we discuss contextual interpretation.
Even though much of language is highly context dependent, there is still considerable
semantic structure to language that is context independent and that can be used in the semantic
interpretation process to produce the logical form. Much of this semantic knowledge consists of the
type of information you can find in dictionaries - the basic semantic properties of words (that is,
whether they refer to relations, objects, and so on), what different senses are possible for each
word, what senses may combine to form larger semantic structures, and so on. Identifying these
forms of information and using this information to compute a logical form are the focus of Part II of
the book.
Note that while the logical forms are presented in a predicate-argument form here, the same
distinctions are made in most other meaning representations. For instance, a network
representation would have nodes that correspond to the word senses and arcs that indicate the
predicate-argument structure. The meaning of the sentence Sue loves Jack in a semantic network
like representation might appear in one of the two forms shown in Figure 4.3. For most purposes
all of these representation formalisms are equivalent.
More complex propositions are constructed using a new class of constants called logical
operators. For example, the operator NOT allows you to construct a proposition that says that some
proposition is not true. The proposition corresponding to the sentence Sue does not love Jack would
be
(NOT (LOVES1 SUE1 JACK1))
English also contains operators that combine two or more propositions to form a complex
proposition. FOPC contains operators such as disjunction (v), conjunction (&), what is often called
implication ( ⊃ ) and other forms (there are 16 possible truth functional binary operators in FOPC).
English contains many similar operators including or, and, if only if, and so on. Natural language
connectives often involve more complex relationships between sentences. For instance, the
conjunction "and" might correspond to the logical operator "&" but often also involves temporal
sequencing, as in "I went home and had a drink", in which going home preceded having the drink.
The connective "but", on the other hand, is like "and" except that the second argument is something
that the hearer might not expect to be true given the first argument. The general form for such a
proposition is (connective proposition proposition). For example, the logical form of the sentence
"Jack loves Sue or Jack loves Mary" would be (OR1 (LOVES1 JACK1 SUE1) (LOVES1 JACK1
MARY1)). The logical form language will allow both operators corresponding to word senses and
operators like "&" directly from FOPC. The logic based operators will be used to connect
propositions not explicitly conjoined in the sentence.
With the simple propositions we just defined, we can define the meaning of only a very
limited subset of English, namely sentences consisting of simple verb forms and proper names. To
account for more complex sentences, we must define additional semantic constructs. One
important construct is the quantifier. In first-order predicate calculus, there are only two
quantifiers: Ɐ and ꓱ . English contains a much larger range of quantifiers, including all, some, most,
many, a few, the, and so on. To allow quantifiers, variables are introduced as in first-order logic but
with an important difference. In first-order logic a variable only retains its significance within the
scope of the quantifier. Thus two instances of the same variable x occurring in two different
formulas - say in the formulas ꓱ x. P(x) and ꓱ x. Q(x) - are treated as completely different variables
with no relation to each other. Natural languages display a different behavior. For instance,
consider the two sentences "A man entered the room. He walked over to the table." The first
sentence introduces a new object to the discussion, namely some man. You might think to treat the
meaning of this sentence along the lines of the existential quantifier in logic. But the problem is that
the man introduced existentially in the first sentence is referred to by the pronoun "He" in the
second sentence. So variables appear to continue their existence after being introduced. To allow
this, each time a discourse variable is introduced, it is given a unique name not used before. Under
the right circumstances, a subsequent sentence can then refer back to this term.
Natural language quantifiers have restricted ranges and thus are more complex than those
found in FOPC. In FOPC a formula of the form Ɐx. Px is true if and only if Px is true for every
possible object in the domain (that is, x may be any term in the language). Such statements are rare
in natural language. Rather, you would say "all dogs bark" or "most people laughed", which require
constructs that are often called generalized quantifiers. These quantifiers are used in statements of
the general form
(quantifier variable : restriction-proposition body-proposition)
For instance, the sentence "Most dogs bark" would have the logical form
(MOST1 d1 (DOG1 d1) (BARKS1 d1))
This means that most of the objects dl that satisfy (DOG1 d1) also satisfy (BARKS1 d1). Note that
this has a very different meaning from the formula
(MOST d2 : (BARKS1 d2) (DOG1 d2))
which roughly captures the meaning of the sentence "Most barking things are dogs".
A very important class of generalized quantifiers corresponds to the articles "the" and "a". The
sentence "The dog barks" would have a logical form
(THE x: (DOG1 x) (BARKS1 x))
which would be true only if there is a uniquely determined dog in context and that dog barks.
Clearly, in any natural setting there will be many dogs in the world, so the use of context to identify
the correct one is crucial for understanding the sentence. Since this identification process requires
context, however, discussion of it is delayed until Part III. Here it suffices to have a way to write the
logical form. The special set of quantifiers corresponding to the articles (or the absence of articles
in bare noun phrases) is shown in Figure 4.4.
More complex noun phrases will result in more complex restrictions. For instance, the
sentence "The happy dog barks" will involve a restriction that is a conjunction, namely (THE x (&
(DOG1 x) (HAPPY x)) (BARKS1 x))
This will be true only if there is a contextually unique x such that (& (DOG1 x) (HAPPY x)) is true,
and this x barks.
Another construct needs to be introduced to handle plural forms, as in a phrase such as the
dogs bark. A new type of constant called a predicate operator is introduced that takes a predicate as
an argument and produces a new predicate. For plurals the predicate operator PLUR will be used. If
DOG1 is a predicate that is true of any dog, then (PLUR DOG1) is a predicate that is true of any set
of dogs. Thus the representation of the meaning of the sentence "The dogs bark" would be
(THE x: ((PLUR DOG1) x) (BARKS1 x))
Plural noun phrases introduce the possibility of a new form of ambiguity. Note that the natural
reading of The dogs bark is that there is a specific set of dogs, and each one of them barks. This is
called the distributive reading, since the predicate BARKS1 is distributed over each element of the
set. In contrast, consider the sentence The dogs met at the corner. In this case, it makes no sense to
say that each individual dog met; rather the meeting is true of the entire set of dogs. This is called
the collective reading. Some sentences allow both interpretations and hence are ambiguous. For
instance, the sentence Two men bought a stereo can mean that two men each bought a stereo (the
distributive reading), or that two men bought a stereo together (the collective reading).
The final constructs to be introduced are the modal operators. These are needed to
represent the meaning of verbs such as believe and want, for representing tense, and for many
other constructs. Modal operators look similar to logical operators but have some important
differences. Specifically, terms within the scope of a modal operator may have an interpretation
that differs from the normal one. This affects what conclusions you can draw from a proposition.
For example, assume that Jack is also known as John to some people. There are two word senses
that are equal; that is, JACK1 = JOHN22. With a simple proposition, it doesn't matter which of these
two constants is used: if (HAPPY JOHN22) is true then (HAPPY JACK1) is true, and vice versa. This
is true even in complex propositions formed from the logical operators. If (OR (HAPPY JOHN1)
(SAD JOHN1)) is true, then so is (OR (HAPPY JACK1) (SAD JACK1)), and vice versa. The same
propositions within the scope of a modal operators such as BELIEVE1, however, are not
interchangeable. For instance, if Sue believes that Jack is happy, that is,
(BELIEVE SUE1 (HAPPY JACK1))
then it does not necessarily follow that Sue believes John is happy, that is,
(BELIEVE SUE (HAPPY JOHN22))
because Sue might not know that JACK1 and JOHN22 are the same person. Thus you cannot freely
substitute equal terms when they occur within the scope of a modal operator. This is often referred
to as the failure of substitutivity in modal contexts.
An important class of modal operators for natural language are the tense operators, PAST,
PRES, and FUT. So far all examples have ignored the effect of tense. With these new operators,
however, you can represent the difference
in meaning between John sees Fido, John saw Fido, and John will see Fido, namely as the propositions
(PRES (SEES1 JOHN1 FIDO1))
(PAST (SEES1 JOHN1 FIDO1))
(FUT (SEES1 JOHN1 FIDO1))
You can see that these are modal operators because they exhibit the failure of substitutivity. For
example, consider the operator PAST, and assume two constants, say JOHN1 and PRESIDENT1, that
are equal now, indicating that John is currently the president. But in the past, John was not the
president, so JOHN1 did not equal
PRESIDENT!. Given this and the fact that John saw Fido in the past, (PAST (SEES1 JOHN1 FJDO 1)),
you cannot conclude that the president saw Fido in the past, that is, (PAST (SEES1 PRESIDENT1
FIDO1)), since John was not the president at that time. Note also that a proposition and its negation
can both be true in the past (but at different times). Thus it is possible for both the sentences John
was happy and John was not happy to be true; that is, (PAST (HAPPY JOHN!)) and (PAST (NOT
(HAPPY JOHN1))) are both true.
This completes the specification of the basic logical form language. The next sections discuss
various extensions that make the language more convenient for expressing ambiguity and
capturing semantic regularities.
The previous sections defined many of the constructs needed to specify the logical form of a
sentence, and if you were interested solely in the nature of logical form, you could be finished. But
representations for computational use have another important constraint on them, namely the
handling of ambiguity. A typical sentence will have multiple possible syntactic structures, each of
which might have multiple possible logical forms. In addition, the words in the sentence will have
multiple senses. Simply enumerating the possible logical forms will not be practical. Rather, we will
take an approach where certain common ambiguities can be collapsed and locally represented
within the logical form, and we will develop techniques to incrementally resolve these ambiguities
as additional constraints from the rest of the sentence and from context are brought into play.
Many researchers view this ambiguity encoding as a separate level of representation from the
logical form, and it is often referred to as the quasi-logical form.
Perhaps the greatest source of ambiguity in the logical form comes from the fact that most
words have multiple senses. Some of these senses have different structural properties, so they can
be eliminated given the context of the surrounding sentence. But often words have different senses
that have identical structural constraints. At present, the only way to encode these would be to
build a separate logical form for each possible combination of senses for the words in the sentence.
To reduce this explosion of logical forms, we can use the same technique used to handle multiple
feature values in the syntactic structure. Namely, anywhere an atomic sense is allowed, a set of
possible atomic senses can be used. For example, the noun "ball" has at least two senses: BALL1,
the object used in games, and BALL2, the social event involving dancing. Thus the sentence "Sue
watched the ball" is ambiguous out of context. A single logical form can represent these two
possibilities, however:
1. (THE b1 : ({BALL1 BALL2} b1) (PAST (WATCH1 SUE1 b1 )))
This abbreviates two possible logical forms, namely
2. (THE b1 : (BALL1 b1) (PAST (WATCH1 SUE1 b1 )))
and
3. (THE b1 : (BALL2 b1) (PAST (WATCH1 SUE1 b1 )))
One of the most complex forms of ambiguity in logical forms arises from the relative scoping
of the quantifiers and operators. You saw in Section 4.1 that a sentence such as "Every boy loves a
dog" is ambiguous between two readings depending on the scope of the quantifiers. There is no
context-independent method for resolving such issues, so the ambiguity must be represented in the
final logical forms for the sentence. Rather than enumerating all possible scopings, which would
lead to an exponentially growing number of interpretations based on the number of scoping
constructs, we introduce another abbreviation into the logical form language that collapses
interpretations together. Specifically, the abbreviated logical form does not contain scoping
information at all. Rather, constructs such as generalized quantifiers are treated syntactically like
terms and appear in the position indicated by the syntactic structure of the sentence. They are
marked using angle brackets to indicate the scoping abbreviation. For example, the logical forms
for the sentence "Every boy loves a dog" are captured by a single ambiguous form
(LOVES1 <EVERY b1 (BOY1 b1)> <A d1 (DOG1 d1)>)
This abbreviates an ambiguity between the logical form
(EVERY b1 : (BOY1 b1) (A d1 (DOG1 d1) (LOVES1 b1 d1)))
and
(A d1: (DOG1 d1) (EVERY b1 : (BOY1 b1) (LOVES1 b1 d1)))
While the savings don't amount to much in this example, consider that a sentence with four
scoping constructs in it would have 24 (4 factorial) possible orderings, and one with five scoping
constructs would have 120 orderings. The abbreviation convention allows all these forms to be
collapsed to a single representation. Chapter 12 will explore some heuristic techniques for
determining the scope of operators. For the time being, however, it is reasonable to assume that no
context-independent scoping constraints need be represented.
If the restriction in a generalized quantifier is a proposition involving a simple unary
predicate, a further abbreviation will be used that drops the variable. Thus the form <EVERY b1
(BOY b1)> will often be abbreviated as <EVERY b1 BOY>
A large number of constructs in natural language are sensitive to scoping. In particular, all
the generalized quantifiers, including "the", are subject to scoping. For example, in "At every hotel,
the receptionist was friendly", the preferred reading in almost any context has the definite reference
"the receptionist" fall within the scope of "every hotel"; that is, there is a different receptionist at
each hotel.
In addition, operators such as negation and tense are also scope sensitive. For example, the
sentence "Every boy didn't run" is ambiguous between the reading in which some boys didn't run
and some did, that is,
(NOT (EVERY b1 : (BOY1 b1) (RUN1 b1 )))
and the reading where no boys ran, that is,
(EVERY b1 : (BOY1 b1) (NOT (RUN1 b1)))
These two readings are captured by the single logical form
(<NOT RUN1> <EVERY b1 BOY1>)
where unscoped unary operators (for example, NOT, PAST, PRES, and so on) are wrapped around
the predicate.
Finally, let us return to two constructs that need to be examined further: proper names and
pronouns. So far we have assumed that each proper name identifies a sense that denotes an object
in the domain. While this was a useful abstraction to introduce the basic ideas of logical form, it is
not a general enough treatment of the phenomena. In fact, proper names must be interpreted in
context, and the name John will refer to different people in different situations. Our revised
treatment resolves these problems by using a discourse variable that has the property of having the
specified name. We will introduce this construct as a special function, namely
(NAME <variable> <name>)
which produces the appropriate object with the name in the current context. Thus, the logical form
of "John ran" would be (<PAST RUN1> (NAME j1 "John")).
Arguments similar to those previously given for proper names can be made for pronouns
and other indexical words, such as "here" and "yesterday", and we will treat them using a special
function of the form (PRO <variable> <proposition>). For example, the quasi-logical form for "Every
man liked him" would be
(<PAST LIKE1> <EVERY m1 MAN1> (PRO m2 (HE1 m2)))
HE1 is the sense for "he" and "him", and formally is a predicate true of objects that satisfy the
restrictions on any antecedent, that is, being animate-male in this case. As with generalized
quantifiers, when the restriction is a simple unary predicate, the pro forms are often abbreviated.
For example, the logical form for "he" will often be
written as (PRO m2 HE1).
The constructs described in this chapter dramatically reduce the number of logical forms
that must be initially computed for a given sentence. Not all ambiguities can be- captured by the
abbreviations, however, so there will still be sentences that will require a list of possible logical
forms, even for a single syntactic structure.
This section examines theories based on the notion of thematic roles, or cases. One motivating
example from the last section included the sentences
John broke the window with the hammer.
The hammer broke the window.
The window broke.
"John", " the hammer", and "the window" play the same semantic roles in each of these sentences.
"John" is the actor, "the window" is the object, and "the hammer" is the instrument used in the act of
breaking of the window. We introduced relations such as AGENT, THEME, and INSTR to capture
these intuitions. But can we define these relations more precisely, and what other thematic roles
have proved useful in natural language systems? These issues are explored in this section.
Perhaps the easiest thematic role to define is the AGENT role. A noun phrase fills the AGENT
role if it describes the instigator of the action described by the sentence. Further, this role may
attribute intention, volition, or responsibility for the action to the agent described. One test for
AGENT-hood involves adding phrases like "intentionally" or "in order to" to active voice sentences.
If the resulting sentence is well formed, the subject NP can fill the AGENT role. The following
sentences are acceptable:
John intentionally broke the window.
John broke the window in order to let in some air.
But these sentences are not acceptable:
* The hammer intentionally broke the window.
* The window broke in order to let in some air.
Thus the NP "John" fills the AGENT role only in the first two sentences.
Not all animate NPs, even in the subject position, fill the AGENT role. For instance, you cannot
normally say
* John intentionally died.
* Mary remembered her birthday in order to get some presents.
Of course, by adding the phrase intentionally to the first sentence, you may construct some
plausible reading of the sentence ("John killed himself"), but this is a result of modifying the initial
meaning of the sentence "John died".
NPs that describe something undergoing some change or being acted upon will fill a role
called THEME. This usually corresponds to the syntactic OBJECT and, for any transitive verb X, is
the answer to the question "What was Xed?". For example, given the sentence "The gray eagle saw
the mouse", the NP "the mouse" is the THEME and is the answer to the question "What was seen?".
For intransitive verbs, the THEME role is used for the subject NPs that are not AGENTs. Thus in
"The clouds appeared over the horizon", the NP "the clouds" fills the THEME
role. More examples follow, with the THEME NP in italics:
The rock broke.
John broke the rock.
I gave John the book.
A range of roles has to do with locations, or abstract locations. First, we must make the
distinction mentioned earlier between relations that indicate a location or place and those that
indicate motion or paths. The AT-LOC relation indicates where an object is or where an event takes
place, as in
Harry walked on the road.
The chair is by the door.
On the road describes where the walking took place, while by the door describes where the chair is
located.
Other phrases describe changes in location, direction of motion, or paths:
I walked from here to school yesterday.
It fell to the ground.
The birds flew from the lake along the river gorge.
There are at least three different types of phrases here: those that describe where something came
from (the FROM-LOC role), such as "from here"; those that describe the destination (the TO-LOC
role), such as "to the ground"; and those that describe the trajectory or path (the PATH-LOC role),
such as "along the gorge".
These location roles can be generalized into roles over arbitrary state values, called the AT
role, and roles for arbitrary state change (the FROM, TO, and PATH roles). Thus AT-LOC is a
specialization of the AT role, and so on. You can see other specializations of these roles when you
consider the abstract relation of possession:
I threw the ball to John. (the TO-LOC role)
I gave a book to John. (the TO-POSS role)
I caught the ball from John. (the FROM-LOC role)
I borrowed a book from John. (the FROM-POSS role)
The box contains a ball. (the AT LOC role)
John owns a book. (the AT POSS role)
Similarly, you might define AT-TIME, TO-TIME, and FROM-TIME roles, as in
I saw the car at 3 o'clock. (the AT-TIME role)
I worked from one until three. (the FROM-TIME and TO-TIME role)
The roles apply to general state change as well, as with temperature in
The temperature remains at zero. (AT VALUE)
The temperature rose from zero. (FROM-VALUE)
Thus the notion of general value and change of value along many dimensions seems to be
supported by the similarity of the ways of realizing these roles in sentences.
Another role is motivated by the problem that, given the present taxonomy, you cannot easily
classify the role of the NP in a sentence such as
John believed that it was raining.
The THEME role is filled with the clause "that it was raining", since this is what is believed. "John"
cannot be an AGENT because there is no intentionality in believing something. Thus you must
introduce a new role, called EXPERIENCER, which is filled by animate objects that are in a
described psychological state, or that undergo some psychological process, such as perception, as
in the preceding sentence and as in
John saw the unicorn.
Another role is the BENEFICIARY role, which is filled by the animate person for whom a certain
event is performed, as in
I rolled on the floor for Lucy.
Find me the papers!
I gave the book to Jack for Susan.
The last example demonstrates the need to distinguish the TO-POSS role (that is, to Jack) from the
BENEFICIARY role.
The INSTR role describes a tool, material, or force used to perform some event, as in
Harry broke the glass with the telescope.
The telescope broke the glass.
I used some flour to make a cake.
I made a cake with some flour.
Depending on the verb, the INSTR role sometimes can be used as the surface subject when the
AGENT role is not specified. Natural forces are also included in the INSTR category here, although
you could argue for a different analysis. Thus the following are also examples of the INSTR role:
The sun dried the apples.
Jack used the sun to dry the apples.
The AGENT and INSTR roles could be combined into a more general role named CAUSAL-
AGENT.
Other roles need to be identified before certain sentences can be analyzed. For example,
some sentences describe situations where two people perform an act together:
Henry lifted the piano with Jack.
To handle this, you must introduce a role CO-AGENT to account for the PP with Jack.
A more complicated case occurs in sentences involving exchanges or other complex
interactions. For example, consider the sentences
Jack paid $1 to the man for the book.
Jack bought the book from the man for $1.
These sentences both describe a situation where Jack gives the man $1 and receives the book in
exchange. In the first sentence, however, the $1 is the THEME and there is no role to account for the
book. In the second sentence the situation is reversed: the book is the THEME and $1 is
unaccounted for. To handle these cases you must add a role CO-THEME for the second object in an
exchange.
A more general solution to this problem would be to analyze such sentences as describing
two events. The primary event is the one you have been considering so far, but a secondary event
may be present. In this analysis you might analyze the first sentence as follows (the primary event
being Jack paying the dollar, and the secondary being Jack receiving the book):
Jack: AGENT of both PRIMARY and SECONDARY event
$1: THEME of PRIMARY event
the man: TO-POSS of PRIMARY, FROM-POSS of SECONDARY
the book: THEME of SECONDARY event
Roles and Other Common Definition
Subroles Names
CAUSAL-AGENT the object that caused the
event
AGENT intentional causation
INSTR force/tool used in causing
the event
THEME PATIENT the thing affected by the
event
EXPERIENCER the person involved in
perception
or a physical/psychological
state
BENEFICIARY the person for whom an act is
done
AT the state/value on some
dimension
AT-LOC LOCATION current location
AT-POSS POSSESSOR current possessor
AT-VALUE current value
AT-TIME current time
TO final value in a state change
TO-LOC DESTINATION final location
TO-POSS RECIPIENT final possessor
TO-VALUE final value
FROM original value in a state
change
FROM-LOC SOURCE original location
FROM-POSS original possessor
FROM- original value
VALUE
PATH path over which something
travels
CO-AGENT secondary agent in an action
CO-THEME secondary theme in an
exchange
Figure 4.5 Some possible semantic roles
This possibility will not be pursued further, however, since it leads into many issues not
relevant to the remainder of this chapter.
Figure 4.5 provides a summary of most of the roles distinguished thus far and the
hierarchical relationships between them.
As you've seen, verbs can be classified by the thematic roles that they require. To classify
them precisely, however, you must make a distinction between roles that are "intimately" related
to the verb and those that are not. For example, almost any past tense verb allows an AT-TIME role
realized by the adverb "yesterday". Thus this role is apparently more a property of verb phrases in
general than a property of any individual verb. However, other roles – namely, those realized by
constituents for which the verb subcategorizes - seem to be properties of the verb. For example, the
verb "put" subcategorizes for a PP, and furthermore, this PP must realize the TO-LOC role. In verb
classification this latter type of role is important, and these roles are called the inner roles of the
verb.
The preceding examples suggest one test for determining whether a given role is an inner
role for a given verb: if the role is obligatory, it is an inner role. Other inner roles, however, appear
to be optional, so other tests are also needed. Another test is based on the observation that all verbs
may take at most one NP in any given inner role. If multiple NPs are needed, they must be related
by a conjunction. Thus you can say
John and I ran to the store.
but not
* John I ran to the store.
Similarly, you can say
I ran to the store and to the bank.
but not
* I ran to the store to the bank.
Thus the AGENT and TO-LOC roles for the verb run are inner roles.
Verbs typically specify up to three inner roles, at least one of which must always be realized
in any sentence using the verb. Sometimes a particular role must always be present (for example,
TO-LOC with put). Typically, the THEME role is also obligatory, whereas the AGENT role is always
optional for any verb that allows the passive form.
There are also syntactic restrictions on how various roles can be realized. Figure 4.6 shows a
sample of ways that roles can be realized in different sentences.
The following are some sample sentences with each verb in italics and its argument,
whether NP, PP. or embedded 5, classified by its role in order of occurrence:
Jack ran. AGENT only
Jack ran with a crutch. AGENT + INSTR
Jack ran with a crutch for Susan. AGENT + INSTR + BENEFICIARY
Jack destroyed the car. AGENT + THEME
Jack put the car through the wall. AGENT + THEME + PATH
Jack sold Henry the car. AGENT + TO-POSS + THEME
Henry pushed the car from Jack's
house to the junkyard. AGENT + THEME + FROM-LOC + TO-LOC
Jack is tall. THEME
Henry believes that Jack is tall. EXPERIENCER + THEME
Susan owns a car. AT-POSS + THEME
Jack ran. AGENT only
I am in the closet. THEME + AT-LOC
The ice melted. THEME
Jack enjoyed the play. EXPERIENCER + THEME
The ball rolled down the hill
to the water. THEME + PATH + TO-LOC
Role Realization
AGENT as subject in active sentences
preposition by in passive sentences
THEME as object of transitive verbs
s subject of nonaction verbs
INSTR as subject in active sentences with no agent preposition with
EXPERIENCER as animate subject in active sentences with no agent
BENEFICIARY as indirect object with transitive verbs preposition for
AT-LOC prepositions in, on, beyond, etc.
AT-POSS possessive NP
as subject of sentence if no agent
TO-LOC prepositions to, into
TO-POSS preposition to, indirect object with certain verbs
FROM-LOC prepositions from, out of, etc.
FROM-POSS preposition from
A similar technique is used to insert unscoped tense operators for the past and present tenses. The
revisedmorphological rules are shown in Grammar 4.4.
These are the same as the original rules given as Grammar 3.5 except for the addition of the SEM
feature.
Rule 1 was previously discussed. Rules 2 and 3 handle transitive and intransitive verbs and
build the appropriate VP interpretation. Each takes the SEM of the verb, ?semv, and constructs a
unary predicate that will apply to the subject. The arguments to the verb sense include an event
variable, which is stored in the VAR feature, the subject, and then any additional arguments for
subcategorized constituents. Rule 4 constructs the appropriate SEM structure for pronouns given a
pronoun sense ?sempro, and rule 5 does the same for proper names. Rule 6 defines an expression
that involves an unscoped quantified expression, which consists of the quantifier ?semart, the
discourse variable ?v, and a proposition restricting the quantifier, which is constructed by applying
the unary predicate ?semcnp to the discourse. variable. For example, assuming that the discourse
variable ?v is ml, the NP the man would combine the SEM of the, namely the operator THE, with the
SEM of man, namely MAN1, to form the expression <THE m1 (MAN1 m1)>. Rule 7 builds a simple
CNP out of a single N. Since the SEM of a common noun is a unary predicate already, this value
simply is used as the SEM of the CNP.
Only two simple modifications need to be made to the standard chart parser to handle
semantic interpretation:
When a lexical rule is instantiated for use, the VAR feature is set to a new discourse variable.
Whenever a constituent is built, the SEM is simplified by per forming any lambda reductions
that are possible.
Figure 4.9 The parse of "Jill saw the dog" showing the SEM and VAR features
With these two changes, the existing parser now will compute the logical form as it parses.
Consider a trace of the parser on Jill saw the dog, whose parse tree is shown in Figure.9.5. The word
Jill is parsed as a name using the lexical lookup. A new discourse variable, j1, is generated and set to
the VAR feature. This constituent is used by rule 5 to build an NP. Since VAR is a head feature, the
VAR feature of the NP will be j1 as well, and the SEM is constructed from the equation in the
obvious manner. The lexical entry for the word saw generates a V constituent with the SEM <PAST
SEES1> and a new VAR ev1. Rule 6 combines the SEMs of the two entries for the and dog with the
VAR from the noun to build an NP with the SEM <THE d1 (DOG d1)>. This is then combined with
the SEM of the verb and its VAR by rule 3 to form a VP with the SEM
(λ x (<PAST SEES1> ev1 x <THE d1 (DOG1 d1)>))
This is then combined with the subject NP to form the final logical form for the sentence.
This completes the description of the basic semantic interpretation process. With the two
new features and two minor extensions to the parser described here, a grammar can be specified
that builds the logical form as it parses.
While the last section introduced everything needed to semantically interpret sentences, only the
simplest interpretation techniques were introduced. This section describes some additional
examples of grammatical rules that handle slightly more complicated phenomena. Specifically, it
addresses the inter pretation of verb phrases and prepositional phrases in more detail. First,
consider the rule that is needed for handling auxiliary verbs:
(VP SEM(λ a1 (?semaux (?semvp a1))))
(AUX SUBCAT ?v SEM ?semaux)
(VP VFORM ?v SEM ?semvp)
This rule inserts a modal operator in the appropriate place for the new VP. The SEM equation for
the new VP is a little complicated, so consider it in more detail. If ?semaux is a modal operator such
as CAN1, and ?semvp is a lambda expression such as (λ x (LAUGHS 1 e3 x)), then, according to the
auxiliary rule, the SEM of the VP can laugh will be
(λ a1 (CAN1 ((λ x (LAUGHS1 e3 x)) a1))
This can be simplified to
(λ a1 (CAN1 (LAUGHS1 e3 a1)))
It might help to view this type of SEM equation as "lifting" the variable for the subject over
the CAN1 operator. Starting with the VP interpretation (Û x (LAUGHS1 e3 x)), the equation builds a
new formula containing the CAN1 operator yet still retaining the lambda variable for the subject on
the outside of the formula. Note that, like all VPs, the new SEM is a unary predicate that applies to
the subject, so the auxiliary rule could be used recursively to analyze more complex sequences of
auxiliary verbs.
To analyze prepositional phrases, it is important to realize that they play two different
semantic roles in sentences. In one analysis the PP is a modifier to a noun phrase or verb phrase. In
the other use the PP is subcategorized for by a head word, and the preposition acts more as a flag
for an argument position than as an independent predicate.
Consider the modification case first. In these cases the SEM of the PP is a unary predicate to
be applied to whatever it eventually modifies. Thus the follow ing rule would be appropriate for
building a PP modifier:
(PP SEM (λ y (?semp y ?semnp))) (P SEM ?semp) (NP SEM ?semnp)
Given the PP in the corner, if the SEM of the P is IN-LOC1, and the SEM of the NP is <THE c1
(CORNER1 c1)>, then the SEM of the PP would be the unary predicate
(λ y (IN-LOC1 y <THE c1 CORNER1>))
Now you can consider the interpretation of the noun phrase the tnan in the corner. A reasonable
rule that incorporates the PP modifier would be
(CNP SEM (λ n1 (& (?semcnp n1) (?sempp n1)))) (CNP SEM ?semcnp) (PP SEM ?sempp)
Given that the SEM of the CNP man is the unary predicate MAN1 and the SEM of the PP in the
corner is (λ y (IN 1 y <THE c1 CORNER 1>)), the new SEM of the CNP, before simplification, would
be
The first case results from treating on a couch as an adverbial PP. which was discussed above. What
is the semantic interpretation equation for the second case? The appropriate syntactic rule is
VP -> V[_pp:on] NP PP[on]
and the desired logical form of the final VP is
(λ s (DECIDES-ON1 d1 s <A c1 (COUCH c1)>))
Note that there appears to be no semantic contribution from the word on in this case.
Subcategorized PPs are handled in many systems by distinguishing between two different types of
prepositional phrases. A new binary feature PRED is introduced, where + indicates that the
prepositional phrase should be interpreted as a .predicate, and — means it should be interpreted as
an argument, that is, a term. This binary distinction allows us to specify two rules for prepositional
phrases. These are rules 8 and 9 in Grammar 4.7. Rule 8 is restricted to prepositional phrases with
a +PRED value, and the PP acts as a modifier. Rule 9 handles all the subcategorized prepositional
phrases, with the -PRED feature, in which case the SEM of the PP is simply the SEM of the object NP.
Grammar 4.7 also summarizes all the other rules developed in this section.
8. (PP PRED + SEM (Û x (?semp x ?semnp))) (P SEM ?semp) (NP SEM ?semnp)
9. (PP PREI) - PFORM ?pf SEM ?semnp) (P ROOT ?pf) (NP SEM ?semnp)
10. (VP VAR ?v SEM (Û ag1 (& (?semvp ag1) (?sempp ?v))))
(VP SEM ?semvp) (PP PRED + SEM ?sempp)
11. (VP VAR ?v SEM (Û ag2 (?semv ?v ag2 ?sempp)))
(V[_np_pp:on SEM ?semv) (PP PRED — PFORM on SEM ?sempp)
12. (VP SEM (Û a1 (?semaux (?semvp a1))))
(AUX SUBCAT ?v SEM ?semaux) (VP VFORM ?v SEM ?semvp)
13. (CNP SEM (A. n1 (& (?semcnp n1) (?sempp n1))))
(CNP SEM ?semcnp) (PP PRED + SEM ?sempp)
Head features for PP: PFORM
Head features for VP, CNP: VAR
Figure 4.11 shows the two readings of the VP decide on a couch given the grammar defined
by Grammars 4.3 and 4.7. The case in which the decision is about a couch is shown in the upper half
of the figure: the PP has the feature -PRED, as previously discussed, and its SEM is <A c1 COUCH1>.
The case in which a decision is made on a couch is shown in the lower half of the figure: the PP has
the feature +PRED and its SEM is (λ x (ON-LOC1 x <A c1 COUCH1>)).
Figure 4.11 Two possible parse trees for the VP "decide on a couch"
4.7.4 Lexicalized Semantic Interpretation and Semantic Roles
So far, the semantic forms of the lexical entries have only consisted of the possible senses for each
word, and all the complexity of the semantic interpretation is encoded in the grammatical rules.
While this is a reasonable strategy, many researchers use a different approach, in which the lexical
entries encode the complexities and the grammatical rules are simpler. Consider the verb decide,
say in its intransitive sense, DECIDES1. The SEM of the verb is simply the sense DECIDES1, and rule
2 builds the lambda expression (λ y (DECIDES1 e1 y)). An alternative approach would be to define
the SEM of the lexical entry to be (λ y (DECIDES1 e1 y)), and then the SEM equation for rule 2
would just use the SEM of the verb as its SEM. Likewise, the SEM for the lexical entry for the
transitive sense could be the expression (λ o (λ y (DECIDES-ON1 e1 y o))). The SEM equation for
rule 3 would then apply this predicate to the SEM of the object to obtain the appropriate SEM, as
before.
There is a tradeoff here between the complexity of the grammatical rules and the complexity
of the lexical entries. With the grammar we have developed so far, it seems better to stay with the
simple lexicon. But if the logical forms become more complex, the alternative approach begins to
look attractive. As an example, consider how to specify a grammar that produces a logical form
based on thematic roles. Consider first what happens if the lexicon can only store atomic word
senses. Whereas the earlier grammar had only one rule that could cover all transitive verbs, the
new grammar would have to classify transitive verbs by the thematic roles they use, and a separate
rule must be used for each. The verbs see and eat, for example, both have transitive forms where
the subject fills the AGENT role and the object fills the THEME role. The verb break, on the other
hand, also has a sense where the subject fills the INSTR role and the object fills the THEME role, as
in The hammer broke the window. To handle these two cases, a new feature, say ROLES, would
have to be added to the lexical entries that identifies the appropriate forms, and then a separate
grammar rule added for each. These might look as follows:
(VP VAR ?v SEM (λ a (?semv ?v [AGENT a] [THEME ?semnpl)))
(V ROLES AG-THEME SEM ?semv) (NP SEM ?semnp)
(VP VAR ?v SEM (λ a (?semv ?v [INSTR a] [THEME ?semnp])))
(V ROLES INSTR-THEME SEM ?semv) (NP SEM ?semnp)
Additional rules would be added for all the other possible combinations of roles that can be used by
the verbs.
Clearly, this approach could be cumbersome. Since it requires adding thematic role
information to the lexicon anyway (using the ROLES feature), it might be simpler just to encode the
appropriate forms in the lexicon. For instance, if the lexical entries are
see: (V VAR ?v SEM (λ o (λ a (SEES1 ?v [AGENT a] [THEME ?o]))))
break: (V VAR ?v SEM (λ o (λ a (BREAKS1 ?v [INSTR al [THEME ?o]))))
then a single grammar rule, namely,
(VP SEM (?semv ?semnp)) (V SEM ?semv) (NP SEM ?semnp)
will cover all the cases. Consider the VP see the book, where the SEM of "see" is as above and "the
book" is <THE b1 (BOOK1 b1)>. The SEM for the VP would be
((λ o (λ a (SEES1 b1 [AGENT a] [THEME 0]))) <THE b1 (BOOK1 b1)>)
which can be simplified using one lambda reduction-to
(λ a (SEES1 b1 [AGENT a] [THEME <THE b1 (BOOK1 b1)>]))
The same rule given the VP break the book would apply the SEM of break above to the SEM for the
book to produce the reduced logical form
(λ a (BREAKS1 b1 [INSTR a] [THEME <THE b1 (BOOK1 b1)>]))
Hierarchical Lexicons
The problem with making the lexicon more complex is that there are many words, and specifying a
lexicon is difficult even when the entries are simple. Just specifying the semantic interpretation
rules for the most common sense is tedious, as there is a different semantic interpretation rule for
each complement structure the verb allows. For example, the verb give allows the forum
I gave the money.
I gave John the money.
I gave the money to John.
The lexical entries for give would include the following:
(V SUBCAT _np
SEM λ o λ a (GIVE1 * [AGENT a] [THEME o]))
(V SUBCAT _np_np
SEM V λ r λ o λ a (GIVE1 * [AGENT a] [THEME o] [TO-POSS r]))
(V SUBCAT _np_pp:to
SEM λ o λ r λ a (GIVE1 * [AGENT a] [THEME o] [TO-POSS r]))
This is quite a burden if it must be repeated for every verb. Luckily, we can do much better
by exploiting some general regularities across verbs in English. For example, there is a very large
class of verbs, including most transitive verbs, that all use the same semantic interpretation rule for
the ..np SUBCAT form. This class includes verbs such as give, take, see, find; paint, and so on -
virtually all verbs that describe actions. The idea of a hierarchical lexicon is to organize verb senses
in such a way that their shared properties can be captured concisely. This depends on a technique
called inheritance, where word senses inherit or acquire the properties of the abstract classes
above them in the hierarchy. For example, a very useful lexical hierarchy can be based on the
SUBCAT and SEM properties of verbs, Near the top of the hierarchy are abstract verb senses that
define common verb classes. For example, the abstract class INTRANS-ACT defines a class of verbs
that allow a SUBCAT _none and have the semantic interpretation rule
λ s (?PREDN *[AGENT s])
where ?PREDN is a predicate name determined by the verb. This fully specifies the semantic
interpretation of intransitive verbs such as run, laugh, sit, and so on, except for the actual predicate
name that still needs to be specified in the lexical entry for the word. Another common form is the
simple transitive verb that describes an action, including the verbs listed above. This form, TRANS-
ACT, has a SUBCAT _np and a SEM λ o λ a (? PREDN * [AGENT a] [THEME o]).
We can define similar classes for all the common verb forms and then build an inheritance
structure that relates verb senses to the forms they allow. Figure 4.12 shows a lexical hierarchy
that encodes the definitions of four different verb senses. It is equivalent to the following entries
specified without a hierarchy:
run (in the intransitive exercise sense, RUN1):
(SUBCAT _none
SEM λ a (RUN1 * [AGENT a]))
(SUBCAT _np
SEM λ o λ a (OP1 * [AGENT a] [THEME oD)
(SUBCAT np
SEM λ o λ a (DONATE1 * [AGENT a] [THEME o]))
(SUBCAT np pp:to
SEM λ o λ r λ a (DONATE1 * [AGENT a] [THEME o] [TO-POSS r]))
and, of course,
give (in all its forms as previously discussed):
(SUBCAT np
SEM λ o λ a (GIVE1 * [AGENT al [THEME o]))
(SUBCAT _np_pp:to
SEM λ o λ r λ a (GIVE1 * [AGENT a] [THEME o] [TO-POSS r]))
(SUBCAT _np_np
SEM λ r λ o λ a (GIVE1 * [AGENT a] [THEME o] [TO-POSS r]))
Figure 4.12 Part of a lexical hierarchy for verbs based on SUBCAT and SEM features
You could implement a lexical hierarchy by adding another feature to each lexical entry called SUP
(for superclass), which has as its value a list of abstract categories from which the constituent
inherits properties. It is then relatively simple to write a program that searches up this hierarchy to
find all the relevant feature values whenever the lexicon is accessed. The entry for the verb give
might now look like this:
give: (V ROOT give
PREDN GIVE1
SUP (BITRANS-TO-ACT TRANS-ACT))
14. (S INV — SEM (WH-query ?sems)) (NP WH Q AGR ?a SEM ?semnp)
(S INV + SEM ?sems GAP (NP AGR ?a SEM ?semnp))
15. (S INV + GAP ?g SEM (?semaux (?semvp ?semnp)))(AUX AGR ?a SUBCAT ?s SEM ?semaux)
(NP AGR ?a GAP - SEM ?sempp)
(VP VFORM ?s GAP ?g SEM ?semvp)
16. (NP WH Q VAR ?v SEM <WH ?v (?sempro ?v)>) (PRO WH Q SEM ?sempro)
Grammar 4.10 Rules to handle simple wh-questions
4.7.5 Handling Simple Questions
The grammar developed so far deals only with simple declarative sentences. To extend the
grammar to handle other sentence types requires adding rules to inter pret wh-terms, inverted
sentences, and the gap propagation required to handle wh-questions. This can be done quite
directly with the mechanisms developed so far and requires no additional extensions to the parser.
All you need to do is augment the S rules first developed for questions in Chapter 5 with
appropriate SEM feature equations.
The part that may need some explanation is how the SEM feature interacts with the GAP
feature. As you recall, many wh-questions use the GAP feature to build the appropriate structures.
Examples are
Who did Jill see?
Who did Jill want to see?
Who did Jill want to win the prize?
The rule introduced in Chapter 5 to account for these questions was
(S INV -) -> (NP WH Q AGR ?a) (S INV + GAP(NP AGR ?a))
That is, a wh-question consists of an NP with the WH feature Q followed by an inverted sentence
with an NP missing that agrees with the first NP. To make the semantic interpretation work, we add
the SEM feature to the features of the gap. This way it is passed into the S structure and can be used
at the appropriate place when the gap is found. The revised rule is
(S INV — SEM (WH-query ?sems)) (NP WH Q AGR ?a SEM ?semnp)
(S INV + SEM ?sems GAP (NP AGR ?a SEM ?semnp))
Grammar 4.10 gives the new rules required to handle this type of question. Rule 14 was just
discussed. Rule 15 handles inverted sentences and could be used for yes/no questions as well as for
the wh-questions discussed here. Rule 17 allows noun phrases to be built from wh-pronouns, such
as who and what.
The lexical entry for the wh-words would have to be extended with a SEM feature as well.
For instance, the ward who would have the entry
(PRO WH {Q R} SEM WHO1 AGR {3s 3p})
The predicate WHO1 would be true of any objects that are reasonable answers to such questions,
including people and possibly other animate agents.
To see how the SEM feature and GAP features interact, consider a derivation of the parse tree for
the question Who did Jill see?, as shown in Figure 4.13. Using rule 16, the word who would be
parsed as the noun phrase
(NP WH Q AGR 3s SEM <WH p1 (WHO1 p1)>)
Figure 4.13 The parse tree for "Who did Jill see?"
This constituent can be used to start rule 14, and thus we need the following constituent to
complete the rule:
(S INV +
GAP (NP AGR 3s SEM <WH p1 (WHO1 p1)>)
SEM ?sems)
Rule 15 can be used to rewrite this, and the words did and Jill provide the first two subconstituents.
The third subconstituent needed is:
(VP VFORM base
GAP (NP AGR 3s SEM <WH p1 (WHO1 p1)>)
SEM ?semvp)
This is a VP with an NP gap. Rule 3 from Grammar 4.3 applies to transitive verbs such as see. The
GAP feature is used to fill a gap for the object of the verb, instantiating ?semnp to <WH p1 (WHO
p1)>. Thus the SEM of the new VP is
(λ a3 (SEES1 s1 a3 <WH p1 (WHO1 p1)>))
The SEM of the initial word who has found its way to the appropriate place in the sentence. The
parse can now be completed as usual. This technique of passing in -the SEM in the GAP is
completely general and can be used to handle all of the question types discussed in Chapter 5.
Figure 4.14 The parse tree for "Where did Jill go?"
Note that handling questions starting with +PRED prepositional phrases depends on having
a solution to the problem concerning gap propagation first mentioned in Chapters. Specifically, the
rule VP -+ VP PP, treated in the normal way, would only pass the GAP into the VP subconstituent,
namely the nonlexical head. Thus there seems no way to create a PP gap that modifies a verb
phrase. But this was a problem with the syntactic grammar as well, and any solution at that level
should carry over to the semantic level.
So far we have used lambda expressions and lambda reduction to drive the semantic
interpretation. This provides a good framework for explaining and corn -paring techniques for
semantic interpretation. However, many systems do not explicitly use lambda expressions and
perform semantic interpretation directly using feature values and variables. The basic idea is to
introduce new features for the argument positions that earlier would have been filled using lambda
reduction. For instance, instead of using rule 1 in Grammar 4.3, namely
(S SEM (?semvp ?semnp)) -> (NP SEM ?semnp) (VP SEM ?sernvp)
a new feature SUBJ is introduced, and the rule becomes
(S SEM ?semvp) -> (NP SEM ?semnp) (VP SUBJ ?semnp SEM ?semvp)
The SEM of the subject is passed into the VP constituent as the SUBJ feature and the SEM equations
for the VP insert the subject in the correct position. The new version of rule 3 in Grammar 9.3 that
does this is
(VP VAR ?v SUBJ ?semsubj SEM (?semv ?v ?semsubj ?semnp)) ->
(V[_none] SEM ?semv) (NP SEM ?semnp)
Figure 4.15 shows how this rule builds the SEM of the sentence Jill saw the dog. Compare this to the
analysis built using Grammar 4.3 shown in Figure 4.9. The differences appear in the treatment of
the VP. Here the SEM is the full proposition with the subject inserted, whereas before the SEM was
a lambda expression that would be applied to the subject later in the rule that builds the S.
Figure 4.15 The parse tree for Jill saw the dog using the SUBJ feature
1. (S SEM ?semvp) (NP SEM ?semsubj) (VP SUBJ ?semsubj SEA! ?semvp)
2. (VP VAR ?v SUBJ ?semsubj SEM (?semv ?v ?semsubj)) (V[nonej SEA! ?semv)
3. (VP VAR ?v SUBJ ?semsubj SEM (?semv ?v ?semsubj ?semnp))
(V[_np] SEM ?semv) (NP SEM ?semnp)
4. (NP VAR ?v SEM (PRO ?v ?sempro)) (PRO SEM ?sempro)
5. (NP VAR ?v SEM (NAME ?v ?semname)) (NAME SEM ?semname)
6. (NP VAR ?v SEM <?semart ?v ?semcnp>) (ART SEM ?semart) (CNP SEM ?semcnp)
7. (CNP VAR ?v SEM (?semn ?v)) (N SEM ?semn)
Head features for S, VP, NP, CNP: VAR
Grammar 4.14 is a version of Grammar 4.3 reformulated using this technique. Besides the
changes to rules 1, 2, and 3, rules 6 and 7 are also modified. Rule 7 uses the VAR value to build a full
proposition (as opposed to a unary predicate in the old grammar), and rule 6 is changed
appropriately to account for the change to rule 7.
One advantage of this approach is that no special mechanism need be introduced to handle
semantic interpretation. In particular, there is no need to have a lambda reduction step. Everything
is accomplished by feature unification. Another significant advantage is that a grammar specified in
this form is reversible and hence can be used to generate sentences as well as parse them, as
discussed in the next section. Not all lambda expressions can be eliminated using these techniques,
however. For instance, to handle conjoined subject phrases as in Sue and Sam saw Jack, the
meaning of the verb phrase must still be a lambda expression. If the subject were inserted into the
VP using a variable for the SUBJ feature, then this variable would have to unify with the SEMs
of both Sue and Sam. which it can't do.
But the SEM of the NP is unconstrained. The realizer could only proceed by randomly
generating noun phrases and then attempting to realize the VP with each one and backtracking
until an appropriate one is found. Worse than that, with a more general grammar, the algorithm
may fall into an infinite loop. Clearly, this is an unworkable strategy.
One method for avoiding these problems is to expand the constituents in a different order.
For instance, choosing what term will be the subject depends on decisions made about the verb and
the structure of the verb phrase, such as whether the active or passive voice is used, and so on. So it
makes sense to expand the verb phrase first and then generate the appropriate subject. In fact, a
good general strategy is to expand the head constituents first and then fill in the others. Figure 4.16
gives a simple algorithm to do this. The realization algorithm operates on a list of constituents
much like the basic top-down parser described in Chapter 3. It continues to rewrite constituents in
this list until the list consists only of lexical constituents, at which point the words can be
generated.
Initialization: Set L to a list containing the constituent that you wish to generate.
Do until L contains no nonlexical constituents:
1. If L contains a constituent C that is marked as a nonlexical head,
2. Then use a rule in the grammar to rewrite C. Any variables in C that are bound in the rewrite
should be instantiated throughout the entire list.
3. Else choose a nonlexical constituent C, giving preference to one whose SEM feature is bound,
if one exists. Use a rule in the grammar to rewrite C. Any variables in C that are bound in the
rewrite should be instantiated throughout the entire list.
This algorithm can easily be generalized so that it searches all possible realizations based on
the grammar using a backtracking technique. The reason it works well is that, by expanding the
head constituents first, the algorithm moves quickly down to the lexical level and chooses the
words that have the most influence on the structure of the overall sentence. Once the lexical head is
chosen, most of the structure of the rest of the constituent is determined.
Consider this algorithm operating with Grammar 4.14 and the initial input
(S SEM (<PAST SEES1> s1 (NAME j1 "Jill") <THE d1 (DOG1 d1)>))
The S constituent is rewritten based on rule 1 in the grammar to produce the following constituent
list: (NP SEM ?semsubj)
(VP SUBJ ?semsubj
SEM (<PAST SEES1> s1 (NAME j1 "Jill") <THE d1 (DOG] d1)>))
The nonlexical head constituents are indicated in italics. Thus the VP is expanded next. Only rule 3
will match the SEM structure. As a result of the match, the following variables are bound:
?semv <- <PAST SEES1>
?v <- s1
?semsubj <— (NAME j1 "Jill")
?semnp <- <THE d1 (DOG1 d1>
Thus you obtain the following list of constituents after rewriting the VP and instantiating the
variables throughout the list:
(NP SEM (NAME j1 "Jill"))
(V[_np] SEM <PAST SEES1>)
(NP SEM <THE d1 (DOG1 d1)>)
Since there is no nonlexical head, the algorithm now picks any nonlexical constituent with a bound
SEM, say the first NP. Only rule 5 will match, yielding the lexical constituent (NAME SEM "Jill") and
producing the constituent list
(NAME SEM "Jill")
(V[_np] SEM <PAST SEES1>)
(NP SEM <THE d1 (DOG1 d1>)
The remaining NP is selected next. Rule 6 matches and the subconstituents (ART SEM THE) and
(CNP SEM DOG1) are generated, producing the constituent list
(NAME SEM "Jill")
(V[_np] SEM <PAST SEES1>)
(ART SEM THE)
(CNP SEM DOG1)
The CNP constituent is selected next and rewritten as a common noun with SEM DOG1, and the
algorithm is complete. The constituent list is now a sequence of lexical categories:
(NAME SEM "Jill")
(VLnp] SEM <PAST SEES1>)
(ART SEM THE)
(N SEM DOG1)
It is simple to produce the sentence Jill saw the dog from the lexical constituents.
The grammar used in this example was very small, so there is only one possible realization
of this sentence. With a larger grammar, a wider range of forms would be possible. For instance, if
only the SEM feature is specified, the realization program would randomly pick between active and
passive sentences when allowed by the verb. In other cases it might pick between different
subcategorization structures. For instance, the same logical form might be realized as Jill gave
the dog to Jack or Jill gave Jack the dog. Depending on the number of word senses used and whether
different words have senses in com mon, the realizer may also randomly pick between various
lexical realizations of a logical form. For instance, there could be a logical form that could be
realized by the sentence Jill gave the money to the Humane Society or Jack donated the money to
the Humane Society.
Each of these variations may have different effects in context, but these dis tinctions are not
captured in the logical form. To force particular realizations, you would have to specify other
features in addition to the SEM feature. For instance, you might set the VOICE feature to active to
guarantee an active voice sentence.
Transfer Learning
Transfer learning is a machine learning approach that involves training a model for one job and then
repurposing it for a related activity. Instead of developing and training a model from start, which is
costly, time-consuming, and requires a large quantity of data, you may simply fine-tune one that has
already been trained. As a result, organizations can accomplish NLP jobs faster and with less labeled
data. Transfer learning, which first gained popularity in the field of computer vision, is now being
utilized in NLP applications such as intent classification, sentiment analysis, and named entity
recognition.
Automation In NLP
The success of automated machine learning or autoML in effectively dealing with real-world problems
has prompted researchers to develop more automation and no-code tools and platforms.
One such area is automation in natural language processing. With AutoNLP, users can build models like
sentiment analysis with just a few basic lines of code. This encourages wider participation in the
machine learning community, earlier thought to be restricted to just developers and engineers.
AutoNLP helps in automating processes such as stemming, tokenisation, and lemmatisation, among
others. It can also help in choosing the best model for a given dataset.
Chapter 6 suggested employing heuristics to choose between alternate syntactic analyses. Creating
such heuristics, however, is difficult and time consuming. Furthermore, there is no systematic
method for evaluating how well the heuristic rules work in practice. This section explores some
techniques for solving these problems based on probability theory. Such approaches have become
very popular in the past few years because large databases, or corpora, of natural language data
have become available. This data allows you to use statistically based techniques for automatically
deriving the probabilities needed. The most commonly available corpus, the Brown corpus, consists
of about a million words, all labeled with their parts of speech. More recently, much larger
databases have become available in formats ranging from raw text format (as in databases of
articles from the AP news wire and the "Wall Street Journal") to corpora with full syntactic
annotation (such as the Penn Treebank). The availability of all this data suggests new analysis
techniques that were previously not possible.
Intuitively, the probability of some event is the likelihood that it will occur. A probability of 1
indicates that the event is certain to occur, while a probability of 0 indicates that the event
definitely will not occur. Any number between 0 and 1 indicates some degree of uncertainty. This
uncertainty can be illuminated by considering the odds of the event occurring, as you would if you
were going to bet on whether the event will occur or not. A probability of .5 would indicate that the
event is equally likely to occur as not to occur, that is, a "50/50" bet. An event with
probability .5 would occur exactly half of the time. An event with probability .1 would occur once in
every 10 opportunities (1/10 odds), whereas an event with probability .75 would occur 75 times
out of 100 (3/4 odds).
More formally, probability can be defined in terms of a random variable, which may range
over a predefined set of values. While random variables may range over infinite sets and
continuous values, here we will use only random variables that range over a finite set of values. For
instance, consider tossing a coin. The random variable TOSS representing the result of a coin toss
would range over two possible values: heads (h) or tails (t). One possible event is that the coin
comes up heads - TOSS=h; the other is that the coin comes up tails - TOSS=t. No other value for
TOSS is possible, reflecting the fact that a tossed coin always comes up either heads or tails. Usually,
we will not mention the random variable and will talk of the probability of TOSS=h simply as the
probability of the event h.
A probability function, PROB, assigns a probability to every possible value of a random
variable. Every probability function must have the following properties, where e1,...,en are the
possible distinct values of a random variable E:
1. PROB(ei) ≥ 0, for all i
2. PROB(ei) ≤ 1, for all i
3. ∑i=1,n PROB(ei)= 1
Consider a specific example. A particular horse, Harry, ran 100 races in his career. The result of a
race is represented as a random variable R that has one of two values, Win or Lose. Say that Harry
won 20 times overall. Thus the probability of Harry winning the race is PROB(R=Win) = .2 and the
probability of him losing is
PROB(R=Lose) = .8. Note that these values satisfy the three constraints just defined.
Of course, there may be many different random variables, and we are often interested in
how they relate to each other. Continuing the racing example, consider another random variable W
representing the state of the weather and ranging over the values Rain and Shine. Let us say that it
was raining 30 times out of the 100 times a race was run. Of these 30 races, Harry won 15 times.
Intuitively, if you were given the fact that it was raining - that is, that W=Rain - the probability that
R=Win would be .5 (15 out of 30). This intuition is captured by the concept of conditional
probability and is written as PROB(Win | Rain), which is often described as the probability of the
event Win given the event Rain. Conditional probability is defined by the formula
PROB(e | e') = PROB(e & e') / PROB(e')
where PROB(e & e') is the probability of the two events e and e' occurring simultaneously.
You know that PROB( Rain) = .3 and PROB(Win & Rain) = .15, and using the definition of
conditional probability you can compute PROB(Win | Rain) and see that it agrees with your
intuition:
PROB(Win | Rain) = PROB(Win & Rain) / PROB(Rain)
= .15 / .30
= .5
An important theorem relating conditional probabilities is called Bayes' rule. This rule relates the
conditional probability of an event A given B to the conditional probability of B given A:
PROB(A | B) = PROB(B | A) * PROB(A)
PROB(B)
We can illustrate this rule using the race horse example. Using Bayes’ rule we can compute the
probability that it rained on a day that Harry won a race:
PROB(Rain | Win) = ( PROB(Win | Rain)*PROB(Rain)) / PROB(Win)
= (.5 * .3) / .2
= .75
which, of course, is the same value as if we calculated the conditional probability directly from its
definition:
PROB(Rain | Win) = PROB(Rain & Win) / PROB(Win)
= .15 / .20
= .75
The reason that Bayes' rule is useful is that we usually do not have complete information
about a situation and so do not know all the required probabilities. We can, however, often
estimate some probabilities reasonably and then use Bayes’ rule to calculate the others.
Another important concept in probability theory is the notion of independence. Two events
are said to be independent of each other if the occurrence of one does not affect the probability of
the occurrence of the other. For instance, in the race horse example, consider another random
variable, L, that indicates whether I took my lucky rabbit's foot to the race (value Foot) or not
(Empty). Say I took my rabbit's foot 60 times, and Harry won 12 races. This means that the
probability that Harry won the race on a day that I took the rabbit's foot is PROB(Win I Foot) =
12/60 = .2. Since this is the same as the usual probability of Harry winning, you can conclude that
winning the race is independent of taking the rabbit’s foot.
More formally, two events A and B are independent of each other if and only if
PROB(A | B) = PROB(A)
which, using the definition of conditional probability, is equivalent to saying
PROB(A & B) = PROB(A) * PROB(B)
Note that the events of winning and raining are not independent, given that PROB(Win & Rain) =
.15 while PROB(Win) * PROB(Rain) = .2 * .3 = .06. In other words, winning and raining occur
together at a rate much greater than random chance.
Consider an application of probability theory related to language, namely part-of-speech
identification: Given a sentence with ambiguous words, determine the most likely lexical category
for each word. This problem will be examined in detail in Section 7.3. Here we consider a trivial
case to illustrate the ideas under lying probability theory. Say you need to identify the correct
syntactic category for words that can be either nouns or verbs. This can be formalized using two
random variables: one, C, that ranges over the parts of speech (N and V), and another, W, that
ranges over all the possible words. Consider an example when W = "flies". The problem can be
stated as determining whether PROB(C=N | W="flies") or PROB(C=V | W="flies") is greater. Note
that, as mentioned earlier, the random variables will usually be omitted from the formulas. Thus
PROB(C=N | W="flies") will usually be abbreviated as PROB(N | "flies"). Given the definition of
conditional probability, the probabilities for the categories of the word flies are calculated as
follows:
PROB( N | flies ) = PROB(flies & N) / PROB(flies)
PROB( V | flies ) = PROB(flies & V) / PROB(flies)
So the problem reduces to finding which of PROB (flies & N) and PROB(flies & V) is greater, since
the denominator is the same in each formula.
How might you obtain these probabilities? You clearly cannot determine the true
probabilities since you don't have a record of all text ever written, let alone the text that will be
processed in the future. But if you have a large enough sample of data, you can estimate the
probabilities.
Let’s say we have a corpus of simple sentences containing 1,273,000 words. Say we find
1000 uses of the word "flies", 400 of them in the N sense, and 600 in the V sense. We can
approximate the probabilities by looking at the ratios of the frequencies of the words in the corpus.
For example, the probability of a randomly selected word being the word "flies" is
PROB(flies) ≡ 1000 / 1.273.000 = 0.0008
The joint probabilities for "flies" as a noun and "flies" as a verb are
PROB(flies & N) ≡ 400 / 1.273.000 = 0.0003
PROB(flies & V) ≡ 600 / 1.273.000 = 0.0005
Thus our best guess is that each use of "flies" is a verb. We can compute the probability that "flies" is
a verb using the formula
PROB (V |flies) = PROB(V & flies) / PROB (flies)
= .0005 / .0008 = .625
Using this method, an algorithm that always asserts "flies" to be a verb will be correct about 60
percent of the time. This is clearly a poor strategy, but better than guessing that it is a noun all the
time! To get a better method, you must consider more context, such as the sentence in which "flies"
occurs.
If you have all the data that would ever be relevant to a problem, you can compute exact
probabilities for that data. For instance, say Harry, the horse in the last section, ran only 100 races
and then was put out to pasture. Now you can compute the exact probability of Harry winning any
particular race someone might choose of the 100 possibilities. But, of course, this is not how
probability is generally used. Typically, you want to use probability to predict future behavior; that
is, you’d use information on Harry’s past performance to predict how likely he is to win his 101st
race (which you are going to bet on). This is a real-life application of probability theory. You are
in the same position when using probabilities to help resolve ambiguity, since you are interested in
parsing sentences that have never been seen before. Thus you need to use data on previously occur
ring sentences to predict the interpretation of the next sentence. We will always be working with
estimates of probability rather than
the actual probabilities.
As seen in the last section, one method of estimation is to use the ratios from the corpus as
the probability to predict the interpretation of the new sentence. Thus, if we have seen the word
"flies" 1000 times before, and 600 of them were as a verb, then we assume that PROB(V | "flies") is
.6 and use that to guide our guess with the 1001st case. This simple ratio estimate is called the
maximum likelihood estimator (MLE). If you have many examples of the events you are
estimating, this can be a very reliable estimate of the true probability.
In general, the accuracy of an estimate increases as the amount of data used expands, and
there is a theorem of statistics - the law of large numbers - that states that estimates can be made as
accurate as desired if you have unlimited data. Estimates can be particularly unreliable, however,
when only a small number of samples are involved. Consider trying to estimate the true probability
of a fair coin coming up heads when it is flipped. For the sake of discussion. since we know the
actual answer is .5, let us say the estimate is accurate enough if it falls between .25 and .75. This
range will be called the acceptable margin of error. If you only do two trials, there are four possible
outcomes, as shown in Figure 5.1: two heads, heads then tails, tails then heads, or two tails. This
means that, if you flip a coin twice, half the time you will obtain an estimate of 1/2 or .5. The other
half of the time, however, you will estimate that the probability of coming up heads is 1 (the two-
heads case) or 0 (the two tails case). So you have only a 50 percent chance of obtaining an estimate
within the desired margin of error. With three flips, the chance of getting a reliable enough estimate
jumps to 75 percent, as shown in Figure 5.2. With four flips, there is an 87.5 percent chance of the
estimate being accurate enough, with eight flips a 93 percent chance, with 12 flips a 95 percent
chance, and so on. No matter how long you flip, there is always a chance that the estimate found is
inaccurate, but you can reduce the probability of this occurring to as small a number as you desire
if you can do enough trials.
Almost any method of estimation works well when there is a lot of data. Unfortunately, there
are a vast number of estimates needed for natural language applications, and a large proportion of
these events are quite rare. This is the problem of sparse data. For instance, the Brown corpus
contains about a million words, but due to duplication there are only 49,000 different words. Given
this, you might expect each word to occur about 20 times on average. But over 40,000 of the words
occur five times or less. With such few numbers, our estimates of the part of speech for such words
may be highly inaccurate. The worst case occurs if a low-frequency word does not occur at all in
one of its possible categories. Its probability in this category would then be estimated as 0; thus no
interpretation using the word in any category would be possible, because the probability of the
overall sentence containing the word would be 0 as well. Unfortunately, rare events are common
enough in natural language applications that reliable estimates for these low-frequency words are
essential for the algorithms.
There are other estimation techniques that attempt to address the problem of estimating
probabilities of low frequency events. To examine them, let’s intro duce a framework in which they
can be compared. For a particular random variable X, all techniques start with a set of values, Vi,
computed from the count of the number of times X = xi. The maximum likelihood estimation
technique uses V1 = |xi|; that is, Vi is exactly the count of the number of times X = xi. Once Vi is
determined for each xi, the probability estimates are obtained by the formula
PROB (X = xi) ≡ Vi / ∑i Vi
Note that the denominator guarantees that the estimates obey the three properties of a probability
function defined in Section 5.1.
One technique to solve the zero probability problem is to ensure that no Vi has the value 0.
We might, for instance, add a small amount, say .5, to every count, such as in Vi = |xi| + .5. This
guarantees no zero probabilities yet retains the relative likelihoods for the frequently occurring
values. This estimation technique is called the expected likelihood estimator (ELE). To see the
difference between this and the MLE, consider a word w that happens not to occur in the corpus,
and consider estimating the probability that w occurs in one of 40 word classes L1 …… L40.
Thus we have a random variable X, where X = xi only if w appears in word category Li. The MLE for
PROB(X = xi) will not be defined because the formula has a zero denominator. The ELE, however,
gives an equally likely probability to each possible word class. With 40 word classes, for instance,
each Vi will be .5, and thus PROB(L1 | w) ≡ .5/20 = .025. This estimate better reflects the fact that
we have no information about the word. On the other hand, the ELE is very conservative. If w
appears in the corpus five times, once as a verb and four times as a noun, then the MLE estimate of
PROB(N | w) would be .8, while the ELE estimate would be 4.5/25 = .18, a very small value
compared to intuition.
Evaluation
Once you have a set of estimated probabilities and an algorithm for some particular
application, you would like to be able to tell how well your new technique performs compared with
other algorithms or variants of your algorithm. The general method for doing this is to divide the
corpus into two parts: the training set and the test set. Typically, the test set consists of 10 - 20
percent of the total data. The training set is then used to estimate the probabilities, and the
algorithm is run on the test set to see how well it does on new data. Running the algorithm on the
training set is not considered a reliable method of evaluation because it does not measure the
generality of your technique. For instance, you could do well on the training set simply by
remembering all the answers and repeating them back in the test! A more thorough method of
testing is called cross-validation, which involves repeatedly removing different parts of the corpus
as the test set, training on the remainder of the corpus, and then evaluating on the new test set.
This technique reduces the chance that the test set selected was somehow easier than you might
expect.
5.3 Part-of-Speech Tagging
Part-of-speech tagging involves selecting the most likely sequence of syntactic categories for the
words in a sentence. A typical set of tags, used in the Penn Treebank project, is shown in Figure 5.3.
In Section 5.1 you saw the simplest algorithm for this task: Always choose the interpretation that
occurs most frequently in the training set. Surprisingly, this technique often obtains about a 90
percent success rate, primarily because over half the words appearing in most corpora are not
ambiguous. So this measure is the starting point from which to evaluate algorithms that use more
sophisticated techniques. Unless a method does significantly better than 90 percent, it is not
working very well.
The general method to improve reliability is to use some of the local context of the sentence in
which the word appears. For example, in Section 5.1 you saw that choosing the verb sense of "flies"
in the sample corpus was the best choice and would be right about 60 percent of the time. If the
word is preceded by the word "the", on the other hand, it is much more likely to be a noun. The
technique developed in this section is able to exploit such information.
Consider the problem in its full generality. Let w1,...,wT be a sequence of words. We want to find
the sequence of lexical categories C1,...,CT that maximizes
1. PROB (C1,...,CT | W1,...,WT )
Unfortunately, it would take far too much data to generate reasonable estimates for such
sequences, so direct methods cannot be applied. There are, however, reasonable approximation
techniques that produce good results.To develop them, you must restate the problem using Bayes’
rule, which says that this conditional probability equals
2. (PROB (C1,...,CT) * PROB (W1,...,WT | C1,...,CT)) / PROB (W1,...,WT)
As before, since we are interested in finding the C1,...,Cn that gives the maximum value, the
common denominator in all these cases will not affect the answer. Thus the problem reduces to
finding the sequence C1,...,Cn that maxi mizes the formula
3. PROB (C1,...,CT) * PROB (W1,...,WT | C1,...,CT)
There are still no effective methods for calculating the probability of these long sequences
accurately, as it would require far too much data. But the probabilities can be approximated by
probabilities that are simpler to collect by making some independence assumptions. While these
independence assumptions are not really valid, the estimates appear to work reasonably well in
practice. Each of the two expressions in formula 3 will be approximated. The first expression, the
probability of the sequence of categories, can be approximated by a series of probabilities based on
a limited number of previous categories. The most common assumptions use either one or two
previous categories. The bigram model looks at pairs of categories (or words) and uses the
conditional probability that a category C1 will follow a category Ci-1, written as PROB(C1 | Ci-1).
The trigram model uses the conditional probability of one category (or word) given the two
preceding categories (or words), that is, PROB(Ci | Ci-2 Ci-1). These models are called n-gram
models, in which n represents the number of words used in the pattern. While the trigram model
will produce better results in practice, we will consider the bigram model here for simplicity. Using
bigrams, the following approximation can be used:
PROB (C1,...,CT) ≡ ∏i=1,T PROB(Ci| Ci-1)
To account for the beginning of a sentence, we posit a pseudocategory Ø at position 0 as the value
of C0. Thus the first bigram for a sentence beginning with an ART would be PROB(ART | Ø). Given
this, the approximation of the probability of the sequence ART N V N using bigrams would be
PROB (ART N V N) ≡ PROB (ART | Ø) * PROB (N | ART)* PROB (V | N) * PROB (N | V)
The second probability in formula 3,
PROB (W1,...,WT | C1,...,CT)
can be approximated by assuming that a word appears in a category independent of the words in
the preceding or succeeding categories. It is approximated by the product of the probability that
each word occurs in the indicated part of speech. that is, by
PROB (W1,...,WT | C1,...,CT) ≡ ∏i=1,T PROB(wi| Ci)
With these two approximations, the problem has changed into finding the sequence C1,...,CT that
maximizes the value of
∏i=1,T PROB(Ci| Ci-1) * PROB PROB(wi| Ci)
The advantage of this new formula is that the probabilities involved can be read ily estimated from
a corpus of text labeled with parts of speech. In particular, given a database of text, the bigram
probabilities can be estimated simply by counting the number of times each pair of categories
occurs compared to the individual category counts. The probability that a V follows an N would be
estimated as follows:
𝑪𝒐𝒖𝒏𝒕(𝑵 𝒂𝒕 𝒑𝒐𝒔𝒊𝒕𝒊𝒐𝒏 𝒊 − 𝟏 𝒂𝒏𝒅 𝑽 𝒂𝒕 𝒊)
𝑷𝑹𝑶𝑩(𝑪𝒊 = 𝑽 | 𝑪𝒊 − 𝟏 = 𝑵) ≡
𝑪𝒐𝒖𝒏𝒕 (𝑵 𝒂𝒕 𝒑𝒐𝒔𝒊𝒕𝒊𝒐𝒏 𝒊 − 𝟏)
N V ART P TOTAL
flies 21 23 0 0 44
fruit 49 5 1 0 55
like 10 30 0 21 61
a 1 0 201 0 202
the 1 0 300 2 303
flower 53 15 0 0 68
flowers 42 16 0 0 58
birds 64 1 0 0 65
others 592 210 56 284 1142
TOTAL 833 300 558 307 1998
Figure 5.5 A summary of some of the word counts in the corpus
Since we are only dealing with bigram probabilities, the probability that the i'th word is in a
category C1 depends only on the category of the (i-1)th word, Ci-1. Thus the process can be
modeled by a special form of probabilistic finite state machine, as shown in Figure 5.7. Each node
represents a possible lexical category and the transition probabilities (the bigram probabilities in
Figure 5.4) indicate the probability of one category following another.
With such a network you can compute the probability of any sequence of categories simply
by finding the path through the network indicated by the sequence and multiplying the transition
probabilities together. For instance, the sequence ART N V N would have the probability .71 * 1 *
.43 * .35 = .107. This representation, of course, is only accurate if the probability of a category occur
ring depends only on the one category before it. In probability theory this is often called the
Markov assumption, and networks like that in Figure 5.7 are called Markov chains.
Now we can resume the discussion of how to find the most likely sequence of categories for
a sequence of words. The key insight is that because of the Markov assumption, you do not have to
enumerate all the possible sequences. In fact, sequences that end in the same category can be
collapsed together since the next category only depends on the one previous category in the
sequence. So if you just keep track of the most likely sequence found so far for each possible ending
category, you can ignore all the other less likely sequences. For example. Consider the problem of
finding the most likely categories for the sentence "Flies like a flower", with the lexical-generation
probabilities and bigram probabilities discussed so far. Given that there are four possible
categories, there are 44 =256 different sequences of length four. The brute force algorithm would
have to generate all 256 sequences and compare their probabilities in order to find this one.
Exploiting the Markov assumption, however, this set of sequences can be collapsed into a
representation that considers only the four possibilities for each word. This representation, shown
as a transition diagram in Figure 5.8, represents all 256 sequences. To find the most likely
sequence, you sweep forward through the words one at a time finding the most likely sequence for
each ending category. In other words, you first find the four best sequences for the two words "flies
like": the best ending with "like" as a V, the best as an N, the best as a P, and the best as an ART. You
then use this information to find the four best sequences for the three words "flies like a", each one
ending in a different category. This process is repeated until all the words are accounted for. This
algorithm is usually called the Viterbi algorithm. For a problem involving T words and N lexical
categories, it is guaranteed to find the most likely sequence using k*T*N2 steps, for some constant
k, significantly better than the NT steps required by the brute force search! The
rest of this section develops the Viterbi algorithm in detail.
Figure 5.9 The Viterbi algorithm
Corpus-based methods suggest some new ways to control parsers. If we had some large
corpora of parsed sentences available, we could use statistical methods to identify the common
structures in English and favor these in the parsing algorithm. This might allow us to choose the
most likely interpretation when a sentence is ambiguous, and might lead to considerably more
efficient parsers that are nearly deterministic. Such corpora of parsed sentences are now becoming
available.
The first issue is what the input would be to such a parser. One simple approach would be
to use a part-of-speech tagging algorithm from the last section to select a single category for each
word and then start the parse with these categories. If the part-of-speech tagging is accurate, this
will be an excellent approach, because a considerable amount of lexical ambiguity will be
eliminated before the parser even starts. But if the tagging is wrong, it will prevent the parser from
ever finding the correct interpretation. Worse, the parser may find a valid but implausible
interpretation based on the wrongly tagged word and never realize the error. Consider that even at
95 percent accuracy, the chance that every word is correct in a sentence consisting of only 8 words
is .67, and with 12 words it is .46 - less than half. Thus the chances of this approach working in
general look slim.
total sum of 1.13 * 10-5. Since these are the only sequences that have nonzero scores when the
second word is "flies", the sum of all these sequences will be the probability of the sequence "The
flies", namely 9.59 1 * 10-3. We can now compute the probability that "flies" is a noun as follows:
PROB (flies/N | The flies) = PROB (flies/N & The flies) / PROB (The flies)
= 9.58 * 10-3 / 9.591 * 10-3
= .9988
Likewise, the probability that "flies" is a verb would be .0012.
Figure 5.14 The forward algorithm for computing the lexical probabilities
Figure 5.16 Context-dependent estimates for lexical categories in the sentence The flies like flowers
Consider deriving the lexical probabilities for the sentence "The flies like flowers" using the
probability estimates in Figures 5.4 and 5.6. The algorithm in Figure 5.14 would produce the sums
shown in Figure 5.15 for each category in each position, resulting in the probability estimates
shown in Figure 5.16.
Note that while the context-independent approximation in Figure 5.13 slightly favors the
verb interpretation of "flies", the context-dependent approximation virtually eliminates it because
the training corpus had no sentences with a verb immediately following an article. These
probabilities are significantly different than the context independent ones and much more in line
with intuition.
Note that you could also consider the backward probability, βi(t), the probability of
producing the sequence (w1,...,wt) beginning from state wt/Lj. These values can be computed by
an algorithm similar to the forward probability algorithm but starting at the end of the sentence
and sweeping backward through the states. Thus a better method of estimating the lexical
probabilities for word wt would be to consider the entire sentence rather than just the words up to
t. In this case, the estimate would be
PROB(wt / Li) = (αi(t) * βi(t)) / ∑j=1,N (αi(t) * βj(t))
Just as finite state machines could be generalized to the probabilistic case, context-free grammars
can also be generalized. To do this, we must have some statistics on rule use. The simplest
approach is to count the number of times each rule is used in a corpus containing parsed sentences
and use this to estimate the probability of each rule being used. For instance, consider a category C,
where the grammar contains m rules, R1 Rm, with the left-hand side C. You could estimate the
probability of using rule Rj to derive C by the formula
PROB(Rj | C) ≡ Count (#times Rj used) / ∑j=1,m (#times Ri used)
Grammar 5.17 shows a probabilistic CFG with the probabilities derived from analyzing a parsed
version of the demonstration corpus.
You can then develop algorithms similar in function to the Viterbi algorithm that, given a
sentence, will find the most likely parse tree that could have generated that sentence. The
technique involves making certain independence assumptions about rule use. In particular, you
must assume that the probability of a constituent being derived by a rule Rj is independent of how
the constituent is used as a subconstituent. For example, this assumption would imply that the
probabilities of NP rules are the same whether the NP is the subject, the object of a verb, or the
object of a preposition. We know that this assumption is not valid in most cases. For instance, noun
phrases in the subject position are much more likely to be pronouns than noun phrases not in the
subject position. But, as before, it might be that useful predictive power can be obtained using these
techniques in practice.
With this assumption, a formalism can be developed based on the probability that a
constituent C generates a sequence of words wi, wi+1,...,wj. written as wi,j. This type of probability
is called the inside probability because it assigns a probability to the word sequence inside the
constituent. It is written as
PROB(wi,j | C)
Consider how to derive inside probabilities. The case for lexical categories is simple. In fact,
these are exactly the lexical-generation probabilities derived in Section 5.3. For example,
PROB(flower | N) is the inside probability that the constituent N is realized as the word "flower",
which for our hypothetical corpus was .06, given in Figure 5.6.
Using such lexical-generation probabilities, we can then derive the probability that the
constituent NP generates the sequence a "flower" as follows: There are only two NP rules in
Grammar 5.17 that could generate a sequence of two words. The parse trees generated by these
two rules are shown in Figure 5.18. You know the likelihood of each rule, estimated from the
corpus as shown in Grammar 5.17, so the probability that the constituent NP generates the words a
"flower" will be the sum of the probabilities of the two ways it can be derived, as follows:
This probability can then be used to compute the probability of larger constituents. For
instance, the probability of generating the words "A flower wilted" from constituent S could be
computed by summing the probabilities generated from each of the possible trees shown in Figure
5.19. Although there are three possible interpretations, the first two differ only in the derivation of
a "flower" as an NP. Both these interpretations are already included in the preceding computation
of PROB(a flower | NP). Thus the probability of "a flower blooms" is
Using this method, the probability that a given sentence will be generated by the grammar
can be computed efficiently. The method only requires recording the value of each constituent
between each two possible positions.
We will not pursue this development further, however, because in parsing, you are
interested in finding the most likely parse rather than the overall probability of a given sentence. In
particular, it matters which of the two NP interpretations in Figure 5.19 was used. The probabilities
of specific parse trees can be found using a standard chart parsing algorithm, where the probability
of each constituent is computed from the probability of its subconstituents and the probability of
the rule used. Specifically, when entering an entry E of category C using a rule i with n
subconstituents corresponding to entries E1,...,En, then
PROB (E) = PROB (Rule i | C) * PROB (E1) * ... * PROB (En)
For lexical categories it is better to use the forward probability than the lexical-generation
probability. This will produce better estimates, because it accounts for some of the context of the
sentence. You can use the standard chart parsing algorithm and add a step that computes the
probability of each entry when it is added to the chart. Using the bottom-up algorithm and the
probabilities derived from the demonstration corpus, Figure 5.20 shows the complete chart for the
input "a flower". Note that the most intuitive reading, that it is an NP, has the highest
probability by far, .54. But there are many other possible interpretations because of the low
probability readings of the word a as a noun and the reading of "flower" as a verb. Note that the
context-independent probabilities for "flower" would favor the verb interpretation, but the forward
algorithm strongly favors the noun interpretation in the context where it immediately follows the
word "a". The probabilities for the lexical constituents ART416, N417, V419, and N422 were
computed using the forward algorithm. The context-independent lexical probabilities would be
much less in line with intuition.
Unfortunately, a parser built using these techniques turns out not to work as well as you
might expect, although it does help. Some researchers have found that these techniques identify the
correct parse about 50 percent of the time. It doesn't do better because the independence
assumptions that need to be made are too radical. Problems arise in many guises, but one critical
issue is the handling of lexical items. For instance, the context-free model assumes that the
probability of a particular verb being used in a VP rule is independent of which rule is being
considered. This means lexical preferences for certain rules cannot be handled within the basic
framework. This problem then influences many other issues, such as attachment decisions.
For example, Grammar 5.17 indicates that rule 3 is used 39 percent of the time, rule 4, 22
percent of the time, and rule 5, 24 percent of the time. This means that, independent of whatever
words are used, an input sequence of form V NP PP will always be interpreted with the PP attached
to the verb. The tree fragments are shown in Figure 5.21: Any structure that attaches the PP to the
verb will have a probability of .22 from this fragment, whereas the structure that attaches the PP to
the NP will have a probability of .39 * .25, namely .093. So the parser will always attach a PP to the
verb phrase independent of what words are used. in fact, there are 23 cases of this situation in the
corpus where the PP attaches to an NP, and the probabilistic parser gets every one of them wrong.
Figure 5.21 The two structures affecting attachment decisions
This is true even with particular verbs that rarely take a nppp complement. Because of the
context-free assumption made about probabilities, the particular verb used has no effect on the
probability of a particular rule being used. Because of this type of problem, a parser based on this
approach with a realistic grammar is only a slight improvement over the nonprobabilistic methods.
For example, 84 sentences were selected from the corpus, each involving a PP attachment
problem. With the standard bottom-up nonprobabilistic algorithm, using Grammar 5.17, the first
complete S structure found was the correct answer on one-third of the sentences. By using
probabilistic parsing, the highest probability S structure was correct half of the time. Thus, pure
probabilistic parsing is better than guessing but leaves much to be desired. Note also that this test
was on the same sentences that were used to compute the probabilities in the first place, so
performance on new sentences not in the training corpus would tend to be even worse! Papers in
the literature that use real corpora and extensive training report accuracy results similar to those
described here.
While the pure probabilistic context-free grammars have their faults, the techniques
developed here are important and can be reused in generalizations that attempt to develop more
context-dependent probabilistic parsing schemes.
So far, probabilistic context-free grammars have done nothing to improve the efficiency of
the parser. Algorithms can be developed that attempt to explore the high-probability constituents
first. These are called best-first parsing algorithms. The hope is that the best parse can be found
quickly and much of the search space, containing lower-rated possibilities, is never explored.
It turns out that all the chart parsing algorithms in Chapter 3 can be modified fairly easily to
consider the most likely constituents first. The central idea is to make the agenda a priority queue -
a structure where the highest-rated elements are always first in the queue. The parser then
operates by always re moving the highest-ranked constituent from the agenda and adding it to the
chart.
It might seem that this one change in search strategy is all that is needed to modify the
algorithms, but there is a complication. The previous chart parsing algorithms all depended on the
fact that the parser systematically worked from left to right, completely processing constituents
occurring earlier in the sentence before considering later ones. With the modified algorithm, this is
not the case. If the last word in the sentence has the highest score, it will be added to the chart first.
The problem this causes is that you cannot simply add active arcs to the chart (and depend on later
steps in the algorithm to extend them). In fact, the constituent needed to extend a particular active
arc may already be on the chart. Thus, whenever an active arc is added to the chart, you must check
to see if it can be extended immediately, given the current chart. Thus we need to modify the arc
extension algorithm. The algorithm is as before (in Section 3.4), except that step 2 is modified to
check for constituents that already exist on the chart. The complete algorithm is shown in Figure
5.22.
Figure 5.22 The new arc extension algorithm
Even though it does not consider every possible constituent, the best-first parser is
guaranteed to find the highest probability interpretation. To see this, assume that the parser finds
an interpretation S1 with probability p1. The important property of the probabilistic scoring is that
the probability of a constituent must always be lower (or equal) to the probability of any of its
subconstituents. So if there were an interpretation S2 that had a score p2, higher than p1, then it
would have to be constructed out of subconstituents all with a probability of p2 or higher.
This means that all the subconstituents would have been added to the chart before S1 was. But this
means that the arc that constructs S2 would be completed, and hence S2 would be on the agenda.
Since S2 has a higher score than S 1, it would be considered first.
While the idea of a best-first parser is conceptually simple, there are a few problems that
arise when trying to apply this technique in practice. One problem is that if you use a multiplicative
method to combine the scores, the scores for constituents tend to fall quickly as they cover more
and more input. This might not seem problematic, but in practice, with large grammars, the
probabilities drop off so quickly that the search closely resembles a breadth-first search: First build
all constituents of length 1, then all constituents of length 2, and so on. Thus the promise of quickly
finding the most preferred solution is not realized. To deal with this problem, some systems use a
different function to compute the score for constituents. For instance, you could use the minimum
score of any subconstituent and the rule used, that is,
This gives a higher (or equal) score than the first approach, but a single poorly-rated
subconstituent can essentially eliminate any constituent that contains it, no matter how well all the
other constituents are rated. Unfortunately, testing the same 84 sentences using the MIN function
leads to a significant decrease in accuracy to only 39 percent, not much better than a brute force
search. But other researchers have suggested that the technique performs better than this in actual
practice. You might also try other means of combining the scores, such as taking the average score
of all subconstituents.
The best-first algorithm leads to improvement in the efficiency of the parser but does not
affect the accuracy problem. This section explores a simple alternative method of
computing rule probabilities that uses more context-dependent lexical information. The
idea exploits the observation that the first word in a constituent is often the head and thus
has a dramatic effect on the probabilities of rules that account for its complement. This
suggests a new probability measure for rules that is relative to the first word, PROB(R | C,
w). This is estimated as follows:
Count (# times rule R used for cat C starting with w)
PROB(R|C, w) = ______________________________________________
Count(# times cat C starts with w)
The effect of this modification is that probabilities are sensitive to the particular words used. For
instance, in the corpus, singular nouns rarely occur alone as a noun phrase (that is, starting rule NP
-> N), whereas plural nouns rarely are used as a noun modifying (that is, starting rule
NP —> N N). This can be seen in the difference between the probabilities for these two rules given
the words "house" and "peaches", shown in Figure 7.23. With the context-free probabilities, the
rule NP -> N had a probability of .14. This underestimates the rule when the input has a plural
noun, and overestimates it when the input has a singular noun.
More importantly, the context-sensitive rules encode verb preferences for different sub
categorizations. In particular, Figure 7.23 shows that the rule VP -> V NP PP is used 93 percent of
the time with the verb "put" but only 10 percent of the time with "like". This difference allows a
parser based on these probabilities to do significantly better than the context-free probabilistic
parser. In particular, on the same 84 test sentences on which the context-free probabilistic parser
had 49 percent accuracy, the context-dependent probabilistic parser has 66 percent accuracy,
getting the attachment correct on 14 sentences on which the context- free parser failed. The
context-dependent parser is also more efficient finding the answer on the sentence "The man put
the bird in the house" generating only 36 constituents. The results of the three parsing strategies
are summarized in Figure 7.24.
To see why the context-dependent parser does better, consider the attachment decision that has to
be made between the trees shown in Figure 7.21 for the verbs "like" and "put", say in the sentences
"The man put the bird in the house" and "The man likes the bird in the house". The context-free
probabilistic parser would assign the same structure to both, getting the example with "put" right
and the example with "like" wrong. The relevant parts of the charts are shown in Figures 7.25 and
7.26. In Figure 7.25 the probability of the rule VP -> V NP PP,
starting with "put", is .93, yielding a probability of .54 (.93 * .99 * .76 * .76) for constituent VP6840.
This is far greater than the probability of .0038 given the alternative VP6828. Changing the verb to
"like", as shown in Figure 7.26, affects the probabilities enough to override the initial bias towards
the V-NP-PP interpretation. In this case, the probability of the rule VP -> V NP PP, starting with
"like", is only . 1, whereas the probability of rule VP ->V NP is .9. This gives a probability of .1 to
VP6883, beating out the alternative.
The question arises as to how accurate a parser can become by using the appropriate contextual
information. While 66 percent accuracy was good compared to the alternative strategies, it still
leaves a 33 percent error rate on sentences involving a PP attachment decision. What additional
information could be added to improve the performance? Clearly, you could make the rule
probabilities relative to a larger fragment of the input, using a bigram or trigram at the beginning of
the rule. This may help the selectivity if there is enough training data. Attachment decisions depend
not only on the verb but also on the preposition in the PP. You could devise a more
complex estimate based on the previous category, the head verb, and the preposition for rule VP ->
V NP PP.
This would require more data to obtain reliable statistics but could lead to a significant increase in
reliability. But a complication arises in that you cannot compare across rules as easily now: The VP
-> V NP PP rule is evaluated using the verb and the preposition, but there is no corresponding
preposition with the rule VP -> V NP. Thus a more complicated measure would have to be devised.
In general, the more selective the lexical categories, the more predictive the estimates can be,
assuming there is enough data. Earlier we claimed that basing the statistics on words would
require too much data. But is there a subset of words that could profitably be used individually?
Certainly function words such as prepositions seem ideally suited for individual treatment. There is
a fixed number of them, and they have a significant impact on the structure of the sentence.
Similarly, other closed class words - articles, quantifiers, conjunctions, and so on – could all be
treated individually rather than grouped together as a class. It is reasonable to assume we will be
able to obtain enough data to obtain reliable estimates with these words.
The complications arise with the open class words. Verbs and nouns, for instance, play a major role
in constraining the data, but there are far too many of them to consider individually. One approach
is to model the common ones individually, as we did here. Another would be to try to cluster words
together into classes based on their similarities. This could be done by hand, based on semantic
properties. For instance, all verbs describing motion might have essentially the same behavior and
be collapsed into a single class. The other option is to use some automatic technique for learning
useful classes by analyzing corpora of sentences with attachment ambiguities.
Code No. 3386
B.E. 414
(IT)
FACULTY OF INFORMATICS
I-Semester
Subject:
nester
Natural (Makeup) Examina
(Makeup)
Language Examination, MayI June 201
Time 3 hours
ge Processing (Elective- IV)
Note: Answer all Max. Marks : 75
Derine Naturalgrammar?
8
Processing. language processing. List any two applications or Nauia
What is logical
form? How is it 3
explain derived tree forhelpful
9 Draw and in sentence 3
10 Explain interpretation?
"the boy kicked the bucket"? 3
Context -Free grammars.
PART B (50 Marks)
11 a) What is Natural Language
processing understanding? Explain with an example.
b) What is a phrase? Explain
prepositional phrase and compliments with example. 5
12 a) Explain Shift Reduce Parser.
5
b) What is Look ahead parser?
5
13 Explain semantic networks with example? What is the
inheritance property in semantic network.
advantage in removing
10
14 Explain the different types of logical forms. Discuss the features of logical forms.
10
15 Discuss rule - by rule semantic interpretation based on lambda calculus.
10
16 Write short notes on: a) Word senses ambiguity b) Encoding ambiguity in
logical forms. 10
7 State why each of the following sentences are ambiguous or not. Specifically, state
whether they are ambiguous because of their possible syntactic structures, their word
senses, their semantic structures, or a combination of these factors (3)
S1:A man stopped at every truck stop
$2: Several people ate the pizza
$3: We saw her duck
8 The process of mapping the logical form to the final knowledge representation (KR)
languageis called (2)
9 Say we have an acceptable margin of error of between 0.4 and 0.6 for estimating the
probability of a fair coin coming up heads. What is the chance of obtaining a reliable
estimate for the probability with five flips? (3)
10 Write the advantages of best-first parser.
(2)
PART B (50 Marks)
11 (a) Draw diagram to show flow of information in Natural Language Understanding
Systems and explain with the following9 example, consider the following two
sentences (7)
S1:Visiting relatives can be trying
$2: Visiting museums can be trying
(b) Write short notes on different approaches to studying language.
(3)
2
.2.
and
prepositi
tional phrases with an
p h r a s e s
verb
of Translatior
on.
interpretation
Machine
Tin
Explain
for
12 (a) No
s t r a t e g i e s
e x a m p l e .
d i f f e r e n t
D i s c u s s
CFG.
(b)
following
1. De
Consider
the
13 (a) S 2 NPV 2. W
NPAUX
V cessing t h e sentence: The man is
S2
NP 2 ARTN
chart
parsers
in
p r o c e s s i n g
laughing 3. V
one
of the 4 .
Trace
lexicon
entries
with the
The: A R T 5
man: N
6.
is: AUX
and drawing the chart
laughing: V
of the parse,
giving
the parse, time
Show every step to the chart.
non-terminal
constituent is added
8.
Algorithm.
Chart Parsing
Top-Down 9
(b) Write
with an example.
and Logical Form
about Semantics following sentences, and ed
14 (a) Explain word used in the
can
senses of the
(b)Identify the predicate), n-ary predicate lo
term, property (unary
whether the sense is a modal operator. Justify v
predicate operator, or
operator, generalized quantifier, that could work just as well.G
classification and discuss alternate interpretation
an example logical form that
uses each sense.
S1:The yellow can fell to the ground
$2: He can see it
$3: He wants to can the tomatoes
FACULTY OF ENGINEERING
DE 4/4
(.T.) II- Semester (Main Backlog) Examination,
2019
& Nayu
Subject: Natural Language Processing (Elective - )
Time: 3 hours. Max. Marks: 75
Contextual-Interpretation. (5)
b) What is Ambiguity? How do you encode? (5)
13. a) llustrat Depth-First Top-Down Parsing. (5)
b) Discuss issues in the design of Lexicon. (5)
14. a) Discuss co-occurrence of constraints that arise between word senses. (5)
b) Discuss linking Syntax and
Semantics. (5)
15.a) Present a Bigram and explain
some typical computations involved therein. (5)
c)Explain the use of the following assumption: the probability of a category occurring