Statistical NLP

Download as pdf or txt
Download as pdf or txt
You are on page 1of 88

Statistical Natural Language Processing

Vincent Ng
Human Language Technology Research Institute
University of Texas at Dallas
September1999

Honors AI
First offering in Spring 2010
MW

7:00-8:15pm

Whether it will be offered again depends on enrollment

Isnt more difficult than the regular section, but covers

more topics at a faster pace


Web

search technologies

Small-scale project to be done in teams of 1-2 people

September1999

Undergraduate AI Courses
COGS/CS 4314: Intelligent Systems Analysis
COGS/CS 4315: Intelligent Systems Design
CS 4365: Artificial Intelligence
CS 4365 (Honors): Artificial Intelligence

CS 4375: Introduction to Machine Learning


CS 4391: Introduction to Computer Vision
September1999

Undergraduate AI Courses
COGS/CS 4314: Intelligent Systems Analysis
Not

offered every year

COGS/CS 4315: Intelligent Systems Design


Not

offered every year

CS 4365: Artificial Intelligence


Spring

and probably Fall

CS 4365 (Honors): Artificial Intelligence


Spring

only

CS 4375: Introduction to Machine Learning


Fall

only

CS 4391: Introduction to Computer Vision


Not

September1999
offered every year, may be offered Fall or Spring
4

Graduate AI Courses
CS 6320: Natural Language Processing

CS 6322: Information Retrieval


CS 6364: Artificial Intelligence
CS 6373: Intelligent Systems
CS 6375: Machine Learning
CS 6384: Computer Vision
CS 6395: Speech Recognition, Synthesis & Understanding
CS 6v81: Statistical Natural Language Processing
September1999

The Intelligent Systems Group


Dr. Sanda Harabagiu
Information

retrieval, natural language processing

Dr. Vasileios Hatzivassiloglu


Natural

language processing, bioinformatics

Dr. Yang Liu


Speech

and language processing

Dr. Dan Moldovan


Natural

language processing, knowledge representation

Dr. Vincent Ng
Natural

language processing

Dr. Haim Schweitzer


Computer

vision
September1999

Statistical Natural Language Processing

September1999

Where are the Flying Cars?


According to science fiction, the future has talking machines.
(1926): false Maria
Star Wars: Episode IV: C3PO (Marias influence?)
2001: A Space Odyssey (1968): HAL (the HAL-9000)
Metropolis

Dave: Open the pod bay doors, HAL.


HAL: Im sorry Dave, Im afraid I cant do that.
Dave: Whats the problem?
HAL: I think you know what the problem is just as well as I do.

September1999

Where are the Flying Cars?


According to science fiction, the future has talking machines.
(1926): false Maria
Star Wars: Episode IV: C3PO (Marias influence?)
2001: A Space Odyssey (1968): HAL (the HAL-9000)
Metropolis

Dave: Open the pod bay doors, HAL.


HAL: Im sorry Dave, Im afraid I cant do that.
Dave: Whats the problem?
HAL: I think you know what the problem is just as well as I do.

Requires both understanding and generation


September1999

Natural Language Processing (NLP)


natural language
Languages

that people use to communicate with one another

Ultimate goal
To

build computer systems that perform as well at using


natural languages as humans do

Immediate goal
To

build computer systems that can process text and speech


more intelligently

September1999

10

Natural Language Processing (NLP)


natural language
Languages

that people use to communicate with one another

Ultimate goal
To

build computer systems that perform as well at using


natural languages as humans do

Immediate goal
To

build computer systems that can process text and speech


more intelligently
language

Understanding

computer

language

Generation
September1999

11

Why NLP?
Lots of information is in natural language format
Documents
News

broadcasts
User utterances

Lots of users want to communicate in natural language


Do

what I mean!

September1999

12

NLP is Useful
Application: Text Summarization

Summarize the public commentary regarding


the prohibition of potassium hydroxide for
peeling peaches.

E-mail, letters,
editorials, technical
reports, newswires

multi-document
summary

September1999

13

NLP is Useful
Application: Information Retrieval

Topic: Advantages of
using potassium
hydroxide in any
aspect of organic
farming, especially

doc 1

score

doc 2

score

doc 3

score

doc n

score

relevant documents
(ranked)

information need
text collection

September1999

14

NLP is Useful
Application: Question Answering
Retrieve not just relevant documents, but return the answer

Answer

Query
Which country has the
largest part of the
Amazon forest?

text collection

Brazil

September1999

15

NLP is Useful
Application: Information Extraction
AFGANISTAN MAY BE
PREPARING FOR ANOTHER
TEST

Thousands of people are feared


dead following... (voice-over)
...a powerful earthquake that hit
Afghanistan today. The quake
registered 6.9 on the Richter
scale. (on camera) Details now
hard to come by, but reports say
entire villages were buried by
the quake.

Disaster Type:
location:
date:
magnitude:
magnitude-confidence:
damage:
human-effect:
victim:
number:
outcome:
physical-effect:
object:
outcome:

September1999

16

NLP is Useful
Application: Information Extraction
AFGANISTAN MAY BE
PREPARING FOR ANOTHER
TEST

Thousands of people are feared


dead following... (voice-over)
...a powerful earthquake that hit
Afghanistan today. The quake
registered 6.9 on the Richter
scale. (on camera) Details now
hard to come by, but reports say
entire villages were buried by
the quake.

Disaster Type: earthquake


location: Afghanistan
date: today
magnitude: 6.9
magnitude-confidence: high
damage:
human-effect:
victim: Thousands of people
number: Thousands
outcome: dead
physical-effect:
object: entire villages
outcome: damaged

September1999

17

NLP is Useful
Application: Machine Translation

?
Japaneseto-English
Translator

How do you say


octopus in Japanese?
September1999

18

NLP is Useful
Application: Machine Translation

?
Japaneseto-English
Translator

How do you say


octopus in Japanese?
Bill Gates, 1997 now were betting the company on these
natural interface technologies

September1999

19

NLP is
Interdisciplinary
Linguistics:

models of language

emphasizes 100% accuracy

Psychology:

emphasizes biological and/or cognitive plausibility

Mathematics

models of cognitive processes


and statistics: properties of models

emphasizes formal aspects

September1999

20

NLP is
Interdisciplinary
Linguistics:

models of language

emphasizes 100% accuracy

Psychology:

emphasizes biological and/or cognitive plausibility

Mathematics

vs.

NLP
Computational study of language use
Definite engineering aspect in addition to a scientific one

and statistics: properties of models

emphasizes formal aspects

models of cognitive processes

Scientific: to explore the nature of linguistic communication


Engineering: to enable effective human-machine communication

Emphasis on computational, not cognitive plausibility


Models of language: 95% correct is OK
September1999

21

Why study NLP?


Challenging
AI-complete

borrows from the notion of NP-completeness


to solve NLP, youd need to solve all of the problems in AI

Turing

test

Turing (1950): "Computing machinery and intelligence


posits that engaging effectively in linguistic behavior is a
sufficient condition for having achieved intelligence.

September1999

22

The Turing Test


Turning predicted that by 2000, a machine might have a

30% chance of fooling a lay person for 5 minutes

September1999

23

But little kids can do NLP

September1999

24

But little kids can do NLP

Why is NLP hard?


September1999

25

Why is NLP hard?


Ambiguity!!! at all levels of analysis
Phonetics and phonology
Concerns

how words are related to the sounds that realize them


Important for speech-based systems
I scream vs. ice cream
Its very hard to recognize speech vs.
Its very hard to wreck a nice beach

September1999

26

Why is NLP hard?


Ambiguity!!! at all levels of analysis
Morphology
Concerns

how words are constructed from sub-word units


Unionized

Union-ized?
Un-ionized in chemistry?

September1999

27

Why is NLP hard?


Ambiguity!!! at all levels of analysis
Syntax
Concerns

sentence structure
Different syntactic structure implies different interpretation

Squad helps dog bite victim.


[np squad] [vp helps [np dog bite victim]]
[np squad] [vp helps [np dog] [inf-clause bite victim]

September1999

28

Why is NLP hard?


Ambiguity!!! at all levels of analysis
Semantics
Concerns

what words mean and how these meanings combine


to form sentence meanings.

Jack invited Mary to the Halloween ball.

dance vs. some big sphere with Halloween decorations?

September1999

29

Why is NLP hard?


Ambiguity!!! at all levels of analysis
Discourse
Concerns

how the immediately preceding sentences affect the


interpretation of the next sentence

The city council refused to give the women a permit because


they feared violence.
The city council refused to give the women a permit because
they advocated violence.

September1999

30

Im Afraid I Cant Do That


The task seems so difficult! What resources do we need?
Knowledge

about language
Knowledge about the world

September1999

31

An Idea
Have computers learn models of language
Statistical

NLP:
learns statistical models that capture language properties
from a corpus (text samples)
helps ease the knowledge acquisition bottleneck

Why

is statistical language learning possible?

usage of words exhibits statistical regularities.

September1999

32

Probabilities are Realistic


Its hard to recognize speech
vs.
Its hard to wreck a nice beach
Which is more likely? (both are grammatical)
Applications: speech recognition, handwriting recognition,
spelling correction,

General problem in statistical NLP: density estimation


P(Its hard to recognize speech)
P(Its hard to wreck a nice beach)
September1999

33

No, Really, Its a Crazy Idea


Late 50s-80s: statistical NLP in disfavor
It is fair to assume that neither sentence

(1) Colorless green ideas sleep furiously


nor
(2) Furiously sleep ideas green colorless
has ever occurred Hence, in any statistical model
these sentences will be ruled out on identical grounds as
equally remote from English. Yet (1), though nonsensical,
is grammatical, while (2) is not. [Chomsky 1957]
September1999

34

Who Are You Calling Crazy?


I dont believe in this statistics stuff
Thats not learning, thats statistics

Knowledge-intensive NLP is going nowhere fast


Every time I fire a linguist, my performance goes up

September1999

35

Which sentence is more likely?


Its hard to recognize speech
vs.
Its hard to wreck a nice beach
Statistical approach: density estimation

P(Its hard to recognize speech)


P(Its hard to wreck a nice beach)
Estimate these probabilities from a corpus (text sample)
Count

the number of times sentences appears in corpus


Divide the count by the total number of sentences in corpus
September1999

36

Is there any problem with this approach?

September1999

37

Solution 1: Use a larger corpus


Many sentences may still not appear in a larger corpus.
probability

will be zero!

September1999

38

Problems
We may not be able to find these sentences even in a very

large corpus
Even if we do, each of them may appear only once and

twice

September1999

39

Solution 2: Use a language model


A language model assigns a probability to a sentence
How?

September1999

40

A Simple Two-Step Approach


Goal: assign a probability to a sentence
Lets take a talk.

September1999

41

A Simple Two-Step Approach


Goal: assign a probability to a sentence
Lets take a talk.
Step 1: Compute the probability each word in the sentence

using the previous N-1 words.

September1999

42

A Simple Two-Step Approach


Goal: assign a probability to a sentence
Lets take a talk.
Step 1: Compute the probability each word in the sentence

using the previous N-1 words.


Assume

N=3.

September1999

43

A Simple Two-Step Approach


Goal: assign a probability to a sentence
Lets take a talk.
Step 1: Compute the probability each word in the sentence

using the previous N-1 words.


Assume

N=3. Compute

P(lets | 2nd-prev-word=null, prev-word=null)

September1999

44

A Simple Two-Step Approach


Goal: assign a probability to a sentence
Lets take a talk.
Step 1: Compute the probability each word in the sentence

using the previous N-1 words.


Assume

N=3. Compute

P(lets | 2nd-prev-word=null, prev-word=null)


P(take | 2nd-prev-word=null, prev-word=lets)

September1999

45

A Simple Two-Step Approach


Goal: assign a probability to a sentence
Lets take a talk.
Step 1: Compute the probability each word in the sentence

using the previous N-1 words.


Assume

N=3. Compute

P(lets | 2nd-prev-word=null, prev-word=null)


P(take | 2nd-prev-word=null, prev-word=lets)
P(a | 2nd-prev-word=lets, prev-word=take)

September1999

46

A Simple Two-Step Approach


Goal: assign a probability to a sentence
Lets take a talk.
Step 1: Compute the probability each word in the sentence

using the previous N-1 words.


Assume

N=3. Compute

P(lets | 2nd-prev-word=null, prev-word=null)


P(take | 2nd-prev-word=null, prev-word=lets)
P(a | 2nd-prev-word=lets, prev-word=take)

P(talk | 2nd-prev-word=take, prev-word=a)


September1999

47

A Simple Two-Step Approach


How to compute

P(talk | 2nd-prev-word=take, prev-word=a)?


Collect statistics from corpus!
Count

number of times we see take a talk in corpus


Count number of times we see take a in corpus
Divide these two numbers

September1999

48

A Simple Two-Step Approach


How to compute

P(talk | 2nd-prev-word=take, prev-word=a)?


Collect statistics from corpus!
Count

number of times we see take a talk in corpus


Count number of times we see take a in corpus
Divide these two numbers
Now we know how to compute probability of each word

September1999

49

A Simple Two-Step Approach


How to compute

P(talk | 2nd-prev-word=take, prev-word=a)?


Collect statistics from corpus!
Count

number of times we see take a talk in corpus


Count number of times we see take a in corpus
Divide these two numbers
Now we know how to compute probability of each word
Step 2: Multiply probability of each word to get probability

of sentence
September1999

50

An Example
P(Lets take a talk)

= P(Lets | null, null) * P(take | outside and)


* P(a | and take) * P(talk | take a)

September1999

51

An Example
P(Lets take a talk)

= P(Lets | null, null) * P(take | outside and)


* P(a | and take) * P(talk | take a)

Does language modeling solve the problems of


(1) not seeing a sentence in a corpus at all?
(2) not seeing a sentence frequently enough?

September1999

52

Problems Solved???
To some extent
More

likely to be able to find short word sequences


than long word sequences in a corpus
Still, there is no guarantee that we will be able to find
Lets take a
If we cannot, probability of sentence will be zero,
even if the sentence is sensible

September1999

53

Solution 3: Use a Language Model with Small N


Use N=2

P(Lets take a talk)


= P(Lets | null) * P(take | Lets) * P(a | take)
* P(talk | a)
Use N = 1

P(Lets take a talk)


= P(Lets) * P(take) * P(a) * P(talk)

September1999

54

Problems Solved???
To a larger extent
It is less likely, though not impossible, to see word

sequences of one or two not appearing in a corpus


Other problems?

September1999

55

Comparing Language Models


Is a language model with N=3 better than one with N=2?
If yes, how to compare?
Generate a sentence using the language model
Generate

each word from left to right


At each point, we are in a different state
Throw a dice to determine which word to output

September1999

56

Example
To generate Lets go outside and take a talk with N=3:
Current state: <null, null>. Throw a dice that generates

Lets with a probability of P(Lets | null, null)


Current state: <null, Lets>. Throw a dice that generates

go with a probability of P(go | null, Lets)


Current state: <Lets, go>. Throw a dice that generates

outside with a probability of P(outside | Lets, go)

September1999

57

Experimental Results
Corpus: Complete works of Shakespeare
N=1: Will rash been and by I the me loves gentle me not

slavish page, the and hour; ill let


N=2: What means, sir. I confess she? Then all sorts, he is

trim, captain.
N=3: Fly, and will rid me these news of price. Therefore

the sadness of parting, as they say, tis done.


N=4: They say all lovers swear more performance than

they are wont to keep obliged faith unforfeited!

September1999

58

Solution 4: Smoothing
Goal: make sure no N-gram (i.e., word sequence of length

N) has zero probability


Currently, P(take a talk) =

number of times take a talk is seen in corpus


number of 3-grams in corpus
After smoothing, P(take a talk) =

number of times take a talk is seen in corpus + 1


number of 3-grams in corpus + number of distinct 3-grams

September1999

59

Linguistics vs. Statistics


It is not our intention to ignore linguistics, neither to replace it.
Rather, we hope to enfold it in the embrace of a secure
probabilistic framework so that the two together may draw
strength from one another and guide us to better natural
language processing systems in general and to better
machine translation systems in particular.
--- The mathematics of Statistical Machine Translation
[Brown, Della Pietra, Della Pietra, and Mercer, 1993]

September1999

60

Summary
Different NLP tasks require the collection of different

statistics from a text corpus. Just need to be creative.


Language learning is in general a difficult task due to

ambiguities at different levels, and is complicated by the


fact there are almost always exceptions to a general rule
It is relatively easy to start doing research in NLP: only

basic knowledge of statistics and linguistics is needed

September1999

61

Some Famous Imitation Games


1960s

1990s

ELIZA

ALICE

Loebner prize
win $100,000 if you pass the Turing test

September1999

62

ELIZA: Impersonating a Pyschotherapist

1960s ELIZA (Joseph Weizenbaum)


http://www-ai.ijs.si/eliza/eliza.html

Eliza: Hello. I am ELIZA. How can I help you?


You: Well I feel sad
Eliza: Do you often feel sad?
You: not very often, but it's becoming more common
Eliza: Please go on.
September1999

63

ALICE
Human: hi my name is Carla

ALICE: How are you doing, Carla.


Human: what is your name?
ALICE: My name is ALICE.
Human: what is your full name?
ALICE: ALICE = Artificial Linguistic Internet Computer Entity.
Human: when were you born?
ALICE: I was connected to the net on November 23, 1995.
Human: Have you passed the Loebner Prize?
ALICE: The Loebner Prize is an annual Turing Test sponsored
by Hugh Loebner. Dr. Wallace and ALICE won it in 2000. The
ALICE program was ranked "most human computer" by the
panel of judges.
Human: who is Dr. Wallace?
ALICE: He is a famous computer scientist, the author of ALICE,
and a graduate of Carnegie Mellon.
Human: Where does Dr. Wallace work?
ALICE: he works from his home in San Francisco.
September1999

64

The Loebner Contest


A modern version of the Turing Test, held annually, with

a $100,000 cash prize.


Restricted topic and limited time.
Participants include a set of humans and a set of

computers and a set of judges.


Scoring

Rank

from least human to most human.


If better than a human, win $100,000. (Nobody
yet)
September1999

65

Morphological Segmentation
Segmentation of words into prefixes, suffixes and roots.
unfriendly

= un + friend + ly

September1999

66

Morphological Segmentation
Segmentation of words into prefixes, suffixes and roots.
unfriendly

= un + friend + ly

Some words do not have a prefix


validate

= valid + ate

Some words do not have a suffix


devalue

= de + value

How to automatically segment a word by computing statistics


from a corpus?
September1999

67

Morphological Segmentation
Input: Text corpus

Output: Segmented Words

Word

Frequency

Word

Segmentation

aback
abacus
abacuses
abalone
abandon
abandoned
abandoning
abandonment
abandonments
abandons

157
6
3
77
2781
4696
1082
378
23
117
.......

aback
abacus
abacuses
abalone
abandon
abandoned
abandoning
abandonment
abandonments
abandons
....

aback
abacus
abacus+es
abalone
abandon
abandon+ed
abandon+ing
abandon+ment
abandon+ment+s
abandon+s
....

....

September1999

68

A Word Segmentation Algorithm


Basic idea:
1.
2.

Learn prefixes, suffixes and roots from corpus


Segment the words using the learned prefixes and
suffixes

September1999

69

A Word Segmentation Algorithm


Let V be the vocabulary (i.e., set of words in corpus)
Let A and B be two character sequences.
Let AB be the concatenation of A and B.
Prefix and suffix learning algorithm:

and A in V B is a suffix
singing and sing ing is a suffix
AB and B in V A is a prefix
preset and set pre is a prefix
AB

September1999

70

A Word Segmentation Algorithm


Problem: Assumption does not always hold
diverge

and diver are in V ge is a suffix Wrong!


Many of the learned prefixes and suffixes are erroneous

Solution: score each learned prefix and suffix and retain only
those whose scores are above a pre-defined threshold
After learning, we can try to use them to segment words.
Suppose we learn that ate is a suffix. Then:
candidate = candid + ate

September1999

71

Determining Most Frequent Part-of-Speech


Task: determine the most frequent POS of a word
a:

DET
buy: VERB
mother: NOUN
beautiful: ADJECTIVE
beautifully: ADVERB
carry: VERB
Useful for part-of-speech tagging
Too time-consuming to do this by hand, so lets learn
September1999

72

Determining Most Frequent Part-of-Speech


Approach:
Group

words that are likely to have the same POS together


(to form, e.g., 100 groups)
Hand-label each group with a POS tag
How to generate groups of words with similar POS?
Idea:

use contextual information


The

boy

is going to the library.

The

lady went to the market.


Noun

The left word and/or the right word are useful indicators.
September1999

73

Determining Most Frequent Part-of-Speech


Create a profile for each word w in the vocabulary that

tells us whether a word has ever appeared to the left/right


of w.
Example

profile for boy:

the-left: yes, a-left: yes, happy-left: no, cry-left: no


the-right: no, a-right: no, happy-right: no, cry-right: no
Compare profiles

Words with similar profiles tend to have the same POS?

September1999

74

Determining Most Frequent Part-of-Speech


Profiles too big
Use

only the most frequent N left-words and N right-words


Determiners are more likely to remain in profile than verbs,
for instance

September1999

75

Identifying Semantically Similar Nouns


Task: identify/group nouns that are semantically similar
How do we know that boy is more similar to words like

girl, man, woman, individual than car, ship,


aeroplane, etc.?
Idea: use contextual information
Similar

words tend to occur in similar context

What kind of context is useful to capture?

September1999

76

Identifying Semantically Similar Nouns


For each noun, collect all verbs for which the noun can

serve as subjects
boy:

speak, play, cry, laugh, jump,


capture context using the governing verbs
The profile for each noun consists of these verbs
Compare profiles
Words with similar profiles tend to be semantially similar?

September1999

77

Pronoun Resolution
Task: find the noun phrase to which it refers

They know full well that companies held tax money aside for
collection later on the basis that the government said it1 was
going to collect it2.

Given a corpus, what kind of statistics can we collect that

can help us resolve occurrences of it correctly?


is the subject of collect
it2 is the object of collect
it1

September1999

78

Pronoun Resolution
Using the corpus, compute the number of times each noun

phrase in the paragraph serves as the subject of collect


The

ones that have high counts are likely to be the referent

of it1

Similarly, compute the number of times each noun phrase

in the paragraph serves as the object of collect


The

ones that have high counts are likely to be the referent

of it2

September1999

79

Supervised Learning
Learning from an annotated text corpus
Corpus

annotated with part-of-speech tags


Human knowledge encoded in the form of annotations
Machine learning algorithms can be used to learn from
annotated corpora
Supervised methods typically outperform unsupervised
methods

September1999

80

Learning for a Resource-Scarce Language


Project annotations from a resource-rich language to a

resource-scarce language
NP

NP

[That] perhaps

[
NP

was

NP

[the happiest moment] of [his life].

]
NP

]
NP

September1999

81

September1999

82

Learning

I like candy
I candy like
September1999

83

F-measure
MUC scoring program

F-measure :=

Want high
F-measure

Harmonic mean of Recall and Precision

% coref links correctly


found by the system

% coref links found by the


system that are correct

Measure of coverage
Want high recall

Measure of accuracy
Want high precision

September1999

84

NLP is Challenging
It is often said that NLP is AI-complete:
All the difficult problems in artificial intelligence manifest
themselves in NLP problems.

This idea dates back at least to the Turing Test:


The question and answer method seems to be suitable for
introducing almost any one of the fields of human endeavour
that we wish to include [Turing, Computing Machinery and
Intelligence, 1950]

September1999

85

NLP is Cross-Disciplinary
Excellent opportunities for interdisciplinary work
Linguistics:

models of language

emphasizes 100% accuracy

Psychology:

emphasizes biological/cognitive plausibility

Mathematics

models of cognitive processes


and statistics: properties of models

emphasizes formal aspects

On the whole, NLP tends to be applications-oriented


95%

is OK
Models need be neither biologically plausible nor
mathematically satisfying
September1999

86

Statistical NLP
Statistical NLP: Infer language properties from text samples
Helps ease the knowledge acquisition bottleneck

September1999

87

The Turing Test


Three rooms contain a person, a computer, and an

interrogator.
The interrogator can communicate with the other two by
teleprinter.
The interrogator tries to determine which is the person and
which is the machine.
The machine tries to fool the interrogator into believing that
it is the person.
If the machine succeeds, then we conclude that the
machine has exhibited intelligence.
Turning predicted that by 2000, a machine might have a
30% chance of fooling a lay person for 5 minutes
September1999

88

You might also like