Statistical NLP
Statistical NLP
Statistical NLP
Vincent Ng
Human Language Technology Research Institute
University of Texas at Dallas
September1999
Honors AI
First offering in Spring 2010
MW
7:00-8:15pm
search technologies
September1999
Undergraduate AI Courses
COGS/CS 4314: Intelligent Systems Analysis
COGS/CS 4315: Intelligent Systems Design
CS 4365: Artificial Intelligence
CS 4365 (Honors): Artificial Intelligence
Undergraduate AI Courses
COGS/CS 4314: Intelligent Systems Analysis
Not
only
only
September1999
offered every year, may be offered Fall or Spring
4
Graduate AI Courses
CS 6320: Natural Language Processing
Dr. Vincent Ng
Natural
language processing
vision
September1999
September1999
September1999
Ultimate goal
To
Immediate goal
To
September1999
10
Ultimate goal
To
Immediate goal
To
Understanding
computer
language
Generation
September1999
11
Why NLP?
Lots of information is in natural language format
Documents
News
broadcasts
User utterances
what I mean!
September1999
12
NLP is Useful
Application: Text Summarization
E-mail, letters,
editorials, technical
reports, newswires
multi-document
summary
September1999
13
NLP is Useful
Application: Information Retrieval
Topic: Advantages of
using potassium
hydroxide in any
aspect of organic
farming, especially
doc 1
score
doc 2
score
doc 3
score
doc n
score
relevant documents
(ranked)
information need
text collection
September1999
14
NLP is Useful
Application: Question Answering
Retrieve not just relevant documents, but return the answer
Answer
Query
Which country has the
largest part of the
Amazon forest?
text collection
Brazil
September1999
15
NLP is Useful
Application: Information Extraction
AFGANISTAN MAY BE
PREPARING FOR ANOTHER
TEST
Disaster Type:
location:
date:
magnitude:
magnitude-confidence:
damage:
human-effect:
victim:
number:
outcome:
physical-effect:
object:
outcome:
September1999
16
NLP is Useful
Application: Information Extraction
AFGANISTAN MAY BE
PREPARING FOR ANOTHER
TEST
September1999
17
NLP is Useful
Application: Machine Translation
?
Japaneseto-English
Translator
18
NLP is Useful
Application: Machine Translation
?
Japaneseto-English
Translator
September1999
19
NLP is
Interdisciplinary
Linguistics:
models of language
Psychology:
Mathematics
September1999
20
NLP is
Interdisciplinary
Linguistics:
models of language
Psychology:
Mathematics
vs.
NLP
Computational study of language use
Definite engineering aspect in addition to a scientific one
21
Turing
test
September1999
22
September1999
23
September1999
24
25
September1999
26
Union-ized?
Un-ionized in chemistry?
September1999
27
sentence structure
Different syntactic structure implies different interpretation
September1999
28
September1999
29
September1999
30
about language
Knowledge about the world
September1999
31
An Idea
Have computers learn models of language
Statistical
NLP:
learns statistical models that capture language properties
from a corpus (text samples)
helps ease the knowledge acquisition bottleneck
Why
September1999
32
33
34
September1999
35
36
September1999
37
will be zero!
September1999
38
Problems
We may not be able to find these sentences even in a very
large corpus
Even if we do, each of them may appear only once and
twice
September1999
39
September1999
40
September1999
41
September1999
42
N=3.
September1999
43
N=3. Compute
September1999
44
N=3. Compute
September1999
45
N=3. Compute
September1999
46
N=3. Compute
47
September1999
48
September1999
49
of sentence
September1999
50
An Example
P(Lets take a talk)
September1999
51
An Example
P(Lets take a talk)
September1999
52
Problems Solved???
To some extent
More
September1999
53
September1999
54
Problems Solved???
To a larger extent
It is less likely, though not impossible, to see word
September1999
55
September1999
56
Example
To generate Lets go outside and take a talk with N=3:
Current state: <null, null>. Throw a dice that generates
September1999
57
Experimental Results
Corpus: Complete works of Shakespeare
N=1: Will rash been and by I the me loves gentle me not
trim, captain.
N=3: Fly, and will rid me these news of price. Therefore
September1999
58
Solution 4: Smoothing
Goal: make sure no N-gram (i.e., word sequence of length
September1999
59
September1999
60
Summary
Different NLP tasks require the collection of different
September1999
61
1990s
ELIZA
ALICE
Loebner prize
win $100,000 if you pass the Turing test
September1999
62
63
ALICE
Human: hi my name is Carla
64
Rank
65
Morphological Segmentation
Segmentation of words into prefixes, suffixes and roots.
unfriendly
= un + friend + ly
September1999
66
Morphological Segmentation
Segmentation of words into prefixes, suffixes and roots.
unfriendly
= un + friend + ly
= valid + ate
= de + value
67
Morphological Segmentation
Input: Text corpus
Word
Frequency
Word
Segmentation
aback
abacus
abacuses
abalone
abandon
abandoned
abandoning
abandonment
abandonments
abandons
157
6
3
77
2781
4696
1082
378
23
117
.......
aback
abacus
abacuses
abalone
abandon
abandoned
abandoning
abandonment
abandonments
abandons
....
aback
abacus
abacus+es
abalone
abandon
abandon+ed
abandon+ing
abandon+ment
abandon+ment+s
abandon+s
....
....
September1999
68
September1999
69
and A in V B is a suffix
singing and sing ing is a suffix
AB and B in V A is a prefix
preset and set pre is a prefix
AB
September1999
70
Solution: score each learned prefix and suffix and retain only
those whose scores are above a pre-defined threshold
After learning, we can try to use them to segment words.
Suppose we learn that ate is a suffix. Then:
candidate = candid + ate
September1999
71
DET
buy: VERB
mother: NOUN
beautiful: ADJECTIVE
beautifully: ADVERB
carry: VERB
Useful for part-of-speech tagging
Too time-consuming to do this by hand, so lets learn
September1999
72
boy
The
The left word and/or the right word are useful indicators.
September1999
73
September1999
74
September1999
75
September1999
76
serve as subjects
boy:
September1999
77
Pronoun Resolution
Task: find the noun phrase to which it refers
They know full well that companies held tax money aside for
collection later on the basis that the government said it1 was
going to collect it2.
September1999
78
Pronoun Resolution
Using the corpus, compute the number of times each noun
of it1
of it2
September1999
79
Supervised Learning
Learning from an annotated text corpus
Corpus
September1999
80
resource-scarce language
NP
NP
[That] perhaps
[
NP
was
NP
]
NP
]
NP
September1999
81
September1999
82
Learning
I like candy
I candy like
September1999
83
F-measure
MUC scoring program
F-measure :=
Want high
F-measure
Measure of coverage
Want high recall
Measure of accuracy
Want high precision
September1999
84
NLP is Challenging
It is often said that NLP is AI-complete:
All the difficult problems in artificial intelligence manifest
themselves in NLP problems.
September1999
85
NLP is Cross-Disciplinary
Excellent opportunities for interdisciplinary work
Linguistics:
models of language
Psychology:
Mathematics
is OK
Models need be neither biologically plausible nor
mathematically satisfying
September1999
86
Statistical NLP
Statistical NLP: Infer language properties from text samples
Helps ease the knowledge acquisition bottleneck
September1999
87
interrogator.
The interrogator can communicate with the other two by
teleprinter.
The interrogator tries to determine which is the person and
which is the machine.
The machine tries to fool the interrogator into believing that
it is the person.
If the machine succeeds, then we conclude that the
machine has exhibited intelligence.
Turning predicted that by 2000, a machine might have a
30% chance of fooling a lay person for 5 minutes
September1999
88