Extracting Noun Phrases From Large-Scale Texts: A Hybrid Approach and Its Automatic Evaluation
Extracting Noun Phrases From Large-Scale Texts: A Hybrid Approach and Its Automatic Evaluation
Extracting Noun Phrases From Large-Scale Texts: A Hybrid Approach and Its Automatic Evaluation
1. Introduction
From the cognitive point of view, human being must
recognize, learn and understand the entities or concepts
(concrete or abstract) in the texts for natural language
comprehension. These entities or concepts are usually
described by noun phrases. The evidences from the
language learning of children also show the belief (Snow
and Ferguson, 1977). Therefore, if we can grasp the
noun phases of the texts, we will understand the texts to
some extent. This consideration is also captured by
theories of discourse analysis, such as Discourse
Representation Theory (Kamp, 1981).
Traditionally, to make out the noun phrases in a text
means to parse the text and to resolve the attachment
relations among the constituents. However, parsing the
text completely is very difficult, since various
ambiguities cannot be resolved solely by syntactic or
semantic information. Do we really need to fully parse
the texts in every application? Some researchers apply
shallow or partial parsers (Smadja, 1991; Hindle, 1990)
to acquiring specific patterns from texts. These tell us
that it is not necessary to completely parse the texts for
some applications.
This paper will propose a probabilistic partial parser
and incorporate linguistic knowledge to extract noun
2. Previous Works
Church (1988) proposes a part of speech tagger and a
simple noun phrase extractor. His noun phrase extractor
brackets the noun phrases of input tagged texts according
to two probability matrices: one is starting noun phrase
matrix; the other is ending noun phrase matrix. The
methodology is a simple version of Garside and Leech's
probabilistic parser (1985). Church lists a sample text in
the Appendix of his paper to show the performance of his
work. It demonstrates only 5 out of 248 noun phrases are
omitted. Because the tested text is too small to assess the
results, the experiment for large volume of texts is needed.
P ( Ci | t 1n ) = Pi ( c1n | t 1n )
i
mi
Pi ( ck | ck 1 , t 1n ) Pi ( ck | t1n )
k =1
mi
Pi ( ck | ck 1 ) Pi ( ck )
k =1
mi
argmax Pi ( ck | ck 1 ) Pi ( ck )
Ci
3. Language Model
Parsing can be viewed as optimizing. Suppose an nword sentence, w1, w2, ..., wn (including punctuation
marks), the parsing task is to find a parsing tree T, such
that P(T|w1, w2, ..., wn) has the maximal probability. We
define T here to be a sequence of chunks, c1, c2, ..., cm,
and each ci (0 < i m ) contains one or more words wj
(0 < j n) . For example, the sentence "parsing can be
viewed as optimization." consists of 7 words. Its one
possible parsing result under our demand is:
(2) [parsing] [can be viewed] [as optimization] [.]
c1
c2
c3
c4
k =1
mi
k =1
o/w.
C* = argmax
P(Ci | w1n )
C
i
mi
argmax
Pi ( ck |ck 1 ) Pi ( ck )
C
i
k =1
mi
= argmax
[log( Pi ( ck |ck 1 )) + log( Pi ( ck ))]
C
i
k =1
mi
= argmax
[ S ( Pi ( ck |ck1 )) + S ( Pi ( ck ))]
C
i
k =1
3. j*=maxarg (score[pre[j]]+dscore(cj)+cscore(cj|cj-1));
0 j< i
4. Linguistic Knowledge
In order to assign a head to each chunk, we first define
priorities of POSes. X'-theory (Sells, 1985) has defined
the X'-equivalences shown as Table 1.
Both the syntactic head and the semantic head are useful
in extracting noun phrases. For example, if the semantic
head of a chunk is the noun and the syntactic one is the
preposition, it would be a prepositional phrase.
Therefore, it can be connected to the previous noun
chunk to form a new noun phrase. In some case, we will
find some chunks contain only one word, called oneword chunks. They maybe contain a conjunction, e.g.,
that. Therefore, the syntactic head and the semantic
head of one-word chunks are the word itself.
Following these definitions, we extract the noun
phrases by procedure (14):
Table 1. X'-Equivalences
X
N
V
A
P
INFL
X'
N'
V'
A'
P'
S (I')
X"
NP
VP
AP
PP
S' (IP)
raw
texts
TAGGER
LOB
CORPUS
We do not consider the INFL, since our model will not touch on this
structure.
chunked
texts
NP-TRACTOR
SUSANNE
CORPUS
NP set
recall &
precision
NP set
EVALUATION
CHUNKER
TAG-MAPPER
tagged
texts
7
CD*
P*
P*
A*
0
A*
J* or P*$
NR*
J* or
OD*
VBN or
J* or OD*
VBG
VBN or
VBG
N*
3
2
VBN or
N*
VBG
IN
4
N*
VBN or VBG
N*
6
NR*
<minbrk>
G01:0010b - JJ
NORTHERN
northern
[Oh.Oh]
[O[S[Np:s.
G01:0010c - NN2
liberals
liberal
.Np:s]
G01:0010d - VBR
are
be
[Vab.Vab]
G01:0010e - AT
the
the
[Np:e.
G01:0010f - JB
chief
chief
G01:0010g - NN2
supporters
supporter
G01:0010h - IO
of
of
[Po.
G01:0010i - JJ
civil
civil
[Np.
G01:0010j - NN2
rights
right
.Np]
G01:0020a - CC
and
and
[Po+.
G01:0020b - IO
of
of
integration
.Po+]Po]Np:e]S]
G01:0020d - YF
+.
Files
16
16
16
16
64
Paragraphs
767
280
197
723
1967
Sentences
1445
1554
1353
2568
6920
Words
37180
37583
36554
38736
150053
A
G
J
N
Av.
T/W
0.00295
0.00283
0.00275
0.00309
0.00291
T/C
0.0071
0.0069
0.0073
0.0066
0.0070
T/S
0.0758
0.0685
0.0743
0.0467
0.0663
This criterion is based on an observation that each nonterminal node has a chance to dominate a chunk.
Table 4 is the experimental results of testing the
SUSANNE Corpus according to the specified criterion.
As usual, the symbol C denotes chunk and S denotes
sentence.
TEST
OUTSIDE TEST
C
S
4866
380
40
14
4906
394
0.99
0.96
4748
355
153
32
4901
387
0.97
0.92
4335
283
170
15
4505
298
0.96
0.95
5163
536
79
42
5242
578
0.98
0.93
19112
1554
442
103
19554
1657
0.98
0.94
Cat.
# of correct
# of incorrect
total #
correct rate
# of correct
# of incorrect
total #
correct rate
# of correct
# of incorrect
total #
correct rate
# of correct
# of incorrect
total #
correct rate
# of correct
Av.
# of incorrect
total #
correct rate
INSIDE TEST
C
S
10480
1022
84
29
10564
1051
0.99
0.97
10293
1130
133
37
10426
1167
0.99
0.97
9193
1032
88
23
9281
1055
0.99
0.98
12717
1906
172
84
12889
1990
0.99
0.96
42683
5090
477
173
43160
5263
0.99
0.97
A
G
J
N
Total
Words
37180
37583
36554
38736
150053
Time (sec.)
112.32
108.80
103.04
122.72
446.88
Time/Word
0.00302
0.00289
0.00282
0.00317
0.00298
NP
non-NP
NP
non-NP
a
c
b
-
1
0.98
0.96
0.94
A
G
J
N
Avg.
0.92
0.9
0.88
Chunk
Sentence
Outside Test
Chunk
Sentence
Inside Test
A
G
J
N
Total
NP
MNP
mNP
MmNP
NP
MNP
10063
9221
8696
9851
37831
5614
5451
4568
7895
23528
6503
6143
5200
7908
25754
3207
3226
2241
5993
14667
1.79
1.69
1.90
1.25
1.61
mNP
MNP
1.16
1.13
1.14
1.00
1.09
A
G
J
N
Total
ENP
8011
7431
6457
8861
30760
CNP
7660
6943
5958
8559
29120
CMNP
3709
3626
2701
6319
16355
CmNP
4348
4366
3134
6637
18485
CMmNP
3047
3028
2005
5808
13888
Precision
0.96
0.93
0.92
0.97
0.95
A
G
J
N
Av.
ANP
7873
7199
6278
8793
30143
CNP
7660
6943
5958
8559
29120
Recall
0.97
0.96
0.95
0.97
0.96
7. Applications
Identification of noun phrases in texts is useful for many
applications. Anaphora resolution (Hirst, 1981) is to
resolve the relationship of the noun phrases, namely,
what the antecedent of a noun phrase is. The extracted
noun phrases can form the set of possible candidates (or
universal in the terminology of discourse representation
theory). For acquisition of verb subcategorization frames,
to bracket the noun phrases in the texts is indispensable.
It can help us to find the boundary of the subject, the
object and the prepositional phrase. We would use the
acquired noun phrases for an application of adjective
grouping. The extracted noun phrases may contain
adjectives which pre-modify the head noun. We then
utilize the similarity of head nouns to group the adjectives.
In addition, we may give the head noun a semantic tag,
such as Roget's Thesaurus provides, and then analyze the
adjectives. To automatically produce the index of a book,
Acknowledgements
We are grateful to Dr. Geoffrey Sampson for his kindly
providing SUSANNE Corpus and the details of tag set to
us.
8. Concluding Remarks
The difficulty of this work is how to extract the real
maximal noun phrases. If we cannot decide the
prepositional phrase "over a husband eyes" is licensed by
the verb "pull", we will not know "the wool" and "a
husband eyes" are two noun phrases or form a noun
pharse combined by the preposition "over".
(18) to pull the wool over a husband eyes
to sell the books of my uncle
In contrast, the noun phrase "the books of my uncle" is
so called maximal noun phrase in current context. As
the result, we conclude that if we do not resolve PPattachment problem (Hindle and Rooth, 1993), to the
expected extent, we will not extract the maximal noun
phrases. In our work, the probabilistic chunker decides
the implicit boundaries between words and the NPTRACTOR connects the adjacent noun chunks. When a
noun chunk is followed by a preposition chunk, we do
not connect the two chunks except the preposition chunk
is led by "of" preposition.
Comparing with other works, our results are
evaluated by a parsed corpus automatically and show the
high precision. Although we do not point out the exact
recall, we provide estimated values. The testing scale is
large enough (about 150,000 words). In contrast,
Church (1988) tests a text and extracts the simple noun
phrases only. Bourigault's work (1992) is evaluated
manually, and dose not report the precision. Hence, the
real performance is not known. The work executed by
Voutilainen (1993) is more complex than our work. The
input text first is morphologizied, then parsed by
constraint grammar, analyzed by two different noun
phrases grammar and finally extracted by the
occurrences. Like other works, Voutilainen's work is
also evaluated manually.
In this paper, we propose a language model to chunk
texts. The simple but effective chunker could be seen as
a linear structure parser, and could be applied to many
References
Abney, Steven (1991),
"Parsing by Chunks," in
Principle-Based Parsing, Berwick, Abney and
Tenny (Eds.), Kluwer Academic Publishers, pp.
257-278.
Bourigault, Didier (1992), "Surface Grammatical
Analysis for the Extraction of Terminological Noun
Phrases," Proceedings of the 15th International
Conference
on
Computational
Linguistics,
COLING-92, Vol. III, Nantes, France, pp. 977-981.
Church, Kenneth (1988), "A Stochastic Parts Program
and Noun Phrase Parser for Unrestricted Text,"
Proceedings of Second Conference on Applied
Natural Language Processing, pp. 136-143.
Francis, N. and Kucera, H. (1979),
Manual of
Information to Accompany a Standard Sample of
Presentday Edited American English, for Use with
Digital Computers, Department of Linguistics,
Brown University, Providence, R. I., U.S.A.,
original ed. 1964, revised 1971, revised and
augmented 1979.
Garside, Roger and Leech, Geoffrey (1985),
"A
Probabilistic Parser,"
Proceedings of Second
Conference of the European Chapter of the ACL,
pp. 166-170.
Hindle, Donald (1990), "Noun Classification from
Predicate-Argument Structures," Proceedings of
28th Annual Meeting of ACL, pp. 268-275.
Hindle, Donald and Rooth, Mats (1993), "Structural
Ambiguity and Lexical Relations," Computational
Linguistics, 19(1), pp. 103-120.
Hirst, G. (1981), Anaphora in Natural Language
Understanding: a Survey, Lecture Notes 119,
Springer-Verlag.
Johansson, Stig (1986), The Tagged LOB Corpus:
Users' Manual, Bergen: Norwegian Computing
Centre for the Humanities.
Appendix
For demonstration, we list a sample text quoted from
N18:0010a-N18:0250e, SUSANNE Corpus.
The
extracted noun phrases are bracketed. We could compute
the precision and the recall from the text as a reference
and compare the gap with the experimental results
itemized in Section 6. In actual, the result shows that the
system has high precision and recall for the text.
[ Too_QL many_AP people_NNS ] think_VB that_CS [ the_ATI
primary_JJ purpose_NN of_IN a_AT higher_JJR education_NN ] is_BEZ
to_TO help_VB [ you_PP2 ] make_VB [ a_AT living_NN ] +;_; this_DT
is_BEZ not_XNOT so_RB +,_, for_CS [ education_NN ] offers_VBZ
[ all_ABN kinds_NNS of_IN dividends_NNS ] +,_, including_IN
how_WRB to_TO pull_VB [ the_ATI wool_NN ] over_IN [ a_AT
husband_NN eyes_NNS ] while_CS [ you_PP2 ] are_BER having_HVG
[ an_AT affair_NN ] with_IN [ his_PP$ wife_NN ] ._. If_CS [ it_PP3 ]
were_BED not_XNOT for_IN [ an_AT old_JJ professor_NPT ]
who_WPR made_VBD [ me_PP1O ] read_VB [ the_ATI classics_NN ]
[ I_PP1A ] would_MD have_HV been_BEN stymied_VBN on_IN
what_WDT to_TO do_DO +,_, and_CC now_RN [ I_PP1A ]
understand_VB why_WRB [ they_PP3AS ] are_BER [ classics_NN ] +;_;
those_DTS who_WPR wrote_VBD [ them_PP3OS ] knew_VBD
[ people_NNS ] and_CC what_WDT made_VBD [ people_NNS ]
tick_VB ._. [ I_PP1A ] worked_VBD for_IN [ my_PP$ Uncle_NPT ]
(_( [ +an_AT Uncle_NPT by_IN marriage_NN ] so_RB [ you_PP2 ]
will_MD not_XNOT think_VB this_DT has_HVZ [ a_AT mild_JJ
undercurrent_NN of_IN incest_NN ] +)_) who_WPR ran_VBD
[ one_CD1 of_IN those_DTS antique_JJ shops_NNS ] in_IN [ New_JJ
Orleans_NP ] Vieux_&FW Carre_&FW +,_, [ the_ATI old_JJ French_JJ
Quarter_NPL ] ._. [ The_ATI arrangement_NN ] [ I_PP1A ] had_HVD
with_IN [ him_PP3O ] was_BEDZ to_TO work_VB [ four_CD
hours_NRS ] [ a_AT day_NR ] ._. [ The_ATI rest_NN of_IN the_ATI
time_NR ] [ I_PP1A ] devoted_VBD to_IN painting_VBG or_CC to_IN
those_DTS [ other_JJB activities_NNS ] [ a_AT young_JJ and_CC
healthy_JJ man_NN ] just_RB out_IN of_IN [ college_NN ] finds_VBZ