J01-2004
J01-2004
J01-2004
Language Modeling
Brian Roark*
Brown University
This paper describes the functioning of a broad-coverageprobabilistic top-down parser, and its
application to the problem of language modeling for speech recognition. The paperfirst introduces
key notions in language modeling and probabilistic parsing, and briefly reviews some previous
approaches to using syntactic structure for language modeling. A lexicalized probabilistic top-
down parser is then presented, which performs very well, in terms of both the accuracy of returned
parses and the efficiency with which they arefound, relative to the best broad-coveragestatistical
parsers. A new language model that utilizes probabilistic top-down parsing is then outlined, and
empirical results show that it improves upon previous work in test corpus perplexity. Interpolation
with a trigram model yields an exceptional improvement relative to the improvement observed
by other models, demonstrating the degree to which the information captured by our parsing
model is orthogonal to that captured by a trigram model. A small recognition experiment also
demonstrates the utility of the model.
1. Introduction
With certain exceptions, computational linguists have in the past generally formed
a separate research community from speech recognition researchers, despite some
obvious overlap of interest. Perhaps one reason for this is that, until relatively re-
cently, few methods have come out of the natural language processing community
that were shown to improve upon the very simple language models still standardly
in use in speech recognition systems. In the past few years, however, some improve-
ments have been made over these language models through the use of statistical meth-
ods of natural language processing, and the development of innovative, linguistically
well-motivated techniques for improving language models for speech recognition is
generating more interest among computational linguists. While language models built
around shallow local dependencies are still the standard in state-of-the-art speech
recognition systems, there is reason to hope that better language models can and will
be developed by computational linguists for this task.
This paper will examine language modeling for speech recognition from a nat-
ural language processing point of view. Some of the recent literature investigating
approaches that use syntactic structure in an attempt to capture long-distance depen-
dencies for language modeling will be reviewed. A new language model, based on
probabilistic top-down parsing, will be outlined and compared with the previous liter-
ature, and extensive empirical results will be presented which demonstrate its utility.
Two features of our top-down parsing approach will emerge as key to its success.
First, the top-down parsing algorithm builds a set of rooted candidate parse trees from
left to right over the string, which allows it to calculate a generative probability for
* Department of Cognitive and Linguistic Sciences,Box 1978, Brown University, Providence, RI 02912
each prefix string from the probabilistic grammar, and hence a conditional probability
for each word given the previous words and the probabilistic grammar. A left-to-
right parser whose derivations are not rooted, i.e., with derivations that can consist of
disconnected tree fragments, such as an LR or shift-reduce parser, cannot incrementally
calculate the probability of each prefix string being generated by the probabilistic
grammar, because their derivations include probability mass from unrooted structures.
Only at the point when their derivations become rooted (at the end of the string) can
generative string probabilities be calculated from the grammar. These parsers can
calculate word probabilities based upon the parser state--as in Chelba and Jelinek
(1998a)--but such a distribution is not generative from the probabilistic grammar.
A parser that is not left to right, but which has rooted derivations, e.g., a head-
first parser, will be able to calculate generative joint probabilities for entire strings;
however, it will not be able to calculate probabilities for each word conditioned on
previously generated words, unless each derivation generates the words in the string
in exactly the same order. For example, suppose that there are two possible verbs that
could be the head of a sentence. For a head-first parser, some derivations will have the
first verb as the head of the sentence, and the second verb will be generated after the
first; hence the second verb's probability will be conditioned on the first verb. Other
derivations will have the second verb as the head of the sentence, and the first verb's
probability will be conditioned on the second verb. In such a scenario, there is no
way to decompose the joint probability calculated from the set of derivations into the
product of conditional probabilities using the chain rule. Of course, the joint probability
can be used as a language model, but it cannot be interpolated on a word-by-word
basis with, say, a trigram model, which we will demonstrate is a useful thing to do.
Thus, our top-down parser allows for the incremental calculation of generative
conditional word probabilities, a property it shares with other left-to-right parsers
with rooted derivations such as Earley parsers (Earley 1970) or left-corner parsers
(Rosenkrantz and Lewis II 1970).
A second key feature of our approach is that top-down guidance improves the
efficiency of the search as more and more conditioning events are extracted from the
derivation for use in the probabilistic model. Because the rooted partial derivation is
fully connected, all of the conditioning information that might be extracted from the
top-down left context has already been specified, and a conditional probability model
built on this information will not impose any additional burden on the search. In
contrast, an Earley or left-corner parser will underspecify certain connections between
constituents in the left context, and if some of the underspecified information is used
in the conditional probability model,- it will have to become specified. Of course, this
can be done, but at the expense of search efficiency; the more that this is done, the
less benefit there is from the underspecification. A top-down parser will, in contrast,
derive an efficiency benefit from precisely the information that is underspecified in
these other approaches.
Thus, our top-down parser makes it very easy to condition the probabilistic gram-
mar on an arbitrary number of values extracted from the rooted, fully specified deriva-
tion. This has lead us to a formulation of the conditional probability model in terms
of values returned from tree-walking functions that themselves are contextually sen-
sitive. The top-down guidance that is provided makes this approach quite efficient in
practice.
The following section will provide some background in probabilistic context-free
grammars and language modeling for speech recognition. There will also be a brief
review of previous work using syntactic information for language modeling, before
we introduce our model in Section 4.
250
Roark Top-Down Parsing
S S STOP S STOP
NP VP NP VP (/s) NP VP
chased DT NN chased DT NN
I I f J
the ball the ball
Figure 1
Three parse trees: (a) a complete parse tree; (b) a complete parse tree with an explicit stop
symbol; and (c) a partial parse tree.
2. Background
1 For a detailed introduction to PCFGs, see Manning and Sch~itze (1999), for example.
2 For ease of exposition, we will ignore epsilon productions for now. An epsilon production has the
empty string (e) on the right-hand side, and can be written A --~ c. Everything that is said here can be
straightforwardly extended to include such productions.
251
Computational Linguistics Volume 27, Number 2
be the set of all complete trees rooted at the start symbol, with the string of terminals
w~ as leaves. We call T G the set of complete parses of w~.
A PCFG is a CFG with a probability assigned to each rule; specifically, each right-
hand side has a probability given the left-hand side of the rule. The probability of a
parse tree is the product of the probabilities of each rule in the tree. Provided a PCFG
is consistent (or tight), which it always will be in the approach we will be advocating,
this defines a proper probability distribution over completed trees. 3
A PCFG also defines a probability distribution over strings of words (terminals)
in the following way:
P(w~) = ~ P(t) (1)
tETw~
The intuition behind Equation 1 is that, if a string is generated by the PCFG, then it
will be produced if and only if one of the trees in the set Twg generated it. Thus the
probability of the string is the probability of the set T G, i.e., the sum of its members'
probabilities.
Up to this point, we have been discussing strings of words without specifying
whether they are "complete" strings or not. We will adopt the convention that an
explicit beginning of string symbol, (s/, and an explicit end symbol, </s), are part of
the vocabulary, and a string wg is a complete string if and only if w0 i s (s) and wn
is (/s). Since the beginning of string symbol is not predicted by language models,
but rather is axiomatic in the same w a y that S~f is for a parser, we can safely omit it
from the current discussion, and simply assume that it is there. See Figure l(b) for the
explicit representation.
While a complete string of words must contain the end symbol as its final word,
a string prefix does not have this restriction. For example, "Spot chased the b a l l
(/s)" is a complete string, and the following is the set of prefix strings of this com-
plete string: "Spot"; "Spot chased"; "Spot chased the"; "Spot chased the b a l l " ;
and "Spot chased the b a l l (/s>'. A PCFG also defines a probability distribution
over string prefixes, and we will present this in terms of partial derivations. A partial
derivation (or parse) d is defined with respect to a prefix string w~ as follows: it is the
leftmost derivation of the string, with wj on the right-hand side of the last expansion
in the derivation) Let Dw0Jbe the set of all partial derivations for a prefix string w0J.
Then
P(wg) = ~ P(d) (2)
dCDwJ°
We left-factor the PCFG, so that all productions are binary, except those with a
single terminal on the right-hand side and epsilon productions, s We do this because
it delays predictions about what nonterminals we expect later in the string until we
have seen more of the string. In effect, this is an underspecification of some of the
predictions that our top-down parser is making about the rest of the string. The left-
factorization transform that we use is identical to what is called right binarization
in Roark and Johnson (1999). See that paper for more discussion of the benefits of
3 A PCFG is consistent or tight if there is n o probability m a s s reserved for infinite trees. Chi a n d G e m a n
(1998) p r o v e d that a n y PCFG e s t i m a t e d from a treebank w i t h the relative frequency estimator is tight.
All of the PCFGs that are u s e d in this p a p e r are e s t i m a t e d u s i n g the relative frequency estimator.
4 A leftmost derivation is a derivation in w h i c h the leftmost n o n t e r m i n a l is a l w a y s e x p a n d e d .
5 The only c-productions that w e will u s e in this p a p e r are those i n t r o d u c e d by left-factorization.
252
Roark Top-Down Parsing
(a) (b)
St St
f ~
S ~-S S "~-S
I
NP S-NP (/s) NP S-NP
I
Spot VP S-NP, VP Spot
VBD VP-VBD e
chased NP VP-VBD,NP
DT NP-DT e
the NN NP-DT,NN
I I
ball e
Figure 2
Two parse trees: (a) a complete left-factored parse tree with epsilon productions and an
explicit stop symbol; and (b) a partial left-factored parse tree.
We can see the effect of this transform on our example parse trees in Figure 2. This
underspecification of the nonterminal predictions (e.g., VP-VBD in the example in
Figure 2, as o p p o s e d to NP), allows lexical items to become part of the left context,
and so be used to condition p r o d u c t i o n probabilities, even the p r o d u c t i o n probabil-
ities of constituents that dominate them in the unfactored tree. It also brings w o r d s
further d o w n s t r e a m into the look-ahead at the point of specification. Note that partial
trees are defined in exactly the same w a y (Figure 2b), but that the nonterminal yields
are m a d e up exclusively of the composite nonterminals introduced b y the g r a m m a r
transform.
This transform has a couple of v e r y nice properties. First, it is easily reversible,
i.e., every parse tree built with Gf corresponds to a unique parse tree built with G.
Second, if w e use the relative frequency estimator for our p r o d u c t i o n probabilities, the
probability of a tree built with @ is identical to the probability of the corresponding
tree built with G.
Finally, let us introduce the term c-command. We will use this notion in our condi-
tional probability model, and it is also useful for understanding some of the previous
work in this area. The simple definition of c - c o m m a n d that we will be using in this
p a p e r is the following: a n o d e A c-commands a n o d e B if and only if (i) A does not
dominate B; and (ii) the lowest branching n o d e (i.e., n o n - u n a r y node) that dominates
253
Computational Linguistics Volume 27, Number 2
A also dominates B.6 Thus in Figure l(a), the subject NP and the VP each c-command
the other, because neither dominates the other and the lowest branching node above
both (the S) dominates the other. Notice that the subject NP c-commands the object
NP, but not vice versa, since the lowest branching node that dominates the object NP
is the VP, which does not dominate the subject NP.
k
P(w k) = P(w0)P(Wl [ w0)... P(wn-1 ]wg-2)IIP(wi I Wi_n
)i-1 (4)
i=n
These models are commonly called n-gram models, s The standard language model
used in many speech recognition systems is the trigram model, i.e., a Markov model
of order 2, which can be characterized by the following equation:
n-t
P(wg -1) = P(w0)P(wl [ WO) I I P(wi I Wi_2
)i-1 (5)
i=2
To smooth the trigram models that are used in this paper, we interpolate the
probability estimates of higher-order Markov models with lower-order Markov models
(Jelinek and Mercer 1980). The idea behind interpolation is simple, and it has been
shown to be very effective. For an interpolated (n + 1)-gram:
Here 13 is the empirically observed relative frequency, and /~n is a function from V n to
[0,1]. This interpolation is recursively applied to the smaller-order n-grams until the
bigram is finally interpolated with the unigram, i.e., A0 = 1.
3. Previous Work
There have been attempts to jump over adjacent words to words farther back in the
left context, without the use of dependency links or syntactic structure, for example
Saul and Pereira (1997) and Rosenfeld (1996, 1997). We will focus our very brief review,
however, on those that use grammars or parsing for their language models. These can
be divided into two rough groups: those that use the grammar as a language model,
254
Roark Top-Down Parsing
and those that use a parser to uncover phrasal heads standing in an important relation
(c-command) to the current word. The approach that we will subsequently present uses
the probabilistic grammar as its language model, but only includes probability mass
from those parses that are found, that is, it uses the parser to find a subset of the total
set of parses (hopefully most of the high-probability parses) and uses the sum of their
probabilities as an estimate of the true probability given the grammar.
Stolcke and Segal (1994) and Jurafsky et al. (1995) used these basic ideas to es-
timate bigram probabilities from hand-written PCFGs, which were then used in lan-
guage models. Interpolating the observed bigram probabilities with these calculated
bigrams led, in both cases, to improvements in word error rate over using the observed
bigrams alone, demonstrating that there is some benefit to using these syntactic lan-
guage models to generalize beyond observed n-grams.
255
Computational Linguistics Volume 27, Number 2
DT NN DT NN
L I I I
the dog chased tile cat with spots
Figure 3
Tree representation of a derivation state.
The structured language model (SLM) used in Chelba and Jelinek (1998a, 1998b,
1999), Jelinek and Chelba (1999), and Chelba (2000) is similar to that of Goddeau,
except that (i) their shift-reduce parser follows a nondeterministic beam search, and
(ii) each stack entry contains, in addition to the nonterminal node label, the headword
of the constituent. The SLM is like a trigram, except that the conditioning words are
taken from the tops of the stacks of candidate parses in the beam, rather than from
the linear order of the string.
Their parser functions in three stages. The first stage assigns a probability to the
word given the left context (represented by the stack state). The second stage predicts
the POS given the word and the left context. The last stage performs all possible parser
operations (reducing stack entries and shifting the new word). When there is no more
parser work to be done (or, in their case, when the beam is full), the following word
is predicted. And so on until the end of the string.
Each different POS assignment or parser operation is a step in a derivation. Each
distinct derivation path within the beam has a probability and a stack state associated
with it. Every stack entry has a nonterminal node label and a designated headword of
the constituent. When all of the parser operations have finished at a particular point in
the string, the next word is predicted as follows: For each derivation in the beam, the
headwords of the two topmost stack entries form a trigram with the conditioned word.
This interpolated trigram probability is then multiplied by the normalized probability
of the derivation, to provide that derivation's contribution to the probability of the
word. More precisely, for a beam of derivations Di
256
Roark Top-Down Parsing
10 Johnson et al. (1999), Henderson and Brill (1999), and Collins (2000) demonstratemethods for choosing
the best completeparse tree from among a set of completeparse trees, and the latter two show
accuracy improvementsover some of the parsers cited above, from which they generated their
candidate sets. Here we will be comparing our work with parsing algorithms,i.e., algorithmsthat
build parses for strings of words.
257
Computational Linguistics Volume 27, Number 2
constituents on the right-hand side with their lexical heads, given the left-hand side
constituent and its lexical head. The same procedure works if we annotate parent infor-
mation onto constituents. This is how Johnson (1998) conditioned the probabilities of
productions: the left-hand side is no longer, for example, S, but rather STSBAR, i.e., an
S with SBAR as parent. Notice, however, that in this case the annotations on the right-
hand side are predictable from the annotation on the left-hand side (unlike, for ex-
ample, bilexical grammars), so that the relative frequency estimator yields conditional
probability distributions of the original rules, given the parent of the left-hand side.
All of the conditioning information that we will be considering will be of this
latter sort: the only novel predictions being made by rule expansions are the node
labels of the constituents on the right-hand side. Everything else is already specified
by the left context. We use the relative frequency estimator, and smooth our production
probabilities by interpolating the relative frequency estimates with those obtained by
"annotating" less contextual information.
This perspective on conditioning production probabilities makes it easy to see that,
in essence, by conditioning these probabilities, we are growing the state space. That
is, the number of distinct nonterminals grows to include the composite labels; so does
the number of distinct productions in the grammar. In a top-down parser, each rule
expansion is made for a particular candidate parse, which carries with it the entire
rooted derivation to that point; in a sense, the left-hand side of the rule is annotated
with the entire left context, and the rule probabilities can be conditioned on any aspect
of this derivation.
We do not use the entire left context to condition the rule probabilities, but rather
"pick-and-choose" which events in the left context we would like to condition on.
One can think of the conditioning events as functions, which take the partial tree
structure as an argument and return a value, upon which the rule probability can be
conditioned. Each of these functions is an algorithm for walking the provided tree and
returning a value. For example, suppose that we want to condition the probability of
the rule A --* ~. We might write a function that takes the partial tree, finds the parent
of the left-hand side of the rule and returns its node label. If the left-hand side has no
parent (i.e., it is at the root of the tree), the function returns the null value (NULL). We
might write another function that returns the nonterminal label of the closest sibling to
the left of A, and NULL if no such node exists. We can then condition the probability
of the production on the values that were returned by the set of functions.
Recall that we are working with a factored grammar, so some of the nodes in the
factored tree have nonterminal labels that were created by the factorization, and may
not be precisely what we want for conditioning purposes. In order to avoid any con-
fusions in identifying the nonterminal label of a particular rule production in either its
factored or nonfactored version, we introduce the function constLtuent (A) for every
nonterminal in the factored grammar GI, which is simply the label of the constituent
whose factorization results in A. For example, in Figure 2, constLtuent (NP-DT-NN)
is simply NP.
Note that a function can return different values depending upon the location in
the tree of the nonterminal that is being expanded. For example, suppose that we have
a function that returns the label of the closest sibling to the left of c o n s t i t u e n t ( A )
or NULL if no such node exists. Then a subsequent function could be defined as
follows: return the parent of the parent (the grandparent) of c o n s t i t u e n t (A) only if
c o n s t i t u e n t ( A ) has no sibling to the left--in other words, if the previous function
returns NULL; otherwise return the second closest sibling to the left of constLtuent (A),
or, as always, NULL if no such node exists. If the function returns, for example, NP,
this could either mean that the grandparent is NP or the second closest sibling is
258
Roark Top-Down Parsing
F o r all rules A -+ ct ®A
)
O the parent, Yp, of c o n s t i t u e n t (A) in the derivation
I
(~ the closest sibling, Ys, to the left of c o n s t i t u e n t ( A ) in the derivation
S , G NULL
Figure 4
Conditional probability model represented as a decision tree, identifying the location in the
partial parse tree of the conditioning information.
NP; yet there is no ambiguity in the meaning of the function, since the result of the
previous function disambiguates between the two possibilities.
The functions that were used for the present s t u d y to condition the probability
of the rule, A ---, a, are presented in Figure 4, in a tree structure. This is a sort of
decision tree for a tree-walking algorithm to decide what value to return, for a given
partial tree and a given depth. For example, if the algoritt~rn is asked for the value
at level 0, it will return A, the left-hand side of the rule being expanded. 1~ Suppose
the algorithm is asked for the value at level 4. After level 2 there is a branch in the
decision tree. If the left-hand side of the rule is a POS, and there is no sibling to the left
of c o n s t i t u e n t (A) in the derivation, then the algorithm takes the right branch of the
decision tree to decide what value to return; otherwise the left branch. Suppose it takes
the left branch. Then after level 3, there is another branch in the decision tree. If the
left-hand side of the p r o d u c t i o n is a POS, then the algorithm takes the right branch of
the decision tree, and returns (at level 4) the POS of the closest c-commanding lexical
head to A, which it finds b y walking the parse tree; if the left-hand side of the rule is
not a POS, then the algorithm returns (at level 4) the closest sibling to the left of the
parent of constituent (A).
The functions that w e have chosen for this p a p e r follow from the intuition (and
experience) that what helps parsing is different d e p e n d i n g on the constituent that is
being expanded. POS nodes have lexical items on the right-hand side, and hence can
bring into the model some of the head-head dependencies that have been s h o w n to
be so effective. If the POS is leftmost within its constituent, then v e r y often the lexical
11 Recall that A can be a composite nonterminal introduced by grammar factorization. When the function
is defined in terms of constJ_tuent (A), the values returned are obtained by moving through the
nonfactored tree.
259
Computational Linguistics Volume 27, Number 2
Table 1
Levels of conditioning information, mnemonic labels, and a brief description of the
information level for empirical results.
Conditioning Mnemonic Label Information Level
0,0,0 none Simple PCFG
2,2,2 par+sib Small amount of structural context
5,2,2 NT struct All structural (non-lexical) context for non-POS
6,2,2 NT head Everything for non-POS expansions
6,3,2 POS struct More structural info for leftmost POS expansions
6,5,2 attach All attachment info for leftmost POS expansions
6,6,4 all Everything
item is sensitive to the governing category to which it is attaching. For example, if the
POS is a preposition, then its probability of expanding to a particular w o r d is v e r y
different if it is attaching to a n o u n phrase than if it is attaching to a verb phrase,
and perhaps quite different d e p e n d i n g on the h e a d of the constituent to which it is
attaching. Subsequent POSs within a constituent are likely to be open-class words, and
less d e p e n d e n t on these sorts of attachment preferences.
Conditioning on parents and siblings of the left-hand side has p r o v e n to be v e r y
useful. To u n d e r s t a n d w h y this is the case, one n e e d merely to think of VP expansions.
If the parent of a VP is another VP (i.e., if an auxiliary or m o d a l verb is used), then the
distribution over productions is different than if the parent is an S. Conditioning on
head information, both POS of the head and the lexical item itself, has p r o v e n useful
as well, although given our parser's left-to-right orientation, in m a n y cases the h e a d
has not been encountered within the particular constituent. In such a case, the head
of the last child within the constituent is used as a p r o x y for the constituent head. All
of our conditioning functions, with one exception, return either parent or sibling n o d e
labels at some specific distance from the left-hand side, or h e a d information from c-
c o m m a n d i n g constituents. The exception is the function at level 5 along the left branch
of the tree in Figure 4. Suppose that the n o d e being e x p a n d e d is being conjoined with
another node, which we can tell b y the presence or absence of a CC node. In that case,
w e want to condition the expansion on h o w the conjoining constituent expanded. In
other words, this attempts to capture a certain a m o u n t of parallelism b e t w e e n the
expansions of conjoined categories.
In presenting the parsing results, w e will systematically v a r y the a m o u n t of con-
ditioning information, so as to get an idea of the behavior of the parser. We will refer
to the a m o u n t of conditioning b y specifying the deepest level from which a value
is returned for each branching path in the decision tree, from left to right in Fig-
ure 4: the first n u m b e r is for left contexts w h e r e the left branch of the decision tree
is always followed (non-POS nonterminals on the left-hand side); the second n u m b e r
is for a left branch followed b y a right branch (POS nodes that are leftmost within
their constituent); and the third n u m b e r is for the contexts where the right branch is
always followed (POS nodes that are not leftmost within their constituent). For exam-
ple, (4,3,2) w o u l d represent a conditional probability m o d e l that (i) returns N U L L for
all functions below level 4 in all contexts; (ii) returns N U L L for all functions below
level 3 if the left-hand side is a POS; and (iii) returns N U L L for all functions below
level 2 for nonleftmost POS expansions.
Table 1 gives a b r e a k d o w n of the different levels of conditioning information used
in the empirical trials, with a m n e m o n i c label that will be used w h e n presenting results.
These different levels were chosen as s o m e w h a t natural points at which to observe
260
Roark Top-Down Parsing
i. DI = D + A --~ X o . . . X k
ii. $ = Ac~$;
iii. either S ' = X o . . . Xko~$ and j = i
or k = 0, X0 = wi, j = i + 1, and $ ' = c~$;
iv. PD, = PoP(A --~ X o . . . Xk); and
v. F' = PD, L A P ( S ' , w / )
The parse begins with a single candidate analysis on the priority queue: ((), St$,
1, 1, w~). Next, the top ranked candidate analysis, C = (D, $, PD, F, w~), is p o p p e d from
the priority queue. If $ = $ and wi = (/s), then the analysis is complete. Otherwise,
all C' such that C ~ C t are p u s h e d onto the priority queue.
12 Again, for ease of exposition, we will ignore e-productions. Everything presented here can be
straightforwardly extended to include them. The + in (i) denotes concatenation. To avoid confusion
between sets and sequences, 0 will not be used for empty strings or sequences, rather the symbol ( )
will be used. Note that the script $ is used to denote stacks, while St is the start symbol.
261
Computational Linguistics Volume 27, Number 2
We implement this as a beam search. For each word position i, we have a separate
priority queue Hi of analyses with look-ahead wi. When there are "enough" analyses
by some criteria (which we will discuss below) on priority queue Hi+l, all candidate
analyses remaining on Hi are discarded. Since Wn = (/s), all parses that are pushed
onto Hn+l are complete. The parse on Hn+l with the highest probability is returned
for evaluation. In the case that no complete parse is found, a partial parse is returned
and evaluated.
The LAP is the probability of a particular terminal being the next left-corner of a
particular analysis. The terminal m a y be the left corner of the topmost nonterminal
on the stack of the analysis or it might be the left corner of the nth nonterminal, after
the top n - 1 nonterminals have rewritten to ~. Of course, we cannot expect to have
adequate statistics for each n o n t e r m i n a l / w o r d pair that we encounter, so we smooth
to the POS. Since we do not k n o w the POS for the word, we must s u m the LAP for
all POS labels. 13
For a PCFG G, a stack $ = A 0 . . . A,$ (which we will write Ag$) and a look-ahead
terminal item wi, we define the look-ahead probability as follows:
PG(Aj -- wio~) ~, A&P(Aj -G wick) + (1 - AAj) ~ P(Aj Z~ Xc~)P(X --* wi) (11)
xcv
The lambdas are a function of the frequency of the nonterminal Aj, in the standard
w a y (Jelinek and Mercer 1980).
The beam threshold at w o r d wi is a function of the probability of the top-ranked
candidate analysis on priority queue Hi+1 and the number of candidates on Hi+l. The
basic idea is that we want the beam to be very wide if there are few analyses that have
been advanced, but relatively narrow if m a n y analyses have been advanced. If ~ is the
probability of the highest-ranked analysis on Hi+l, then another analysis is discarded
if its probability falls below Pf('7, IH/+ll), where 3' is an initial parameter, which we
call the base b e a m factor. For the current study, 3' was 10 -11 , unless otherwise noted,
and f('y, IHi+l I) 'TIHi+I]3" ThUS, if 100 analyses have already been p u s h e d onto/-//+1,
=
then a candidate analysis must have a probability above 10-5~ to avoid being pruned.
After 1,000 candidates, the beam has narrowed to 10-2p. There is also a m a x i m u m
n u m b e r of allowed analyses on Hi, in case the parse fails to advance an analysis to
Hi+]. This was typically 10,000.
As mentioned in Section 2.1, we left-factor the grammar, so that all productions are
binary, except those with a single terminal on the right-hand side and epsilon produc-
tions. The only e-productions are those introduced by left-factorization. Our factored
13 Equivalently, w e can split the analyses at this point, so that there is one POS per analysis.
262
Roark Top-Down Parsing
grammar was produced by factoring the trees in the training corpus before grammar
induction, which proceeded in the standard way, by counting rule frequencies.
5. Empirical Results
The empirical results will be presented in three stages: (i) trials to examine the accuracy
and efficiency of the parser; (ii) trials to examine its effect on test corpus perplexity
and recognition performance; and (iii) trials to examine the effect of beam variation
on these performance measures. Before presenting the results, we will introduce the
methods of evaluation.
5.1 Evaluation
Perplexity is a standard measure within the speech recognition community for com-
paring language models. In principle, if two models are tested on the same test corpus,
the model that assigns the lower perplexity to the test corpus is the model closest to
the true distribution of the language, and thus better as a prior model for speech
recognition. Perplexity is the exponential of the cross entropy, which we will define
next.
Given a random variable X with distribution p and a probability model q, the cross
entropy, H(p, q) is defined as follows:
Let p be the true distribution of the language. Then, under certain assumptions, given
a large enough sample, the sample mean of the negative log probability of a model
will converge to its cross entropy with the true model. 14 That is
where w~ is a string of the language L. In practice, one takes a large sample of the
language, and calculates the negative log probability of the sample, normalized by its
size. 15 The lower the cross entropy (i.e., the higher the probability the model assigns
to the sample), the better the model. Usually this is reported in terms of perplexity,
which we will do as well. 16
Some of the trials discussed below will report results in terms of word a n d / o r
sentence error rate, which are obtained when the language model is embedded in a
speech recognition system. Word error rate is the number of deletion, insertion, or
substitution errors per 100 words. Sentence error rate is the number of sentences with
one or more errors per 100 sentences.
Statistical parsers are typically evaluated for accuracy at the constituent level,
rather than simply whether or not the parse that the parser found is completely correct
or not. A constituent for evaluation purposes consists of a label (e.g., NP) and a span
(beginning and ending word positions). For example, in Figure l(a), there is a VP that
spans the words "chased the b a l l " . Evaluation is carried out on a hand-parsed test
corpus, and the manual parses are treated as correct. We will call the manual parse
14 See Cover and Thomas (1991) for a discussion of the Shannon-McMillan-Breiman theorem, under the
assumptions of which this convergence holds.
15 It is important to remember to include the end marker in the strings of the sample.
16 When assessing the magnitude of a perplexity improvement, it is often better to look at the reduction
in cross entropy, by taking the log of the perplexity. It will be left to the reader to do so.
263
Computational Linguistics Volume 27, Number 2
GOLD and the parse that the parser returns TEST. Precision is the number of common
constituents in GOLD and TEST divided by the number of constituents in TEST. Recall
is the number of common constituents in GOLD and TEST divided by the number of
constituents in GOLD. Following standard practice, we will be reporting scores only
for non-part-of-speech constituents, which are called labeled recall (LR) and labeled
precision (LP). Sometimes in figures we will plot their average, and also what can be
termed the parse error, which is one minus their average.
LR and LP are part of the standard set of PARSEVAL measures of parser qual-
ity (Black et al. 1991). From this set of measures, we will also include the crossing
bracket scores: average crossing brackets (CB), percentage of sentences with no cross-
ing brackets (0 CB), and the percentage of sentences with two crossing brackets or
fewer (< 2 CB). In addition, we show the average number of rule expansions con-
sidered per word, that is, the number of rule expansions for which a probability was
calculated (see Roark and Charniak [2000]), and the average number of analyses ad-
vanced to the next priority queue per word.
This is an incremental parser with a pruning strategy and no backtracking. In such
a model, it is possible to commit to a set of partial analyses at a particular point that
cannot be completed given the rest of the input string (i.e., the parser can "garden
path"). In such a case, the parser fails to return a complete parse. In the event that
no complete parse is found, the highest initially ranked parse on the last nonempty
priority queue is returned. All unattached words are then attached at the highest
level in the tree. In such a way we predict no new constituents and all incomplete
constituents are closed. This structure is evaluated for precision and recall, which is
entirely appropriate for these incomplete as well as complete parses. If we fail to
identify nodes later in the parse, recall will suffer, and if our early predictions were
bad, both precision and recall will suffer. Of course, the percentage of these failures
are reported as well.
264
Roark Top-Down Parsing
Table 2
Results conditioning on various contextual events, standard training and testing corpora.
Conditioning LR LP CB 0 CB G 2 CB Percent
Average Rule Average
Failed
Expansions Analyses
Considered ~ Advanced ~
Section 23:2245 sentences of length G 40
none 71.1 75.3 2.48 37.3 62.9 0.9 14,369 516.5
par+sib 82.8 83.6 1 . 5 5 54.3 76.2 1.1 9,615 324.4
NT struct 84.3 84.9 1 . 3 8 56.7 79.5 1.0 8,617 284.9
NT head 85.6 85.7 1.27 59.2 81.3 0.9 7,600 251.6
POS struct 86.1 86.2 1 . 2 3 60.9 82.0 1.0 7,327 237.9
attach 86.7 86.6 1.17 61.7 83.2 1.2 6,834 216.8
all 86.6 86.5 1 . 1 9 62.0 82.7 1.3 6,379 198.4
Section 23:2416 sentences of length < 100
attach 85.8 85.8 1 . 4 0 58.9 80.3 1.5 7,210 227.9
all 85.7 85.7 1 . 4 1 59.0 79.9 1.7 6,709 207.6
~per word
[
i
-1
-20
-30
c-
o
g.
-40 0..
-50
~..
-60 ¸
(0,0,0) (2,2,2) (5,2,2) (6,2,2) (6,3,2) (6,5,2) (6,6,4)
Conditioning information
Figure 5
Reduction in average precision/recall error and in number of rule expansions per word as
conditioning increases, for sentences of length < 40.
265
Computational Linguistics Volume 27, Number 2
30 i i
o Individual parses 0
~ , - Ratnaparkhi observed time o
0 0
25 o
oo
og
o
8 #
20 go
o
o
°~ 0
o 0
0 °0 g
"~ 0 0 O0 O0 O0 0 0
O~ 0 0 0 0
~15 o 0 O0 0
o 0 0
0 v~O~ ~^~:~oo~ 0 ~ v O^ o 0
o % ~o~A~ ~ G ~ ° ~ o o o o ^ o ~ . o
o o , o
5
oo,,,,,,ililiiiiiiilt
°~
°0 °
o ° '°°°° ~
~
* o o o
0
0 10 20 30 40 50 60 70
Sentence Length
Figure 6
Observed running time on Section 23 of the Penn Treebank, with the full conditional
probability model and beam of 10 -:1, using one 300 MHz UltraSPARC processor and 256MB
of RAM of a Sun Enterprise 450.
accuracies cited above. :7 Of the 2,416 sentences in the section, 728 h a d the totally cor-
rect parse, 30.1 percent tree accuracy. Also, the parser returns a set of candidate parses,
f r o m w h i c h w e h a v e b e e n choosing the top ranked; if w e use an oracle to choose the
p a r s e w i t h the highest accuracy f r o m a m o n g the candidates (which a v e r a g e d 70.0 in
n u m b e r per sentence), w e find an a v e r a g e labeled p r e c i s i o n / r e c a l l of 94.1, for sentences
of length G 100. The parser, thus, could be u s e d as a front e n d to s o m e other model,
w i t h the h o p e s of selecting a m o r e accurate p a r s e f r o m a m o n g the final candidates.
While w e h a v e s h o w n that the conditioning i n f o r m a t i o n i m p r o v e s the efficiency in
t e r m s of rule expansions considered a n d analyses a d v a n c e d , w h a t does the efficiency
of such a parser look like in practice? Figure 6 s h o w s the o b s e r v e d time at o u r s t a n d a r d
base b e a m of 10 -11 w i t h the full conditioning regimen, alongside an a p p r o x i m a t i o n
of the r e p o r t e d o b s e r v e d (linear) time in R a t n a p a r k h i (1997). O u r o b s e r v e d times look
p o l y n o m i a l , w h i c h is to be expected given our p r u n i n g strategy: the denser the com-
petitors within a n a r r o w probability range of the best analysis, the m o r e time will be
spent w o r k i n g on these competitors; a n d the farther along in the sentence, the m o r e
chance for ambiguities that can lead to such a situation. While our o b s e r v e d times are
not linear, a n d are clearly slower t h a n his times (even w i t h a faster machine), they are
quite respectably fast. The differences b e t w e e n a k-best a n d a b e a m - s e a r c h parser (not
to m e n t i o n the use of d y n a m i c p r o g r a m m i n g ) m a k e a r u n n i n g time difference unsur-
17 Our score of 85.8 average labeled precision and recall for sentences less than or equal to 100 on
Section 23 compares to: 86.7 in Charniak (1997), 86.9 in Ratnaparkhi (1997), 88.2 in Collins (1999), 89.6
in Charniak (2000), and 89.75 in Collins (2000).
266
Roark Top-Down Parsing
prising. What is perhaps surprising is that the difference is not greater. Furthermore,
this is quite a large beam (see discussion below), so that very large improvements in
efficiency can be had at the expense of the number of analyses that are retained.
5.3 P e r p l e x i t y Results
The next set of results will highlight what recommends this approach most: the ease
with which one can estimate string probabilities in a single pass from left to right across
the string. By definition, a PCFG's estimate of a string's probability is the sum of the
probabilities of all trees that produce the string as terminal leaves (see Equation 1).
In the beam search approach outlined above, we can estimate the string's probability
in the same manner, by summing the probabilities of the parses that the algorithm
finds. Since this is not an exhaustive search, the parses that are returned will be a
subset of the total set of trees that would be used in the exact PCFG estimate; hence
the estimate thus arrived at will be bounded above by the probability that would be
generated from an exhaustive search. The hope is that a large amount of the probability
mass will be accounted for by the parses in the beam. The method cannot overestimate
the probability of the string.
Recall the discussion of the grammar models above, and our definition of the set
of partial derivations Dwd with respect to a prefix string w0j (see Equations 2 and 7).
By definition,
P(d)
P(wj+I ]wg) - P(wg+l) - ~a~Dw°J+l
(14)
P(w0j) ~d~DwdP(d)
Note that the numerator at word wj is the denominator at word wj+l, so that the
product of all of the word probabilities is the numerator at the final word, namely, the
string prefix probability.
We can make a consistent estimate of the string probability by similarly summing
over all of the trees within our beam. Let H~nit be the priority queue Hi before any
processing has begun with word Wi in the look-ahead. This is a subset of the possi-
ble leftmost partial derivations with respect to the prefix string w 0i-1. Since ,/4init~i+lis
produced by expanding only analyses on priority queue HI '~it, the set of complete
trees consistent with the partial derivations on priority queue ~4i,,~t
"i+1 is a subset of the
set of complete trees consistent with the partial derivations on priority queue H~nit,
that is, the total probability mass represented by the priority queues is monotoni-
cally decreasing. Thus conditional word probabilities defined in a w a y consistent with
Equation 14 will always be between zero and one. Our conditional word probabilities
are calculated as follows:
267
Computational Linguistics Volume 27, Number 2
we would need to renormalize by dividing by some factor. Note, however, that this
renormalization factor is necessarily less than one, and thus would uniformly increase
each word's probability under the model, that is, any perplexity results reported below
will be higher than the "true" perplexity that would be assigned with a properly
normalized distribution. In other words, renormalizing would make our perplexity
measure lower still. The hope, however, is that the improved parsing model provided
by our conditional probability model will cause the distribution over structures to
be more peaked, thus enabling us to capture more of the total probability mass, and
making this a fairly snug upper bound on the perplexity.
One final note on assigning probabilities to strings: because this parser does gar-
den path on a small percentage of sentences, this must be interpolated with another
estimate, to ensure that every word receives a probability estimate. In our trials, we
used the unigram, with a very small mixing coefficient:
P(wi l Wlo-1)
• = ~(w,-,,
~, 0 /~v - , i+,
La~H~i*P(d)
q- (1 _ ~(w~-l))P(wi) (16)
If ~dcH~it P(d) = 0 in our model, then our model provides no distribution over
following words since the denominator is zero. Thus,
268
Roark Top-Down Parsing
Table 3
Results conditioning on various contextual events, Sections 23-24, modifications following
Chelba and Jelinek.
Corpora Conditioning LR LP Percent Perplexity Average Rule Average
Failed Expansions Analyses
Considered t Advanced t
Sections 23-24:3761 sentences G 120
unmodified all 85.2 85.1 1.7 7,206 213.5
no punct all 82.4 82.9 0.2 9,717 251.8
C&J corpus par+sib 75.2 77.4 0.1 310.04 17,418 457.2
C&J corpus NT struct 77.3 79.2 0.1 290.29 15,948 408.8
C&J corpus NT head 79.2 80.4 0.1 255.85 14,239 363.2
C&J corpus POS struct 80.5 81.6 0.1 240.37 13,591 341.3
C&J corpus all 81.7 82.1 0.2 152.26 11,667 279.7
tper word
-2O
X ..
\
X
~-30 X
\
\
~5 X
\
\
--4O \
\
\
\
-50
-60 I I I
(2,2,2) (5,2,2) (6,2,2) (6,3,2) (6,6,4)
Conditioning information
Figure 7
Reduction in average precision/recall error, number of rule expansions, and perplexity as
conditioning increases.
18 Our optimal mixture level was closer to 40 percent, but the difference was negligible.
269
Computational Linguistics Volume 27, Number 2
Table 4
Comparison with previous perplexity results.
Perplexity
Paper Trigram Baseline Model Interpolation, A = .36
Chelba and Jelinek (1998a) 167.14 158.28 148.90
Chelba and Jelinek (1998b) 167.14 153.76 147.70
Current results 167.02 152.26 137.26
19 Recall that our perplexity measure should, ideally, be even lower still.
20 Thanks to Ciprian Chelba for this suggestion.
21 See Chelba (2000) for details on the decoder.
270
Roark Top-Down Parsing
Table 5
Word and sentence error rate results for various models, with differing training and
vocabulary sizes, for the best language model factor for that particular model.
Percentage Percentage
Training Vocabulary LM Word Error Sentence
Model Size Size Weight Rate Error Rate
Lattice trigram 40M 20K 16 13.7 69.0
Chelba (2000) (;~ = .4) 20M 20K 16 13.0
Current model 1M 10K 15 15.1 73.2
Treebank trigram 1M 10K 5 16.5 79.8
No language model 0 16.8 84.0
the idealized circumstances of the production (text read in a lab), the lattices are rel-
atively sparse, and in m a n y cases 50 distinct string hypotheses were not found in
a lattice. We reranked an average of 22.9 hypotheses with our language model per
utterance.
One complicating issue has to do with the tokenization in the Penn Treebank
versus that in the HUB1 lattices. In particular, contractions (e.g., h e ' s) are split in the
Penn Treebank (he 's) but not in the HUB1 lattices. Splitting of the contractions is
critical for parsing, since the two parts oftentimes (as in the previous example) fall
in different constituents. We follow Chelba (2000) in dealing with this problem: for
parsing purposes, we use the Penn Treebank tokenization; for interpolation with the
provided trigram model, and for evaluation, the lattice tokenization is used. If we are to
interpolate our model with the lattice trigram, we must wait until we have our model's
estimate for the probability of both parts of the contraction; their product can then be
interpolated with the trigram estimate. In fact, interpolation in these trials m a d e no
improvement over the better of the uninterpolated models, but simply resulted in
performance somewhere between the better and the worse of the two models, so we
will not present interpolated trials here.
Table 5 reports the word and sentence error rates for five different models: (i) the
trigram model that comes with the lattices, trained on approximately 40M words, with
a vocabulary of 20,000; (ii) the best-performing model from Chelba (2000), which was
interpolated with the lattice trigram at ,~ = 0.4; (iii) our parsing model, with the same
training and vocabulary as the perplexity trials above; (iv) a trigram model with the
same training and vocabulary as the parsing model; and (v) no language model at all.
This last model shows the performance from the acoustic model alone, without the
influence of the language model. The log of the language model score is multiplied
by the language model (LM) weight w h e n s u m m i n g the logs of the language and
acoustic scores, as a w a y of increasing the relative contribution of the language model
to the composite score. We followed Chelba (2000) in using an LM weight of 16 for
the lattice trigram. For our model and the Treebank trigram model, the LM weight
that resulted in the lowest error rates is given.
The small size of our training data, as well as the fact that we are rescoring n-best
lists, rather than working directly on lattices, makes comparison with the other models
not particularly informative. What is more informative is the difference between our
model and the trigram trained on the same a m o u n t of data. We achieved an 8.5 percent
relative improvement in word error rate, and an 8.3 percent relative improvement
in sentence error rate over the Treebank trigram. Interestingly, as mentioned above,
interpolating two models together gave no improvement over the better of the two,
whether our model was interpolated with the lattice or the Treebank trigram. This
271
Computational Linguistics Volume 27, Number 2
Table 6
Results with full conditioning on the C&J corpus at various base beam factors.
Base LR LP Percentage Perplexity Perplexity
Average Rule Words Per
Beam Failed )~ = 0 ), = .36 Expansions Second
Factor Considered t
Sections 23-24:3761 sentences ~ 120
1 0 -11 81.7 82.1 0.2 152.26 137.26 11,667 3.1
10-l° 81.5 81.9 0.3 154.25 137.88 6,982 5.2
10 -9 80.9 81.3 0.4 156.83 138.69 4,154 8.9
10-s 80.2 80.6 0.6 160.63 139.80 2,372 15.3
10 -7 78.8 79.2 1.2 166.91 141.30 1,468 25.5
10 -6 77.4 77.9 1.5 174.44 143.05 871 43.8
10-s 75.8 76.3 2.6 187.11 145.76 517 71.6
10 -4 72.9 73.9 4.5 210.28 148.41 306 115.5
10 -3 68.4 70.6 8.0 253.77 152.33 182 179.6
tper word
contrasts with our perplexity results reported above, as well as with the recognition
experiments in Chelba (2000), w h e r e the best results resulted from interpolated models.
The point of this small experiment was to see if our parsing m o d e l could p r o v i d e
useful information even in the case that recognition errors occur, as o p p o s e d to the
(generally) fully grammatical strings u p o n which the perplexity results were obtained.
As one reviewer pointed out, given that our m o d e l relies so heavily on context, it m a y
have difficulty recovering from even one recognition error, perhaps more difficulty
than a more locally oriented trigram. While the i m p r o v e m e n t s over the trigram m o d e l
in these trials are modest, they do indicate that our m o d e l is robust e n o u g h to p r o v i d e
good information even in the face of noisy input. Future w o r k will include more
substantial w o r d recognition experiments.
The empirical results presented above are quite encouraging, and the potential of this
kind of approach both for parsing and language modeling seems v e r y promising.
272
Roark Top-Down Parsing
100
-o- Parse error ~. -~r~ •- -
Model Perplexity ~ ~ -
90 •O Interpolated Perplexity ~
• ~" Efficiency (decrease in / /~"
/
rule expansions) ~ i
80
/
/
/
70
/
0)
,~ 60
50 /" /
./
/I / //
30
I /
20 // I I
/ ~ -£3/
10
,' ....... ..... O" ..... '
: - - ...... , .......... ? .......... , , ,
-li - 10 -9 -8 -7 -6 -5 -4 -3
lOgjo of base b e a m factor
Figure 8
Increase in average precision/recall error, model perplexity, interpolated perplexity, and
efficiency (i.e., decrease in rule expansions per word) as base beam factor decreases.
With a simple conditional probability model, and simple statistical search heuristics,
we were able to find very accurate parses efficiently, and, as a side effect, were able to
assign word probabilities that yield a perplexity improvement over previous results.
These perplexity improvements are particularly promising, because the parser is pro-
viding information that is, in some sense, orthogonal to the information provided by
a trigram model, as evidenced by the robust improvements to the baseline trigram
when the two models are interpolated.
There are several important future directions that will be taken in this area. First,
there is reason to believe that some of the conditioning information is not uniformly
useful, and we would benefit from finer distinctions. For example, the probability
of a preposition is presumably more dependent on a c-commanding head than the
probability of a determiner is. Yet in the current model they are both conditioned
on that head, as leftmost constituents of their respective phrases. Second, there are
advantages to top-down parsing that have not been examined to date, e.g., empty
categories. A top-down parser, in contrast to a standard bottom-up chart parser, has
enough information to predict empty categories only where they are likely to occur.
By including these nodes (which are in the original annotation of the Penn Treebank),
we may be able to bring certain long-distance dependencies into a local focus. In
addition, as mentioned above, we would like to further test our language model in
speech recognition tasks, to see if the perplexity improvement that we have seen can
lead to significant reductions in word error rate.
Other parsing approaches might also be used in the way that we have used a top-
down parser. Earley and left-corner parsers, as mentioned in the introduction, also
have rooted derivations that can be used to calculated generative string prefix proba-
273
Computational Linguistics Volume 27, Number 2
274
Roark Top-Down Parsing
275
Computational Linguistics Volume 27, Number 2
276