Dependencies vs. Constituents For Tree-Based Alignment: Daniel Gildea
Dependencies vs. Constituents For Tree-Based Alignment: Daniel Gildea
Dependencies vs. Constituents For Tree-Based Alignment: Daniel Gildea
ultimately interested in predicting the correct target the fan-out of the grammar.
string, regardless of its structure, we do not assign
probabilities to these steps. The nonterminals on the 2.1 Clone Operation
target side are ignored entirely, and while the align- Both our constituent and dependency models make
ment algorithm considers possible pairs of nodes as use of the “clone” operation introduced by Gildea
elementary trees on the target side during training, (2003), which allows words to be aligned even
the generative probability model should be thought in cases of radically mismatched trees, at a cost
of as only generating single nodes on the target side. in the probability of the alignment. Allowing m-
Thus, the alignment algorithm is constrained by the to-n matching of up to two nodes on either side
bracketing on the target side, but does not generate of the parallel treebank allows for limited non-
the entire target tree structure. isomorphism between the trees. However, even
given this flexibility, requiring alignments to match
While the probability model for tree transforma-
two input trees rather than one often makes tree-to-
tion operates from the top of the tree down, proba-
tree alignment more constrained than tree-to-string
bility estimation for aligning two trees takes place
alignment. For example, even alignments with no
by iterating through pairs of nodes from each tree in
change in word order may not be possible if the
bottom-up order, as sketched below:
structures of the two trees are radically mismatched.
for all nodes εa in source tree Ta in bottom-up order
do
Thus, it is helpful to allow departures from the con-
for all elementary trees ta rooted in εa do straints of the parallel bracketing, if it can be done
for all nodes εb in target tree Tb in bottom-up or- in without dramatically increasing computational
der do complexity.
for all elementary trees tb rooted in εb do The clone operation allows a copy of a node from
for all alignments α of the children of ta and the source tree to be made anywhere in the target
tb do tree. After the clone operation takes place, the trans-
β(εa , εb ) Q += formation of source into target tree takes place using
Pelem (ta |εa )Palign (α|εi ) (i,j)∈α β(εi , εj )
the tree decomposition and subtree alignment oper-
end for
ations as before. The basic algorithm of the previ-
end for
end for ous section remains unchanged, with the exception
end for that the alignments α between children of two ele-
end for mentary trees can now include cloned, as well as in-
serted, nodes on the target side. Given that α speci-
The outer two loops, iterating over nodes in each fies a new cloned node as a child of εj , the choice of
tree, require O(|T |2 ). Because we restrict our ele- which node to clone is made as in the tree-to-string
mentary trees to include at most one child of the root model:
node on either side, choosing elementary trees for a
node pair is O(m2 ), where m refers to the maxi- Pmakeclone (εi )
mum number of children of a node. Computing the Pclone (εi |clone ∈ α) = P
k Pmakeclone (εk )
alignment between the 2m children of the elemen-
tary tree on either side requires choosing which sub- Because a node from the source tree is cloned with
set of source nodes to delete, O(22m ), which sub- equal probability regardless of whether it has al-
set of target nodes to insert (or clone), O(22m ), and ready been “used” or not, the probability of a clone
how to reorder the remaining nodes from source to operation can be computed under the same dynamic
target tree, O((2m)!). Thus overall complexity of programming assumptions as the basic tree-to-tree
the algorithm is O(|T |2 m2 42m (2m)!), quadratic in model. As with the tree-to-string cloning operation,
the size of the input sentences, but exponential in this independence assumption is essential to keep
the complexity polynomial in the size of the input nese to a noun modified by prepositional phrase
sentences. and an adjective-noun relation in English means that
the conversion rules select different heads even for
3 Dependency Tree-to-Tree Alignments pieces of tree that are locally similar.
Dependencies were found to be more consistent 3.1 The Dependency Alignment Model
than constituent structure between French and En- While the basic tree-to-tree alignment algorithm is
glish by Fox (2002), though this study used a tree the same for dependency trees, a few modifications
representation on the English side only. We wish to to the probability model are necessary.
investigate whether dependency trees are also more First, the lexical translation operation takes place
suited to tree-to-tree alignment. at each node in the tree, rather than only at the
Figure 1 shows a typical Xinhua newswire sen- leaves. Lexical translation probabilities are main-
tence with the Chinese parser output, and the sen- tained for each word pair as before, and the lexical
tence’s English translation with its parse tree. The translation probabilities are included in the align-
conversion to dependency representation is shown ment cost for each elementary tree. When both el-
below the original parse trees. ementary trees contain two words, either alignment
Examination of the trees shows both cases where is possible between the two. The direct alignment
the dependency representation is more similar between nodes within the elementary tree has prob-
across the two languages, as well as its potential ability 1 − Pswap . A new parameter Pswap gives the
pitfalls. The initial noun phrase, “14 Chinese open probability of the upper node in the elementary tree
border cities” has two subphrases with a level of in English corresponding to the lower node in Chi-
constituent structure (the QP and the lower NP) nese, and vice versa. Thus, the probability for the
not found in the English parse. In this case, the following transformation:
difference in constituent structure derives primar-
ily from differences in the annotation style between A ⇒ B’
the original English and Chinese treebanks (Marcus
et al., 1993; Xue and Xia, 2000; Levy and Man-
ning, 2003). These differences disappear in the con- B A’
stituent representation. In general, the number of
levels of constituent structure in a tree can be rela-
tively arbitrary, while it is easier for people (whether X Y X Y
professional syntacticians or not) to agree on the
word-to-word dependencies. is factored as Pelem (AB|A⇒B) Pswap Pt (A0 |A)
In some cases, differences in the number of level Pt (B0 |B) Palign ({(1, 1)(2, 2)}|A ⇒ XY ).
may be handled by the tree-to-tree model, for ex- Our model does not represent the position of the
ample by grouping the subject NP and its base NP head among its children. While this choice would
child together as a single elementary tree. How- have to be made in generating MT output, for the
ever, this introduces unnecessary variability into the purposes of alignment we simply score how many
alignment process. In cases with large difference tree nodes are correctly aligned, without flattening
in the depths of the two trees, the aligner may not our trees into a string.
be able to align the corresponding terminal nodes We further extended the tree-to-tree alignment al-
because only one merge is possible at each level. gorithm by conditioning the reordering of a node’s
In this case the aligner will clone the subtree, at an children on the node’s lexical item as well as its syn-
even greater cost in probability. tactic category at the categories of its children. The
The rest of our example sentence, however, lexicalized reordering probabilities were smoothed
shows cases where the conversion to dependency with the nonlexicalized probabilities (which are
structure can in some cases exacerbate differences themselves smoothed with a uniform distribution).
in constituent structure. For example, jingji and We smooth using a linear interpolation of lexical-
jianshe are sisters in the original constituent struc- ized and unlexicalized probabilities, with weights
ture, as are their English translations, economic and proportional to the number of observations for each
construction. In the conversion to Chinese depen- type of event.
dency structure, they remain sisters both dependent
on the noun chengjiu (achievements) while in En- 4 Experiments
glish, economic is a child of construction. The We trained our translation models on a parallel
correspondence of a three-noun compound in Chi- corpus of Chinese-English newswire text. We re-
IP
NP NP VP
NP QP NP NN NN NN VV
ge
economic construction
significant achievements in JJ NN
JJ NNS IN NP
NP VP
VV:xianzhu
NN:chengshi NN:chengjiu
M:ge
JJ:economic
NN:construction
NNS:cities NNS:achievements
VV:make
Table 2: Alignment results on Chinese-English corpus. Higher precision and recall correspond to lower
alignment error rate.
stricted ourselves to sentences of no more than 25 differ, we break this figure down into precision:
words in either language, resulting in a training cor-
pus of 18,773 sentence pairs with a total of 276,113 |A ∩ G|
P=
Chinese words and 315,415 English words. The |A|
Chinese data were automatically segmented into to-
kens, and English capitalization was retained. We and recall:
|A ∩ G|
replace words occurring only once with an unknown R=
|G|
word token, resulting in a Chinese vocabulary of
23,783 words and an English vocabulary of 27,075 Since none of the systems presented in this com-
words. Chinese data was parsed using the parser parison make use of hand-aligned data, they may
of Bikel (2002), and English data was parsed us- differ in the overall proportion of words that are
ing Collins (1999). Our hand-aligned test data were aligned, rather than inserted or deleted. This affects
those used in Hwa et al. (2002), and consisted of 48 the precision/recall tradeoff; better results with re-
sentence pairs also with less than 25 words in either spect to human alignments may be possible by ad-
language, for a total of 788 English words and 580 justing an overall insertion probability in order to
Chinese words. The hand aligned data consisted of optimize AER.
745 individual aligned word pairs. Words could be Table 2 provides a comparison of results using the
aligned one-to-many in either direction. This limits tree-based models with the word-level IBM models.
the performance achievable by our models; the IBM IBM Models 1 and 4 refer to Brown et al. (1993).
models allow one-to-many alignments in one direc- We used the GIZA++ package, including the HMM
tion only, while the tree-based models allow only model of Och and Ney (2000). We trained each
one-to-one alignment unless the cloning operation model until AER began to increase on our held-out
is used. A separate set of 49 hand-aligned sentence cross validation data, resulting in running Model 1
pairs was used to control overfitting in training our for three iterations, then the HMM model for three
models. iterations, and finally Model 4 for two iterations
We evaluate our translation models in terms of (the optimal number of iterations for Models 2 and
agreement with human-annotated word-level align- 3 was zero). “Constituent Tree-to-Tree” indicates
ments between the sentence pairs. For scoring the model of Section 2 trained and tested directly
the viterbi alignments of each system against gold- on the trees output by the parser, while “Depen-
standard annotated alignments, we use the align- dency Tree-to-Tree” makes the modifications to the
ment error rate (AER) of Och and Ney (2000), model described in Section 3. For reasons of com-
which measures agreement at the level of pairs of putational efficiency, our constituent-based training
words:1 procedure skipped sentences for which either tree
had a node with more than five children, and the
2|A ∩ G| dependency-based training skipped trees with more
AER = 1 −
|A| + |G| than six children. Thus, the tree-based models were
effectively trained on less data than IBM Model 4:
where A is the set of word pairs aligned by the auto- 11,422 out of 18,773 sentence pairs for the con-
matic system, and G the set aligned in the gold stan- stituent model and 10,662 sentence pairs for the de-
dard. For a better understanding of how the models pendency model. Our tree-based models were ini-
tialized with lexical translation probabilities trained
1
While Och and Ney (2000) differentiate between sure and
using IBM Model 1, and uniform probabilities for
possible hand-annotated alignments, our gold standard align- the tree reordering operations. The models were
ments come in only one variety. trained until AER began to rise on our held-out
cross-validation data, though in practice AER was better English/Chinese agreement overall, but there
nearly constant for both tree-based models after the were still disagreements in the head words choices
first iteration. for a third of all consistently aligned constituent
pairs. Running our alignment system on gold stan-
5 Discussion dard trees did not improve results. The compari-
son between parser output and gold standard trees
The constituent-based version of the alignment
is summarized in Table 3.
model significantly outperforms the dependency-
based model. The IBM models outperform the con- We used head rules developed for statistical
stituent tree-to-tree model to a lesser degree, with parsers in both languages, but other rules may be
tree-to-tree achieving higher recall, and IBM higher better suited to the alignment task. For example,
precision. It is particularly significant that the tree- the tensed auxiliary verb is considered the head of
based model gets higher recall than the other mod- English progressive and perfect verb phrases, rather
els, since it is limited to one-to-one alignments un- than the present or past participle of the main verb.
less the clone operation is used, bounding the recall Such auxiliaries carry agreement information rele-
it can achieve. vant to parsing, but generally have no counterpart in
Chinese. A semantically oriented dependency struc-
In order to better understand the differences be-
ture, such as Tree Adjoining Grammar derivation
tween the constituent and dependency representa-
trees, may be more appropriate for alignment.
tions of our data, we analyzed how well the two
representations match our hand annotated alignment
data. We looked for consistently aligned pairs of 6 Conclusion
constituents in the two parse trees. By consistently We present a comparison of constituent and de-
aligned, we mean that all words within the English pendency models for tree-to-tree alignment. De-
constituent are aligned to words inside the Chinese spite equalizing some mismatches in tree structure,
constituent (if they are aligned to anything), and the dependency representation does not perform as
vice versa. In our example in Figure 1, the NP “14 well, likely because it is less robust to large differ-
Chinese border cities” and the Chinese subject NP ences between the tree structures.
“Zhongguo shisi ge bianjing kaifang chengshi” are
consistently aligned, but the PP “in economic con- Acknowledgments We are very grateful to Re-
struction” has no consistently aligned constituent in becca Hwa, Hao Zhang, everyone at the 2003 John
the Chinese sentence. We found that of the 2623 Hopkins speech and language summer research
constituents in our English parse trees (not count- workshop, and EMNLP’s reviewers for their assis-
ing unary constituents, which have the same bound- tance, criticism, and data. This work was partially
aries as their children), for 1044, or 40%, there ex- supported by NSF ITR IIS-09325646, NSF research
ists some constituent in the Chinese parse tree that infrastructure grant EIA-0080124, and NSF grant
is consistently aligned. This confirms the results of 0121285 to the summer workshop.
Fox (2002) and Galley et al. (2004) that many trans-
lation operations must span more than one parse tree References
node. For each of our consistently aligned pairs, we
then found the head word of both the Chinese and Daniel M. Bikel. 2002. Design of a multi-lingual,
English constituents according to our head rules. parallel-processing statistical parsing engine. In
The two head words correspond in the annotated Proceedings ARPA Workshop on Human Lan-
alignments 67% of the time (700 out of 1044 con- guage Technology.
sistently aligned constituent pairs). While the head- Peter F. Brown, John Cocke, Stephen A. Della
swapping operation of our translation model will be Pietra, Vincent J. Della Pietra, Frederick Je-
able to handle some cases of differing heads, it can linek, John D. Lafferty, Robert L. Mercer, and
only do so if the two heads are adjacent in both tree Paul S. Roossin. 1990. A statistical approach to
structures. machine translation. Computational Linguistics,
Our system is trained and test on automatically 16(2):79–85, June.
generated parse trees, which may contribute to the Peter F. Brown, Stephen A. Della Pietra, Vincent J.
mismatches in the tree structures. As our test Della Pietra, and Robert L. Mercer. 1993. The
data was taken from the Chinese Treebank, hand- mathematics of statistical machine translation:
annotated parse trees were available for the Chinese, Parameter estimation. Computational Linguis-
but not the English, sentences. Running the analy- tics, 19(2):263–311.
sis on hand-annotated Chinese trees found slightly Michael John Collins. 1999. Head-driven Statisti-
Chinese Parse Trees
Automatic Treebank
Proportion of English constits w/ consistently aligned Chinese constit .40 .42
Proportion of above with heads words aligned .67 .66
Constituent-Based AER .50 .51
Dependency-Based AER .60 .62
cal Models for Natural Language Parsing. Ph.D. ings of the 38th Annual Conference of the Asso-
thesis, University of Pennsylvania, Philadelphia. ciation for Computational Linguistics (ACL-00),
Yuan Ding and Martha Palmer. 2004. Automatic pages 440–447, Hong Kong, October.
learning of parallel dependency treelet pairs. In Dekai Wu. 1997. Stochastic inversion transduction
The First International Joint Conference on Nat- grammars and bilingual parsing of parallel cor-
ural Language Processing (IJCNLP). pora. Computational Linguistics, 23(3):377–403.
Jason Eisner. 2003. Learning non-isomorphic tree Nianwen Xue and Fei Xia. 2000. The bracketing
mappings for machine translation. In Proceed- guidelines for the penn chinese treebank. Tech-
ings of the 41st Meeting of the Association for nical Report IRCS-00-08, IRCS, University of
Computational Linguistics, companion volume, Pennsylvania.
Sapporo, Japan. Kenji Yamada and Kevin Knight. 2001. A syntax-
Heidi J. Fox. 2002. Phrasal cohesion and statisti- based statistical translation model. In Proceed-
cal machine translation. In In Proceedings of the ings of the 39th Annual Conference of the Asso-
2002 Conference on Empirical Methods in Natu- ciation for Computational Linguistics (ACL-01),
ral Language Processing (EMNLP 2002), pages Toulouse, France.
304–311.
Michel Galley, Mark Hopkins, Kevin Knight, and
Daniel Marcu. 2004. What’s in a translation
rule? In Proceedings of the Human Language
Technology Conference/North American Chapter
of the Association for Computational Linguistics
(HLT/NAACL).
Daniel Gildea. 2003. Loosely tree-based alignment
for machine translation. In Proceedings of the
41th Annual Conference of the Association for
Computational Linguistics (ACL-03), pages 80–
87, Sapporo, Japan.
Rebecca Hwa, Philip Resnik, Amy Weinberg, and
Okan Kolak. 2002. Evaluating translational cor-
respondence using annotation projection. In Pro-
ceedings of the 40th Annual Conference of the
Association for Computational Linguistics (ACL-
02).
Roger Levy and Christopher Manning. 2003. Is
it harder to parse Chinese, or the Chinese Tree-
bank? In Proceedings of the 41th Annual Con-
ference of the Association for Computational Lin-
guistics (ACL-03), Sapporo, Japan.
Mitchell P. Marcus, Beatrice Santorini, and
Mary Ann Marcinkiewicz. 1993. Building a
large annotated corpus of English: The Penn
treebank. Computational Linguistics, 19(2):313–
330, June.
Franz Josef Och and Hermann Ney. 2000. Im-
proved statistical alignment models. In Proceed-