Word Alignment Modeling With Context Dependent Deep Neural Network
Word Alignment Modeling With Context Dependent Deep Neural Network
Word Alignment Modeling With Context Dependent Deep Neural Network
Soa, Bulgaria, August 4-9 2013. c 2013 Association for Computational Linguistics
Word Alignment Modeling with Context Dependent Deep Neural Network
Nan Yang
1
, Shujie Liu
2
, Mu Li
2
, Ming Zhou
2
, Nenghai Yu
1
1
University of Science and Technology of China, Hefei, China
2
Microsoft Research Asia, Beijing, China
{v-nayang,shujliu,muli,mingzhou}@microsoft.com
ynh@ustc.edu.cn
Abstract
In this paper, we explore a novel bilin-
gual word alignment approach based on
DNN (Deep Neural Network), which has
been proven to be very effective in var-
ious machine learning tasks (Collobert
et al., 2011). We describe in detail
how we adapt and extend the CD-DNN-
HMM (Dahl et al., 2012) method intro-
duced in speech recognition to the HMM-
based word alignment model, in which
bilingual word embedding is discrimina-
tively learnt to capture lexical translation
information, and surrounding words are
leveraged to model context information
in bilingual sentences. While being ca-
pable to model the rich bilingual corre-
spondence, our method generates a very
compact model with much fewer parame-
ters. Experiments on a large scale English-
Chinese word alignment task show that the
proposed method outperforms the HMM
and IBM model 4 baselines by 2 points in
F-score.
1 Introduction
Recent years research communities have seen a
strong resurgent interest in modeling with deep
(multi-layer) neural networks. This trending topic,
usually referred under the name Deep Learning, is
started by ground-breaking papers such as (Hin-
ton et al., 2006), in which innovative training pro-
cedures of deep structures are proposed. Unlike
shallow learning methods, such as Support Vector
Machine, Conditional Random Fields, and Maxi-
mum Entropy, which need hand-craft features as
input, DNN can learn suitable features (represen-
tations) automatically with raw input data, given a
training objective.
DNN did not achieve expected success until
2006, when researchers discovered a proper way
to intialize and train the deep architectures, which
contains two phases: layer-wise unsupervised pre-
training and supervised ne tuning. For pre-
training, Restricted Boltzmann Machine (RBM)
(Hinton et al., 2006), auto-encoding (Bengio et al.,
2007) and sparse coding (Lee et al., 2007) are pro-
posed and popularly used. The unsupervised pre-
training trains the network one layer at a time, and
helps to guide the parameters of the layer towards
better regions in parameter space (Bengio, 2009).
Followed by ne tuning in this region, DNN is
shown to be able to achieve state-of-the-art per-
formance in various area, or even better (Dahl et
al., 2012) (Kavukcuoglu et al., 2010). DNN also
achieved breakthrough results on the ImageNet
dataset for objective recognition (Krizhevsky et
al., 2012). For speech recognition, (Dahl et al.,
2012) proposed context-dependent neural network
with large vocabulary, which achieved 16.0% rel-
ative error reduction.
DNN has also been applied in Natural Lan-
guage Processing (NLP) eld. Most works con-
vert atomic lexical entries into a dense, low di-
mensional, real-valued representation, called word
embedding; Each dimension represents a latent as-
pect of a word, capturing its semantic and syntac-
tic properties (Bengio et al., 2006). Word embed-
ding is usually rst learned from huge amount of
monolingual texts, and then ne-tuned with task-
specic objectives. (Collobert et al., 2011) and
(Socher et al., 2011) further apply Recursive Neu-
ral Networks to address the structural prediction
tasks such as tagging and parsing, and (Socher
et al., 2012) explores the compositional aspect of
word representations.
Inspired by successful previous works, we pro-
pose a new DNN-based word alignment method,
which exploits contextual and semantic similari-
ties between words. As shown in example (a) of
Figure 1, in word pair {juda mammoth},
the Chinese word juda is a common word, but
166
mammoth will be a
jiang shi yixiang juda gongcheng
job
(a)
) ) |
A : farmer Yibula said
nongmin yibula shuo :
(b)
([ |
Figure 1: Two examples of word alignment
the English word mammoth is not, so it is very
hard to align them correctly. If we know that
mammoth has the similar meaning with big,
or huge, it would be easier to nd the corre-
sponding word in the Chinese sentence. As we
mentioned in the last paragraph, word embedding
(trained with huge monolingual texts) has the abil-
ity to map a word into a vector space, in which,
similar words are near each other.
For example (b) in Figure 1, for the word pair
{yibula Yibula}, both the Chinese word
yibula and English word Yibula are rare name
entities, but the words around them are very com-
mon, which are {nongmin, shuo} for Chinese
side and {farmer, said} for the English side.
The pattern of the context {nongmin X shuo
farmer X said} may help to align the word
pair which ll the variable X, and also, the pattern
{yixiang X gongcheng a X job} is helpful
to align the word pair {juda mammoth} for
example (a).
Based on the above analysis, in this paper, both
the words in the source and target sides are rstly
mapped to a vector via a discriminatively trained
word embeddings, and word pairs are scored by a
multi-layer neural network which takes rich con-
texts (surrounding words on both source and target
sides) into consideration; and a HMM-like distor-
tion model is applied on top of the neural network
to characterize structural aspect of bilingual sen-
tences.
In the rest of this paper, related work about
DNN and word alignment are rst reviewed in
Section 2, followed by a brief introduction of
DNN in Section 3. We then introduce the details
of leveraging DNN for word alignment, including
the details of our network structure in Section 4
and the training method in Section 5. The mer-
its of our approach are illustrated with the experi-
ments described in Section 6, and we conclude our
paper in Section 7.
2 Related Work
DNN with unsupervised pre-training was rstly
introduced by (Hinton et al., 2006) for MNIST
digit image classication problem, in which, RBM
was introduced as the layer-wise pre-trainer. The
layer-wise pre-training phase found a better local
maximum for the multi-layer network, thus led to
improved performance. (Krizhevsky et al., 2012)
proposed to apply DNN to do object recognition
task (ImageNet dataset), which brought down the
state-of-the-art error rate from 26.1% to 15.3%.
(Seide et al., 2011) and (Dahl et al., 2012) apply
Context-Dependent Deep Neural Network with
HMM (CD-DNN-HMM) to speech recognition
task, which signicantly outperforms traditional
models.
Most methods using DNN in NLP start with a
word embedding phase, which maps words into
a xed length, real valued vectors. (Bengio et
al., 2006) proposed to use multi-layer neural net-
work for language modeling task. (Collobert et al.,
2011) applied DNN on several NLP tasks, such
as part-of-speech tagging, chunking, name entity
recognition, semantic labeling and syntactic pars-
ing, where they got similar or even better results
than the state-of-the-art on these tasks. (Niehues
and Waibel, 2012) shows that machine transla-
tion results can be improved by combining neural
language model with n-gram traditional language.
(Son et al., 2012) improves translation quality of
n-gram translation model by using a bilingual neu-
ral language model. (Titov et al., 2012) learns a
context-free cross-lingual word embeddings to fa-
cilitate cross-lingual information retrieval.
For the related works of word alignment, the
most popular methods are based on generative
models such as IBM Models (Brown et al., 1993)
and HMM (Vogel et al., 1996). Discriminative ap-
proaches are also proposed to use hand crafted fea-
tures to improve word alignment. Among them,
(Liu et al., 2010) proposed to use phrase and rule
pairs to model the context information in a log-
linear framework. Unlike previous discriminative
methods, in this work, we do not resort to any hand
crafted features, but use DNN to induce features
from raw words.
167
3 DNN structures for NLP
The most important and prevalent features avail-
able in NLP are the words themselves. To ap-
ply DNN to NLP task, the rst step is to trans-
form a discrete word into its word embedding, a
low dimensional, dense, real-valued vector (Ben-
gio et al., 2006). Word embeddings often implic-
itly encode syntactic or semantic knowledge of
the words. Assuming a nite sized vocabulary V ,
word embeddings form a (L|V |)-dimension em-
bedding matrix W
V
, where L is a pre-determined
embedding length; mapping words to embed-
dings is done by simply looking up their respec-
tive columns in the embedding matrix W
V
. The
lookup process is called a lookup layer LT , which
is usually the rst layer after the input layer in neu-
ral network.
After words have been transformed to their em-
beddings, they can be fed into subsequent classi-
cal network layers to model highly non-linear re-
lations:
z
l
= f
l
(M
l
z
l1
+ b
l
) (1)
where z
l
is the output of lth layer, M
l
is a |z
l
|
|z
l1
| matrix, b
l
is a |z
l
|-length vector, and f
l
is an activation function. Except for the last
layer, f
l
must be non-linear. Common choices for
f
l
include sigmoid function, hyperbolic function,
hard hyperbolic function etc. Following (Col-
lobert et al., 2011), we choose hard hyperbolic
function as our activation function in this work:
htanh(x) =
1 if x is greater than 1
1 if x is less than -1
x otherwise
(2)
If probabilistic interpretation is desired, a softmax
layer (Bridle, 1990) can be used to do normaliza-
tion:
z
l
i
=
e
z
l1
i
|z
l
|
j=1
e
z
l1
j
(3)
The above layers can only handle xed sized in-
put and output. If input must be of variable length,
convolution layer and max layer can be used, (Col-
lobert et al., 2011) which transformvariable length
input to xed length vector for further processing.
Multi-layer neural networks are trained with
the standard back propagation algorithm (LeCun,
1985). As the networks are non-linear and the
task specic objectives usually contain many lo-
cal maximums, special care must be taken in the
optimization process to obtain good parameters.
Techniques such as layerwise pre-training(Bengio
et al., 2007) and many tricks(LeCun et al., 1998)
have been developed to train better neural net-
works. Besides that, neural network training also
involves some hyperparameters such as learning
rate, the number of hidden layers. We will address
these issues in section 4.
4 DNN for word alignment
Our DNN word alignment model extends classic
HMM word alignment model (Vogel et al., 1996).
Given a sentence pair (e, f), HMM word alignment
takes the following form:
P(a, e|f) =
|e|
i=1
P
lex
(e
i
|f
a
i
)P
d
(a
i
a
i1
)
(4)
where P
lex
is the lexical translation probability
and P
d
is the jump distance distortion probability.
One straightforward way to integrate DNN
into HMM is to use neural network to compute
the emission (lexical translation) probability P
lex
.
Such approach requires a softmax layer in the neu-
ral network to normalize over all words in source
vocabulary. As vocabulary for natural languages
is usually very large, it is prohibitively expen-
sive to do the normalization. Hence we give up
the probabilistic interpretation and resort to a non-
probabilistic, discriminative view:
s
NN
(a|e, f) =
|e|
i=1
t
lex
(e
i
, f
a
i
|e, f)t
d
(a
i
, a
i1
|e, f)
(5)
where t
lex
is a lexical translation score computed
by neural network, and t
d
is a distortion score.
In the classic HMM word alignment model,
context is not considered in the lexical translation
probability. Although we can rewrite P
lex
(e
i
|f
a
i
)
to P
lex
(e
i
|context of f
a
i
) to model context, it in-
troduces too many additional parameters and leads
to serious over-tting problem due to data sparse-
ness. As a matter of fact, even without any con-
texts, the lexical translation table in HMM al-
ready contains O(|V
e
| |V
f
|) parameters, where
|V
e
| and V
f
denote source and target vocabulary
sizes. In contrast, our model does not maintain
a separate translation score parameters for every
source-target word pair, but computes t
lex
through
a multi-layer network, which naturally handles
contexts on both sides without explosive growth
of number of parameters.
168
Input
Source window e Target window f
) (
3
2
3
b z M
) (
2
1
2
b z M
i i-1 i+1 j-1 j j+1
Lookup
LT
0
z
Layer f1
1
z
Layer f2
2
z
farmer yibula said
) (
1
0
1
b z M htanh
htanh
Layer f3
) , | , ( f e f e t
j i lex
Figure 2: Network structure for computing context
dependent lexical translation scores. The example
computes translation score for word pair (yibula,
yibulayin) given its surrounding context.
Figure 2 shows the neural network we used
to compute context dependent lexical transla-
tion score t
lex
. For word pair (e
i
, f
j
), we take
xed length windows surrounding both e
i
and f
j
as input: (e
i
sw
2
, . . . , e
i+
sw
2
, f
j
tw
2
, . . . , f
j+
tw
2
),
where sw, tw stand window sizes on source and
target side respectively. Words are converted to
embeddings using the lookup table LT, and the
catenation of embeddings are fed to a classic neu-
ral network with two hidden-layers, and the output
of the network is the our lexical translation score:
t
lex
(e
i
, f
j
|e, f)
= f
3
f
2
f
1
LT(window(e
i
), window(f
j
))
(6)
f
1
and f
2
layers use htanh as activation functions,
while f
3
is only a linear transformation with no
activation function.
For the distortion t
d
, we could use a lexicalized
distortion model:
t
d
(a
i
, a
i1
|e, f) = t
d
(a
i
a
i1
|window(f
a
i1
))
(7)
which can be computed by a neural network sim-
ilar to the one used to compute lexical transla-
tion scores. If we map jump distance (a
i
a
i1
)
to B buckets, we can change the length of the
output layer to B, where each dimension in the
output stands for a different bucket of jump dis-
tances. But we found in our initial experiments
on small scale data, lexicalized distortion does not
produce better alignment over the simple jump-
distance based model. So we drop the lexicalized
distortion and reverse to the simple version:
t
d
(a
i
, a
i1
|e, f) = t
d
(a
i
a
i1
) (8)
Vocabulary V of our alignment model consists
of a source vocabulary V
e
and a target vocabu-
lary V
f
. As in (Collobert et al., 2011), in addition
to real words, each vocabulary contains a special
unknown word symbol unk to handle unseen
words; two sentence boundary symbols s and
/s, which are lled into surrounding window
when necessary; furthermore, to handle null align-
ment, we must also include a special null symbol
null. When f
j
is null word, we simply ll the
surrounding window with the identical null sym-
bols.
To decode our model, the lexical translation
scores are computed for each source-target word
pair in the sentence pair, which requires going
through the neural network (|e| |f|) times; af-
ter that, the forward-backward algorithm can be
used to nd the viterbi path as in the classic HMM
model.
The majority of tunable parameters in our
model resides in the lookup table LT, which is
a (L (|V
e
| + |V
f
|))-dimension matrix. For a
reasonably large vocabulary, the number is much
smaller than the number of parameters in classic
HMM model, which is in the order of (|V
e
||V
f
|).
1
The ability to model context is not unique to
our model. In fact, discriminative word alignment
can model contexts by deploying arbitrary features
(Moore, 2005). Different from previous discrim-
inative word alignment, our model does not use
manually engineered features, but learn features
automatically from raw words by the neural net-
work. (Berger et al., 1996) use a maximum en-
tropy model to model the bag-of-words context for
word alignment, but their model treats each word
as a distinct feature, which can not leverage the
similarity between words as our model.
5 Training
Although unsupervised training technique such as
Contrastive Estimation as in (Smith and Eisner,
2005), (Dyer et al., 2011) can be adapted to train
1
In practice, the number of non-zero parameters in clas-
sic HMM model would be much smaller, as many words do
not co-occur in bilingual sentence pairs. In our experiments,
the number of non-zero parameters in classic HMM model
is about 328 millions, while the NN model only has about 4
millions.
169
our model from raw sentence pairs, they are too
computational demanding as the lexical transla-
tion probabilities must be computed from neu-
ral networks. Hence, we opt for a simpler su-
pervised approach, which learns the model from
sentence pairs with word alignment. As we do
not have a large manually word aligned corpus,
we use traditional word alignment models such as
HMM and IBM model 4 to generate word align-
ment on a large parallel corpus. We obtain bi-
directional alignment by running the usual grow-
diag-nal heuristics (Koehn et al., 2003) on uni-
directional results from both directions, and use
the results as our training data. Similar approach
has been taken in speech recognition task (Dahl et
al., 2012), where training data for neural network
model is generated by forced decoding with tradi-
tional Gaussian mixture models.
Tunable parameters in neural network align-
ment model include: word embeddings in lookup
table LT, parameters W
l
, b
l
for linear transforma-
tions in the hidden layers of the neural network,
and distortion parameters s
d
of jump distance. We
take the following ranking loss with margin as our
training criteria:
loss() =
every (e,f)
max{0, 1 s
(a
+
|e, f) + s
(a
|e, f)}
(9)
where denotes all tunable parameters, a
+
is
the gold alignment path, a
is
model score for alignment path dened in Eq. 5
. One nuance here is that the gold alignment af-
ter grow-diag-nal contains many-to-many links,
which cannot be generated by any path. Our solu-
tion is that for each source word alignment multi-
ple target, we randomly choose one link among all
candidates as the golden link.
Because our multi-layer neural network is in-
herently non-linear and is non-convex, directly
training against the above criteria is unlikely to
yield good results. Instead, we take the following
steps to train our model.
5.1 Pre-training initial word embedding with
monolingual data
Most parameters reside in the word embeddings.
To get a good initial value, the usual approach is
to pre-train the embeddings on a large monolin-
gual corpus. We replicate the work in (Collobert
et al., 2011) and train word embeddings for source
and target languages from their monolingual cor-
pus respectively. Our vocabularies V
s
and V
t
con-
tain the most frequent 100,000 words from each
side of the parallel corpus, and all other words are
treated as unknown words. We set word embed-
ding length to 20, window size to 5, and the length
of the only hidden layer to 40. Follow (Turian et
al., 2010), we randomly initialize all parameters
to [-0.1, 0.1], and use stochastic gradient descent
to minimize the ranking loss with a xed learn-
ing rate 0.01. Note that embedding for null word
in either V
e
and V
f
cannot be trained from mono-
lingual corpus, and we simply leave them at the
initial value untouched.
Word embeddings from monolingual corpus
learn strong syntactic knowledge of each word,
which is not always desirable for word align-
ment between some language pairs like English
and Chinese. For example, many Chinese words
can act as a verb, noun and adjective without any
change, while their English counter parts are dis-
tinct words with quite different word embeddings
due to their different syntactic roles. Thus we
have to modify the word embeddings in subse-
quent steps according to bilingual data.
5.2 Training neural network based on local
criteria
Training the network against the sentence level
criteria Eq. 5 directly is not efcient. Instead, we
rst ignore the distortion parameters and train neu-
ral networks for lexical translation scores against
the following local pairwise loss:
max{0, 1 t
((e, f)
+
|e, f) + t
((e, f)
|e, f)}
(10)
where (e, f)
+
is a correct word pair, (e, f)
is a
wrong word pair in the same sentence, and t
is as
dened in Eq. 6 . This training criteria essentially
means our model suffers loss unless it gives cor-
rect word pairs a higher score than random pairs
from the same sentence pair with some margin.
We initialize the lookup table with embed-
dings obtained from monolingual training, and
randomly initialize all W
l
and b
l
in linear layers
to [-0.1, 0.1]. We minimize the loss using stochas-
tic gradient descent as follows. We randomly cy-
cle through all sentence pairs in training data; for
each correct word pair (including null alignment),
we generate a positive example, and generate two
negative examples by randomly corrupting either
170
side of the pair with another word in the sentence
pair. We set learning rate to 0.01. As there is no
clear stopping criteria, we simply run the stochas-
tic optimizer through parallel corpus for N itera-
tions. In this work, N is set to 50.
To make our model concrete, there are still
hyper-parameters to be determined: the window
size sw and tw, the length of each hidden layer
L
l
. We empirically set sw and tw to 11, L
1
to
120, and L
2
to 10, which achieved a minimal loss
on a small held-out data among several settings we
tested.
5.3 Training distortion parameters
We x neural network parameters obtained from
the last step, and tune the distortion parameters
s
d
with respect to the sentence level loss using
standard stochastic gradient descent. We use a
separate parameter for jump distance from -7 and
7, and another two parameters for longer for-
ward/backward jumps. We initialize all parame-
ters in s
d
to 0, set the learning rate for the stochas-
tic optimizer to 0.001. As there are only 17 param-
eters in s
d
, we only need to run the optimizer over
a small portion of the parallel corpus.
5.4 Tuning neural network based on sentence
level criteria
Up-to-now, parameters in the lexical translation
neural network have not been trained against the
sentence level criteria Eq. 5. We could achieve
this by re-using the same online training method
used to train distortion parameters, except that we
now x the distortion parameters and let the loss
back-propagate through the neural networks. Sen-
tence level training does not take larger context in
modeling word translations, but only to optimize
the parameters regarding to the sentence level loss.
This tuning is quite slow, and it did not improve
alignment on an initial small scale experiment; so,
we skip this step in all subsequent experiment in
this work.
6 Experiments and Results
We conduct our experiment on Chinese-to-English
word alignment task. We use the manually aligned
Chinese-English alignment corpus (Haghighi et
al., 2009) which contains 491 sentence pairs as
test set. We adapt the segmentation on the Chinese
side to t our word segmentation standard.
6.1 Data
Our parallel corpus contains about 26 million
unique sentence pairs in total which are mined
from web.
The monolingual corpus to pre-train word em-
beddings are also crawled from web, which
amounts to about 1.1 billion unique sentences for
English and about 300 million unique sentences
for Chinese. As pre-processing, we lowercase all
English words, and map all numbers to one spe-
cial token; and we also map all email addresses
and URLs to another special token.
6.2 Settings
We use classic HMM and IBM model 4 as our
baseline, which are generated by Giza++ (Och and
Ney, 2000). We train our proposed model from re-
sults of classic HMM and IBM model 4 separately.
Since classic HMM, IBM model 4 and our model
are all uni-directional, we use the standard grow-
diag-nal to generate bi-directional results for all
models.
Models are evaluated on the manually aligned
test set using standard metric: precision, recall and
F1-score.
6.3 Alignment Result
It can be seen from Table 1, the proposed model
consistently outperforms its corresponding base-
line whether it is trained from alignment of classic
HMM or IBM model 4. It is also clear that the
setting prec. recall F-1
HMM 0.768 0.786 0.777
HMM+NN 0.810 0.790 0.798
IBM4 0.839 0.805 0.822
IBM4+NN 0.885 0.812 0.847
Table 1: Word alignment result. The rst row
and third row show baseline results obtained by
classic HMM and IBM4 model. The second row
and fourth row show results of the proposed model
trained from HMM and IBM4 respectively.
results of our model also depends on the quality
of baseline results, which is used as training data
of our model. In future we would like to explore
whether our method can improve other word align-
ment models.
We also conduct experiment to see the effect
on end-to-end SMT performance. We train hier-
171
archical phrase model (Chiang, 2007) from dif-
ferent word alignments. Despite different align-
ment scores, we do not obtain signicant differ-
ence in translation performance. In our C-E exper-
iment, we tuned on NIST-03, and tested on NIST-
08. Case-insensitive BLEU-4 scores on NIST-08
test are 0.305 and 0.307 for models trained from
IBM-4 and NN alignment results. The result is not
surprising considering our parallel corpus is quite
large, and similar observations have been made in
previous work as (DeNero and Macherey, 2011)
that better alignment quality does not necessarily
lead to better end-to-end result.
6.4 Result Analysis
6.4.1 Error Analysis
From Table 1 we can see higher F-1 score of our
model mainly comes from higher precision, with
recall similar to baseline. By analyzing the results,
we found out that for both baseline and our model,
a large part of missing alignment links involves
stop words like English words the, a, it and
Chinese words de. Stop words are inherently
hard to align, which often requires grammatical
judgment unavailable to our models; as they are
also extremely frequent, our model fully learns
their alignment patterns of the baseline models,
including errors. On the other hand, our model
performs better on low-frequency words, espe-
cially proper nouns. Take person names for ex-
ample. Most names are low-frequency words, on
which baseline HMM and IBM4 models show the
garbage collector phenomenon. In our model,
different person names have very similar word em-
beddings on both English side and Chinese side,
due to monolingual pre-training; what is more, dif-
ferent person names often appear in similar con-
texts. As our model considers both word embed-
dings and contexts, it learns that English person
names should be aligned to Chinese person names,
which corrects errors of baseline models and leads
to better precision.
6.4.2 Effect of context
To examine how context contribute to alignment
quality, we re-train our model with different win-
dow size, all from result of IBM model 4. From
Figure 3, we can see introducing context increase
the quality of the learned alignment, but the ben-
et is diminished for window size over 5. On the
other hand, the results are quite stable even with
large window size 13, without noticeable over-
0.74
0.76
0.78
0.8
0.82
0.84
0.86
1 3 5 7 9 11 13
Figure 3: Effect of different window sizes on word
alignment F-score.
tting problem. This is not surprising consider-
ing that larger window size only requires slightly
more parameters in the linear layers. Lastly, it
is worth noticing that our model with no context
(window size 1) performs much worse than set-
tings with larger window size and baseline IBM4.
Our explanation is as follows. Our model uses
the simple jump distance based distortion, which
is weaker than the more sophisticated distortions
in IBM model 4; thus without context, it does not
perform well compared to IBM model 4. With
larger window size, our model is able to produce
more accurate translation scores based on more
contexts, which leads to better alignment despite
the simpler distortions.
IBM4+NN F-1
1-hidden-layer 0.834
2-hidden-layer 0.847
3-hidden-layer 0.843
Table 3: Effect of different number of hidden lay-
ers. Two hidden layers outperform one hidden
layer, while three hidden layers do not bring fur-
ther improvement.
6.4.3 Effect of number of hidden layers
Our neural network contains two hidden layers be-
sides the lookup layer. It is natural to ask whether
adding more layers would be benecial. To an-
swer this question, we train models with 1, 2 and
3 layers respectively, all from result of IBM model
4. For 1-hidden-layer setting, we set the hidden
layer length to 120; and for 3-hidden-layer set-
ting, we set hidden layer lengths to 120, 100, 10
respectively. As can be seen from Table 3, 2-
hidden-layer outperforms the 1-hidden-layer set-
ting, while another hidden layer does not bring
172
word good history british served labs zetian laggards
LM
bad tradition russian worked networks hongzhang underperformers
great culture japanese lived technologies yaobang transferees
strong practice dutch offered innovations keming megabanks
true style german delivered systems xingzhi mutuals
easy literature canadian produced industries ruihua non-starters
WA
nice historical uk offering lab hongzhang underperformers
great historic britain serving laboratories qichao illiterates
best developed english serve laboratory xueqin transferees
pretty record classic delivering exam fuhuan matriculants
excellent recording england worked experiments bingkun megabanks
Table 2: Nearest neighbors of several words according to their embedding distance. LM shows neighbors
of word embeddings trained by monolingual language model method; WA shows neighbors of word
embeddings trained by our word alignment model.
improvement. Due to time constraint, we have
not tuned the hyper-parameters such as length of
hidden layers in 1 and 3-hidden-layer settings, nor
have we tested settings with more hidden-layers.
It would be wise to test more settings to verify
whether more layers would help.
6.4.4 Word Embedding
Following (Collobert et al., 2011), we show some
words together with its nearest neighbors using the
Euclidean distance between their embeddings. As
we can see from Table 2, after bilingual training,
bad is no longer in the nearest neighborhood of
good as they hold opposite semantic meanings;
the nearest neighbor of history is now changed
to its related adjective historical. Neighbors of
proper nouns such as person names are relatively
unchanged. For example, neighbors of word
zetian are all Chinese names in both settings.
As Chinese language lacks morphology, the single
form and plural form of a noun in English often
correspond to the same Chinese word, thus it is
desirable that the two English words should have
similar word embeddings. While this is true for
relatively frequent nouns such as lab and labs,
rarer nouns still remain near their monolingual
embeddings as they are only modied a few times
during the bilingual training. As shown in last
column, neighborhood of laggards still consists
of other plural forms even after bilingual training.
7 Conclusion
In this paper, we explores applying deep neu-
ral network for word alignment task. Our model
integrates a multi-layer neural network into an
HMM-like framework, where context dependent
lexical translation score is computed by neural
network, and distortion is modeled by a sim-
ple jump-distance scheme. Our model is dis-
criminatively trained on bilingual corpus, while
huge monolingual data is used to pre-train word-
embeddings. Experiments on large-scale Chinese-
to-English task show that the proposed method
produces better word alignment results, compared
with both classic HMM model and IBM model 4.
For future work, we will investigate more set-
tings of different hyper-parameters in our model.
Secondly, we want to explore the possibility of
unsupervised training of our neural word align-
ment model, without reliance of alignment result
of other models. Furthermore, our current model
use rather simple distortions; it might be helpful
to use more sophisticated model such as ITG (Wu,
1997), which can be modeled by Recursive Neural
Networks (Socher et al., 2011).
Acknowledgments
We thank anonymous reviewers for insightful
comments. We also thank Dongdong Zhang, Lei
Cui, Chunyang Wu and Zhenyan He for fruitful
discussions.
References
Yoshua Bengio, Holger Schwenk, Jean-S ebastien
Sen ecal, Fr ederic Morin, and Jean-Luc Gauvain.
2006. Neural probabilistic language models. Inno-
vations in Machine Learning, pages 137186.
Yoshua Bengio, Pascal Lamblin, Dan Popovici, and
173
Hugo Larochelle. 2007. Greedy layer-wise training
of deep networks. Advances in neural information
processing systems, 19:153.
Yoshua Bengio. 2009. Learning deep architectures for
ai. Foundations and Trends R in Machine Learning,
2(1):1127.
Adam L. Berger, Vincent J. Della Pietra, and Stephen
A. Della Pietra. 1996. A maximum entropy ap-
proach to natural language processing. Comput.
Linguist., 22(1):3971, March.
JS Bridle. 1990. Neurocomputing: Algorithms, archi-
tectures and applications, chapter probabilistic inter-
pretation of feedforward classication network out-
puts, with relationships to statistical pattern recogni-
tion.
Peter F Brown, Vincent J Della Pietra, Stephen A Della
Pietra, and Robert L Mercer. 1993. The mathemat-
ics of statistical machine translation: Parameter esti-
mation. Computational linguistics, 19(2):263311.
David Chiang. 2007. Hierarchical phrase-based trans-
lation. computational linguistics, 33(2):201228.
Ronan Collobert, Jason Weston, L eon Bottou, Michael
Karlen, Koray Kavukcuoglu, and Pavel Kuksa.
2011. Natural language processing (almost) from
scratch. The Journal of Machine Learning Re-
search, 12:24932537.
George E Dahl, Dong Yu, Li Deng, and Alex Acero.
2012. Context-dependent pre-trained deep neural
networks for large-vocabulary speech recognition.
Audio, Speech, and Language Processing, IEEE
Transactions on, 20(1):3042.
John DeNero and Klaus Macherey. 2011. Model-
based aligner combination using dual decomposi-
tion. In Proc. ACL.
Chris Dyer, Jonathan Clark, Alon Lavie, and Noah A
Smith. 2011. Unsupervised word alignment with ar-
bitrary features. In Proceedings of the 49th Annual
Meeting of the Association for Computational Lin-
guistics: Human Language Technologies-Volume 1,
pages 409419. Association for Computational Lin-
guistics.
Aria Haghighi, John Blitzer, John DeNero, and Dan
Klein. 2009. Better word alignments with su-
pervised itg models. In Proceedings of the Joint
Conference of the 47th Annual Meeting of the ACL
and the 4th International Joint Conference on Natu-
ral Language Processing of the AFNLP: Volume 2-
Volume 2, pages 923931. Association for Compu-
tational Linguistics.
Geoffrey E Hinton, Simon Osindero, and Yee-Whye
Teh. 2006. A fast learning algorithm for deep be-
lief nets. Neural computation, 18(7):15271554.
Koray Kavukcuoglu, Pierre Sermanet, Y-Lan Boureau,
Karol Gregor, Micha el Mathieu, and Yann LeCun.
2010. Learning convolutional feature hierarchies for
visual recognition. Advances in Neural Information
Processing Systems, pages 10901098.
Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003. Statistical phrase-based translation. In
Proceedings of the 2003 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics on Human Language Technology-
Volume 1, pages 4854. Association for Computa-
tional Linguistics.
Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton.
2012. Imagenet classication with deep convolu-
tional neural networks. In Advances in Neural Infor-
mation Processing Systems 25, pages 11061114.
Yann LeCun, L eon Bottou, Yoshua Bengio, and Patrick
Haffner. 1998. Gradient-based learning applied to
document recognition. Proceedings of the IEEE,
86(11):22782324.
Yann LeCun. 1985. A learning scheme for asymmet-
ric threshold networks. Proceedings of Cognitiva,
85:599604.
Honglak Lee, Alexis Battle, Rajat Raina, and An-
drew Y Ng. 2007. Efcient sparse coding algo-
rithms. Advances in neural information processing
systems, 19:801.
Shujie Liu, Chi-Ho Li, and Ming Zhou. 2010. Dis-
criminative pruning for discriminative itg alignment.
In Proceedings of the 48th Annual Meeting of the As-
sociation for Computational Linguistics, ACL, vol-
ume 10, pages 316324.
Y MarcAurelio Ranzato, Lan Boureau, and Yann Le-
Cun. 2007. Sparse feature learning for deep belief
networks. Advances in neural information process-
ing systems, 20:11851192.
Robert C Moore. 2005. A discriminative framework
for bilingual word alignment. In Proceedings of
the conference on Human Language Technology and
Empirical Methods in Natural Language Process-
ing, pages 8188. Association for Computational
Linguistics.
Jan Niehues and Alex Waibel. 2012. Continuous
space language models using restricted boltzmann
machines. In Proceedings of the nineth Interna-
tional Workshop on Spoken Language Translation
(IWSLT).
Franz Josef Och and Hermann Ney. 2000. Giza++:
Training of statistical translation models.
Frank Seide, Gang Li, and Dong Yu. 2011. Conversa-
tional speech transcription using context-dependent
deep neural networks. In Proc. Interspeech, pages
437440.
174
Noah A Smith and Jason Eisner. 2005. Contrastive
estimation: Training log-linear models on unlabeled
data. In Proceedings of the 43rd Annual Meeting
on Association for Computational Linguistics, pages
354362. Association for Computational Linguis-
tics.
Richard Socher, Cliff C Lin, Andrew Y Ng, and
Christopher D Manning. 2011. Parsing natural
scenes and natural language with recursive neu-
ral networks. In Proceedings of the 26th Inter-
national Conference on Machine Learning (ICML),
volume 2, page 7.
Richard Socher, Brody Huval, Christopher D Manning,
and Andrew Y Ng. 2012. Semantic compositional-
ity through recursive matrix-vector spaces. In Pro-
ceedings of the 2012 Joint Conference on Empiri-
cal Methods in Natural Language Processing and
Computational Natural Language Learning, pages
12011211. Association for Computational Linguis-
tics.
Le Hai Son, Alexandre Allauzen, and Francois Yvon.
2012. Continuous space translation models with
neural networks. In Proceedings of the 2012 confer-
ence of the north american chapter of the associa-
tion for computational linguistics: Human language
technologies, pages 3948. Association for Compu-
tational Linguistics.
Ivan Titov, Alexandre Klementiev, and Binod Bhat-
tarai. 2012. Inducing crosslingual distributed rep-
resentations of words.
Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.
Word representations: a simple and general method
for semi-supervised learning. Urbana, 51:61801.
Stephan Vogel, Hermann Ney, and Christoph Tillmann.
1996. Hmm-based word alignment in statistical
translation. In Proceedings of the 16th conference
on Computational linguistics-Volume 2, pages 836
841. Association for Computational Linguistics.
Dekai Wu. 1997. Stochastic inversion transduction
grammars and bilingual parsing of parallel corpora.
Computational linguistics, 23(3):377403.
175