POS Tagging HMM
POS Tagging HMM
POS Tagging HMM
245
Transactions of the Association for Computational Linguistics, vol. 4, pp. 245–257, 2016. Action Editor: Hinrich Schütze.
Submission batch: 1/2016; Revision batch: 3/2016; Published 6/2016.
c
2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
model of Berg-Kirkpatrick et al. (2010). • A(h) := {x ∈ [n] : o(x|h) > 0 ∧ o(x|h0 ) =
One characteristic of the approach is the imme- 0 ∀h0 6= h} is non-empty for each h ∈ [m].
diate interpretability of inferred hidden states. Be-
cause each hidden state is associated with an obser- In other words, an A-HMM is an HMM in which
vation, we can examine the set of such anchor obser- each hidden state h is associated with at least one
vations to qualitatively evaluate the learned model. “anchor” observation state that can be generated by,
In our experiments on POS tagging, we find that an- and only by, h. Note that the anchor condition im-
chor observations correspond to possible POS tags plies n ≥ m.
across different languages (Table 3). This property An equivalent definition of an A-HMM is given
can be useful when we wish to develop a tagger for a by organizing the parameters in matrix form. Under
new language that has no labeled data; we can label this definition, an A-HMM has parameters (π, T, O)
only the anchor words to achieve a complete label- where π ∈ Rm is a vector and T ∈ Rm×m , O ∈
ing of the data. Rn×m are matrices whose entries are set to:
This paper is structured as follows. In Section 2, • πh = π(h) for h ∈ [m]
we establish the notation we use throughout. In Sec-
• Th0 ,h = t(h0 |h) for h, h0 ∈ [m]
tion 3, we define the model family of anchor HMMs.
In Section 4, we derive a matrix decomposition al- • Ox,h = o(x|h) for h ∈ [m], x ∈ [n]
gorithm for estimating the parameters of an anchor
The anchor condition implies that rank(O) = m.
HMM. In Section 5, we present our experiments on
To see this, consider the rows Oa1 . . . Oam where
unsupervised POS tagging. In Section 6, we discuss
ah ∈ A(h). Since each Oah has a single non-zero
related works.
entry at the h-th index, the columns of O are linearly
2 Notation independent. We assume rank(T ) = m.
An important special case of A-HMM introduced
We use [n] to denote the set of integers {1, . . . , n}. by Brown et al. (1992) is defined below. Under
We use E[X] to denote the expected value of a ran- this A-HMM, every observation state is an anchor
dom variable P X. We define ∆
m−1 := {v ∈ Rm :
of some hidden state.
vi ≥ 0 ∀i, i vi = 1}, i.e., the (m−1)-dimensional
Definition 3.2. A Brown model is an A-HMM in
probability simplex. Given a vector v ∈ Rm , we
which A(1) . . . A(m) partition [n].
use diag(v) ∈ Rm×m to denote the diagonal matrix
with v1 . . . vm on the main diagonal. Given a matrix 4 Parameter Estimation for A-HMMs
M ∈ Rn×m , we write Mi ∈ Rm to denote the i-th
row of M (as a column vector). We now derive an algorithm for learning A-HMMs.
The algorithm reduces the learning problem to an
3 The Anchor Hidden Markov Model instance of NMF from which the model parameters
can be computed in closed-form.
Definition 3.1. An anchor HMM (A-HMM) is a 6-
tuple (n, m, π, t, o, A) for positive integers n, m and 4.1 NMF
functions π, t, o, A where
We give a brief review of the NMF method of Arora
• [n] is a set of observation states. et al. (2013). (Exact) NMF is the following problem:
given an n × d matrix A = BC where B ∈ Rn×m
• [m] is a set of hidden states.
and C ∈ Rm×d have non-negativity constraints, re-
• π(h) is the probability of generating h ∈ [m] cover B and C. This problem is NP-hard in general
in the first position of a sequence. (Vavasis, 2009), but Arora et al. (2013) provide an
exact and efficient method when A has the follow-
• t(h0 |h) is the probability of generating h0 ∈ ing special structure:
[m] given h ∈ [m].
Condition 4.1. A matrix A ∈ Rn×d satisfies this
• o(x|h) is the probability of generating x ∈ [n] condition if A = BC for B ∈ Rn×m and C ∈
given h ∈ [m]. Rm×d where
246
Anchor-NMF the vertices selected so far. Then it recovers each
Input: A ∈ Rn×d satisfying Condition 4.1 with A = Bx as the convex coefficients required to combine
BC for some B ∈ Rn×m and C ∈ Rm×d , value m Aa1 . . . Aam to yield Ax . The latter computation (1)
• For h = 1 . . . m, find a vertex ah as can be achieved with any constrained optimization
method; we use the Frank-Wolfe algorithm (Frank
U ← Gram-Schmidt({Aal }h−1
l=1 ) and Wolfe, 1956). See Arora et al. (2013) for a proof
ah ← arg max Ax − U U Ax
>
2 of the correctness of this algorithm.
x∈[n]
Cρ(h) for some permutation ρ : [m] → [m] P (YI |HI ). We will provide how to define such a
variable in Section 4.4.1: YI will be a function of
Figure 1: Non-negative matrix factorization algorithm of (X1 , . . . , XN ) serving as a “context” representation
Arora et al. (2012). of XI revealing the hidden state HI .
Define unigram probabilities u∞ , u1 ∈ Rn and
bigram probabilities B ∈ Rn×n where
1. For each x ∈ [n], Bx ∈ ∆m−1 . I.e., each row
of B defines a probability distribution over [m]. u∞x := P (XI = x) ∀x ∈ [n]
u1x := P (XI = x|I = 1) ∀x ∈ [n]
2. For each h ∈ [m], there is some ah ∈ [n] such Bx,x0 := P (XI = x, XI+1 = x ) ∀x, x0 ∈ [n]
0
247
where Oex,h = P (HI = h|XI = x) and Θh = Learn-Anchor-HMM
E[YI |HI = h]. Input: Ω in Proposition 4.1, number of hidden states m,
bigram probabilities B, unigram probabilities u∞ , u1
Proof.
e Θ) ← Anchor-NMF(Ω, m).
• Compute (O,
E[YI |XI = x]
• Recover the parameters:
Xm
= P (HI = h|XI = x) × E[YI |HI = h, XI = x] e > u∞
π̄ ← O (4)
h=1
m e
O ← diag(π̄)−1 diag(u∞ )O (5)
X
= P (HI = h|XI = x) × E[YI |HI = h] π=O u + 1
(6)
h=1
−1 + > + >
T ← (diag(π̄) O B(O ) ) (7)
The last equality follows by the conditional inde-
pendence of YI . This shows Ω = OΘ. e By the an- Output: A-HMM parameters (π, T, O)
chor assumption of the A-HMM, each h ∈ [m] has
at least one x ∈ A(h) such that P (HI = h|XI = Figure 2: NMF-based learning algorithm for A-HMMs.
x) = 1, thus Ω satisfies Condition 4.1. The algorithm Anchor-NMF is given in Figure 1.
248
4.3.1 Constrained Optimization for π and T Under the rank conditions in Theorem 4.1, (11) has
Note that (6) and (7) require computing the pseu- rank m.
doinverse of the estimated O, which can be expen- More generally, we can let YI be an observation
sive and vulnerable to sampling errors in practice. (encoded as an indicator vector as in (10)) randomly
To make our parameter estimation more robust, we drawn from a window of L ∈ N nearby observa-
can explicitly impose probability constraints. We re- tions. We can either only use the identity of the cho-
cover π by solving: sen observation (in which case YI ∈ Rn ) or addi-
tionally indicate the relative position in the window
π = arg min u1 − Oπ 0 2 (8) (in which case YI ∈ RnL ). It is straightforward to
π 0 ∈∆m−1
verify that the above two conditions are satisfied un-
which can again be done with algorithms such as der these definitions. Clearly, (11) is a special case
Frank-Wolfe. We recover T by maximizing the log with L = 1.
likelihood of observation bigrams
4.4.2 Reducing the Dimension of Ωx
X X With the definition of Ω in the previous section,
Bx,x0 log π̄h Ox,h Th0 ,h Ox0 ,h0 (9)
the dimension of Ωx is d = O(n) which can be dif-
x,x0 h,h0 ∈[m]
ficult to work with when n m. Proposition 4.1
subject to the constraint (T > )h ∈ ∆m−1 . Since (9) allows us to reduce the dimension as long as the fi-
is concave in T with other parameters O and π̄ fixed, nal matrix retains the form in (3) and has rank m. In
we can use EM to find the global optimum. particular, we can multiply Ω by any rank-m projec-
tion matrix Π ∈ Rd×m on the right side: if Ω sat-
4.4 Construction of the Convex Hull Ω isfies the properties in Proposition 4.1, then so does
In this section, we provide several ways to construct ΩΠ with m-dimensional rows
a convex hull Ω satisfying Proposition 4.1.
(ΩΠ)x = E[YI Π|XI = x]
4.4.1 Choice of the Context YI
In order to satisfy Proposition 4.1, we need to de- Since rank(Ω) = m, a natural choice of Π is the
fine the context variable YI ∈ Rd with two proper- projection onto the best-fit m-dimensional subspace
ties: of the row space of Ω.
We mention that previous works on the NMF-
• P (YI |HI , XI ) = P (YI |HI )
learning framework have employed various projec-
• The matrix Ω with rows tion methods, but they do not examine relative mer-
its of their choices. For instance, Arora et al. (2013)
Ωx = E[YI |XI = x] ∀x ∈ [n] simply use random projection, which is convenient
for theoretical analysis. Cohen and Collins (2014)
has rank m.
use a projection based on canonical correlation anal-
A simple construction (Arora et al., 2013) is given ysis (CCA) without further exploration. In con-
by defining YI ∈ Rn to be an indicator vector for trast, we give a full comparison of valid construc-
the next observation: tion methods and find that the choice of Ω is crucial
in practice.
1 if XI+1 = x0
[YI ]x0 = (10)
0 otherwise 4.4.3 Construction of Ω for the Brown Model
The first condition is satisfied since XI+1 does not We can formulate an alternative way to construct
depend on XI given HI . For the second condition, a valid Ω when the model is further restricted to be a
observe that Ωx,x0 = P (XI+1 = x0 |XI = x), or in Brown model. Since every observation is an anchor,
matrix form Ox ∈ Rm has a single nonzero entry for every x.
Thus the rows defined by Ωx = Ox / ||Ox || (an in-
Ω = diag (u∞ )−1 B (11) dicator vector for the unique hidden state of x) form
249
Input: bigram probabilities B, unigram probabilities u∞ , where O h1/4i is an element-wise exponentiation of
number of hidden states
√ m, construction method τ O by 1/4. Since the rows of f (O) are simply some
Scaled Matrices: ( · is element-wise) scaling and rotation of the rows of O, using Ωx =
−1/2 −1/2 f (O)x / ||f (O)x || yields a valid Ω.
B := diag (u∞ ) Bdiag (u∞ )
√ −1/2 √ √ −1/2 While we need to impose an additional assump-
Be := diag u∞ Bdiag u∞ tion (the Brown model restriction) in order to justify
this choice of Ω, we find in our experiments that it
Singular Vectors: U (M ) (V (M )) is an n × m matrix of performs better than other alternatives. We specu-
the left (right) singular vectors of M corresponding to the late that this is because a Brown model is rather ap-
largest m singular values propriate for the POS tagging task; many words are
• If τ 6= brown: set indeed unambiguous with respect to POS tags (Ta-
−1 ble 5). Also, the general effectiveness of f (O) for
Ω ← diag (u∞ ) BΠ
representational purposes has been demostrated in
where the projection matrix Π ∈ Rn×m is given by previous works (Stratos et al., 2014; Stratos et al.,
2015). By restricting the A-HMM to be a Brown
Πi,j ∼ N (0, 1/m) if τ = random model, we can piggyback on the proven effective-
Π = V (diag (u )∞ −1
B) if τ = best-fit ness of f (O).
Π = diag (u )∞ −1/2
V (B) if τ = cca
Figure 3 shows an algorithm for constructing Ω
with these different construction methods. For sim-
• If τ = brown: compute the transformed emission plicity, we only show the bigram construction (con-
e and set
matrix as f (O) = U (B) text size L = 1), but an extension for larger context
(L > 1) is straightforward as discussed earlier. The
Ω ← diag(v)−1 f (O) construction methods random (random projection),
where vx := ||f (O)x ||2 is the length of the x-th best-fit (projection to the best-fit subspace), and cca
row of f (O). (CCA projection) all compute (11) and differ only
in how the dimension is reduced. The construction
Output: Ω ∈ Rn×m in Proposition 4.1
method brown computes the transformed Brown pa-
rameters f (O) as the left singular vectors of a scaled
Figure 3: Algorithm for constructing a valid Ω with dif- covariance matrix and then normalizes its rows. We
ferent construction methods. For simplicity, we only
direct the reader to Stratos et al. (2015) for a deriva-
show the bigram construction (context size L = 1), but an
extension for larger context (L > 1) is straightforward. tion of this calculation.
250
sions indicating the spelling features of x. For in- (Merialdo, 1994; Smith and Eisner, 2005b; Liang
stance, the (n + 1)-th dimension may be defined as: and Klein, 2008).
Consequently, many works focus on using more
[Ωx ]n+1 = E [1 (x ends in “ing” ) |XI = x]
expressive models such as log-linear models (Smith
This value will be generally large for verbs and and Eisner, 2005a; Berg-Kirkpatrick et al., 2010)
small for non-verbs, nudging verbs closer together and Markov random fields (MRF) (Haghighi and
and away from non-verbs. The modified (n + s)- Klein, 2006). These models are shown to deliver
dimensional representation is followed by the usual good performance even though learning is approxi-
dimension reduction. Note that the spelling features mate. Thus one may question the value of having a
are a deterministic function of a word, and we are consistent estimator for A-HMMs and Brown mod-
implicitly assuming that they are independent of the els in this work: if the model is wrong, what is the
word given its tag. While this is of course not true in point of learning it accurately?
practice, we find that these features can significantly However, there is also ample evidence that
boost the tagging performance. HMMs are competitive for unsupervised POS induc-
tion when they incorporate domain-specific struc-
5 Experiments tures. Johnson (2007) is able to outperform the
We evaluate our A-HMM learning algorithm on the sophisticated MRF model of Haghighi and Klein
task of unsupervised POS tagging. The goal of this (2006) on one-to-one accuracy by using a sparse
task is to induce the correct sequence of POS tags prior in HMM estimation. The clustering method
(hidden states) given a sequence of words (observa- of Brown et al. (1992) which is based on optimizing
tion states). The anchor condition corresponds to as- the likelihood under the Brown model (a special case
suming that each POS tag has at least one word that of HMM) remains a baseline difficult to outperform
occurs only under that tag. (Christodoulopoulos et al., 2010).
We add to this evidence by demonstrating the ef-
5.1 Background on Unsupervised POS Tagging fectiveness of A-HMMs on this task. We also check
Unsupervised POS tagging has long been an active the anchor assumption on data and show that the A-
area of research (Smith and Eisner, 2005a; John- HMM model structure is in fact appropriate for the
son, 2007; Toutanova and Johnson, 2007; Haghighi problem (Table 5).
and Klein, 2006; Berg-Kirkpatrick et al., 2010),
5.2 Experimental Setting
but results on this task are complicated by vary-
ing assumptions and unclear evaluation metrics We use the universal treebank dataset (version 2.0)
(Christodoulopoulos et al., 2010). Rather than ad- which contains sentences annotated with 12 POS tag
dressing multiple alternatives for evaluating unsu- types for 10 languages (McDonald et al., 2013). All
pervised POS tagging, we focus on a simple and models are trained with 12 hidden states. We use
widely used metric: many-to-one accuracy (i.e., we the English portion to experiment with different hy-
map each hidden state to the most frequently coin- perparameter configurations. At test time, we fix a
ciding POS tag in the labeled data and compute the configuration (based on the English portion) and ap-
resulting accuracy). ply it across all languages.
The list of compared methods is given below:
5.1.1 Better Model v.s. Better Learning
Vanilla HMMs are notorious for their mediocre BW The Baum-Welch algorithm, an EM algorithm
performance on this task, and it is well known that for HMMs (Baum and Petrie, 1966).
they perform poorly largely because of model mis-
CLUSTER A parameter estimation scheme for
specification, not because of suboptimal parameter
HMMs based on Brown clustering (Brown et al.,
estimation (e.g., because EM gets stuck in local op-
1992). We run the Brown clustering algorithm1 to
tima). More generally, a large body of work points
obtain 12 word clusters C1 . . . C12 . Then we set
to the inappropriateness of simple generative mod-
els for unsupervised induction of linguistic structure 1
We use the implementation of Liang (2005).
251
the emission parameters o(x|h), transition param- Choice of Ω Accuracy
eters t(h0 |h), and prior π(h) to be the maximum- Random 48.2
likelihood estimates under the fixed clusters. Best-Fit 53.4
CCA 57.0
ANCHOR Our algorithm Learn-Anchor-HMM in
Brown 66.1
Figure 2 but with the constrained optimization (8)
and (9) for estimating π and T .2 Table 1: Many-to-one accuracy on the English data with
different choices of the convex hull Ω (Figure 3). These
ANCHOR - FEATURES Same as ANCHOR but em- results do not use spelling features.
ploys the feature augmentation scheme described in
Section 4.4.4.
construction (τ = brown in Figure 3) clearly per-
LOG - LINEAR The unsupervised log-linear model
forms the best: essentially, the anchor algorithm is
described in Berg-Kirkpatrick et al. (2010). Instead
used to extract the HMM parameters from the CCA-
of emission parameters o(x|h), the model maintains
based word embeddings of Stratos et al. (2015).
a miniature log-linear model with a weight vector w
We also explore feature augmentation discussed
and a feature function φ. The probability of a word
in Section 4.4.4. For comparison, we employ the
x given tag h is computed as
same word features used by Berg-Kirkpatrick et al.
exp(w> φ(x, h)) (2010):
p(x|h) = P >
x∈[n] exp(w φ(x, h)) • Indicators for whether a word is capitalized,
contains a hyphen, or contains a digit
The model can be trained by maximizing the like-
lihood of observed sequences. We use L-BFGS to • Suffixes of length 1, 2, and 3
directly optimize this objective.3 This approach ob-
tains the current state-of-the-art accuracy on fine- We weigh the l2 norm of these extra dimensions
grained (45 tags) English WSJ dataset. in relation to the original dimensions: we find a
small weight (e.g., 0.1 of the norm of the original
We use maximum marginal decoding for HMM dimensions) works well. We also find that these fea-
predictions: i.e., at each position, we predict the tures can sometimes significantly improve the per-
most likely tag given the entire sentence. formance. For instance, the accuracy on the English
portion can be improved from 66.1% to 71.4% with
5.3 Practical Issues with the Anchor Algorithm feature augmentation.
In our experiments, we find that Anchor-NMF (Fig- Another natural experiment is to refine the HMM
ure 1) tends to propose extremely rare words as an- parameters obtained from the anchor algorithm (or
chors. A simple fix is to search for anchors only Brown clusters) with a few iterations of the Baum-
among relatively frequent words. We find that any Welch algorithm. In our experiments, however, it
reasonable frequency threshold works well; we use did not significantly improve the tagging perfor-
the 300 most frequent words. Note that this is not mance, so we omit this result.
a problem if these 300 words include anchor words
corresponding to all the 12 tags. 5.4 Tagging Accuracy
We must define the context for constructing Ω. Table 2 shows the many-to-one accuracy on all lan-
We use the previous and next words (i.e., context guages in the dataset. For the Baum-Welch algo-
size L = 2) marked with relative positions. Thus Ω rithm and the unsupervised log-linear models, we
has 2n columns before dimension reduction. Table 1 report the mean and the standard deviation (in paren-
shows the performance on the English portion with theses) of 10 random restarts run for 1,000 itera-
different construction methods for Ω. The Brown tions.
2
https://github.com/karlstratos/anchor
Both ANCHOR and ANCHOR - FEATURES compete
3
We use the implementation of Berg-Kirkpatrick et al. favorably. On 5 out of 10 languages, ANCHOR -
(2010) (personal communication). FEATURES achieves the highest accuracy, often
252
Model de en es fr id it ja ko pt-br sv
(4.8) (3.4) (2.2) (3.6) (3.1) (2.6) (2.1) (0.6) (3.7) (3.0)
BW
45.5 59.8 60.6 60.1 49.6 51.5 59.5 51.7 59.5 42.4
CLUSTER 60.0 62.9 67.4 66.4 59.3 66.1 60.3 47.5 67.4 61.9
ANCHOR 61.1 66.1 69.0 68.2 63.7 60.4 65.3 53.8 64.9 51.1
ANCHOR - FEATURES 63.4 71.4 74.3 71.9 67.3 60.2 69.4 61.8 65.8 61.0
(1.8) (3.5) (3.1) (4.5) (3.9) (2.9) (2.9) (3.6) (2.2) (2.5)
LOG - LINEAR
67.5 62.4 67.1 62.1 61.3 52.9 78.2 60.5 63.2 56.7
Table 2: Many-to-one accuracy on each language using 12 universal tags. The first four models are HMMs estimated
with the Baum-Welch algorithm (BW), the clustering algorithm of Brown et al. (1992), the anchor algorithm without
(ANCHOR) and with (ANCHOR - FEATURES) feature augmentation. LOG - LINEAR is the model of Berg-Kirkpatrick et
al. (2010) trained with the direct-gradient method using L-BFGS. For BW and LOG - LINEAR, we report the mean and
the standard deviation (in parentheses) of 10 random restarts run for 1,000 iterations.
closely followed by ANCHOR. The Brown clustering that an anchor is representative of a certain group
estimation is also competitive and has the highest of words. For instance, the state “loss” rep-
accuracy on 3 languages. Not surprisingly, vanilla resents noun-like words, “1” represents numbers,
HMMs trained with BW perform the worst (see Sec- “on” represents preposition-like words, “one” rep-
tion 5.1.1 for a discussion). resents determiner-like words, and “closed” repre-
LOG - LINEAR is a robust baseline and performs sents verb-like words. The conditional distribution
the best on the remaining 2 languages. It per- is peaked for anchors that represent function tags
forms especially strongly on Japanese and Korean (e.g., determiners, punctuation) and flat for anchors
datasets in which poorly segmented strings such as that represent content tags (e.g., nouns). Occasion-
“1950年11月5日には” (on November 5, 1950) and ally, an anchor assigns high probabilities to words
“40.3%ᄅ ᅩ” (by 40.3%) abound. In these datasets, that do not seem to belong to the corresponding
it is crucial to make effective use of morphological POS tag. But this is to be expected since o(x|h) ∝
features. P (XI = x) is generally larger for frequent words.
253
de en es fr id it ja ko pt-br sv
empfehlen loss y avait bulan radar お世話に ᅪ
ᄋᄌ
ᆫ ᆫ
ᅥ E och
wie 1 hizo commune tetapi però ないと ᆼᄋ
ᄌ
ᅮ ᅦ de bör
; on - Le wilayah sulle ことにより ᆼᄋ
ᄀ
ᅧ ᅮ partida grund
Sein one especie de - - されている。 ᆯ
ᄌ
ᅮ fazer mellan
Berlin closed Además président Bagaimana Stati ものを ᇀᄋ
ᅡ
가ᄋ ᅭ meses i
und are el qui , Lo , ᆭᄋ
ᅡ
ᄆ ᆫ
ᅳ os sociala
, take paı́ses ( sama legge した , : .
- , la à . al それは ᆯ
ᄇ
ᅩ diretor bli
der vice España États dan far- 、 ᆫᄋ
ᅡᄉ
지 ᅴ 2010 den
im to en Unis Utara di 幸福の ᆮᄀ
ᅡ
ᄇ ᅩ , ,
des York de Cette pada la ことが ᅡ
ᄆᆻᄂ
ᆺᄋ
ᅵ ᆫ
ᅳ uma tid
Region Japan municipio quelques yang art. 通常の ᅱᅡ
ᄋ ᆫ
ᄒ O Detta
loss year (.02) market (.01) share (.01) company (.01) stock (.01) quarter (.01) shares (.01) price (.01)
1 1 (.03) 10 (.02) 30 (.02) 15 (.02) 8 (.02) 2 (.01) 20 (.01) 50 (.01)
on of (.14) in (.12) . (.08) for (.06) on (.04) by (.04) from (.04) and (.03)
one the (.23) a (.12) “ (.03) an (.03) $ (.02) its (.02) that (.02) this (.02)
closed said (.05) ’s (.02) is (.02) says (.02) was (.01) has (.01) had (.01) expected (.01)
are and (.08) is (.08) are (.05) was (.04) ’s (.04) “ (.04) has (.03) of (.03)
take be (.04) % (.02) have (.02) million (.02) But (.02) do (.01) The (.01) make (.01)
, , (.53) . (.25) and (.05) ” (.04) % (.01) million (.01) – (.01) that (.01)
vice ’s (.03) The (.02) “ (.02) New (.01) and (.01) new (.01) first (.01) chief (.01)
to to (.39) . (.11) a (.06) will (.04) $ (.03) n’t (.03) would (.02) % (.02)
York the (.15) a (.05) The (.04) of (.04) ’s (.04) million (.01) % (.01) its (.01)
Japan Mr. (.03) it (.02) ” (.02) $ (.02) he (.02) that (.02) which (.01) company (.01)
Table 4: Most likely words under each anchor word (English model ANCHOR - FEATURES). Emission probabilities
o(x|h) are given in parentheses.
and Brown clustering directly optimize likelihood. rect learning algorithm for topic models with a cer-
In contrast, the anchor algorithm is based on the tain parameter structure.
method of moments and does not (at least directly) The anchor-based framework has been originally
optimize likelihood. Note that high likelihood does formulated for learning topic models (Arora et al.,
not imply high accuracy under HMMs. 2013). It has been subsequently adopted to learn
other models such as latent-variable probabilistic
6 Related Work context-free grammars (Cohen and Collins, 2014).
6.1 Latent-Variable Models In our work, we have extended this framework to
address unsupervised sequence labeling.
There has recently been great progress in estima-
Zhou et al. (2014) also extend Arora et al.
tion of models with latent variables. Despite the
(2013)’s framework to learn various models includ-
NP-hardness in general cases (Terwijn, 2002; Arora
ing HMMs, but they address a more general prob-
et al., 2012), many algorithms with strong theoreti-
lem. Consequently, their algorithm draws from
cal guarantees have emerged under natural assump-
Anandkumar et al. (2012) and is substantially dif-
tions. For example, for HMMs with full-rank condi-
ferent from ours.
tions, Hsu et al. (2012) derive a consistent estimator
of the marginal distribution of observed sequences.
6.2 Unsupervised POS Tagging
Anandkumar et al. (2014) propose an exact tensor
decomposition method for learning a wide class of Unsupervised POS tagging is a classic problem in
latent variable models with similar non-degeneracy unsupervised learning that has been tackled with
conditions. Arora et al. (2013) derive a provably cor- various approaches. Johnson (2007) observes that
254
de en es fr id it ja ko pt-br sv
% anchored tags 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
% unambig words 96.6 91.5 94.0 94.2 94.8 94.8 99.5 98.4 94.8 97.4
. VERB PRON ADP NOUN ADV CONJ DET NUM ADJ X PRT
, said it from Mr. n’t or which billion new bono na
53928 6339 5147 4856 4436 3582 2748 2458 1935 1542 8 5
Table 5: Verifying model assumptions on the universal treebank. The anchor assumption is satisfied in every language.
The Brown assumption (each word has exactly one possible tag) is violated but not by a large margin. The lower table
shows the most frequent anchor word and its count under each tag on the English portion.
255
Acknowledgments Dipanjan Das and Slav Petrov. 2011. Unsupervised
part-of-speech tagging with bilingual graph-based pro-
We thank Taylor Berg-Kirkpatrick for providing the jections. In Proceedings of the 49th Annual Meet-
implementation of Berg-Kirkpatrick et al. (2010). ing of the Association for Computational Linguistics:
We also thank anonymous reviewers for their con- Human Language Technologies-Volume 1, pages 600–
structive comments. 609. Association for Computational Linguistics.
Marguerite Frank and Philip Wolfe. 1956. An algorithm
for quadratic programming. Naval Research Logistics
References Quarterly, 3(1-2):95–110.
Aria Haghighi and Dan Klein. 2006. Prototype-driven
Animashree Anandkumar, Daniel Hsu, and Sham M. learning for sequence models. In Proceedings of
Kakade. 2012. A method of moments for mixture the main conference on Human Language Technol-
models and hidden Markov models. In Twenty-Fifth ogy Conference of the North American Chapter of the
Annual Conference on Learning Theory. Association of Computational Linguistics, pages 320–
Animashree Anandkumar, Rong Ge, Daniel Hsu, 327. Association for Computational Linguistics.
Sham M. Kakade, and Matus Telgarsky. 2014. Ten- Daniel Hsu, Sham M. Kakade, and Tong Zhang. 2012. A
sor decompositions for learning latent variable models. spectral algorithm for learning hidden Markov mod-
Journal of Machine Learning Research, 15(1):2773– els. Journal of Computer and System Sciences,
2832. 78(5):1460–1480.
Sanjeev Arora, Rong Ge, and Ankur Moitra. 2012. Mark Johnson. 2007. Why doesn’t EM find good HMM
Learning topic models–going beyond SVD. In Foun- POS-taggers? In EMNLP-CoNLL, pages 296–305.
dations of Computer Science (FOCS), 2012 IEEE 53rd Shen Li, Joao V. Graça, and Ben Taskar. 2012. Wiki-
Annual Symposium on, pages 1–10. IEEE. ly supervised part-of-speech tagging. In Proceedings
Sanjeev Arora, Rong Ge, Yonatan Halpern, David of the 2012 Joint Conference on Empirical Methods
Mimno, Ankur Moitra, David Sontag, Yichen Wu, and in Natural Language Processing and Computational
Michael Zhu. 2013. A practical algorithm for topic Natural Language Learning, pages 1389–1398. Asso-
modeling with provable guarantees. In Proceedings of ciation for Computational Linguistics.
the 30th International Conference on Machine Learn- Percy Liang and Dan Klein. 2008. Analyzing the errors
ing (ICML-13), pages 280–288. of unsupervised learning. In ACL, pages 879–887.
Percy Liang. 2005. Semi-supervised learning for natural
Leonard E. Baum and Ted Petrie. 1966. Statisti-
language. Master’s thesis, Massachusetts Institute of
cal inference for probabilistic functions of finite state
Technology.
Markov chains. The Annals of Mathematical Statis-
Ryan T. McDonald, Joakim Nivre, Yvonne Quirmbach-
tics, 37(6):1554–1563.
Brundage, Yoav Goldberg, Dipanjan Das, Kuzman
Taylor Berg-Kirkpatrick, Alexandre Bouchard-Côté, Ganchev, Keith B. Hall, Slav Petrov, Hao Zhang, Os-
John DeNero, and Dan Klein. 2010. Painless unsu- car Täckström, Claudia Bedini, Núria B. Castelló, and
pervised learning with features. In Human Language Jungmee Lee. 2013. Universal dependency annota-
Technologies: The 2010 Annual Conference of the tion for multilingual parsing. In ACL, pages 92–97.
North American Chapter of the Association for Com- Bernard Merialdo. 1994. Tagging English text with
putational Linguistics, pages 582–590. Association for a probabilistic model. Computational Linguistics,
Computational Linguistics. 20(2):155–171.
Peter F. Brown, Peter V. Desouza, Robert L. Mercer, Vin- Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012.
cent J. Della Pietra, and Jenifer C. Lai. 1992. Class- A universal part-of-speech tagset. In Proceedings of
based n-gram models of natural language. Computa- the Eighth International Conference on Language Re-
tional Linguistics, 18(4):467–479. sources and Evaluation (LREC’12).
Christos Christodoulopoulos, Sharon Goldwater, and Noah A. Smith and Jason Eisner. 2005a. Contrastive
Mark Steedman. 2010. Two decades of unsupervised estimation: Training log-linear models on unlabeled
POS induction: How far have we come? In Proceed- data. In Proceedings of the 43rd Annual Meeting
ings of the 2010 Conference on Empirical Methods in on Association for Computational Linguistics, pages
Natural Language Processing, pages 575–584. Asso- 354–362. Association for Computational Linguistics.
ciation for Computational Linguistics. Noah A. Smith and Jason Eisner. 2005b. Guiding un-
Shay B. Cohen and Michael Collins. 2014. A provably supervised grammar induction using contrastive esti-
correct learning algorithm for latent-variable PCFGs. mation. In Proc. of IJCAI Workshop on Grammatical
In Proceedings of ACL. Inference Applications, pages 73–82.
256
Karl Stratos, Do-kyum Kim, Michael Collins, and Daniel
Hsu. 2014. A spectral algorithm for learning class-
based n-gram models of natural language. In Proceed-
ings of the thirtieth conference on Uncertainty in Arti-
ficial Intelligence.
Karl Stratos, Michael Collins, and Daniel Hsu. 2015.
Model-based word embeddings from decompositions
of count matrices. In Proceedings of the 53rd Annual
Meeting of the Association for Computational Linguis-
tics and the 7th International Joint Conference on Nat-
ural Language Processing (Volume 1: Long Papers),
pages 1282–1291, Beijing, China, July. Association
for Computational Linguistics.
Oscar Täckström, Dipanjan Das, Slav Petrov, Ryan Mc-
Donald, and Joakim Nivre. 2013. Token and type
constraints for cross-lingual part-of-speech tagging.
Transactions of the Association for Computational
Linguistics, 1:1–12.
Sebastiaan A. Terwijn. 2002. On the learnability of hid-
den Markov models. In Grammatical Inference: Al-
gorithms and Applications, pages 261–268. Springer.
Kristina Toutanova and Mark Johnson. 2007. A
Bayesian LDA-based model for semi-supervised part-
of-speech tagging. In Advances in Neural Information
Processing Systems, pages 1521–1528.
Stephen A. Vavasis. 2009. On the complexity of nonneg-
ative matrix factorization. SIAM Journal on Optimiza-
tion, 20(3):1364–1377.
Tianyi Zhou, Jeff A. Bilmes, and Carlos Guestrin. 2014.
Divide-and-conquer learning by anchoring a conical
hull. In Advances in Neural Information Processing
Systems, pages 1242–1250.
257
258