POS Tagging HMM

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Unsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models

Karl Stratos, Michael Collins∗ and Daniel Hsu


Department of Computer Science, Columbia University
{stratos, mcollins, djhsu}@cs.columbia.edu

Abstract with standard unsupervised learning methods such


as the expectation-maximization (EM) algorithm,
We tackle unsupervised part-of-speech (POS) but it quickly became clear that they performed very
tagging by learning hidden Markov models poorly in inducing POS tags (Merialdo, 1994). Later
(HMMs) that are particularly well-suited for works improved upon vanilla HMMs by incorporat-
the problem. These HMMs, which we call an-
ing specific structures that are well-suited for the
chor HMMs, assume that each tag is associ-
ated with at least one word that can have no task, such as a sparse prior (Johnson, 2007) or a
other tag, which is a relatively benign con- hard-clustering assumption (Brown et al., 1992).
dition for POS tagging (e.g., “the” is a word In this work, we tackle unsupervised POS tagging
that appears only under the determiner tag). with HMMs whose structure is deliberately suitable
We exploit this assumption and extend the
for POS tagging. These HMMs impose an assump-
non-negative matrix factorization framework
of Arora et al. (2013) to design a consistent
tion that each hidden state is associated with an ob-
estimator for anchor HMMs. In experiments, servation state (“anchor word”) that can appear un-
our algorithm is competitive with strong base- der no other state. For this reason, we denote this
lines such as the clustering method of Brown class of restricted HMMs by anchor HMMs. Such
et al. (1992) and the log-linear model of Berg- an assumption is relatively benign for POS tagging;
Kirkpatrick et al. (2010). Furthermore, it pro- it is reasonable to assume that each POS tag has at
duces an interpretable model in which hidden least one word that occurs only under that tag. For
states are automatically lexicalized by words.
example, in English, “the” is an anchor word for the
determiner tag; “laughed” is an anchor word for the
1 Introduction verb tag.
We build on the non-negative matrix factoriza-
Part-of-speech (POS) tagging without supervision is
tion (NMF) framework of Arora et al. (2013) to de-
a quintessential problem in unsupervised learning
rive a consistent estimator for anchor HMMs. We
for natural language processing (NLP). A major ap-
make several new contributions in the process. First,
plication of this task is reducing annotation cost: for
to our knowledge, there is no previous work di-
instance, it can be used to produce rough syntactic
rectly building on this framework to address unsu-
annotations for a new language that has no labeled
pervised sequence labeling. Second, we generalize
data, which can be subsequently refined by human
the NMF-based learning algorithm to obtain exten-
annotators.
sions that are important for empirical performance
Hidden Markov models (HMMs) are a natural
(Table 1). Third, we perform extensive experiments
choice of model and have been a workhorse for
on unsupervised POS tagging and report competitive
this problem. Early works estimated vanilla HMMs
results against strong baselines such as the cluster-

Currently on leave at Google Inc. New York. ing method of Brown et al. (1992) and the log-linear

245

Transactions of the Association for Computational Linguistics, vol. 4, pp. 245–257, 2016. Action Editor: Hinrich Schütze.
Submission batch: 1/2016; Revision batch: 3/2016; Published 6/2016.
c
2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
model of Berg-Kirkpatrick et al. (2010). • A(h) := {x ∈ [n] : o(x|h) > 0 ∧ o(x|h0 ) =
One characteristic of the approach is the imme- 0 ∀h0 6= h} is non-empty for each h ∈ [m].
diate interpretability of inferred hidden states. Be-
cause each hidden state is associated with an obser- In other words, an A-HMM is an HMM in which
vation, we can examine the set of such anchor obser- each hidden state h is associated with at least one
vations to qualitatively evaluate the learned model. “anchor” observation state that can be generated by,
In our experiments on POS tagging, we find that an- and only by, h. Note that the anchor condition im-
chor observations correspond to possible POS tags plies n ≥ m.
across different languages (Table 3). This property An equivalent definition of an A-HMM is given
can be useful when we wish to develop a tagger for a by organizing the parameters in matrix form. Under
new language that has no labeled data; we can label this definition, an A-HMM has parameters (π, T, O)
only the anchor words to achieve a complete label- where π ∈ Rm is a vector and T ∈ Rm×m , O ∈
ing of the data. Rn×m are matrices whose entries are set to:
This paper is structured as follows. In Section 2, • πh = π(h) for h ∈ [m]
we establish the notation we use throughout. In Sec-
• Th0 ,h = t(h0 |h) for h, h0 ∈ [m]
tion 3, we define the model family of anchor HMMs.
In Section 4, we derive a matrix decomposition al- • Ox,h = o(x|h) for h ∈ [m], x ∈ [n]
gorithm for estimating the parameters of an anchor
The anchor condition implies that rank(O) = m.
HMM. In Section 5, we present our experiments on
To see this, consider the rows Oa1 . . . Oam where
unsupervised POS tagging. In Section 6, we discuss
ah ∈ A(h). Since each Oah has a single non-zero
related works.
entry at the h-th index, the columns of O are linearly
2 Notation independent. We assume rank(T ) = m.
An important special case of A-HMM introduced
We use [n] to denote the set of integers {1, . . . , n}. by Brown et al. (1992) is defined below. Under
We use E[X] to denote the expected value of a ran- this A-HMM, every observation state is an anchor
dom variable P X. We define ∆
m−1 := {v ∈ Rm :
of some hidden state.
vi ≥ 0 ∀i, i vi = 1}, i.e., the (m−1)-dimensional
Definition 3.2. A Brown model is an A-HMM in
probability simplex. Given a vector v ∈ Rm , we
which A(1) . . . A(m) partition [n].
use diag(v) ∈ Rm×m to denote the diagonal matrix
with v1 . . . vm on the main diagonal. Given a matrix 4 Parameter Estimation for A-HMMs
M ∈ Rn×m , we write Mi ∈ Rm to denote the i-th
row of M (as a column vector). We now derive an algorithm for learning A-HMMs.
The algorithm reduces the learning problem to an
3 The Anchor Hidden Markov Model instance of NMF from which the model parameters
can be computed in closed-form.
Definition 3.1. An anchor HMM (A-HMM) is a 6-
tuple (n, m, π, t, o, A) for positive integers n, m and 4.1 NMF
functions π, t, o, A where
We give a brief review of the NMF method of Arora
• [n] is a set of observation states. et al. (2013). (Exact) NMF is the following problem:
given an n × d matrix A = BC where B ∈ Rn×m
• [m] is a set of hidden states.
and C ∈ Rm×d have non-negativity constraints, re-
• π(h) is the probability of generating h ∈ [m] cover B and C. This problem is NP-hard in general
in the first position of a sequence. (Vavasis, 2009), but Arora et al. (2013) provide an
exact and efficient method when A has the follow-
• t(h0 |h) is the probability of generating h0 ∈ ing special structure:
[m] given h ∈ [m].
Condition 4.1. A matrix A ∈ Rn×d satisfies this
• o(x|h) is the probability of generating x ∈ [n] condition if A = BC for B ∈ Rn×m and C ∈
given h ∈ [m]. Rm×d where

246
Anchor-NMF the vertices selected so far. Then it recovers each
Input: A ∈ Rn×d satisfying Condition 4.1 with A = Bx as the convex coefficients required to combine
BC for some B ∈ Rn×m and C ∈ Rm×d , value m Aa1 . . . Aam to yield Ax . The latter computation (1)
• For h = 1 . . . m, find a vertex ah as can be achieved with any constrained optimization
method; we use the Frank-Wolfe algorithm (Frank
U ← Gram-Schmidt({Aal }h−1
l=1 ) and Wolfe, 1956). See Arora et al. (2013) for a proof

ah ← arg max Ax − U U Ax
>
2 of the correctness of this algorithm.
x∈[n]

4.2 Random Variables


where Gram-Schmidt({Aal }h−1 l=1 ) is the Gram-
Schmidt process that orthonormalizes {Aal }h−1 To derive our algorithm, we make use of cer-
l=1 .
tain random variables under the A-HMM. Let
• For x = 1 . . . n, recover the x-th row of B as (X1 , . . . , XN ) ∈ [n]N be a random sequence

Xm of N ≥ 2 observations drawn from the model,

B x ← arg min Ax − uh Aah (1) along with the corresponding hidden state sequence
u∈∆m−1
h=1 2 (H1 , . . . , HN ) ∈ [m]N ; independently, pick a posi-
tion I ∈ [N − 1] uniformly at random. Let YI ∈ Rd
• Set C = [Aa1 . . . Aam ]> .
be a d-dimensional vector which is conditionally in-
Output: B and C such that B > h = Bρ(h) and C h = dependent of XI given HI , i.e., P (YI |HI , XI ) =
>

Cρ(h) for some permutation ρ : [m] → [m] P (YI |HI ). We will provide how to define such a
variable in Section 4.4.1: YI will be a function of
Figure 1: Non-negative matrix factorization algorithm of (X1 , . . . , XN ) serving as a “context” representation
Arora et al. (2012). of XI revealing the hidden state HI .
Define unigram probabilities u∞ , u1 ∈ Rn and
bigram probabilities B ∈ Rn×n where
1. For each x ∈ [n], Bx ∈ ∆m−1 . I.e., each row
of B defines a probability distribution over [m]. u∞x := P (XI = x) ∀x ∈ [n]
u1x := P (XI = x|I = 1) ∀x ∈ [n]
2. For each h ∈ [m], there is some ah ∈ [n] such Bx,x0 := P (XI = x, XI+1 = x ) ∀x, x0 ∈ [n]
0

that Bah ,h = 1 and Bah ,h0 = 0 for all h0 6= h.


Additionally, define π̄ ∈ Rm where
3. rank(C) = m.
π̄h = P (HI = h) ∀h ∈ [m] (2)
Since rank(B) = rank(C) = m (by property 2 and We assume π̄h > 0 for all h ∈ [m].
3), the matrix A must have rank m. Note that by
property 1, each row of A is given by a convex com- 4.3 Derivation of a Learning Algorithm
bination of the rows of C: for x ∈ [n], The following proposition provides a way to use the
m
X NMF algorithm in Figure 1 to recover the emission
Ax = Bx,h × Ch parameters O up to scaling.
h=1 Proposition 4.1. Let XI ∈ [n] and YI ∈ Rd be
respectively an observation and a context vector
Furthermore, by property 2 each h ∈ [m] has an as-
drawn from the random process described in Sec-
sociated row ah ∈ [n] such that Aah = Cah . These
tion 4.2. Define a matrix Ω ∈ Rn×d with rows
properties can be exploited to recover B and C.
A concrete algorithm for factorizing a matrix sat- Ωx = E[YI |XI = x] ∀x ∈ [n] (3)
isfying Condition 4.1 is given in Figure 1 (Arora
et al., 2013). It first identifies a1 . . . am (up to If rank(Ω) = m, then Ω satisfies Condition 4.1:
some permutation) by greedily locating the row
e
Ω = OΘ
of A furthest away from the subspace spanned by

247
where Oex,h = P (HI = h|XI = x) and Θh = Learn-Anchor-HMM
E[YI |HI = h]. Input: Ω in Proposition 4.1, number of hidden states m,
bigram probabilities B, unigram probabilities u∞ , u1
Proof.
e Θ) ← Anchor-NMF(Ω, m).
• Compute (O,
E[YI |XI = x]
• Recover the parameters:
Xm
= P (HI = h|XI = x) × E[YI |HI = h, XI = x] e > u∞
π̄ ← O (4)
h=1
m e
O ← diag(π̄)−1 diag(u∞ )O (5)
X
= P (HI = h|XI = x) × E[YI |HI = h] π=O u + 1
(6)
h=1
−1 + > + >
T ← (diag(π̄) O B(O ) ) (7)
The last equality follows by the conditional inde-
pendence of YI . This shows Ω = OΘ. e By the an- Output: A-HMM parameters (π, T, O)
chor assumption of the A-HMM, each h ∈ [m] has
at least one x ∈ A(h) such that P (HI = h|XI = Figure 2: NMF-based learning algorithm for A-HMMs.
x) = 1, thus Ω satisfies Condition 4.1. The algorithm Anchor-NMF is given in Figure 1.

A useful interpretation of Ω in Proposition 4.1


is that its rows Ω1 . . . Ωn are d-dimensional vec- Thus π can be recovered as π = O+ u1 where
tor representations of observation states forming a O+ is the Moore–Penrose pseudoinverse of O. Fi-
convex hull in Rd . This convex hull has m ver- nally, it can be algebraically verified that B =
tices Ωa1 . . . Ωam corresponding to anchors ah ∈ Odiag(π̄)T > O> (Hsu et al., 2012). Since all the in-
A(h) which can be convexly combined to realize all volved matrices have rank m, we can directly solve
Ω1 . . . Ωn . for T as
Given O,e we can recover the A-HMM parameters
as follows. First, we recover the stationary state dis- T = (diag(π̄)−1 O+ B(O> )+ )>
tribution π̄ as: Figure 2 shows the complete algorithm. As input,
X it receives a matrix Ω satisfying Proposition 4.1, the
π̄h = P (HI = h|XI = x) × P (XI = x)
number of hidden states, and the probabilities of ob-
x∈[n]
X served unigrams and bigrams. It first decomposes
= ex,h × u∞
O x Ω using the NMF algorithm in Figure 1. Then it
x∈[n] computes the A-HMM parameters whose solution is
given analytically.
The emission parameters O are given by Bayes’ the-
The following theorem guarantees the consis-
orem:
tency of the algorithm.
P (HI = h|XI = x) × P (XI = x) Theorem 4.1. Let (π, T, O) be an A-HMM such that
Ox,h = P
x∈[n] P (HI = h|XI = x) × P (XI = x) rank(T ) = m and π̄ defined in (2) has strictly pos-
ex,h × u∞
O itive entries π̄h > 0. Given random variables Ω
x
= satisfying Proposition 4.1 and B, u∞ , u1 under this
π̄h
model, the algorithm Learn-Anchor-HMM in Fig-
Using the fact that the emission probabilities are ure 2 outputs (π, T, O) up to a permutation on hid-
position-independent, we see that the initial state den states.
distribution π satisfies u1 = Oπ:
Proof. By Proposition 4.1, Ω satisfies Condition 4.1
u1x = P (XI = x|I = 1) with Ω = OΘ,e thus Oe can be recovered up to a
X
= P (XI = x|HI = h, I = 1) × P (HI = h|I = 1) permutation on columns with the algorithm Anchor-
h∈[m] NMF. The consistency of the recovered parameters
X
= Ox,h × πh follows from the correctness of (4–7) under the rank
h∈[m] conditions.

248
4.3.1 Constrained Optimization for π and T Under the rank conditions in Theorem 4.1, (11) has
Note that (6) and (7) require computing the pseu- rank m.
doinverse of the estimated O, which can be expen- More generally, we can let YI be an observation
sive and vulnerable to sampling errors in practice. (encoded as an indicator vector as in (10)) randomly
To make our parameter estimation more robust, we drawn from a window of L ∈ N nearby observa-
can explicitly impose probability constraints. We re- tions. We can either only use the identity of the cho-
cover π by solving: sen observation (in which case YI ∈ Rn ) or addi-
tionally indicate the relative position in the window
π = arg min u1 − Oπ 0 2 (8) (in which case YI ∈ RnL ). It is straightforward to
π 0 ∈∆m−1
verify that the above two conditions are satisfied un-
which can again be done with algorithms such as der these definitions. Clearly, (11) is a special case
Frank-Wolfe. We recover T by maximizing the log with L = 1.
likelihood of observation bigrams
  4.4.2 Reducing the Dimension of Ωx
X X With the definition of Ω in the previous section,
Bx,x0 log  π̄h Ox,h Th0 ,h Ox0 ,h0  (9)
the dimension of Ωx is d = O(n) which can be dif-
x,x0 h,h0 ∈[m]
ficult to work with when n  m. Proposition 4.1
subject to the constraint (T > )h ∈ ∆m−1 . Since (9) allows us to reduce the dimension as long as the fi-
is concave in T with other parameters O and π̄ fixed, nal matrix retains the form in (3) and has rank m. In
we can use EM to find the global optimum. particular, we can multiply Ω by any rank-m projec-
tion matrix Π ∈ Rd×m on the right side: if Ω sat-
4.4 Construction of the Convex Hull Ω isfies the properties in Proposition 4.1, then so does
In this section, we provide several ways to construct ΩΠ with m-dimensional rows
a convex hull Ω satisfying Proposition 4.1.
(ΩΠ)x = E[YI Π|XI = x]
4.4.1 Choice of the Context YI
In order to satisfy Proposition 4.1, we need to de- Since rank(Ω) = m, a natural choice of Π is the
fine the context variable YI ∈ Rd with two proper- projection onto the best-fit m-dimensional subspace
ties: of the row space of Ω.
We mention that previous works on the NMF-
• P (YI |HI , XI ) = P (YI |HI )
learning framework have employed various projec-
• The matrix Ω with rows tion methods, but they do not examine relative mer-
its of their choices. For instance, Arora et al. (2013)
Ωx = E[YI |XI = x] ∀x ∈ [n] simply use random projection, which is convenient
for theoretical analysis. Cohen and Collins (2014)
has rank m.
use a projection based on canonical correlation anal-
A simple construction (Arora et al., 2013) is given ysis (CCA) without further exploration. In con-
by defining YI ∈ Rn to be an indicator vector for trast, we give a full comparison of valid construc-
the next observation: tion methods and find that the choice of Ω is crucial
 in practice.
1 if XI+1 = x0
[YI ]x0 = (10)
0 otherwise 4.4.3 Construction of Ω for the Brown Model
The first condition is satisfied since XI+1 does not We can formulate an alternative way to construct
depend on XI given HI . For the second condition, a valid Ω when the model is further restricted to be a
observe that Ωx,x0 = P (XI+1 = x0 |XI = x), or in Brown model. Since every observation is an anchor,
matrix form Ox ∈ Rm has a single nonzero entry for every x.
Thus the rows defined by Ωx = Ox / ||Ox || (an in-
Ω = diag (u∞ )−1 B (11) dicator vector for the unique hidden state of x) form

249
Input: bigram probabilities B, unigram probabilities u∞ , where O h1/4i is an element-wise exponentiation of
number of hidden states
√ m, construction method τ O by 1/4. Since the rows of f (O) are simply some
Scaled Matrices: ( · is element-wise) scaling and rotation of the rows of O, using Ωx =
−1/2 −1/2 f (O)x / ||f (O)x || yields a valid Ω.
B := diag (u∞ ) Bdiag (u∞ )
√ −1/2 √ √ −1/2 While we need to impose an additional assump-
Be := diag u∞ Bdiag u∞ tion (the Brown model restriction) in order to justify
this choice of Ω, we find in our experiments that it
Singular Vectors: U (M ) (V (M )) is an n × m matrix of performs better than other alternatives. We specu-
the left (right) singular vectors of M corresponding to the late that this is because a Brown model is rather ap-
largest m singular values propriate for the POS tagging task; many words are
• If τ 6= brown: set indeed unambiguous with respect to POS tags (Ta-
−1 ble 5). Also, the general effectiveness of f (O) for
Ω ← diag (u∞ ) BΠ
representational purposes has been demostrated in
where the projection matrix Π ∈ Rn×m is given by previous works (Stratos et al., 2014; Stratos et al.,
2015). By restricting the A-HMM to be a Brown
Πi,j ∼ N (0, 1/m) if τ = random model, we can piggyback on the proven effective-
Π = V (diag (u )∞ −1
B) if τ = best-fit ness of f (O).
Π = diag (u )∞ −1/2
V (B) if τ = cca
Figure 3 shows an algorithm for constructing Ω
with these different construction methods. For sim-
• If τ = brown: compute the transformed emission plicity, we only show the bigram construction (con-
e and set
matrix as f (O) = U (B) text size L = 1), but an extension for larger context
(L > 1) is straightforward as discussed earlier. The
Ω ← diag(v)−1 f (O) construction methods random (random projection),
where vx := ||f (O)x ||2 is the length of the x-th best-fit (projection to the best-fit subspace), and cca
row of f (O). (CCA projection) all compute (11) and differ only
in how the dimension is reduced. The construction
Output: Ω ∈ Rn×m in Proposition 4.1
method brown computes the transformed Brown pa-
rameters f (O) as the left singular vectors of a scaled
Figure 3: Algorithm for constructing a valid Ω with dif- covariance matrix and then normalizes its rows. We
ferent construction methods. For simplicity, we only
direct the reader to Stratos et al. (2015) for a deriva-
show the bigram construction (context size L = 1), but an
extension for larger context (L > 1) is straightforward. tion of this calculation.

4.4.4 Ω with Feature Augmentation


a trivial convex hull in which every point is a ver- The x-th row of Ω is a d-dimensional vector repre-
tex. This corresponds to choosing an oracle context sentation of x lying in a convex set with m vertices.
YI ∈ Rm where This suggests a natural way to incorporate domain-
 specific features: we can add additional dimensions
1 if HI = h
[YI ]h = that provide information about hidden states from
0 otherwise
the surface form of x.
It is possible to recover the Brown model param- For instance, consider the the POS tagging task.
eters O up to element-wise scaling and rotation of In the simple construction (11), the representation
rows using the algorithm of Stratos et al. (2015). of word x is defined in terms of neighboring words
More specifically, let f (O) ∈ Rn×m denote the out- x0 :
put of their algorithm. Then they show that for some   
vector s ∈ Rm with strictly positive entries and an [Ωx ]x0 = E 1 XI+1 = x0 |XI = x
orthogonal matrix Q ∈ Rm×m :
where 1(·) ∈ {0, 1} is the indicator function. We
f (O) = Oh1/4i diag(s)Q> can augment this vector with s additional dimen-

250
sions indicating the spelling features of x. For in- (Merialdo, 1994; Smith and Eisner, 2005b; Liang
stance, the (n + 1)-th dimension may be defined as: and Klein, 2008).
Consequently, many works focus on using more
[Ωx ]n+1 = E [1 (x ends in “ing” ) |XI = x]
expressive models such as log-linear models (Smith
This value will be generally large for verbs and and Eisner, 2005a; Berg-Kirkpatrick et al., 2010)
small for non-verbs, nudging verbs closer together and Markov random fields (MRF) (Haghighi and
and away from non-verbs. The modified (n + s)- Klein, 2006). These models are shown to deliver
dimensional representation is followed by the usual good performance even though learning is approxi-
dimension reduction. Note that the spelling features mate. Thus one may question the value of having a
are a deterministic function of a word, and we are consistent estimator for A-HMMs and Brown mod-
implicitly assuming that they are independent of the els in this work: if the model is wrong, what is the
word given its tag. While this is of course not true in point of learning it accurately?
practice, we find that these features can significantly However, there is also ample evidence that
boost the tagging performance. HMMs are competitive for unsupervised POS induc-
tion when they incorporate domain-specific struc-
5 Experiments tures. Johnson (2007) is able to outperform the
We evaluate our A-HMM learning algorithm on the sophisticated MRF model of Haghighi and Klein
task of unsupervised POS tagging. The goal of this (2006) on one-to-one accuracy by using a sparse
task is to induce the correct sequence of POS tags prior in HMM estimation. The clustering method
(hidden states) given a sequence of words (observa- of Brown et al. (1992) which is based on optimizing
tion states). The anchor condition corresponds to as- the likelihood under the Brown model (a special case
suming that each POS tag has at least one word that of HMM) remains a baseline difficult to outperform
occurs only under that tag. (Christodoulopoulos et al., 2010).
We add to this evidence by demonstrating the ef-
5.1 Background on Unsupervised POS Tagging fectiveness of A-HMMs on this task. We also check
Unsupervised POS tagging has long been an active the anchor assumption on data and show that the A-
area of research (Smith and Eisner, 2005a; John- HMM model structure is in fact appropriate for the
son, 2007; Toutanova and Johnson, 2007; Haghighi problem (Table 5).
and Klein, 2006; Berg-Kirkpatrick et al., 2010),
5.2 Experimental Setting
but results on this task are complicated by vary-
ing assumptions and unclear evaluation metrics We use the universal treebank dataset (version 2.0)
(Christodoulopoulos et al., 2010). Rather than ad- which contains sentences annotated with 12 POS tag
dressing multiple alternatives for evaluating unsu- types for 10 languages (McDonald et al., 2013). All
pervised POS tagging, we focus on a simple and models are trained with 12 hidden states. We use
widely used metric: many-to-one accuracy (i.e., we the English portion to experiment with different hy-
map each hidden state to the most frequently coin- perparameter configurations. At test time, we fix a
ciding POS tag in the labeled data and compute the configuration (based on the English portion) and ap-
resulting accuracy). ply it across all languages.
The list of compared methods is given below:
5.1.1 Better Model v.s. Better Learning
Vanilla HMMs are notorious for their mediocre BW The Baum-Welch algorithm, an EM algorithm
performance on this task, and it is well known that for HMMs (Baum and Petrie, 1966).
they perform poorly largely because of model mis-
CLUSTER A parameter estimation scheme for
specification, not because of suboptimal parameter
HMMs based on Brown clustering (Brown et al.,
estimation (e.g., because EM gets stuck in local op-
1992). We run the Brown clustering algorithm1 to
tima). More generally, a large body of work points
obtain 12 word clusters C1 . . . C12 . Then we set
to the inappropriateness of simple generative mod-
els for unsupervised induction of linguistic structure 1
We use the implementation of Liang (2005).

251
the emission parameters o(x|h), transition param- Choice of Ω Accuracy
eters t(h0 |h), and prior π(h) to be the maximum- Random 48.2
likelihood estimates under the fixed clusters. Best-Fit 53.4
CCA 57.0
ANCHOR Our algorithm Learn-Anchor-HMM in
Brown 66.1
Figure 2 but with the constrained optimization (8)
and (9) for estimating π and T .2 Table 1: Many-to-one accuracy on the English data with
different choices of the convex hull Ω (Figure 3). These
ANCHOR - FEATURES Same as ANCHOR but em- results do not use spelling features.
ploys the feature augmentation scheme described in
Section 4.4.4.
construction (τ = brown in Figure 3) clearly per-
LOG - LINEAR The unsupervised log-linear model
forms the best: essentially, the anchor algorithm is
described in Berg-Kirkpatrick et al. (2010). Instead
used to extract the HMM parameters from the CCA-
of emission parameters o(x|h), the model maintains
based word embeddings of Stratos et al. (2015).
a miniature log-linear model with a weight vector w
We also explore feature augmentation discussed
and a feature function φ. The probability of a word
in Section 4.4.4. For comparison, we employ the
x given tag h is computed as
same word features used by Berg-Kirkpatrick et al.
exp(w> φ(x, h)) (2010):
p(x|h) = P >
x∈[n] exp(w φ(x, h)) • Indicators for whether a word is capitalized,
contains a hyphen, or contains a digit
The model can be trained by maximizing the like-
lihood of observed sequences. We use L-BFGS to • Suffixes of length 1, 2, and 3
directly optimize this objective.3 This approach ob-
tains the current state-of-the-art accuracy on fine- We weigh the l2 norm of these extra dimensions
grained (45 tags) English WSJ dataset. in relation to the original dimensions: we find a
small weight (e.g., 0.1 of the norm of the original
We use maximum marginal decoding for HMM dimensions) works well. We also find that these fea-
predictions: i.e., at each position, we predict the tures can sometimes significantly improve the per-
most likely tag given the entire sentence. formance. For instance, the accuracy on the English
portion can be improved from 66.1% to 71.4% with
5.3 Practical Issues with the Anchor Algorithm feature augmentation.
In our experiments, we find that Anchor-NMF (Fig- Another natural experiment is to refine the HMM
ure 1) tends to propose extremely rare words as an- parameters obtained from the anchor algorithm (or
chors. A simple fix is to search for anchors only Brown clusters) with a few iterations of the Baum-
among relatively frequent words. We find that any Welch algorithm. In our experiments, however, it
reasonable frequency threshold works well; we use did not significantly improve the tagging perfor-
the 300 most frequent words. Note that this is not mance, so we omit this result.
a problem if these 300 words include anchor words
corresponding to all the 12 tags. 5.4 Tagging Accuracy
We must define the context for constructing Ω. Table 2 shows the many-to-one accuracy on all lan-
We use the previous and next words (i.e., context guages in the dataset. For the Baum-Welch algo-
size L = 2) marked with relative positions. Thus Ω rithm and the unsupervised log-linear models, we
has 2n columns before dimension reduction. Table 1 report the mean and the standard deviation (in paren-
shows the performance on the English portion with theses) of 10 random restarts run for 1,000 itera-
different construction methods for Ω. The Brown tions.
2
https://github.com/karlstratos/anchor
Both ANCHOR and ANCHOR - FEATURES compete
3
We use the implementation of Berg-Kirkpatrick et al. favorably. On 5 out of 10 languages, ANCHOR -
(2010) (personal communication). FEATURES achieves the highest accuracy, often

252
Model de en es fr id it ja ko pt-br sv
(4.8) (3.4) (2.2) (3.6) (3.1) (2.6) (2.1) (0.6) (3.7) (3.0)
BW
45.5 59.8 60.6 60.1 49.6 51.5 59.5 51.7 59.5 42.4
CLUSTER 60.0 62.9 67.4 66.4 59.3 66.1 60.3 47.5 67.4 61.9
ANCHOR 61.1 66.1 69.0 68.2 63.7 60.4 65.3 53.8 64.9 51.1
ANCHOR - FEATURES 63.4 71.4 74.3 71.9 67.3 60.2 69.4 61.8 65.8 61.0
(1.8) (3.5) (3.1) (4.5) (3.9) (2.9) (2.9) (3.6) (2.2) (2.5)
LOG - LINEAR
67.5 62.4 67.1 62.1 61.3 52.9 78.2 60.5 63.2 56.7

Table 2: Many-to-one accuracy on each language using 12 universal tags. The first four models are HMMs estimated
with the Baum-Welch algorithm (BW), the clustering algorithm of Brown et al. (1992), the anchor algorithm without
(ANCHOR) and with (ANCHOR - FEATURES) feature augmentation. LOG - LINEAR is the model of Berg-Kirkpatrick et
al. (2010) trained with the direct-gradient method using L-BFGS. For BW and LOG - LINEAR, we report the mean and
the standard deviation (in parentheses) of 10 random restarts run for 1,000 iterations.

closely followed by ANCHOR. The Brown clustering that an anchor is representative of a certain group
estimation is also competitive and has the highest of words. For instance, the state “loss” rep-
accuracy on 3 languages. Not surprisingly, vanilla resents noun-like words, “1” represents numbers,
HMMs trained with BW perform the worst (see Sec- “on” represents preposition-like words, “one” rep-
tion 5.1.1 for a discussion). resents determiner-like words, and “closed” repre-
LOG - LINEAR is a robust baseline and performs sents verb-like words. The conditional distribution
the best on the remaining 2 languages. It per- is peaked for anchors that represent function tags
forms especially strongly on Japanese and Korean (e.g., determiners, punctuation) and flat for anchors
datasets in which poorly segmented strings such as that represent content tags (e.g., nouns). Occasion-
“1950年11月5日には” (on November 5, 1950) and ally, an anchor assigns high probabilities to words
“40.3%ᄅ ᅩ” (by 40.3%) abound. In these datasets, that do not seem to belong to the corresponding
it is crucial to make effective use of morphological POS tag. But this is to be expected since o(x|h) ∝
features. P (XI = x) is generally larger for frequent words.

5.5 Qualitative Analysis 5.5.2 Model Assumptions on Data


Table 5 checks the assumptions in A-HMMs and
5.5.1 A-HMM Parameters
Brown models on the universal treebank dataset.
An A-HMM can be easily interpreted since each The anchor assumption is indeed satisfied with 12
hidden state is marked with an anchor observation. universal tags: in every language, each tag has at
Table 3 shows the 12 anchors found in each lan- least one word uniquely associated with the tag.
guage. Note that these anchor words generally have The Brown assumption (each word has exactly one
a wide coverage of possible POS tags. possible tag) is of course not satisfied, since some
We also experimented with using true anchor words are genuinely ambiguous with respect to their
words (obtained from labeled data), but they did not POS tags. However, the percentage of unambigu-
improve performance over automatically induced ous words is very high (well over 90%). This analy-
anchors. Since anchor discovery is inherently tied to sis supports that the model assumptions made by A-
parameter estimation, it is better to obtain anchors in HMMs and Brown models are appropriate for POS
a data-driven manner. In particular, certain POS tags tagging.
(e.g., X) appear quite infrequently, and the model is Table 6 reports the log likelihood (normalized by
worse off by being forced to allocate a hidden state the number of words) on the English portion of
for such a tag. different estimation methods for HMMs. BW and
Table 4 shows words with highest emission prob- CLUSTER obtain higher likelihood than the anchor
abilities o(x|h) under each anchor. We observe algorithm, but this is expected given that both EM

253
de en es fr id it ja ko pt-br sv
empfehlen loss y avait bulan radar お世話に ᅪ
ᄋᄌ
ᆫ ᆫ
ᅥ E och
wie 1 hizo commune tetapi però ないと ᆼᄋ

ᅮ ᅦ de bör
; on - Le wilayah sulle ことにより ᆼᄋ

ᅧ ᅮ partida grund
Sein one especie de - - されている。 ᆯ

ᅮ fazer mellan
Berlin closed Además président Bagaimana Stati ものを ᇀᄋ

가ᄋ ᅭ meses i
und are el qui , Lo , ᆭᄋ

ᄆ ᆫ
ᅳ os sociala
, take paı́ses ( sama legge した , : .
- , la à . al それは ᆯ

ᅩ diretor bli
der vice España États dan far- 、 ᆫᄋ
ᅡᄉ
지 ᅴ 2010 den
im to en Unis Utara di 幸福の ᆮᄀ

ᄇ ᅩ , ,
des York de Cette pada la ことが ᅡ
ᄆᆻᄂ
ᆺᄋ
ᅵ ᆫ
ᅳ uma tid
Region Japan municipio quelques yang art. 通常の ᅱᅡ
ᄋ ᆫ
ᄒ O Detta

Table 3: Anchor words found in each language (model ANCHOR - FEATURES).

loss year (.02) market (.01) share (.01) company (.01) stock (.01) quarter (.01) shares (.01) price (.01)
1 1 (.03) 10 (.02) 30 (.02) 15 (.02) 8 (.02) 2 (.01) 20 (.01) 50 (.01)
on of (.14) in (.12) . (.08) for (.06) on (.04) by (.04) from (.04) and (.03)
one the (.23) a (.12) “ (.03) an (.03) $ (.02) its (.02) that (.02) this (.02)
closed said (.05) ’s (.02) is (.02) says (.02) was (.01) has (.01) had (.01) expected (.01)
are and (.08) is (.08) are (.05) was (.04) ’s (.04) “ (.04) has (.03) of (.03)
take be (.04) % (.02) have (.02) million (.02) But (.02) do (.01) The (.01) make (.01)
, , (.53) . (.25) and (.05) ” (.04) % (.01) million (.01) – (.01) that (.01)
vice ’s (.03) The (.02) “ (.02) New (.01) and (.01) new (.01) first (.01) chief (.01)
to to (.39) . (.11) a (.06) will (.04) $ (.03) n’t (.03) would (.02) % (.02)
York the (.15) a (.05) The (.04) of (.04) ’s (.04) million (.01) % (.01) its (.01)
Japan Mr. (.03) it (.02) ” (.02) $ (.02) he (.02) that (.02) which (.01) company (.01)

Table 4: Most likely words under each anchor word (English model ANCHOR - FEATURES). Emission probabilities
o(x|h) are given in parentheses.

and Brown clustering directly optimize likelihood. rect learning algorithm for topic models with a cer-
In contrast, the anchor algorithm is based on the tain parameter structure.
method of moments and does not (at least directly) The anchor-based framework has been originally
optimize likelihood. Note that high likelihood does formulated for learning topic models (Arora et al.,
not imply high accuracy under HMMs. 2013). It has been subsequently adopted to learn
other models such as latent-variable probabilistic
6 Related Work context-free grammars (Cohen and Collins, 2014).
6.1 Latent-Variable Models In our work, we have extended this framework to
address unsupervised sequence labeling.
There has recently been great progress in estima-
Zhou et al. (2014) also extend Arora et al.
tion of models with latent variables. Despite the
(2013)’s framework to learn various models includ-
NP-hardness in general cases (Terwijn, 2002; Arora
ing HMMs, but they address a more general prob-
et al., 2012), many algorithms with strong theoreti-
lem. Consequently, their algorithm draws from
cal guarantees have emerged under natural assump-
Anandkumar et al. (2012) and is substantially dif-
tions. For example, for HMMs with full-rank condi-
ferent from ours.
tions, Hsu et al. (2012) derive a consistent estimator
of the marginal distribution of observed sequences.
6.2 Unsupervised POS Tagging
Anandkumar et al. (2014) propose an exact tensor
decomposition method for learning a wide class of Unsupervised POS tagging is a classic problem in
latent variable models with similar non-degeneracy unsupervised learning that has been tackled with
conditions. Arora et al. (2013) derive a provably cor- various approaches. Johnson (2007) observes that

254
de en es fr id it ja ko pt-br sv
% anchored tags 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
% unambig words 96.6 91.5 94.0 94.2 94.8 94.8 99.5 98.4 94.8 97.4

. VERB PRON ADP NOUN ADV CONJ DET NUM ADJ X PRT
, said it from Mr. n’t or which billion new bono na
53928 6339 5147 4856 4436 3582 2748 2458 1935 1542 8 5

Table 5: Verifying model assumptions on the universal treebank. The anchor assumption is satisfied in every language.
The Brown assumption (each word has exactly one possible tag) is violated but not by a large margin. The lower table
shows the most frequent anchor word and its count under each tag on the English portion.

Model Normalized LL Acc Models Accuracy


BW -6.45 59.8 BW 62.6 (1.1)
CLUSTER -6.71 62.9 CLUSTER 65.6
ANCHOR -7.06 66.1 ANCHOR 67.2
ANCHOR - FEATURES -7.05 71.4 ANCHOR - FEATURES 67.7
LOG - LINEAR 74.9 (1.5)
Table 6: Log likelihood normalized by the number of
words on English (along with accuracy). For BW, we Table 7: Many-to-one accuracy on the English data with
report the mean of 10 random restarts run for 1,000 it- 45 original tags. We use the same setting as in Table 2.
erations. For BW and LOG - LINEAR, we report the mean and the
standard deviation (in parentheses) of 10 random restarts
run for 1,000 iterations.
EM performs poorly in this task because it induces
flat distributions; this is not the case with our algo-
they leverage additional resources. For instance, Das
rithm as seen in the peaky distributions in Table 4.
and Petrov (2011) and Täckström et al. (2013) use
Haghighi and Klein (2006) assume a set of proto-
parallel data to project POS tags from a supervised
typical words for each tag and report high accuracy.
source language. Li et al. (2012) use tag dictionar-
In contrast, our algorithm automatically finds such
ies built from Wiktionary. Thus their results are not
prototypes in a subroutine.
directly comparable to ours.4
Berg-Kirkpatrick et al. (2010) achieve the state-
of-the-art result in unsupervised fine-grained POS
tagging (mid-70%). As described in Section 5.2, 7 Conclusion
their model is an HMM in which probabilties are
given by log-linear models. Table 7 provides a We have presented an exact estimation method for
point of reference comparing our work with Berg- learning anchor HMMs from unlabeled data. There
Kirkpatrick et al. (2010) in their setting: models are are several directions for future work. An important
trained and tested on the entire 45-tag WSJ dataset. direction is to extend the method to a richer family
Their model outperforms our approach in this set- of models such as log-linear models or neural net-
ting: with fine-grained tags, spelling features be- works. Another direction is to further generalize the
come more important, for instance to distinguish method to handle a wider class of HMMs by relax-
“played” (VBD) from “play” (VBZ). Nonetheless, we ing the anchor condition (Condition 4.1). This will
have shown that our approach is competitive when require a significant extension of the NMF algorithm
universal tags are used (Table 2). in Figure 1.
Many past works on POS induction predate the
introduction of the universal tagset by Petrov et al. 4
Das and Petrov (2011) conduct unsupervised experiments
(2012) and thus report results with fine-grained tags. using the model of Berg-Kirkpatrick et al. (2010), but their
More recent works adopt the universal tagset but dataset and evaluation method differ from ours.

255
Acknowledgments Dipanjan Das and Slav Petrov. 2011. Unsupervised
part-of-speech tagging with bilingual graph-based pro-
We thank Taylor Berg-Kirkpatrick for providing the jections. In Proceedings of the 49th Annual Meet-
implementation of Berg-Kirkpatrick et al. (2010). ing of the Association for Computational Linguistics:
We also thank anonymous reviewers for their con- Human Language Technologies-Volume 1, pages 600–
structive comments. 609. Association for Computational Linguistics.
Marguerite Frank and Philip Wolfe. 1956. An algorithm
for quadratic programming. Naval Research Logistics
References Quarterly, 3(1-2):95–110.
Aria Haghighi and Dan Klein. 2006. Prototype-driven
Animashree Anandkumar, Daniel Hsu, and Sham M. learning for sequence models. In Proceedings of
Kakade. 2012. A method of moments for mixture the main conference on Human Language Technol-
models and hidden Markov models. In Twenty-Fifth ogy Conference of the North American Chapter of the
Annual Conference on Learning Theory. Association of Computational Linguistics, pages 320–
Animashree Anandkumar, Rong Ge, Daniel Hsu, 327. Association for Computational Linguistics.
Sham M. Kakade, and Matus Telgarsky. 2014. Ten- Daniel Hsu, Sham M. Kakade, and Tong Zhang. 2012. A
sor decompositions for learning latent variable models. spectral algorithm for learning hidden Markov mod-
Journal of Machine Learning Research, 15(1):2773– els. Journal of Computer and System Sciences,
2832. 78(5):1460–1480.
Sanjeev Arora, Rong Ge, and Ankur Moitra. 2012. Mark Johnson. 2007. Why doesn’t EM find good HMM
Learning topic models–going beyond SVD. In Foun- POS-taggers? In EMNLP-CoNLL, pages 296–305.
dations of Computer Science (FOCS), 2012 IEEE 53rd Shen Li, Joao V. Graça, and Ben Taskar. 2012. Wiki-
Annual Symposium on, pages 1–10. IEEE. ly supervised part-of-speech tagging. In Proceedings
Sanjeev Arora, Rong Ge, Yonatan Halpern, David of the 2012 Joint Conference on Empirical Methods
Mimno, Ankur Moitra, David Sontag, Yichen Wu, and in Natural Language Processing and Computational
Michael Zhu. 2013. A practical algorithm for topic Natural Language Learning, pages 1389–1398. Asso-
modeling with provable guarantees. In Proceedings of ciation for Computational Linguistics.
the 30th International Conference on Machine Learn- Percy Liang and Dan Klein. 2008. Analyzing the errors
ing (ICML-13), pages 280–288. of unsupervised learning. In ACL, pages 879–887.
Percy Liang. 2005. Semi-supervised learning for natural
Leonard E. Baum and Ted Petrie. 1966. Statisti-
language. Master’s thesis, Massachusetts Institute of
cal inference for probabilistic functions of finite state
Technology.
Markov chains. The Annals of Mathematical Statis-
Ryan T. McDonald, Joakim Nivre, Yvonne Quirmbach-
tics, 37(6):1554–1563.
Brundage, Yoav Goldberg, Dipanjan Das, Kuzman
Taylor Berg-Kirkpatrick, Alexandre Bouchard-Côté, Ganchev, Keith B. Hall, Slav Petrov, Hao Zhang, Os-
John DeNero, and Dan Klein. 2010. Painless unsu- car Täckström, Claudia Bedini, Núria B. Castelló, and
pervised learning with features. In Human Language Jungmee Lee. 2013. Universal dependency annota-
Technologies: The 2010 Annual Conference of the tion for multilingual parsing. In ACL, pages 92–97.
North American Chapter of the Association for Com- Bernard Merialdo. 1994. Tagging English text with
putational Linguistics, pages 582–590. Association for a probabilistic model. Computational Linguistics,
Computational Linguistics. 20(2):155–171.
Peter F. Brown, Peter V. Desouza, Robert L. Mercer, Vin- Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012.
cent J. Della Pietra, and Jenifer C. Lai. 1992. Class- A universal part-of-speech tagset. In Proceedings of
based n-gram models of natural language. Computa- the Eighth International Conference on Language Re-
tional Linguistics, 18(4):467–479. sources and Evaluation (LREC’12).
Christos Christodoulopoulos, Sharon Goldwater, and Noah A. Smith and Jason Eisner. 2005a. Contrastive
Mark Steedman. 2010. Two decades of unsupervised estimation: Training log-linear models on unlabeled
POS induction: How far have we come? In Proceed- data. In Proceedings of the 43rd Annual Meeting
ings of the 2010 Conference on Empirical Methods in on Association for Computational Linguistics, pages
Natural Language Processing, pages 575–584. Asso- 354–362. Association for Computational Linguistics.
ciation for Computational Linguistics. Noah A. Smith and Jason Eisner. 2005b. Guiding un-
Shay B. Cohen and Michael Collins. 2014. A provably supervised grammar induction using contrastive esti-
correct learning algorithm for latent-variable PCFGs. mation. In Proc. of IJCAI Workshop on Grammatical
In Proceedings of ACL. Inference Applications, pages 73–82.

256
Karl Stratos, Do-kyum Kim, Michael Collins, and Daniel
Hsu. 2014. A spectral algorithm for learning class-
based n-gram models of natural language. In Proceed-
ings of the thirtieth conference on Uncertainty in Arti-
ficial Intelligence.
Karl Stratos, Michael Collins, and Daniel Hsu. 2015.
Model-based word embeddings from decompositions
of count matrices. In Proceedings of the 53rd Annual
Meeting of the Association for Computational Linguis-
tics and the 7th International Joint Conference on Nat-
ural Language Processing (Volume 1: Long Papers),
pages 1282–1291, Beijing, China, July. Association
for Computational Linguistics.
Oscar Täckström, Dipanjan Das, Slav Petrov, Ryan Mc-
Donald, and Joakim Nivre. 2013. Token and type
constraints for cross-lingual part-of-speech tagging.
Transactions of the Association for Computational
Linguistics, 1:1–12.
Sebastiaan A. Terwijn. 2002. On the learnability of hid-
den Markov models. In Grammatical Inference: Al-
gorithms and Applications, pages 261–268. Springer.
Kristina Toutanova and Mark Johnson. 2007. A
Bayesian LDA-based model for semi-supervised part-
of-speech tagging. In Advances in Neural Information
Processing Systems, pages 1521–1528.
Stephen A. Vavasis. 2009. On the complexity of nonneg-
ative matrix factorization. SIAM Journal on Optimiza-
tion, 20(3):1364–1377.
Tianyi Zhou, Jeff A. Bilmes, and Carlos Guestrin. 2014.
Divide-and-conquer learning by anchoring a conical
hull. In Advances in Neural Information Processing
Systems, pages 1242–1250.

257
258

You might also like