The Infinite Hidden Markov Model
The Infinite Hidden Markov Model
The Infinite Hidden Markov Model
Abstract
We show that it is possible to extend hidden Markov models to have
a countably infinite number of hidden states. By using the theory of
Dirichlet processes we can implicitly integrate out the infinitely many
transition parameters, leaving only three hyperparameters which can be
learned from data. These three hyperparameters define a hierarchical
Dirichlet process capable of capturing a rich set of transition dynamics.
The three hyperparameters control the time scale of the dynamics, the
sparsity of the underlying state-transition matrix, and the expected num-
ber of distinct hidden states in a finite sequence. In this framework it
is also natural to allow the alphabet of emitted symbols to be infinite—
consider, for example, symbols being possible words appearing in En-
glish text.
1 Introduction
Hidden Markov models (HMMs) are one of the most popular methods in machine
learning and statistics for modelling sequences such as speech and proteins. An
HMM defines a probability distribution over sequences of observations (symbols) y =
{y1 , . . . , yt , . . . , yT } by invoking another sequence of unobserved, or hidden, discrete
state variables s = {s1 , . . . , st , . . . , sT }. The basic idea in an HMM is that the se-
quence of hidden states has Markov dynamics—i.e. given st , sτ is independent of sρ
for all τ < t < ρ—and that the observations yt are independent of all other variables
given st . The model is defined in terms of two sets of parameters, the transition matrix
whose ij th element is P (st+1 = j|st = i) and the emission matrix whose iq th element
is P (yt = q|st = i). The usual procedure for estimating the parameters of an HMM is
the Baum-Welch algorithm, a special case of EM, which estimates expected values of two
matrices n and m corresponding to counts of transitions and emissions respectively, where
the expectation is taken over the posterior probability of hidden state sequences [6].
Both the standard estimation procedure and the model definition for HMMs suffer from
important limitations. First, maximum likelihood estimation procedures do not consider
the complexity of the model, making it hard to avoid over or underfitting. Second, the
model structure has to be specified in advance. Motivated in part by these problems there
have been attempts to approximate a full Bayesian analysis of HMMs which integrates over,
rather than optimises, the parameters. It has been proposed to approximate such Bayesian
integration both using variational methods [3] and by conditioning on a single most likely
hidden state sequence [8].
In this paper we start from the point of view that the basic modelling assumption of
HMMs—that the data was generated by some discrete state variable which can take on
one of several values—is unreasonable for most real-world problems. Instead we formu-
late the idea of HMMs with a countably infinite number of hidden states. In principle,
such models have infinitely many parameters in the state transition matrix. Obviously it
would not be sensible to optimise these parameters; instead we use the theory of Dirichlet
processes (DPs) [2, 1] to implicitly integrate them out, leaving just three hyperparameters
defining the prior over transition dynamics.
The idea of using DPs to define mixture models with infinite number of components has
been previously explored in [5] and [7]. This simple form of the DP turns out to be inade-
quate for HMMs.1 Because of this we have extended the notion of a DP to a two-stage hi-
erarchical process which couples transitions between different states. It should be stressed
that Dirichlet distributions have been used extensively both as priors for mixing propor-
tions and to smooth n-gram models over finite alphabets [4], which differs considerably
from the model presented here. To our knowledge no one has studied inference in discrete
infinite-state HMMs.
We begin with a review of Dirichlet processes in section 2 which we will use as the basis
for the notion of a hierarchical Dirichlet process (HDP) described in section 3. We explore
properties of the HDP prior, showing that it can generate interesting hidden state sequences
and that it can also be used as an emission model for an infinite alphabet of symbols. This
infinite emission model is controlled by two additional hyperparameters. In section 4 we
describe the procedures for inference (Gibbs sampling the hidden states), learning (op-
timising the hyperparameters), and likelihood evaluation (infinite-state particle filtering).
We present experimental results in section 5 and conclude in section 6.
where we have used the Kronecker-delta function (δ(a, b) = 1 iff a = b, and 0 otherwise)
to count the number of times nj that st+1 = j has been drawn. Let us see what happens to
the distribution of these indicators when we integrate out the mixing proportions π under a
conjugate prior. We give the mixing proportions a symmetric Dirichlet prior with positive
concentration hyperparameter β
k
Γ(β) Y β/k−1
P (π|β) ∼ Dirichlet(β/k, . . . , β/k) = π , (2)
Γ(β/k)k j=1 j
nii + α nij β
j j j
self existing
oracle
transition
transition
j=i
c) d)
njo
Σ njo + γ Σ njo + γ
j j
existing new
state state
Figure 1: (left) State transition generative mechanism. (right a-d) Sampled state trajectories
of length T = 250 (time along horizontal axis) from the HDP: we give examples of four modes of
behaviour. (a) α = 0.1, β = 1000, γ = 100, explores many states with a sparse transition matrix. (b)
α = 0, β = 0.1, γ = 100, retraces multiple interacting trajectory segments. (c) α = 8, β = 2, γ = 2,
switches between a few different states. (d) α = 1, β = 1, γ = 10000, has strict left-to-right transition
dynamics with long linger time.
Under the oracle, with probability proportional to γ an entirely new state is transitioned
to. This is the only mechanism for visiting new states from the infinitely many available to
us. After each transition we set nij ← nij + 1 and, if we transitioned to the state j via the
oracle DP just described then in addition we set noj ← noj + 1. If we transitioned to a new
state then the size of n and no will increase.
Self-transitions are special because their probability defines a time scale over which the
dynamics of the hidden state evolves. We assign a finite prior mass α to self transitions for
each state; this is the third hyperparameter in our model. Therefore, when first visited (via
γ in the HDP), its self-transition count is initialised to α.
The full hidden state transition mechanism is a two-level DP hierarchy shown in decision
tree form in Figure 1. Alongside are shown typical state trajectories under the prior with
different hyperparameters. We can see that, with just three hyperparameters, there are a
wealth of types of possible trajectories. Note that γ controls the expected number of repre-
sented hidden states, and β influences the tendency to explore new transitions, correspond-
ing to the size and density respectively of the resulting transition count matrix. Finally α
controls the prior tendency to linger in a state.
The role of the oracle is two-fold. First it serves to couple the transition DPs from different
hidden states. Since a newly visited state has no previous transitions to existing states,
without an oracle (which necessarily has knowledge of all represented states as it created
them) it would transition to itself or yet another new state with probability 1. By consulting
the oracle, new states can have finite probability of transitioning to represented states. The
second role of the oracle is to allow some states to be more influential (more commonly
transitioned to) than others.
1500
emission
1
10
1000
mq γ
e
Σ mqo + γe
500
Σ mqo + γe
q q
existing new
0
4
x 10 0 20 40 60 80 100
Figure 2: (left) State emission generative mechanism. (middle) Word occurence for entire Alice
novel: each word is assigned a unique integer identity as it appears. Word identity (vertical) is plotted
against the word position (horizontal) in the text. (right) (Exp 1) Evolution of number of represented
states K (vertical), plotted against iterations of Gibbs sweeps (horizontal) during learning of the
ascending-descending sequence which requires exactly 10 states to model the data perfectly. Each
line represents initialising the hidden state to a random sequence containing K = {1, 2, 4, . . . , 128}
distinct represented states. (Hyperparameters are not optimised.)
5 Synthetic experiments
Exp 1: Discovering the number of hidden states We applied the infinite HMM infer-
ence algorithm to the ascending-descending observation sequence consisting of 30 con-
catenated copies of ABCDEF EDCB. The most parsimonious HMM which models this
data perfectly has exactly 10 hidden states. The infinite HMM was initialised with a ran-
dom hidden state sequence, containing K distinct represented states. In Figure 2 (right) we
show how the number of represented states evolves with successive Gibbs sweeps, starting
from a variety of initial K. In all cases K converges to 10, while occasionally exploring 9
and 11.
Exp 2: Expansive A sequence of length T = 800 was generated from a 4-state 8-symbol
HMM with the transition and emission probabilities as shown in Figure 3 (top left).
Exp 3: Compressive A sequence of length T = 800 was generated from a 4-state 3-symbol
HMM with the transition and emission probabilities as shown in Figure 3 (bottom left).
In both Exp 2 and Exp 3 the infinite HMM was initialised with a hidden state sequence
with K = 20 distinct states. Figure 3 shows that, over successive Gibbs sweeps and hyper-
parameter learning, the count matrices for the infinite HMM converge to resemble the true
probability matrices as shown on the far left.
6 Discussion
We have shown how a two-level Hierarchical Dirichlet Process can be used to define a non-
parametric Bayesian HMM. The HDP implicity integrates out the transition and emission
parameters of the HMM. An advantage of this is that it is no longer necessary to constrain
the HMM to have finitely many states and observation symbols. The prior over hidden state
transitions defined by the HDP is capable of producing a wealth of interesting trajectories
by varying the three hyperparameters that control it.
We have presented the necessary tools for using the infinite HMM, namely a linear-time
approximate Gibbs sampler for inference, equations for hyperparameter learning, and a
particle filter for likelihood evaluation.
6
Different particle initialisations apply if we do not assume that the test sequence immediately
follows the training sequence.
True transition and
emission probability
matrices used for Exp 2
On synthetic data we have shown that the infinite HMM discovers both the appropriate
number of states required to model the data and the structure of the emission and transition
matrices. It is important to emphasise that although the count matrices found by the infinite
HMM resemble point estimates of HMM parameters (e.g. Figure 3), they are better thought
of as the sufficient statistics for the HDP posterior distribution over parameters.
We believe that for many problems the infinite HMM’s flexibile nature and its ability to
automatically determine the required number of hidden states make it superior to the con-
ventional treatment of HMMs with its associated difficult model selection problem. While
the results in this paper are promising, they are limited to synthetic data; in future we hope
to explore the potential of this model on real-world problems.
Acknowledgements
The authors would like to thank David Mackay for suggesting the use of an oracle, and
Quaid Morris for his Perl expertise.
References
[1] C. E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric
problems. Annals of Statistics, 2(6):1152–1174, 1974.
[2] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics,
1(2):209–230, March 1973.
[3] D. J. C. MacKay. Ensemble learning for hidden Markov models. Technical report, Cavendish
Laboratory, University of Cambridge, 1997.
[4] D. J. C. MacKay and L. C. Peto. A hierarchical Dirichlet language model. Natural Language
Engineering, 1(3):1–19, 1995.
[5] R. M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Technical
Report 9815, Dept. of Statistics, University of Toronto, 1998.
[6] L. R. Rabiner and B. H. Juang. An introduction to hidden Markov models. IEEE Acoustics,
Speech & Signal Processing Magazine, 3:4–16, 1986.
[7] C. E. Rasmussen. The infinite Gaussian mixture model. In Advances in Neural Information
Processing Systems 12, Cambridge, MA, 2000. MIT Press.
[8] A. Stolcke and S. Omohundro. Hidden Markov model induction by Bayesian model merging. In
S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing
Systems 5, pages 11–18, San Francisco, CA, 1993. Morgan Kaufmann.