On-Line Viterbi Algorithm For Analysis of Long Biological Sequences
On-Line Viterbi Algorithm For Analysis of Long Biological Sequences
On-Line Viterbi Algorithm For Analysis of Long Biological Sequences
Biological Sequences
Abstract. Hidden Markov models (HMMs) are routinely used for anal-
ysis of long genomic sequences to identify various features such as genes,
CpG islands, and conserved elements. A commonly used Viterbi algo-
rithm requires O(mn) memory to annotate a sequence of length n with
an m-state HMM, which is impractical for analyzing whole chromosomes.
In this paper, we introduce the on-line Viterbi algorithm for decoding
HMMs in much smaller space. Our analysis shows that our algorithm
has the expected maximum memory (m log n) on two-state HMMs. We
also experimentally demonstrate that our algorithm significantly reduces
memory of decoding a simple HMM for gene finding on both simulated
and real DNA sequences, without a significant slow-down compared to
the classical Viterbi algorithm.
1 Introduction
Hidden Markov models (HMMs) are generative probabilistic models that have
been successfully used for annotation of protein and DNA sequences. Their nu-
merous applications in bioinformatics include gene finding [1], protein secondary
structure prediction [2], promoter detection [3], and CpG island detection [4].
More complex phylogenetic HMMs are used to analyze multiple sequence align-
ments in comparative gene finding [5] and detection of conserved elements [6].
The linear-time Viterbi algorithm [7] is the most commonly used algorithm for
these tasks. Unfortunately, the space required by the Viterbi algorithm grows lin-
early with the length of the sequence (with a high constant factor), which makes
it unsuitable for analysis of very long sequences, such as whole chromosomes
or whole-genome multiple alignments. In this paper, we address this problem
by proposing an on-line Viterbi algorithm that on average requires much less
memory and that can even annotate continuous streams of data on-line without
reading the complete input sequence first.
An HMM, composed of states and transitions, is a probabilistic model that
generates sequences over a given alphabet. In each step of this generative process,
the current state generates one symbol of the sequence according to the emission
probabilities associated with that state. Then, an outgoing transition is randomly
chosen according to the transition probability table, and this transition is followed
to the new state. This process is repeated until the whole sequence is generated.
The states of the HMM represent distinct features of the observed sequences
(such as protein coding and non-coding sequences in a genome), and the emission
probabilities in each state represent statistical properties of these features. The
HMM thus defines a joint probability Pr(X, S) over all possible sequences X
and all state paths S through the HMM that could generate these sequences.
To annotate a given sequence X, we find the state path S that maximizes this
joint probability. For example, in an HMM with one state for protein-coding
sequences, and one state for non-coding sequences, the most probable state path
marks each symbol of sequence X as either protein coding or non-coding.
To compute the most probable state path, we use the Viterbi dynamic pro-
gramming algorithm [7]. For every prefix X1 . . . Xi of sequence X and for every
state j, we compute the most probable state path generating this prefix ending
in state j. We store the probability of this path in table P (i, j) and its second
last state in table B(i, j). These values can be computed from left to right, using
the recurrence P (i, j) = maxk {P (i1, k)tk (j)ej (Xi )}, where tk (j) is the tran-
sition probability from state k to state j, and ej (Xi ) is the emission probability
of symbol Xi in state j. Back pointer B(i, j) is the value of k that maximizes
P (i, j). After computing these values, we can recover the most probable state
path S = s1 , . . . , sn by setting the last state as sn = arg maxk {P (n, k)}, and
then following the back pointers B(i, j) from right to left (i.e., si = B(i+1, si+1 )).
For an HMM with m states and a sequence X of length n, the running time of
the Viterbi algorithm is (nm2 ), and the space is (nm).
This algorithm is well suited for sequences and models of moderate size.
However, to annotate all 250 million symbols of the human chromosome 1 with
a gene finding HMM consisting of hundred states, we would require 25 GB of
memory to store the back pointers B(i, j). This is clearly impractical on most
computational platforms.
Several solutions are used in practice to overcome this problem. For example,
most practical gene finding programs process only sequences of limited size. The
long input sequence is split into several shorter sequences which are processed
separately. Afterwards, the results are merged and conflicts are resolved heuris-
tically. This approach leads to suboptimal solutions, especially if the genes we
are looking for cross the boundaries of the split.
Grice et al. [8] proposed a checkpointing algorithm that trades running time
for space. We divide the input sequence into K blocks of L symbols, and during
the forward pass, we only keep the first column of each block. To obtain the most
probable state path, we recompute the last block and use the back pointers to
recover the last L states of the path, as well as the last state of the previous
block. This information can now be used to recompute the most probable state
path within the previous block in the same way, and the process is repeated for
all blocks. If we set K = L = n, this algorithm only requires (n + nm)
sequence positions
states
Fig. 1. Example of the back pointer tree structure. Dashed lines mark the edges
that cannot be part of the most probable state path. The square marks the coalescence
point of the remaining paths.
In our algorithm, we represent the back pointer matrix B in the Viterbi algorithm
by a tree structure (see [7]), with node (i, j) for each sequence position i and
state j. Parent of node (i, j) is the node (i1, B(i, j)). In this data structure, the
most probable state path is the path from the leaf node (n, j) with the highest
probability P (n, j) to the root of the tree (see Figure 1).
This tree is built as the Viterbi algorithm progresses from left to right. After
computing column i, all edges that do not lie on one of the paths ending column i
can be removed; these edges will not be used in the most probable path [11]. The
remaining m paths represent all possible initial segments of the most probable
state path. These paths are not necessarily edge disjoint; in fact, often all the
paths share the same prefix up to some node that we call a coalescence point
(see Figure 1).
Left of the coalescence point there is only a single candidate for the initial
segment of the most probable state path. Therefore we can output this segment
and remove all edges and nodes of the tree up to the coalescence point. Forney
[7] describes an algorithm that after processing D symbols of the input sequence
checks whether a coalescence point has been reached; in such case, the initial
segment of the most probable state path is outputted. If the coalescence point
was not reached, one potential initial segment is chosen heuristically. Several
Fig. 2. An HMM requiring (n) space. Ev-
ery state emits only the symbol shown,
with probability 1. Transition probability
0 1 is evenly divided among transitions outgo-
ing from a given state. For sequences of the
s
form s{0, 1}n {0, 2}, any correct decoding
0 1 2 algorithm must retain some representation
of the input before discovering whether the
last symbol is 0 or 2.
studies [12, 13] suggest how to choose D to limit the expected error caused by
such heuristic steps in the context of convolution codes.
Here we show how to detect the existence of a coalescence point dynamically
without introducing significant overhead to the whole computation. We maintain
a compressed version of the back pointer tree, where we omit all internal nodes
that have less than two children. Any path consisting of such nodes will be
contracted to a single edge. This compressed tree has m leaves and at most
m 1 internal nodes. Each node stores the number of its children and a pointer
to its parent node. We also keep a linked list of all the nodes of the compressed
tree ordered by the sequence position. Finally, we also keep the list of pointers
to all the leaves.
When processing the k-th sequence position in the Viterbi algorithm, we
update the compressed tree as follows. First, we create a new leaf for each node
at position i, link it to its parent (one of the former leaves), and insert it into the
linked list. Once these new leaves are created, we remove all the former leaves
that have no children, and recursively all of their ancestors that would not have
any children.
Finally, we need to compress the new tree: we examine all the nodes in the
linked list in order of decreasing sequence position. If the node has zero or one
child and is not a current leaf, we simply delete it. For each leaf or node that has
at least two children, we follow the parent links until we find its first ancestor
(if any) that has at least two children and link the current node directly to that
ancestor. A node (, j) that does not have an ancestor with at least two children
is the coalescence point; it will become a new root. We can output the most
probable state path for all sequence positions up to , and remove all results of
computation for these positions from memory.
The running time of this update is O(m) per sequence position, and the
representation of the compressed tree takes O(m) space. Thus the asymptotic
running time of the Viterbi algorithm is not increased by the maintenance of
the compressed tree. Moreover, we have implemented both the standard Viterbi
algorithm and our new on-line extension, and the time measurements suggest
that the overhead required for the compressed tree updates is less than 5%.
The worst-case space required by this algorithm is still O(nm). In fact, any
algorithm that correctly finds the most probable state path for the HMM shown
in Figure 2 requires at least (n) space in the worst case.
However, our algorithm rarely requires linear space for realistic data; the
space changes dynamically depending on the input. In the next section, we show
that for simple HMMs the expected maximum space required for processing
sequence of length n is (m logn). This is much better than checkpointing,
which requires space of (n + m n) with a significant increase in running time.
We conjecture that this trend extends to more complex cases. We also present
experimental results on a gene finding HMM and real DNA sequences showing
that the on-line Viterbi algorithm leads to significant savings in memory.
Another advantage of our algorithm is that it can construct initial segments
of the most probable state path before the whole input sequence is read. This
feature makes it ideal for on-line processing of signal streams (such as sensor
readings).
1t 1t A A
B B
t
A B configuration iii: configuration iv:
t
A A
0: 1e 0: e
1: e 1: 1e B B
(a) (b)
Fig. 3. (a) Symmetric two-state HMM with two parameters: e for emission proba-
bilities and t for transitions probabilities. (b) Possible back-pointer configurations
for a two-state HMM.
l m
log(1t)log(t)
K = 2 log(1e)log(e) . The quantity pA pB is updated by log(1 e) log e,
if the symbol at the corresponding sequence position is 0, or log e log(1 e),
if this symbol is 1. This corresponds to updating X by +1 or 1.
When X reaches 0, we have a coalescence point in configuration iii, and the
pA pB is initialized to log t log(1 t) (log e log 1 e), which either means
initialization of X to +1, or another coalescence point, depending on the symbol
at the corresponding sequence position. The other case, when X reaches K and
we have a coalescence point in configuration ii, is symmetric. We can now apply
the classical results from the theory of random walks (see [14, ch.14.3,14.5]) to
analyze the expected length of runs.
Lemma 1. Assuming that the input sequence is uniformly i.i.d., the expected
length of a run of a symmetric two-state HMM is K 1.
The larger is K, the more memory is required to decode the HMM; the worst
case happens as e approaches 1/2 and the states become indistinguishable. From
the theory of random walks, we can characterize the distribution of run lengths.
Lemma 2. Let R be the event that the run length of a symmetric two-state
HMM is either 2 + 1 or 2 + 2. Then, assuming that the input sequence is
uniformly i.i.d., for some constants b, c > 0:
b cos2 Pr(R ) c cos2 (1)
K K
Proof. For a symmetric random walk on interval (0, K) with absorbing barriers
and with starting point z, the probability of event Wz,n that this random walk
ends in point 0 after n steps is zero, if n z is odd, and the following quantity,
if n z is even [14, ch.14.5]:
2 X v v zv
Pr(Wz,n ) = cosn1 sin sin (2)
K K K K
0<v<K/2
Using symmetry, note that the probability of the same random walk ending after
n steps at barrier K is the same as probability of WKz,n . Thus, if K is odd:
There are at most K/4 terms in the sum and they can all be bounded from
above by cos2 v
K . Thus, we can give both upper and lower bounds on Pr(R )
using only the first term of the sum as follows:
4
sin2 cos2 Pr(R ) cos2 (4)
K K K K
Pr(Zn x) (1 cax )n
n
= 1 axlog1/a (c)
n
1 axlog1/a (c)1
= Pr(Zn x log1/a (c) 1)
= Pr(Zn + log1/a (c) + 1 x)
This inequality holds even for x < log1/a (c) + 1, since the right-hand side is zero
in such case. Therefore, E[Zn ] E[Zn + log1/a (c) + 1] = E[Zn ] + O(1). Expected
value of Zn is log1/a (n) + o(log n) [17], which proves our claim.
Using results of Lemma 3 together with the characterization of run length dis-
tributions by Lemma 2, we can conclude that for symmetric two-state HMMs, the
expected maximum memory required to process a uniform i.i.d. input sequence
of length n is (1/ ln(1/ cos(/K)))ln n+o(log n).4 Using the Taylor expansion of
the constant term as K grows to infinity, 1/ ln(1/ cos(/K))) = 2K 2 / 2 + O(1),
we obtain that the maximum memory grows approximately as (2K 2 / 2 ) ln n.
The asymptotic bound (log n) can be easily extended to the sequences that
are generated by the symmetric HMM, instead of uniform i.i.d. The underlying
process can be described as a random walk with approximately 2K states on
two (0, K) lines, each line corresponding to sequence symbols generated by one
of the two states. The distribution of run lengths still decays geometrically as
required by Lemma 3; the base of the exponent is the largest eigenvalue of the
transition matrix with absorbing states omitted (see e.g. [18, Claim 2]).
4
We omitted the first run, which has a different starting point and thus does not
follow the distribution outlined in Lemma 2. However, the expected length of this
run does not depend on n and thus contributes only a lower-order term. We also
omitted the runs of length one that start outside the interval (0, K); these runs again
contribute only to lower order terms of the lower bound.
Theorem 1. The expected maximum memory used by the on-line Viterbi algo-
rithm on a symmetric two-state HMM with t, e < 1/2 is (m log n), assuming
sequences generated by either uniform i.i.d. or the HMM itself. Moreover, for
the uniform
l i.i.d. sequences,
m the constant factor is approximately 2K 2 / 2 , where
log(1t)log(t)
K = 2 log(1e)log(e) .
40K
10K
20K
0 0
15.2M 15.3M 15.4M 15.5M 0 5M 10M 15M 20M
Section of chromosome 1 Sequence length
for self-transitions. We have tested the algorithm for m 6 and sequences gen-
erated both by a uniform i.i.d. process, and by the HMM itself. Observed data
are consistent with the logarithmic growth of average maximum memory needed
to decode a sequence of length n (data not shown).
We have also evaluated the algorithm using a simplified HMM for gene finding
with 265 states. The emission probabilities of the states are defined using at
most 4-th order Markov chains, and the structure of the HMM reflects known
properties of genes (similar to the structure shown in [19]). The HMM was
trained on RefSeq annotations of human chromosomes 1 and 22.
In gene finding, we segment the input DNA sequence into exons (protein-
coding sequence intervals), introns (non-coding sequence separating exons within
a gene), and intergenic regions (sequence separating genes). Common measure
of accuracy is exon sensitivity (how many of real exons we have successfully and
exactly predicted). The implementation used here has exon sensitivity 37% on
the ENCODE testing set [20]. A realistic gene finder, such as ExonHunter [21]
achieves sensitivity of 53%. This difference is due to additional features that are
not implemented in our test, namely GC content levels, non-geometric length
distributions, and sophisticated signal models.
We have tested the algorithm on 20 MB long sequences: regions from the hu-
man genome, simulated sequences generated by the HMM, and i.i.d. sequences
(see Figure 4b). Regions of the human genome were chosen from hg18 assem-
bly so that they do not contain sequencing gaps. The distribution for the i.i.d.
sequences mirrors the distribution of bases in the human chromosome 1. The av-
erage maximum length of the table over several samples appears to grow faster
than logarithmically with the length of the sequence, though it seems to be
bounded by c log2 n. It is not clear whether the faster growth is an artifact that
would disappear with longer sequences or higher number of samples.
The HMM for gene finding has a special structure, with three copies of the
state for introns that have the same emission probabilities and the same self-
transition probability. In two-state symmetric HMMs, similar emission proba-
bilities of the two states lead to increase in the length of individual runs. Intron
states of a gene finder are an extreme example of this phenomenon.
Nonetheless, on average a table of length roughly 100,000 is sufficient to
process sequences of length 20 MB, which is a 200-fold improvement compared
to the trivial Viterbi algorithm. In addition, the length of the table did not
exceed 222,000 on any of the 20MB human segments. As we can see in Figure
4a, most of the time the program keeps only relatively short table; the average
length on the human segments is 11,000. The low average length can be of a
significant advantage if multiple processes share the same memory.
4 Conclusion
In this paper, we introduced the on-line Viterbi algorithm. Our algorithm is
based on efficient detection of coalescence points in trees representing the state-
paths under consideration of the dynamic programming algorithm. The algo-
rithm requires variable space that depends on the HMM and on the local prop-
erties of the analyzed sequence. Memory savings enable analysis of long biolog-
ical sequences in gene finding, promoter detection, and detection of conserved
elements. Previous approaches often artificially split the sequence into smaller
pieces which can negatively influence the prediction accuracy.
For two-state symmetric HMMs, we have shown that the expected maximum
memory needed for sequence of length n is approximately only (2K 2 / 2 ) ln n.
Our experiments on both simulated and real data suggest that the asymptotic
bound (m log n) also extend to multi-state HMMs, and in fact, for most of the
time throughout the execution the algorithm uses much less memory.
Our algorithm can also be used for on-line processing of streamed sequences;
all previous algorithms that are guaranteed to produce the optimal state path
require the whole sequence to be read before the output can be started.
There are still many open problems. We have only been able to analyze the
algorithm for two-state HMMs, though trends predicted by our analysis seem to
generalize even to more complex cases. Can our analysis be extended to multi-
state HMMs? Apparently, design of the HMM affects the memory needed for
the decoding algorithm; for example, presence of states with similar emission
probabilities tends to increase memory requirements. Is it possible to characterize
HMMs that require large amounts of memory to decode? Can we characterize
the states that are likely to serve as coalescence points?