Skip to content

sklearn.hmm ideas #1817

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kmike opened this issue Mar 27, 2013 · 49 comments
Closed

sklearn.hmm ideas #1817

kmike opened this issue Mar 27, 2013 · 49 comments

Comments

@kmike
Copy link
Contributor

kmike commented Mar 27, 2013

Hi,

I think there are some things in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/hmm.py that could be improved.

_BaseHMM.algorithm handling

  • _BaseHMM.decode has an 'algorithm' argument that doesn't affect decoding algorithm if algorithm is already set in constructor OR if _BaseHMM.fit was called.
  • Getting/setting of an algorithm in _BaseHMM code often doesn't use property (there are direct accesses to _algorithm.
  • It is not possible to change a list of supported algorithms because decoder_algorithms is not an attribute of a class (and it is an immutable tuple), so the purpose of this variable is not clear.

Ideas:

  • Make decode_viterbi and decode_map methods public OR make 'algorithm' argument override default algorithm (this would require changing this argument to None by default).
    I'd prefer making decode_viterbi and decode_map public because flags that totally change what function does are bad API (see e.g. http://mail.python.org/pipermail/python-dev/2005-August/055907.html). Is it correct that "decode" is required in first place because of grid search, etc.?
  • always use a property to access 'algorithm' to make code cleaner;
  • kill 'decoder_algorithms' (and use simple ifs) OR make it class attribute of _BaseHMM.

I don't understand why _BaseHMM.fit sets algorithm.

Docstrings

  • _BaseHMM._decode_map docstring tells: "Find most likely state sequence corresponding to obs". In my understanding this is a bit misleading because (unlike _BaseHMM._decode_viterbi) likelihood of individual states is maximized, not the likelihood of a sequence.
  • params and init_params are not properly documented. For example, for MultinomialHMM documented 'c' is not valid, and the most interesting 'e' is not documented;
  • module docstring tells us the module is to be removed in scikit-learn 0.11, but it is 0.14dev and it was not removed - what are the plans for sklearn.hmm?

Ideas: fix this.

fit method

  • There is no supervised fit variant for MultinomialHMM.
  • 'obs' parameter name is quite obscure because it means different things in fit vs all other methods: it should be a list of sequences for fit and an individual sequence in other methods.
  • I don't get a note in _BaseHMM.fit docstring: _BaseHMM shouldn't care about covars_prior, and there is no pruning in current implementation.

Ideas:

  • change _BaseHMM.fit signature to def fit(self, X, y=None)

My knowledge is lacking and I'm clueless about GaussianHMM and GMMHMM - does supervised learning make sense for them?

Normalization

  • I think there are cases when "hard" zeros in transition and emission matrices are wanted, and there are cases when matrices should be sparse, but sklearn.hmm uncomditionally adds adds EPS to each element of transmat and emissionprob, and this could be an issue.

I don't know if normalization is a real issue - maybe there are better solutions for sparse problems (like n-gram POS tagging with |tagset| > 1000) - but sklearn.hmm is currently unusable for such problems at least because of normalization.

misc

  • I think default _compute_log_likelihood and _generate_sample_from_state implementations should raise NotImplementedError instead of returning None.
  • tiny cosmetic changes to _hmmc.pyx: kmike@ae0c4ab (xrange -> range; unneeded dashes are removed)

I'm looking for a feedback (and would be happy to submit a pull request with fixes to some of these issues if this make sense).

@GaelVaroquaux
Copy link
Member

Hey Mikhail,

All good remarks. The problem is that we are seriously considering
phasing out HMMs from scikit-learn, because they are the only objects in
the library to deal with structured output, and we cannot find an API
that accomodates that.

What's the status on that, @amueller?

@mblondel
Copy link
Member

I had a look at the HMM API recently to reply to someone's question on the ML. The main concern I have with the current API is not so much how we represent sequences but more that some methods (namely predict, predict_proba and score) operate on a single instance (sequence), while in the rest of the scikit, they operate on several instances (a set of vectors, i.e., a matrix).

With HMMs, the basic data instance is not the vector but the sequence of vectors (each sequence possibly having a variable length). To be compatible with our API, fit, predict, predict_proba and score should thus accept a list of sequences (of length n_sequences). This is already the case of fit so I'm fine with it. On the other hand, predict, predict_proba and score work on a single sequence. So here are my suggestions:

  • drop predict since it's just a wrapper around decode.
  • modifiy predict_proba so that it takes as input a list of sequences (not just a single sequence) and return an array of shape (n_sequences,) containing the probabilities. Moreover, add an option to choose between computing the probabilities with the forward algorithm (probabilities of the sequences under the model) or the viterbi algorithm (probabilities of the most-likely state sequence, numerically more stable)
  • modifiy score so that it takes as input a list of sequences (not just a single sequence) and return a float

@GaelVaroquaux
Copy link
Member

• drop predict since it's just a wrapper around decode.
• modifiy predict_proba so that it takes as input a list of sequences (not
just a single sequence) and return an array of shape (n_sequences,)
containing the probabilities. Moreover, add an option to choose between
computing the probabilities with the forward algorithm (probabilities of
the sequences under the model) or the viterbi algorithm (probabilities of
the most-likely state sequence, numerically more stable)
• modifiy score so that it takes as input a list of sequences (not just a
single sequence) and return a float

Am I wrong in thinking that these API decision would make it impossible
to use the HMM as a plug-in replacement for another estimator? That said,
it is not clear to me that there is an abstraction that allows this.

@larsmans
Copy link
Member

The problem with devising an API for HMMs is that it's always going to be a special case since they're univariate; other sequence models such as MEMMs or LC-CRFs are going to be different because they'd take sequences of samples, where each sample is in turn a 2-d matrix.

(To make matters worse, I've been told on MetaOptimize that multivariate HMMs are possible, but not in the multinomial case...)

@mblondel
Copy link
Member

@larsmans Here's an example of multivariate HMM using scikit-learn... (Reproduced from the message I sent earlier to the ML)

import numpy as np

from sklearn.hmm import GaussianHMM

# generate 3 sequences containing a variable number of 4-dimensional observations
# (note: the 3 sequences are completely random, sampling from an HMM
would be better)
seq1 = np.random.random((10,4))
seq2 = np.random.random((8,4))
seq3 = np.random.random((6,4))

# the training set consists of 3 sequences
train = [seq1, seq2, seq3]

# assume that 2 states are enough to capture the temporal behavior of the sequences
hmm = GaussianHMM(n_components=2)
hmm.fit(train)

@larsmans
Copy link
Member

Ah! I was looking at MultinomialHMM. So we already have an inconsistent API.

@larsmans
Copy link
Member

We could decide to just ditch the multinomial one and keep GaussianHMM. We don't have any examples to go with it, nor any utilities for transforming data to the format that MultinomialHMM wants.

@mblondel
Copy link
Member

@larsmans Basically, to define an HMM, one needs a way to compute the probability that a given observation belongs to some state (the emission probability in HMM jargon). When observations are multivariate, one can use a multivariate gaussian (GaussianHMM) or a mixture of multivariate gaussians (GMMHMM) to do that. A multivariate gaussian is defined by a mean vector and a covariance matrix. So at training time, it is necessary to learn those for each possible state, plus the mixture coefficients in the case of GMMs.

@mblondel
Copy link
Member

For multinomial HMM, the supervised setting seems more useful (pos-tagging...).

@mblondel
Copy link
Member

BTW, the term "supervision" can be ambiguous in the context of HMMs. Depending on the applications, labels can be associated with entire sequences (spoken word recognition, handwritten digit recognition), or, with each observation in the sequences (POS tagging, NER, ...).

@mblondel
Copy link
Member

to use the HMM as a plug-in replacement for another estimator?

What kind of interoperability are you thinking of?

@mblondel
Copy link
Member

If someone knows how to create a drawing pad using tkinter or matplotlib, I can create a handwritten digit example using GaussianHMM.

@larsmans
Copy link
Member

For POS tagging, you definitely need second-order models (in addition to supervised training) to achieve acceptable accuracy, so we'd have to introduce a new class or do extensive rewriting of MultinomialHMM. (And you'd also want smoothing/back-off/interpolation to get really good results, but that's doable.)

@kmike recently did some work on the NLTK HMMs to make them faster; we could refer users who want to do POS tagging there instead of duplicating the effort.

@GaelVaroquaux
Copy link
Member

to use the HMM as a plug-in replacement for another estimator?

What kind of interoperability are you thinking of?

I don't know. I don't have any usecase for HMMs. All I know, is that so
far they feel odd, and it's basically not easy to move from another part
of the scikit to HMMs both as a user, and as a developer. But if people
are able to use them along with the rest of the scikit (and idealy to
provide an example of this), I am very happy and will gladly retract the
above statement.

@mblondel
Copy link
Member

Personally, I don't think HMMs can be used as a plug-in replacement for other estimators in the scikit: HMMs are just too different beasts. However, we should strive for consistency with the rest of the scikit whenever possible (to minimize surprise).

@kmike
Copy link
Contributor Author

kmike commented Mar 30, 2013

To make _BaseHMM more generic at least 2 more changes are needed:

  1. it shouldn't call np.asarray(obs) in eval, score and _decode_viterbi because the only method that directly deals with obs is _compute_log_likelihood which can be overridden in a subclass (e.g. to accept string token sequences instead of numbers). In my case I don't store emissionprob because it is going to be huge, and there are many unseen tokens; instead of this emission probabilities are calculated online using a morphological analyzer and some simplifying assumptions.

  2. .fit should cast obs to list because it doesn't work with iterables otherwise - OR it should be somehow modified to work with iterators (but this is not going to be efficient unless user is allowed to pass several iterators - e.g. iterators over the same huge file with data).


Currently POS tagging with nltk.tag.hmm is not easier than with sklearn.hmm.MultinomialHMM because they share same limitations (all emission and transition probabilities must be stored in memory). I'm still trying to understand how to fix nltk.tag.hmm to remove these limitations. With minimal changes it is possible to fix sklearn.hmm to allow emission probabilities computed online (see (1)) , but transition probabilities are still an issue.

My hope is to change _BaseHMM to work with sparse matrices, and then store n-gram transition probabilities (with back-off precomputed) in a sparse matrix where state space is a Cartesian product of the original state space. If my understanding is correct, with dense matrix this would require N^(2*(n-1)) cells (where n is n-gram size), while with sparse matrix this should be N^n (because there are many impossible state changes) - e.g. N^4 vs N^3 for trigrams. This is still a lot of memory, but maybe forcing some probabilities to zero help... It is possible to create an explicit second-order HMM, but (as far as I can tell) it would have the same space/time complexity (though being more efficient), and it is not clear how to introduce explicit second-order HMM to sklearn.hmm without rewriting most of code.


So, I want to make fixes to _BaseHMM (and, most probably, to _hmmc.pyx), but I don't care about MultinomialHMM (it has too many assumptions - are there people using it?). My knowledge is extremely lacking and I don't know how GaussianHMM and GMMHMM work and what are they for, so I don't care about them either.

//By the way, _logsum from _hmmc.pyx is a duplicate of sklearn.utils.extmath.logsumexp without "axis" support, and without "axis" support it could be made about 1.5x faster like this: https://gist.github.com/kmike/5278404#file-math-pyx-L54

At this point, there are 3 ways:

a) extract sklearn.hmm to a separate package and try to fix it there (easier as I won't care about interface changes, MultinomialHMM and maybe GaussianHMM+GMMHMM);
b) try to improve it as a part of scikit-learn (harder);
c) do (a) and then maybe port some changes back to scikit-learn.

What do you think?

@kmike
Copy link
Contributor Author

kmike commented Mar 30, 2013

... more on iterators: it turns out that itertools.tee has an undocumented support for efficient cloning of iterators which have __copy__ methods (this doesn't require storing all consumed data in a buffer), so instead of casting to list it may be better to use itertools.tee if iterator support in .fit is desired.

@GaelVaroquaux
Copy link
Member

  1. it shouldn't call np.asarray(obs) in eval, score and _decode_viterbi
    because the only method that directly deals with obs is
    _compute_log_likelihood which can be overridden in a subclass (e.g. to
    accept string token sequences instead of numbers).

The standard way to deal with input being strings is to use
_, new_input = np.unique(input, return_inverse=True) to turn these into
an array of integers. Arrays of integers are much more compact in memory,
so there is a clear interest in using them.

In my case I don't store emissionprob because it is going to be huge,
and there are many unseen tokens; instead of this emission
probabilities are calculated online using a morphological analyzer and
some simplifying assumptions.

Makes sens.

  1. .fit should cast obs to list because it doesn't work with iterables
    otherwise - OR it should be somehow modified to work with iterators
    (but this is not going to be efficient unless user is allowed to pass
    several iterators - e.g. iterators over the same huge file with data).

The way we tackle these type of problems in scikit-learn. Instead we do a
'partial_fit' method, that is called successively on subsets of the data.
The pro compared to working with iterators is that in enables putting the
logic about chunking the data outside of the learner, as this logic is
most likely related to the data source. One example of using this pattern
on biggish data is:
http://scikit-learn.org/dev/auto_examples/cluster/plot_dict_face_patches.html

Note that althought the pattern is different than that of iterators, it
can usually be implemented with the same algorithmic core.

//By the way, _logsum from _hmmc.pyx is a duplicate of
sklearn.utils.extmath.logsumexp without "axis" support, and without
"axis" support it could be made about 1.5x faster like this:
https://gist.github.com/kmike/5278404#file-math-pyx-L54

Sounds definitely an improvement. Also, it would probably be useful to
move it so that it is next to logsumexp so that people see it when they
are looking for logsumexp.

a) extract sklearn.hmm to a separate package and try to fix it there (easier as
I won't care about interface changes, MultinomialHMM and maybe
GaussianHMM+GMMHMM);
b) try to improve it as a part of scikit-learn (harder);
c) do (a) and then maybe port some changes back to scikit-learn.

What do you think?

I don't know. If the changes are not huge, I'd say b), because you should
be able to get reviewed and merge in a reasonnable amount of time, and
work case we can always separate out the code later. If the changes are
large, a) is more realistic, as we won't be able to review them. For NLP,
it would be really nice to have HMMs that work well in the scikit. What
do others think?

@kmike
Copy link
Contributor Author

kmike commented Mar 31, 2013

The standard way to deal with input being strings is to use
_, new_input = np.unique(input, return_inverse=True) to turn these into
an array of integers. Arrays of integers are much more compact in memory,
so there is a clear interest in using them.

Yes, but, as example with emissions shows, using numbers may be ineffective (morphological analyzer don't work on numbers, and I don't want to store token -> number mapping in memory). Moving np.asarray(obs) to methods that need numbers would fix this.

The way we tackle these type of problems in scikit-learn. Instead we do a
'partial_fit' method, that is called successively on subsets of the data.

I think 'partial_fit' interface won't work if n_iter > 1: we can't start next EM iteration until all data was seen (and the model was updated); we also can't check for convergence without all data. On the other hand, just calling itertools.tee(X, n_iter) would make it possible to feed data incrementally without loading more than a single observation to memory at once (by creating a custom input iterator with __copy__ method).

Also, it would probably be useful to
move it so that it is next to logsumexp so that people see it when they
are looking for logsumexp.

Agree. Should it be a new utils/_extmath.pyx extension?

@GaelVaroquaux
Copy link
Member

On Sun, Mar 31, 2013 at 01:53:02AM -0700, Mikhail Korobov wrote:

The standard way to deal with input being strings is to use
_, new_input = np.unique(input, return_inverse=True) to turn these into
an array of integers. Arrays of integers are much more compact in memory,
so there is a clear interest in using them.

Yes, but, as example with emissions shows, using numbers may be ineffective
(morphological analyzer don't work on numbers, and I don't want to store token
-> number mapping in memory).

An array of integers will have a much smaller in-memory footprint than an
array (or worst a list) of string. In general, the rule in the scikit is
to convert to array of integers as early as possible. In this specific
case, the gain main not be huge, but I would prefer if we did it the
standard way. In addition, just glancing at the code, I see many places
where it could be improved to rely on the fact that obs should be an
array of integers (such as set(obs) should be replaced by
np.unique(obs)). Granted, this code is not currently written in a way
that efficiently relies on numpy arrays, but that's more a bug than a
feature, in my eyes. Rewritting it cleanly would bring in memory usage
and computation speed improvements.

The way we tackle these type of problems in scikit-learn. Instead we do a
'partial_fit' method, that is called successively on subsets of the data.

I think 'partial_fit' interface won't work if n_iter > 1: we can't
start next EM iteration until all data was seen (and the model was
updated); we also can't check for convergence without all data.

OK, so it's not a real online setting. Sorry, I was confused. I don't
know HMMs terribly well.

On the other hand, just calling itertools.tee(X, n_iter) would make it
possible to feed data incrementally without loading more than a single
observation to memory at once (by creating a custom input iterator with
copy method).

Yes, but we don't want to support this pattern in the scikit because it
makes coding algorithms efficiently much more difficult. I see your
point, and I definitely agree that for your specific usecase it would
probably help, but generalizing it to various learners would put a large
burden on the development of the package.

Also, it would probably be useful to move it so that it is next to
logsumexp so that people see it when they are looking for
logsumexp.

Agree. Should it be a new utils/_extmath.pyx extension?

I think that this would be the way to go. Thanks!

@kmike
Copy link
Contributor Author

kmike commented Mar 31, 2013

An array of integers will have a much smaller in-memory footprint than an
array (or worst a list) of string. In general, the rule in the scikit is
to convert to array of integers as early as possible. In this specific
case, the gain main not be huge, but I would prefer if we did it the
standard way. In addition, just glancing at the code, I see many places
where it could be improved to rely on the fact that obs should be an
array of integers (such as set(obs) should be replaced by
np.unique(obs)). Granted, this code is not currently written in a way
that efficiently relies on numpy arrays, but that's more a bug than a
feature, in my eyes. Rewritting it cleanly would bring in memory usage
and computation speed improvements.

In this particular example of _BaseHMM 'obs' is not used in any way by scikit-learn code. It is passed to _compute_log_likelihood which is empty and to be overridden in subclasses; it is also passed to _init but unused there. I think this interface is great (this is one of the reasons why fixing nltk.tag.hmm is harder).

So _BaseHMM don't need 'obs' to be integers, and casting obs to an array in _BaseHMM can't improve speed or memory footprint in any way, but it can reduce flexibility.

Subclasses of _BaseHMM provided in scikit-learn requires obs to be an integer array - my point is that they should call np.asarray, not _BaseHMM, because in some cases converting obs to integers would be impossible or require more memory (this is the case when emission probabilities are computed - how could I encode a token unseen in training to a number, and why should I encode tokens to numbers, store the mapping somewhere and then decode tokens back to pass them to morphological analyzer?).

OK, so it's not a real online setting.

Yes, the idea is to iterate over the same data several times (e.g. read it from file several times).

Yes, but we don't want to support this pattern in the scikit because it
makes coding algorithms efficiently much more difficult. I see your
point, and I definitely agree that for your specific usecase it would
probably help, but generalizing it to various learners would put a large
burden on the development of the package.

What about supporting this in sklearn.hmm without generalizing to various learners? "fit" method accepts a list of sequences, not a matrix, so supporting iterators looks quite natural here.

@amueller
Copy link
Member

Sorry, I'm coming a bit late to the discussion. One more thing to consider is that HMMs in sklearn really make it hard to consider the scope of sklearn. We rejected Kalman filters, which are very very related.
Also, we don't have supervised hmms. Do we want them? Do we want CRFs? do we want general graphs?

We could go with "lets try and keep this algorithm in a good enough shape an not add anything more".
The question then is who is able to maintain it. Recently, it didn't seem like any of the core devs had much enthusiasm about it. I think @lucidfrontier45 was the only one seriously working on it.

@kmike
Copy link
Contributor Author

kmike commented Mar 31, 2013

If I go with "make a new package" route (I'm open to other routes though), how should I acknowledge original authors?

Scikit-learn license says that "Neither the name of the Scikit-learn Developers nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission." - am I allowed to say the code is originated from scikit-learn and to list all people contributed to sklearn.hmm?

@amueller
Copy link
Member

@kmike I'm pretty sure that would be ok.
I am note entirely sure but I think permanently factoring out hmms from sklearn would be a good thing.
It might be quite a bit of work to maintain it, though.

@kmike
Copy link
Contributor Author

kmike commented Apr 1, 2013

You're right, it might be quite a bit of work to maintain it, and I don't feel entitled to maintain anything but _BaseHMM (and maybe MultinomialHMM). My goal is to get a versatile BaseHMM with sparse transitions matrix. If you think about permamently factoring out HMMs, we need more people willing to help.

Update on _logsum: I'm trying to make transition matrices sparse; as the first step I'm rewriting _hmmc.pyx without Cython (numpy+python) to make changes easier. To my surprise, partially vectorized numpy+python (see https://gist.github.com/kmike/5285124) already runs about 20+% faster than current Cython extension (measured on eval() method, n_components=1000, len(obs)=10). So faster _logsum may be unnecessary.

@kmike
Copy link
Contributor Author

kmike commented Apr 1, 2013

Upd2: bug is fixed in gist mentioned above; pure-python vectorized version of viterbi that is 6x faster than _hmmc._viterbi is added.

@GaelVaroquaux
Copy link
Member

Upd2: bug is fixed in gist mentioned above; pure-python vectorized version of
viterbi that is 6x faster than _hmmc._viterbi is added.

Whatever the future of HMMs in scikit-learn is, I would be interested in
seeing a pull request implementing this. Thanks a lot!

@kmike
Copy link
Contributor Author

kmike commented Apr 2, 2013

Done as #1825.

@mblondel
Copy link
Member

mblondel commented Apr 2, 2013

Whatever the future of HMMs in scikit-learn is, I would be interested in
seeing a pull request implementing this. Thanks a lot!

It would be nice to create a benchmark for different number of states,
features and levels of sparsity. I suspect that the Cython version may be
faster in some cases.

@kmike
Copy link
Contributor Author

kmike commented Apr 2, 2013

You're right, - just checked it with 2 states, and with 2 states .fit and .eval are 2.5x slower in my pull request than in scikit-learn master; _decode_viterbi is still 1.5x faster. A proper benchmark would be better, but I don't have time to create it, sorry.

I think Cython extension is slow because it has a lot of Python-level calls (this is visible in annotated html from Cython) and using typed memoryviews could make it a lot faster.

@larsmans
Copy link
Member

larsmans commented Apr 2, 2013

Scikit-learn license says that "Neither the name of the Scikit-learn Developers nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission." - am I allowed to say the code is originated from scikit-learn and to list all people contributed to sklearn.hmm?

Including a copyright notice is not promoting/endorsing. Also:

a. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

so you'd have to list scikit-learn somewhere. The endorse/promote clause is mostly to prevent our names or "scikit-learn" to be used in advertising by forks.

As for the separate package route, I would be in favor of starting a scikit-structlearn/scikit-sequence toolkit and would likely contribute to it.

@amueller
Copy link
Member

amueller commented Apr 2, 2013

On 04/02/2013 01:52 PM, Lars Buitinck wrote:

As for the separate package route, I would be in favor of starting a
scikit-structlearn/scikit-sequence toolkit and would likely contribute
to it.

So should I rename pystruct to scikit-structlearn ;)

@larsmans
Copy link
Member

larsmans commented Apr 2, 2013

Yep. And support sparse matrices throughout... are you doing that?

@kmike
Copy link
Contributor Author

kmike commented Apr 2, 2013

By the way, do you know a good way for supporting sparse matrices with non-zero default values (this is the case for matrices of log probability where "empty" values are -inf)? Currently I see 2 possibilities: (a) remove log-space calculations and (b) subclass something from scipy.sparse, but (a) may cause numeric stability issues, and (b) looks quite laborious.

@larsmans
Copy link
Member

larsmans commented Apr 2, 2013

(a) is numerically unstable, (b) is a maintenance nightmare. @jakevdp knows tricks for this.

But are you sure sparsifying the matrices inside an HMM is worth it? The emission probabilities are usually smoothed and thus all non-zero, while the transition matrix shouldn't be that large (even a second-order model with a 100 states takes just a few MBs).

@kmike
Copy link
Contributor Author

kmike commented Apr 2, 2013

I don't know is sparsifying matrices worth it because I didn't try the result :) My goal is to try higher-order models, and state space for full Russian tag disambiguation can easily be > 1500 for first-order HMM (of course there are ways to simplify tags, but 100 states is quite a low number). Also, I thought that this is not only about memory requirements - calculations on sparse matrices should also be orders of magnitude faster, esp. with higher order models.

@kmike
Copy link
Contributor Author

kmike commented Apr 2, 2013

Also, second-order model with 100 states would take 100^4 cells = 800M for transition matrix if states are just a Cartesian product of first-order states; to make it 100^3 rewriting algorithms to explicitly handling second-order case is neccessary; an another rewrite is needed for third-order models, and so on. On the other hand, sparse matrices should solve all these problems and allow arbitrary order HMMs.

@larsmans
Copy link
Member

larsmans commented Apr 2, 2013

I was thinking exactly of a specialized 2nd-order Viterbi algorithm. As you point out, that gives a big saving in complexity, and I think it should be preferred to a generic k-order HMM. >2nd-order models are probably a YAGNI feature. First-order (or even 0th-order) discriminative models tend to work better, at least for POS tagging (ok, Dutch and English POS tagging), so why bother?

The same applies to very large tagsets. I'd switch to discriminative models and maybe multi-stage architectures for such problems: get coarse POS from HMM, fill in details with per-POS classifiers that can see word features.

@kmike
Copy link
Contributor Author

kmike commented Apr 2, 2013

I think you're right about HMMs with order > 2. I read a paper (on unsupervised Russian POS disambiguation) once where (t-2, t-1, t+1, t+2) was taken in account, and I thought it was a 4-order HMM (well, not actually a HMM), but now I re-read it and realized they used 2 trigram HMMs that run on straight and reversed sentences. There is also a paper on Persian unsupervised disambiguation that uses 4-grams (http://airccse.org/journal/ijaia/papers/3212ijaia04.pdf), but it is not clear if 4-grams bring improvements.

One more shot for generic k-order HMM: with sparse matrices it'll be trivial to use hmms as "cyclic dependency networks" (see http://nlp.stanford.edu/kristina/papers/tagging.pdf ) that use right context explicitly; this approach was used e.g. in Hebrew unsupervised POS tagging (http://acl.eldoc.ub.rug.nl/mirror/P/P08/P08-1085.pdf).

..it may be also trivial to handle explicit right contexts in explicit 2-order case - I don't understand this now.

@larsmans
Copy link
Member

larsmans commented Apr 2, 2013

I'm having a hard time understanding the paper about Persian, and I've never heard of the journal, either. It may be legit, but I really can't tell.

IIRC, the "cyclic dependency models" of Toutanova et al. are actually discriminative, MEMM-like. They're closer to CRFs than to HMMs. As for Goldberg et al., you should check the reference Adler 2007 to find out if it's really that trivial -- I couldn't find the algorithm.

For handling right contexts, the trick with two HMMs running in opposite directions is I think the most common one. Note that the Viterbi algorithm computes the most probable path, so it will consider things like P(verb|STOP), which is a limited form of right context.

@kmike
Copy link
Contributor Author

kmike commented Apr 2, 2013

Thanks for the reference, I'll take a look at it - looks very useful!

It is stated in Toutanova et al. that "In essence, there is no difference between inference on this network and a secondorder left-to-right CMM or HMM. The only difference is that, when the Markov window is at a position i, rather
than receiving the score for P(ti | ti-1, ti-2, wi), one receives the score for P(ti-1 | ti, ti-2, wi-1)" - so in my understanding this means that while theoretically they are not HMMs, at least Viterbi is the same (just use ti-1, ti+1 instead of ti-1, ti-2).

@larsmans
Copy link
Member

larsmans commented Apr 2, 2013

Interesting trick! It may be worthwhile to try and implement that.

@kmike
Copy link
Contributor Author

kmike commented Apr 2, 2013

To recap: @larsmans suggestions are all sound, and sparse matrices are not an absolute requirement.

Sparse matrices are still appealing though because they would allow to have 1 copy of algorithms (instead of specialized versions for first-order, second-order and "cyclic dependency" models, and maybe others) and still provide the same time and space complexity as hand-coded models.

@rainmansr
Copy link

Hello all. Related to the discussion of training on multiple sequences. It seems that there is an error if any of the sequences is of length 1. E.g.

import numpy as np

from sklearn.hmm import GaussianHMM
seq1 = np.random.random((10,4))
seq2 = np.random.random((8,4))
seq3 = np.random.random((1,4))
train = [seq1, seq2, seq3]

hmm = GaussianHMM(n_components=2)
hmm.fit(train)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/arch/Linux-x86_64/lib/python2.7/sklearn/hmm.py", line 441, in fit
    bwdlattice, self.params)
  File "/home/user/arch/Linux-x86_64/lib/python2.7/sklearn/hmm.py", line 788, in _accumulate_sufficient_statistics
    params)
  File "/home/user/arch/Linux-x86_64/lib/python2.7/sklearn/hmm.py", line 574, in _accumulate_sufficient_statistics
    stats["trans"] += np.exp(logsumexp(lneta, 0))
  File "/home/user/arch/Linux-x86_64/lib/python2.7/sklearn/utils/extmath.py", line 243, in logsumexp
    vmax = arr.max(axis=0)
ValueError: zero-size array to maximum.reduce without identity

This does arise as an application in e.g. bioinformatics when HMM-based clustering is done on a (disconnected) subset of the genome.

@rvsasseen
Copy link

_accumulate_sufficient_statistics() should probably skip update of the transition stats in that case. Should probably file a bug report.

rmcgibbo added a commit to rmcgibbo/scikit-learn that referenced this issue Oct 15, 2013
ogrisel added a commit that referenced this issue Oct 25, 2013
Fix bug identified in #1817, comment 17340049
@ogrisel ogrisel closed this as completed Oct 25, 2013
@ogrisel
Copy link
Member

ogrisel commented Oct 25, 2013

Oops, reopening as PR #2525 only solves one of those issues. Would be great to split this issue into sub issues.

@ogrisel ogrisel reopened this Oct 25, 2013
@larsmans
Copy link
Member

I'm going to close this issue as we've decided to ditch the HMMs altogether.

@rainmansr
Copy link

Hi I was wondering why HMMs are ditched? What is going to happen to the code?

@larsmans
Copy link
Member

HMMs don't fit with the API and no-one really understands the code anymore. We've decided not to support structured learning, including sequence prediction, because doing it properly would require changes all over the codebase.

The code will be put into a separate repo for others to use/fork/maintain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants