-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
sklearn.hmm ideas #1817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey Mikhail, All good remarks. The problem is that we are seriously considering What's the status on that, @amueller? |
I had a look at the HMM API recently to reply to someone's question on the ML. The main concern I have with the current API is not so much how we represent sequences but more that some methods (namely With HMMs, the basic data instance is not the vector but the sequence of vectors (each sequence possibly having a variable length). To be compatible with our API,
|
Am I wrong in thinking that these API decision would make it impossible |
The problem with devising an API for HMMs is that it's always going to be a special case since they're univariate; other sequence models such as MEMMs or LC-CRFs are going to be different because they'd take sequences of samples, where each sample is in turn a 2-d matrix. (To make matters worse, I've been told on MetaOptimize that multivariate HMMs are possible, but not in the multinomial case...) |
@larsmans Here's an example of multivariate HMM using scikit-learn... (Reproduced from the message I sent earlier to the ML) import numpy as np
from sklearn.hmm import GaussianHMM
# generate 3 sequences containing a variable number of 4-dimensional observations
# (note: the 3 sequences are completely random, sampling from an HMM
would be better)
seq1 = np.random.random((10,4))
seq2 = np.random.random((8,4))
seq3 = np.random.random((6,4))
# the training set consists of 3 sequences
train = [seq1, seq2, seq3]
# assume that 2 states are enough to capture the temporal behavior of the sequences
hmm = GaussianHMM(n_components=2)
hmm.fit(train) |
Ah! I was looking at MultinomialHMM. So we already have an inconsistent API. |
We could decide to just ditch the multinomial one and keep |
@larsmans Basically, to define an HMM, one needs a way to compute the probability that a given observation belongs to some state (the emission probability in HMM jargon). When observations are multivariate, one can use a multivariate gaussian ( |
For multinomial HMM, the supervised setting seems more useful (pos-tagging...). |
BTW, the term "supervision" can be ambiguous in the context of HMMs. Depending on the applications, labels can be associated with entire sequences (spoken word recognition, handwritten digit recognition), or, with each observation in the sequences (POS tagging, NER, ...). |
What kind of interoperability are you thinking of? |
If someone knows how to create a drawing pad using tkinter or matplotlib, I can create a handwritten digit example using |
For POS tagging, you definitely need second-order models (in addition to supervised training) to achieve acceptable accuracy, so we'd have to introduce a new class or do extensive rewriting of @kmike recently did some work on the NLTK HMMs to make them faster; we could refer users who want to do POS tagging there instead of duplicating the effort. |
I don't know. I don't have any usecase for HMMs. All I know, is that so |
Personally, I don't think HMMs can be used as a plug-in replacement for other estimators in the scikit: HMMs are just too different beasts. However, we should strive for consistency with the rest of the scikit whenever possible (to minimize surprise). |
To make _BaseHMM more generic at least 2 more changes are needed:
Currently POS tagging with nltk.tag.hmm is not easier than with sklearn.hmm.MultinomialHMM because they share same limitations (all emission and transition probabilities must be stored in memory). I'm still trying to understand how to fix nltk.tag.hmm to remove these limitations. With minimal changes it is possible to fix sklearn.hmm to allow emission probabilities computed online (see (1)) , but transition probabilities are still an issue. My hope is to change _BaseHMM to work with sparse matrices, and then store n-gram transition probabilities (with back-off precomputed) in a sparse matrix where state space is a Cartesian product of the original state space. If my understanding is correct, with dense matrix this would require N^(2*(n-1)) cells (where n is n-gram size), while with sparse matrix this should be N^n (because there are many impossible state changes) - e.g. N^4 vs N^3 for trigrams. This is still a lot of memory, but maybe forcing some probabilities to zero help... It is possible to create an explicit second-order HMM, but (as far as I can tell) it would have the same space/time complexity (though being more efficient), and it is not clear how to introduce explicit second-order HMM to sklearn.hmm without rewriting most of code. So, I want to make fixes to _BaseHMM (and, most probably, to _hmmc.pyx), but I don't care about MultinomialHMM (it has too many assumptions - are there people using it?). My knowledge is extremely lacking and I don't know how GaussianHMM and GMMHMM work and what are they for, so I don't care about them either. //By the way, _logsum from _hmmc.pyx is a duplicate of sklearn.utils.extmath.logsumexp without "axis" support, and without "axis" support it could be made about 1.5x faster like this: https://gist.github.com/kmike/5278404#file-math-pyx-L54 At this point, there are 3 ways: a) extract sklearn.hmm to a separate package and try to fix it there (easier as I won't care about interface changes, MultinomialHMM and maybe GaussianHMM+GMMHMM); What do you think? |
... more on iterators: it turns out that itertools.tee has an undocumented support for efficient cloning of iterators which have |
The standard way to deal with input being strings is to use
Makes sens.
The way we tackle these type of problems in scikit-learn. Instead we do a Note that althought the pattern is different than that of iterators, it
Sounds definitely an improvement. Also, it would probably be useful to
I don't know. If the changes are not huge, I'd say b), because you should |
Yes, but, as example with emissions shows, using numbers may be ineffective (morphological analyzer don't work on numbers, and I don't want to store token -> number mapping in memory). Moving np.asarray(obs) to methods that need numbers would fix this.
I think 'partial_fit' interface won't work if n_iter > 1: we can't start next EM iteration until all data was seen (and the model was updated); we also can't check for convergence without all data. On the other hand, just calling itertools.tee(X, n_iter) would make it possible to feed data incrementally without loading more than a single observation to memory at once (by creating a custom input iterator with
Agree. Should it be a new utils/_extmath.pyx extension? |
On Sun, Mar 31, 2013 at 01:53:02AM -0700, Mikhail Korobov wrote:
An array of integers will have a much smaller in-memory footprint than an
OK, so it's not a real online setting. Sorry, I was confused. I don't
Yes, but we don't want to support this pattern in the scikit because it
I think that this would be the way to go. Thanks! |
In this particular example of _BaseHMM 'obs' is not used in any way by scikit-learn code. It is passed to _compute_log_likelihood which is empty and to be overridden in subclasses; it is also passed to _init but unused there. I think this interface is great (this is one of the reasons why fixing nltk.tag.hmm is harder). So _BaseHMM don't need 'obs' to be integers, and casting obs to an array in _BaseHMM can't improve speed or memory footprint in any way, but it can reduce flexibility. Subclasses of _BaseHMM provided in scikit-learn requires
Yes, the idea is to iterate over the same data several times (e.g. read it from file several times).
What about supporting this in sklearn.hmm without generalizing to various learners? "fit" method accepts a list of sequences, not a matrix, so supporting iterators looks quite natural here. |
Sorry, I'm coming a bit late to the discussion. One more thing to consider is that HMMs in sklearn really make it hard to consider the scope of sklearn. We rejected Kalman filters, which are very very related. We could go with "lets try and keep this algorithm in a good enough shape an not add anything more". |
If I go with "make a new package" route (I'm open to other routes though), how should I acknowledge original authors? Scikit-learn license says that "Neither the name of the Scikit-learn Developers nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission." - am I allowed to say the code is originated from scikit-learn and to list all people contributed to sklearn.hmm? |
@kmike I'm pretty sure that would be ok. |
You're right, it might be quite a bit of work to maintain it, and I don't feel entitled to maintain anything but _BaseHMM (and maybe MultinomialHMM). My goal is to get a versatile BaseHMM with sparse transitions matrix. If you think about permamently factoring out HMMs, we need more people willing to help. Update on _logsum: I'm trying to make transition matrices sparse; as the first step I'm rewriting _hmmc.pyx without Cython (numpy+python) to make changes easier. To my surprise, partially vectorized numpy+python (see https://gist.github.com/kmike/5285124) already runs about 20+% faster than current Cython extension (measured on eval() method, n_components=1000, len(obs)=10). So faster _logsum may be unnecessary. |
Upd2: bug is fixed in gist mentioned above; pure-python vectorized version of viterbi that is 6x faster than _hmmc._viterbi is added. |
Whatever the future of HMMs in scikit-learn is, I would be interested in |
Done as #1825. |
It would be nice to create a benchmark for different number of states, |
You're right, - just checked it with 2 states, and with 2 states .fit and .eval are 2.5x slower in my pull request than in scikit-learn master; _decode_viterbi is still 1.5x faster. A proper benchmark would be better, but I don't have time to create it, sorry. I think Cython extension is slow because it has a lot of Python-level calls (this is visible in annotated html from Cython) and using typed memoryviews could make it a lot faster. |
Including a copyright notice is not promoting/endorsing. Also: a. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. so you'd have to list scikit-learn somewhere. The endorse/promote clause is mostly to prevent our names or "scikit-learn" to be used in advertising by forks. As for the separate package route, I would be in favor of starting a scikit-structlearn/scikit-sequence toolkit and would likely contribute to it. |
On 04/02/2013 01:52 PM, Lars Buitinck wrote:
|
Yep. And support sparse matrices throughout... are you doing that? |
By the way, do you know a good way for supporting sparse matrices with non-zero default values (this is the case for matrices of log probability where "empty" values are -inf)? Currently I see 2 possibilities: (a) remove log-space calculations and (b) subclass something from scipy.sparse, but (a) may cause numeric stability issues, and (b) looks quite laborious. |
(a) is numerically unstable, (b) is a maintenance nightmare. @jakevdp knows tricks for this. But are you sure sparsifying the matrices inside an HMM is worth it? The emission probabilities are usually smoothed and thus all non-zero, while the transition matrix shouldn't be that large (even a second-order model with a 100 states takes just a few MBs). |
I don't know is sparsifying matrices worth it because I didn't try the result :) My goal is to try higher-order models, and state space for full Russian tag disambiguation can easily be > 1500 for first-order HMM (of course there are ways to simplify tags, but 100 states is quite a low number). Also, I thought that this is not only about memory requirements - calculations on sparse matrices should also be orders of magnitude faster, esp. with higher order models. |
Also, second-order model with 100 states would take 100^4 cells = 800M for transition matrix if states are just a Cartesian product of first-order states; to make it 100^3 rewriting algorithms to explicitly handling second-order case is neccessary; an another rewrite is needed for third-order models, and so on. On the other hand, sparse matrices should solve all these problems and allow arbitrary order HMMs. |
I was thinking exactly of a specialized 2nd-order Viterbi algorithm. As you point out, that gives a big saving in complexity, and I think it should be preferred to a generic k-order HMM. >2nd-order models are probably a YAGNI feature. First-order (or even 0th-order) discriminative models tend to work better, at least for POS tagging (ok, Dutch and English POS tagging), so why bother? The same applies to very large tagsets. I'd switch to discriminative models and maybe multi-stage architectures for such problems: get coarse POS from HMM, fill in details with per-POS classifiers that can see word features. |
I think you're right about HMMs with order > 2. I read a paper (on unsupervised Russian POS disambiguation) once where (t-2, t-1, t+1, t+2) was taken in account, and I thought it was a 4-order HMM (well, not actually a HMM), but now I re-read it and realized they used 2 trigram HMMs that run on straight and reversed sentences. There is also a paper on Persian unsupervised disambiguation that uses 4-grams (http://airccse.org/journal/ijaia/papers/3212ijaia04.pdf), but it is not clear if 4-grams bring improvements. One more shot for generic k-order HMM: with sparse matrices it'll be trivial to use hmms as "cyclic dependency networks" (see http://nlp.stanford.edu/kristina/papers/tagging.pdf ) that use right context explicitly; this approach was used e.g. in Hebrew unsupervised POS tagging (http://acl.eldoc.ub.rug.nl/mirror/P/P08/P08-1085.pdf). ..it may be also trivial to handle explicit right contexts in explicit 2-order case - I don't understand this now. |
I'm having a hard time understanding the paper about Persian, and I've never heard of the journal, either. It may be legit, but I really can't tell. IIRC, the "cyclic dependency models" of Toutanova et al. are actually discriminative, MEMM-like. They're closer to CRFs than to HMMs. As for Goldberg et al., you should check the reference Adler 2007 to find out if it's really that trivial -- I couldn't find the algorithm. For handling right contexts, the trick with two HMMs running in opposite directions is I think the most common one. Note that the Viterbi algorithm computes the most probable path, so it will consider things like P(verb|STOP), which is a limited form of right context. |
Thanks for the reference, I'll take a look at it - looks very useful! It is stated in Toutanova et al. that "In essence, there is no difference between inference on this network and a secondorder left-to-right CMM or HMM. The only difference is that, when the Markov window is at a position i, rather |
Interesting trick! It may be worthwhile to try and implement that. |
To recap: @larsmans suggestions are all sound, and sparse matrices are not an absolute requirement. Sparse matrices are still appealing though because they would allow to have 1 copy of algorithms (instead of specialized versions for first-order, second-order and "cyclic dependency" models, and maybe others) and still provide the same time and space complexity as hand-coded models. |
Hello all. Related to the discussion of training on multiple sequences. It seems that there is an error if any of the sequences is of length 1. E.g. import numpy as np
from sklearn.hmm import GaussianHMM
seq1 = np.random.random((10,4))
seq2 = np.random.random((8,4))
seq3 = np.random.random((1,4))
train = [seq1, seq2, seq3]
hmm = GaussianHMM(n_components=2)
hmm.fit(train)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/arch/Linux-x86_64/lib/python2.7/sklearn/hmm.py", line 441, in fit
bwdlattice, self.params)
File "/home/user/arch/Linux-x86_64/lib/python2.7/sklearn/hmm.py", line 788, in _accumulate_sufficient_statistics
params)
File "/home/user/arch/Linux-x86_64/lib/python2.7/sklearn/hmm.py", line 574, in _accumulate_sufficient_statistics
stats["trans"] += np.exp(logsumexp(lneta, 0))
File "/home/user/arch/Linux-x86_64/lib/python2.7/sklearn/utils/extmath.py", line 243, in logsumexp
vmax = arr.max(axis=0)
ValueError: zero-size array to maximum.reduce without identity This does arise as an application in e.g. bioinformatics when HMM-based clustering is done on a (disconnected) subset of the genome. |
_accumulate_sufficient_statistics() should probably skip update of the transition stats in that case. Should probably file a bug report. |
Fix bug identified in #1817, comment 17340049
Oops, reopening as PR #2525 only solves one of those issues. Would be great to split this issue into sub issues. |
I'm going to close this issue as we've decided to ditch the HMMs altogether. |
Hi I was wondering why HMMs are ditched? What is going to happen to the code? |
HMMs don't fit with the API and no-one really understands the code anymore. We've decided not to support structured learning, including sequence prediction, because doing it properly would require changes all over the codebase. The code will be put into a separate repo for others to use/fork/maintain. |
Hi,
I think there are some things in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/hmm.py that could be improved.
_BaseHMM.algorithm handling
Ideas:
I'd prefer making decode_viterbi and decode_map public because flags that totally change what function does are bad API (see e.g. http://mail.python.org/pipermail/python-dev/2005-August/055907.html). Is it correct that "decode" is required in first place because of grid search, etc.?
I don't understand why _BaseHMM.fit sets algorithm.
Docstrings
obs
". In my understanding this is a bit misleading because (unlike _BaseHMM._decode_viterbi) likelihood of individual states is maximized, not the likelihood of a sequence.params
andinit_params
are not properly documented. For example, for MultinomialHMM documented 'c' is not valid, and the most interesting 'e' is not documented;Ideas: fix this.
fit method
fit
variant for MultinomialHMM.fit
vs all other methods: it should be a list of sequences forfit
and an individual sequence in other methods.Ideas:
def fit(self, X, y=None)
My knowledge is lacking and I'm clueless about GaussianHMM and GMMHMM - does supervised learning make sense for them?
Normalization
I don't know if normalization is a real issue - maybe there are better solutions for sparse problems (like n-gram POS tagging with |tagset| > 1000) - but sklearn.hmm is currently unusable for such problems at least because of normalization.
misc
I'm looking for a feedback (and would be happy to submit a pull request with fixes to some of these issues if this make sense).
The text was updated successfully, but these errors were encountered: