-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MRG feature hashing transformer #909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Very interesting. I was about to work on that too (actually I have a WIP branch somewhere). Stuff to add (to this PR or to a later one):
For the name I would keep Related but not to implement in this PR:
|
BTW: Edit: hum, your description makes it explicit you are already aware of this... |
Where's your branch? Maybe we can combine things. |
It's very early work, more a stash than a branch. The goal would be to create a The other branch (not yet pushed on github either) is about the label hashing but that will happen in the |
I've done some work (not pushed yet) regarding the handling of arbitrary hash functions. This turns out to be trickier than I thought because hash functions may be either 32 or 64 bits (or maybe even larger), and either signed or unsigned. I'll think this through and maybe come up with new code next weekend. In other news, hashing works quite well for a hacked version of |
Out of curiosity what value for |
I was using |
@larsmans great contribution indeed - I've hacked something similar for my work but this looks much more throughout. @ogrisel regarding the multiclass / multilabel wrapper: I thought about something similar for SGDClassifier, basically a specialized |
No news, but finally pushed what I had a week ago. Please review; there's comments in the |
Any performance difference between python hash and murmurhash? I am assuming that the generic untyped python function call overhead is dominating here. |
h = hashfn(f) | ||
sign = (h & 1) * 2 - 1 | ||
h = <uint32_t>(h) | ||
h >>= 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line deserves a comment such as: we use the first bit of the hash function to compute the sign and use the remaining 31 bits for hashing the index in a 2 ** 31 dimensional space. This makes it possible to do a single call to the hash function instead of two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to put that in the narrative docs, but I didn't write those yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good to have that in the source code as well.
Murmurhash is a bit slower than |
Alright. For the text feature extraction version it might be reasonable to handle tokens as utf-8 encoded byte arrays and use |
BTW, I have lost the early branches I was toying around with (laptop crash without pushing those branches to my public repo). So please feel free to move on without waiting for my input. |
Oh, I wasn't waiting -- I'm just very busy ATM. Will resume work when I have more time. |
I pushed a bunch more commits. I dropped custom hash functions because they're currently too hard to support, putting in a no-op I had to change @ogrisel: Do I understand correctly that you wanted to work on this during the next sprint? I won't be there, but you have my blessing to improve on this. In particular, you seem to be knowledgeable wrt. the internals of CSR matrices, and maybe building a CSR matrix directly could provide some speedup. |
I think the refactoring that @ogrisel suggested sounds good. I'd be somewhat interested and I could try to help. But I'm also not going to be at the spring. |
My priority for this sprint will be joblib parallel stuff + python 3 port. I don't think I will have time to work on feature hashing although I think this is a very important feature. Please @amueller feel free to poke around in a branch if you want to suggest further improvement. We can keep feature hashing transformer like this and independently write a |
@larsmans could you please update the summary / todo with the current state of the code and what you are planning to add before switching it to "MRG"? Do you think it would be possible to implement a variant that could accept the same input format as |
@ogrisel Updated the TODO. And yes, we could implement a variant that takes iterables of pairs as samples. We could even do the one-of-K expansion using C string manipulation, if we're careful. (Or C++'s |
I don't know if this behavior is intended, but currently a smaller n_features value doesn't lead to a smaller memory footprint. Here's a modified version of _hashing.pyx: https://github.com/ephes/scikit-learn/blob/hashing-trick_array/sklearn/feature_extraction/_hashing.pyx |
Hi @ephes, thanks for the suggestion, but how did you determine the memory usage? Also, I've been trying to avoid building a hash table, relying on |
@larsmans by looking at the output of "top" :) - here are some more details: #1122 (comment) And no, it's quite a bit slower. But i think coo_matrix does not handle duplicates (quite astonishing):
|
I've remeasured memory usage (500K documents):
With aggregation of token counts:
Hmm, different sizes of n_features have not much influence at all. Main improvement comes from not storing an (i, j, value)-triple for every token. |
About duplicates: |
From the docs: |
Indeed, The idea with such large numbers of documents is that you vectorize them in batches, then feed them to a learner that has a Besides, if you have 500,000 samples with a similar number of features (in the NLP problems that I handle, usually In other words, I don't intend to use your |
@larsmans thanks for the clarification of the intended use case. My actual dataset is even larger and has 3.4 million documents (all questions ever posted on stackoverflow - it's a public dataset). Number of unique tokens: 5054787. Number of tokens after removing all tokens with a document frequency < 2 (not useful for training): 1288578. So i won't hit the 2^31 dimensions limit anytime soon (but nice to know). But i wonder whether the claim that the hashing-trick saves memory is valid at all. Without a vocabulary dict, you have to store all tokens as features and this makes the feature matrix about 4 times as big (at least in this case). And since the vocabulary dict is negligible small in comparison to the feature matrix, you have to pay big time in memory for not having one. Hmm, in an online setting with partial_fit you don't have to hold the complete feature matrix in memory i guess, which makes the memory overhead bearable... |
When using the hashing trick you don't store the feature names (the string of the token): you hash the bytes of the string while reading them to find the integer index of the feature using a modulo into a fixed size of features (e.g. 10e5 possible features). The learning algorithms are usually more than able to deal with the occasional collisions. It's a kind of random projection where the projection matrix is implicitly defined by the hash function and its seed. No need for a vocabulary, no need to store the token string to feature index mapping and no |
That's quite a lot of memory; what's the average number of tokens per document? I'm +1 for building CSR matrices directly - most (all?) of your sparse classifiers/regressors operate on CSR matrics. |
No I didn't, see updated comment :) |
Ok :) more test are better but if you don't want to fiddle with it, just put it in |
!@#$% (excuse me) Turns out Do you think we should handle this case as well? Is it likely that users will try to input Numpy arrays of exotic |
No wait, I was mistaken. The problem is that I tried
but that's not the kind of object I think it was:
I give up on trying to support this kind of thing. Please pull when you think this is done. |
It would be nice to add support for dictionaries as default input_type. Being a drop-in replacement for DictVectorizer by default is important as we don't want users to learn too many input formats. If I'm not mistaken you just need to create a generator that consumes the dicts and yields pairs instead. |
|
for k, v in dict.items(): yield k, v Note that iteritems is deprecated in Python 3 and that items() has been lazy for a little while. |
oh, ok. |
I'll change the input format to dicts since they're actually faster to process. |
+1 for dict as default input format for this kind of vectorizers. |
@larsmans can please you merge master to that branch or rebase your branch for a final review of the mergeable state? |
Build CSR instead of COO matrix in FeatureHasher. Partially followed Jochen Wersdörfer's (@ephes's) suggestions.
Suggestions from @amueller and @GaelVaroquaux: * default value for n_features * input validation in fit, with test * lots of small docfixes * capitalization and full stop in error messages
Rebased. I also squashed a lot of the microcommits to prevent |
Looks good: +1 for merging! |
Have you pointed out that it is now a drop in replacement for dict-vetorizer? (it is after your last change, right?) |
Not exactly as it doesn't do 1-of-K coding. I might add that later once I've found out if I can do it fast enough and factor out the common code. (The NLTK wrapper might also benefit from it.) |
@mblondel a final comment before merging? |
+1 for merge
|
MRG feature hashing transformer
Done! Thanks @larsmans this will be very useful. |
Awesome! Thanks @larsmans :) |
Thanks for the reviewing, everyone! |
This PR implements feature hasher for large-scale learning or memory-restricted prediction.
It implements Weinberger et al.'s algorithm, except that it cuts one corner: instead of running two hash functions on the input, it runs a single one, taking its lowest bit to determine the sign and the rest to determine the column index. That should be ok with a 32-bit hash function, as
scipy.sparse
matrices cannot hold more than 2**31 elements anyway.Todo:
Support multiple hash functions, in particular @ogrisel's implementation of murmurhash.Do we want to support the same style of input asDictVectorizer
? I have a gut feeling that that's going to slow things down. If we don't, we should explicitly document this.Needs optimization.Needs more tests.More documentation needed.Should this thing be calledHashingVectorizer
instead? That means more typing thanFeatureHasher
, but it is a vectorizer...