Skip to content

MRG feature hashing transformer #909

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Nov 18, 2012
Merged

Conversation

larsmans
Copy link
Member

This PR implements feature hasher for large-scale learning or memory-restricted prediction.

It implements Weinberger et al.'s algorithm, except that it cuts one corner: instead of running two hash functions on the input, it runs a single one, taking its lowest bit to determine the sign and the rest to determine the column index. That should be ok with a 32-bit hash function, as scipy.sparse matrices cannot hold more than 2**31 elements anyway.

Todo:

  • Support multiple hash functions, in particular @ogrisel's implementation of murmurhash.
  • Do we want to support the same style of input as DictVectorizer? I have a gut feeling that that's going to slow things down. If we don't, we should explicitly document this.
  • Needs optimization.
  • Needs more tests.
  • More documentation needed.
  • Should this thing be called HashingVectorizer instead? That means more typing than FeatureHasher, but it is a vectorizer...

@ogrisel
Copy link
Member

ogrisel commented Jun 15, 2012

Very interesting. I was about to work on that too (actually I have a WIP branch somewhere).

Stuff to add (to this PR or to a later one):

  • support for hashed feature cross-products (I think VW does this), using hash of the first term as seed to rehash the hash of the second term of the pair.
  • a combined hashing text feature vectorizer that does not use an inmemory vocabulary but hash the terms as they (I agree that the transformer API with sequences of pretokenized inputs is interesting too).

For the name I would keep FeatureHasher for the current API and write a new class dedicated to text input called HashingVectorizer (maybe reusing the TF-IDF normalization as an option too).

Related but not to implement in this PR:

  • multiclass / multilabel wrapper with the hashing trick on both the label and the input (I am working on this currently), also described in the paper you reference.

@ogrisel
Copy link
Member

ogrisel commented Jun 15, 2012

BTW: murmurhash3_32 is already included in sklearn.util if you missed it.

Edit: hum, your description makes it explicit you are already aware of this...

@larsmans
Copy link
Member Author

Where's your branch? Maybe we can combine things.

@ogrisel
Copy link
Member

ogrisel commented Jun 16, 2012

It's very early work, more a stash than a branch. The goal would be to create a BaseVectorizer ABC that factorizes out all the preprocessing / tokenization logic and derive this class to build the CountVectorizer and the HashingVectorizer. CountVectorizer have method such as inverse_transform that don't make sense for HashingVectorizer, hence HashingVectorizer cannot be a direct sublass of CountVectorizer.

The other branch (not yet pushed on github either) is about the label hashing but that will happen in the sklearn.multiclass module independently of this PR I think.

@larsmans
Copy link
Member Author

I've done some work (not pushed yet) regarding the handling of arbitrary hash functions. This turns out to be trickier than I thought because hash functions may be either 32 or 64 bits (or maybe even larger), and either signed or unsigned.

I'll think this through and maybe come up with new code next weekend.

In other news, hashing works quite well for a hacked version of examples/document_classification.py that extracts terms using re.findall(r'\w+', text_file); vectorizing is some 27% faster with no noticeable drop in F1-score. (However, running LinearSVC for very large n_features is very slow.)

@ogrisel
Copy link
Member

ogrisel commented Jun 18, 2012

Out of curiosity what value for n_features are you using in your tests? For 20 newsgroups 1e5 is probably more than enough.

@larsmans
Copy link
Member Author

I was using 2**20. But indeed, 2**16 works quite well too and is a lot easier on the learners.

@pprett
Copy link
Member

pprett commented Jun 18, 2012

@larsmans great contribution indeed - I've hacked something similar for my work but this looks much more throughout.

@ogrisel regarding the multiclass / multilabel wrapper: I thought about something similar for SGDClassifier, basically a specialized WeightVector class that implements dot via hashing - the downside is that it would require lots of refactoring of the other code too - even worse: you cannot use hashing combined with OvR... (ordering of execution of the binary task has effect on collisions/updates)

@larsmans
Copy link
Member Author

No news, but finally pushed what I had a week ago. Please review; there's comments in the .pyx regarding how to handle hash functions. The main issue is casting a signed integer of unknown size to int32_t, preferably without losing bits, and in a way that is reproducible.

@ogrisel
Copy link
Member

ogrisel commented Jun 24, 2012

Any performance difference between python hash and murmurhash? I am assuming that the generic untyped python function call overhead is dominating here.

h = hashfn(f)
sign = (h & 1) * 2 - 1
h = <uint32_t>(h)
h >>= 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line deserves a comment such as: we use the first bit of the hash function to compute the sign and use the remaining 31 bits for hashing the index in a 2 ** 31 dimensional space. This makes it possible to do a single call to the hash function instead of two.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to put that in the narrative docs, but I didn't write those yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to have that in the source code as well.

@larsmans
Copy link
Member Author

Murmurhash is a bit slower than hash. I think I'm going to drop hash support, since it just doesn't provide enough guarantees. Worse, in future Python versions, hash randomization will be enabled by default. It also won't take a seed argument.

@ogrisel
Copy link
Member

ogrisel commented Jun 24, 2012

Alright. For the text feature extraction version it might be reasonable to handle tokens as utf-8 encoded byte arrays and use murmurhash3_bytes_u32 directly inside the cython loop to be able to get rid of some python function call overhead.

@ogrisel
Copy link
Member

ogrisel commented Jul 18, 2012

BTW, I have lost the early branches I was toying around with (laptop crash without pushing those branches to my public repo). So please feel free to move on without waiting for my input.

@larsmans
Copy link
Member Author

Oh, I wasn't waiting -- I'm just very busy ATM. Will resume work when I have more time.

@larsmans
Copy link
Member Author

larsmans commented Sep 4, 2012

I pushed a bunch more commits. I dropped custom hash functions because they're currently too hard to support, putting in a no-op hashfn argument for forward compatibility. I also wrote up some narrative documentation.

I had to change murmurhash.pxd to get this to build. I should probably have changed some setup.py but I'm not going to fix that now.

@ogrisel: Do I understand correctly that you wanted to work on this during the next sprint? I won't be there, but you have my blessing to improve on this. In particular, you seem to be knowledgeable wrt. the internals of CSR matrices, and maybe building a CSR matrix directly could provide some speedup.

@amueller
Copy link
Member

amueller commented Sep 4, 2012

I think the refactoring that @ogrisel suggested sounds good. I'd be somewhat interested and I could try to help. But I'm also not going to be at the spring.

@ogrisel
Copy link
Member

ogrisel commented Sep 4, 2012

My priority for this sprint will be joblib parallel stuff + python 3 port. I don't think I will have time to work on feature hashing although I think this is a very important feature. Please @amueller feel free to poke around in a branch if you want to suggest further improvement.

We can keep feature hashing transformer like this and independently write a HashingVectorizer for text input in a separate PR as both can be useful and will likely share few code as the input datastructures are different and code should be optimized for that.

@ogrisel
Copy link
Member

ogrisel commented Sep 4, 2012

@larsmans could you please update the summary / todo with the current state of the code and what you are planning to add before switching it to "MRG"?

Do you think it would be possible to implement a variant that could accept the same input format as DictVectorizer but using a hashing function instead of a materialized mapping?

@larsmans
Copy link
Member Author

larsmans commented Sep 4, 2012

@ogrisel Updated the TODO.

And yes, we could implement a variant that takes iterables of pairs as samples. We could even do the one-of-K expansion using C string manipulation, if we're careful. (Or C++'s std::string, if @GaelVaroquaux isn't watching. ;)

@ephes
Copy link
Contributor

ephes commented Sep 10, 2012

I don't know if this behavior is intended, but currently a smaller n_features value doesn't lead to a smaller memory footprint. Here's a modified version of _hashing.pyx: https://github.com/ephes/scikit-learn/blob/hashing-trick_array/sklearn/feature_extraction/_hashing.pyx

@larsmans
Copy link
Member Author

Hi @ephes, thanks for the suggestion, but how did you determine the memory usage? Also, I've been trying to avoid building a hash table, relying on coo_matrix's built-in handling of duplicates; is your version just as fast?

@ephes
Copy link
Contributor

ephes commented Sep 10, 2012

@larsmans by looking at the output of "top" :) - here are some more details: #1122 (comment)

And no, it's quite a bit slower. But i think coo_matrix does not handle duplicates (quite astonishing):

>>> x = sparse.coo_matrix(([1, 1, 1, 1, 1, 1], ([0, 0, 0, 1, 1, 1], [1, 1, 1, 2, 2, 2])), shape=(2, 3))
>>> print x
  (0, 1)    1
  (0, 1)    1
  (0, 1)    1
  (1, 2)    1
  (1, 2)    1
  (1, 2)    1

@ephes
Copy link
Contributor

ephes commented Sep 10, 2012

I've remeasured memory usage (500K documents):

  • 5.0Gb (lists, no aggregation)
  • 1.3Gb (arrays, no aggregation)

With aggregation of token counts:

  • 871Mb for n_features=100000000
  • 500Mb for n_features=100

Hmm, different sizes of n_features have not much influence at all. Main improvement comes from not storing an (i, j, value)-triple for every token.

@ogrisel
Copy link
Member

ogrisel commented Sep 10, 2012

About duplicates: scipy.sparse.csr_matrix has eliminate_zeros and sort_indices methods. It don't think COO has such methods.

@amueller
Copy link
Member

From the docs:
"By default when converting [COO] to CSR or CSC format, duplicate (i,j) entries will be summed together. This facilitates efficient construction of finite element matrices and the like. (see example)"

@larsmans
Copy link
Member Author

Indeed, coo_matrix does handle duplicates, in its conversion routines to other formats :)

The idea with such large numbers of documents is that you vectorize them in batches, then feed them to a learner that has a partial_fit method or a trained model. I do intend to optimize for that case, not for vectorizing very large numbers of samples.

Besides, if you have 500,000 samples with a similar number of features (in the NLP problems that I handle, usually n_features > n_samples in training), you've got 5e5² > 2**31 elements, which is unreliable in scipy.sparse matrices anyway. (And yes, we should document that somewhere.)

In other words, I don't intend to use your dict trick. But please leave the commit around, since I do like the idea of using growing arrays to save some memory.

@ephes
Copy link
Contributor

ephes commented Sep 10, 2012

@larsmans thanks for the clarification of the intended use case. My actual dataset is even larger and has 3.4 million documents (all questions ever posted on stackoverflow - it's a public dataset). Number of unique tokens: 5054787. Number of tokens after removing all tokens with a document frequency < 2 (not useful for training): 1288578. So i won't hit the 2^31 dimensions limit anytime soon (but nice to know).

But i wonder whether the claim that the hashing-trick saves memory is valid at all. Without a vocabulary dict, you have to store all tokens as features and this makes the feature matrix about 4 times as big (at least in this case). And since the vocabulary dict is negligible small in comparison to the feature matrix, you have to pay big time in memory for not having one. Hmm, in an online setting with partial_fit you don't have to hold the complete feature matrix in memory i guess, which makes the memory overhead bearable...

@ogrisel
Copy link
Member

ogrisel commented Sep 10, 2012

When using the hashing trick you don't store the feature names (the string of the token): you hash the bytes of the string while reading them to find the integer index of the feature using a modulo into a fixed size of features (e.g. 10e5 possible features). The learning algorithms are usually more than able to deal with the occasional collisions. It's a kind of random projection where the projection matrix is implicitly defined by the hash function and its seed. No need for a vocabulary, no need to store the token string to feature index mapping and no fit-time state: this makes it possible to directly perform the transform operation in parallel on streamed chunks of data, with low and constant memory usage. At the cost off loosing the explainability of your model (no way to invert the mapping to get the feature names).

@pprett
Copy link
Member

pprett commented Sep 10, 2012

I've remeasured memory usage (500K documents):

  • 5.0Gb (lists, no aggregation)
  • 1.3Gb (arrays, no aggregation)

That's quite a lot of memory; what's the average number of tokens per document?

I'm +1 for building CSR matrices directly - most (all?) of your sparse classifiers/regressors operate on CSR matrics.
Furthermore, CSR matrices are more memory efficient than COO matrices.

@larsmans
Copy link
Member Author

No I didn't, see updated comment :)

@amueller
Copy link
Member

Ok :) more test are better but if you don't want to fiddle with it, just put it in dont_test for the moment.

@larsmans
Copy link
Member Author

!@#$% (excuse me)

Turns out np.string_s are all instances of bytes, but Cython doesn't recognize them as strings. Sigh...

Do you think we should handle this case as well? Is it likely that users will try to input Numpy arrays of exotic dtype, even if we don't hint at the possibility in the docs?

@larsmans
Copy link
Member Author

No wait, I was mistaken. The problem is that I tried

In [15]: a = np.array([[("foo", 1.), ("bar", 2.)]], dtype=object)

In [16]: b = np.array([[("foo", 1.), ("bar", 2.)]])

but that's not the kind of object I think it was:

In [17]: a[0]
Out[17]: 
array([[foo, 1.0],
       [bar, 2.0]], dtype=object)

In [18]: b[0]
Out[18]: 
array([['foo', '1.0'],
       ['bar', '2.0']], 
      dtype='|S3')

I give up on trying to support this kind of thing. Please pull when you think this is done.

@mblondel
Copy link
Member

It would be nice to add support for dictionaries as default input_type. Being a drop-in replacement for DictVectorizer by default is important as we don't want users to learn too many input formats. If I'm not mistaken you just need to create a generator that consumes the dicts and yields pairs instead.

@amueller
Copy link
Member

yield from dict.iteritems() ^^ (python 3.2 though :-/)

@GaelVaroquaux
Copy link
Member

yield from dict.iteritems() ^^ (python 3.2 though :-/)

for k, v in dict.items():
   yield k, v

Note that iteritems is deprecated in Python 3 and that items() has been lazy for a little while.

@amueller
Copy link
Member

oh, ok.

@larsmans
Copy link
Member Author

I'll change the input format to dicts since they're actually faster to process.

@ogrisel
Copy link
Member

ogrisel commented Nov 16, 2012

+1 for dict as default input format for this kind of vectorizers.

@ogrisel
Copy link
Member

ogrisel commented Nov 17, 2012

@larsmans can please you merge master to that branch or rebase your branch for a final review of the mergeable state?

Build CSR instead of COO matrix in FeatureHasher.
Partially followed Jochen Wersdörfer's (@ephes's) suggestions.
Suggestions from @amueller and @GaelVaroquaux:
* default value for n_features
* input validation in fit, with test
* lots of small docfixes
* capitalization and full stop in error messages
@larsmans
Copy link
Member Author

Rebased. I also squashed a lot of the microcommits to prevent master from getting cluttered.

@ogrisel
Copy link
Member

ogrisel commented Nov 17, 2012

Looks good: +1 for merging!

@amueller
Copy link
Member

Have you pointed out that it is now a drop in replacement for dict-vetorizer? (it is after your last change, right?)
Otherwise +1 for merge from me, too :)
This is a very awesome addition to sklearn!

@larsmans
Copy link
Member Author

Not exactly as it doesn't do 1-of-K coding. I might add that later once I've found out if I can do it fast enough and factor out the common code. (The NLTK wrapper might also benefit from it.)

@ogrisel
Copy link
Member

ogrisel commented Nov 17, 2012

@mblondel a final comment before merging?

@mblondel
Copy link
Member

+1 for merge
On Nov 18, 2012 8:03 AM, "Olivier Grisel" notifications@github.com wrote:

@mblondel https://github.com/mblondel a final comment before merging?


Reply to this email directly or view it on GitHubhttps://github.com//pull/909#issuecomment-10480109.

ogrisel added a commit that referenced this pull request Nov 18, 2012
MRG feature hashing transformer
@ogrisel ogrisel merged commit c89d208 into scikit-learn:master Nov 18, 2012
@ogrisel
Copy link
Member

ogrisel commented Nov 18, 2012

Done! Thanks @larsmans this will be very useful.

@amueller
Copy link
Member

Awesome! Thanks @larsmans :)

@larsmans
Copy link
Member Author

Thanks for the reviewing, everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants