MRG feature hashing transformer #909

larsmans · 2012-06-15T17:18:43Z

This PR implements feature hasher for large-scale learning or memory-restricted prediction.

It implements Weinberger et al.'s algorithm, except that it cuts one corner: instead of running two hash functions on the input, it runs a single one, taking its lowest bit to determine the sign and the rest to determine the column index. That should be ok with a 32-bit hash function, as scipy.sparse matrices cannot hold more than 2**31 elements anyway.

Todo:

~~Support multiple hash functions, in particular @ogrisel's implementation of murmurhash.~~
~~Do we want to support the same style of input as DictVectorizer? I have a gut feeling that that's going to slow things down. If we don't, we should explicitly document this.~~
~~Needs optimization.~~
~~Needs more tests.~~
~~More documentation needed.~~
~~Should this thing be called HashingVectorizer instead? That means more typing than FeatureHasher, but it is a vectorizer...~~

ogrisel · 2012-06-15T17:39:33Z

Very interesting. I was about to work on that too (actually I have a WIP branch somewhere).

Stuff to add (to this PR or to a later one):

support for hashed feature cross-products (I think VW does this), using hash of the first term as seed to rehash the hash of the second term of the pair.
a combined hashing text feature vectorizer that does not use an inmemory vocabulary but hash the terms as they (I agree that the transformer API with sequences of pretokenized inputs is interesting too).

For the name I would keep FeatureHasher for the current API and write a new class dedicated to text input called HashingVectorizer (maybe reusing the TF-IDF normalization as an option too).

Related but not to implement in this PR:

multiclass / multilabel wrapper with the hashing trick on both the label and the input (I am working on this currently), also described in the paper you reference.

ogrisel · 2012-06-15T17:40:57Z

BTW: murmurhash3_32 is already included in sklearn.util if you missed it.

Edit: hum, your description makes it explicit you are already aware of this...

larsmans · 2012-06-15T18:11:48Z

Where's your branch? Maybe we can combine things.

ogrisel · 2012-06-16T09:32:24Z

It's very early work, more a stash than a branch. The goal would be to create a BaseVectorizer ABC that factorizes out all the preprocessing / tokenization logic and derive this class to build the CountVectorizer and the HashingVectorizer. CountVectorizer have method such as inverse_transform that don't make sense for HashingVectorizer, hence HashingVectorizer cannot be a direct sublass of CountVectorizer.

The other branch (not yet pushed on github either) is about the label hashing but that will happen in the sklearn.multiclass module independently of this PR I think.

larsmans · 2012-06-17T22:16:02Z

I've done some work (not pushed yet) regarding the handling of arbitrary hash functions. This turns out to be trickier than I thought because hash functions may be either 32 or 64 bits (or maybe even larger), and either signed or unsigned.

I'll think this through and maybe come up with new code next weekend.

In other news, hashing works quite well for a hacked version of examples/document_classification.py that extracts terms using re.findall(r'\w+', text_file); vectorizing is some 27% faster with no noticeable drop in F1-score. (However, running LinearSVC for very large n_features is very slow.)

ogrisel · 2012-06-18T09:36:59Z

Out of curiosity what value for n_features are you using in your tests? For 20 newsgroups 1e5 is probably more than enough.

larsmans · 2012-06-18T10:21:17Z

I was using 2**20. But indeed, 2**16 works quite well too and is a lot easier on the learners.

pprett · 2012-06-18T19:56:36Z

@larsmans great contribution indeed - I've hacked something similar for my work but this looks much more throughout.

@ogrisel regarding the multiclass / multilabel wrapper: I thought about something similar for SGDClassifier, basically a specialized WeightVector class that implements dot via hashing - the downside is that it would require lots of refactoring of the other code too - even worse: you cannot use hashing combined with OvR... (ordering of execution of the binary task has effect on collisions/updates)

larsmans · 2012-06-24T00:08:41Z

No news, but finally pushed what I had a week ago. Please review; there's comments in the .pyx regarding how to handle hash functions. The main issue is casting a signed integer of unknown size to int32_t, preferably without losing bits, and in a way that is reproducible.

ogrisel · 2012-06-24T08:31:33Z

Any performance difference between python hash and murmurhash? I am assuming that the generic untyped python function call overhead is dominating here.

ogrisel · 2012-06-24T13:20:12Z

sklearn/feature_extraction/_hashing.pyx

+            h = hashfn(f)
+            sign = (h & 1) * 2 - 1
+            h = <uint32_t>(h)
+            h >>= 1


This line deserves a comment such as: we use the first bit of the hash function to compute the sign and use the remaining 31 bits for hashing the index in a 2 ** 31 dimensional space. This makes it possible to do a single call to the hash function instead of two.

I was going to put that in the narrative docs, but I didn't write those yet.

I think it's good to have that in the source code as well.

larsmans · 2012-06-24T13:50:16Z

Murmurhash is a bit slower than hash. I think I'm going to drop hash support, since it just doesn't provide enough guarantees. Worse, in future Python versions, hash randomization will be enabled by default. It also won't take a seed argument.

ogrisel · 2012-06-24T14:02:34Z

Alright. For the text feature extraction version it might be reasonable to handle tokens as utf-8 encoded byte arrays and use murmurhash3_bytes_u32 directly inside the cython loop to be able to get rid of some python function call overhead.

ogrisel · 2012-07-18T14:08:16Z

BTW, I have lost the early branches I was toying around with (laptop crash without pushing those branches to my public repo). So please feel free to move on without waiting for my input.

larsmans · 2012-07-18T14:45:40Z

Oh, I wasn't waiting -- I'm just very busy ATM. Will resume work when I have more time.

larsmans · 2012-09-04T20:05:27Z

I pushed a bunch more commits. I dropped custom hash functions because they're currently too hard to support, putting in a no-op hashfn argument for forward compatibility. I also wrote up some narrative documentation.

I had to change murmurhash.pxd to get this to build. I should probably have changed some setup.py but I'm not going to fix that now.

@ogrisel: Do I understand correctly that you wanted to work on this during the next sprint? I won't be there, but you have my blessing to improve on this. In particular, you seem to be knowledgeable wrt. the internals of CSR matrices, and maybe building a CSR matrix directly could provide some speedup.

amueller · 2012-09-04T20:24:46Z

I think the refactoring that @ogrisel suggested sounds good. I'd be somewhat interested and I could try to help. But I'm also not going to be at the spring.

ogrisel · 2012-09-04T22:44:01Z

My priority for this sprint will be joblib parallel stuff + python 3 port. I don't think I will have time to work on feature hashing although I think this is a very important feature. Please @amueller feel free to poke around in a branch if you want to suggest further improvement.

We can keep feature hashing transformer like this and independently write a HashingVectorizer for text input in a separate PR as both can be useful and will likely share few code as the input datastructures are different and code should be optimized for that.

ogrisel · 2012-09-04T22:51:35Z

@larsmans could you please update the summary / todo with the current state of the code and what you are planning to add before switching it to "MRG"?

Do you think it would be possible to implement a variant that could accept the same input format as DictVectorizer but using a hashing function instead of a materialized mapping?

larsmans · 2012-09-04T23:18:29Z

@ogrisel Updated the TODO.

And yes, we could implement a variant that takes iterables of pairs as samples. We could even do the one-of-K expansion using C string manipulation, if we're careful. (Or C++'s std::string, if @GaelVaroquaux isn't watching. ;)

ephes · 2012-09-10T15:18:18Z

I don't know if this behavior is intended, but currently a smaller n_features value doesn't lead to a smaller memory footprint. Here's a modified version of _hashing.pyx: https://github.com/ephes/scikit-learn/blob/hashing-trick_array/sklearn/feature_extraction/_hashing.pyx

larsmans · 2012-09-10T16:44:33Z

Hi @ephes, thanks for the suggestion, but how did you determine the memory usage? Also, I've been trying to avoid building a hash table, relying on coo_matrix's built-in handling of duplicates; is your version just as fast?

ephes · 2012-09-10T17:00:54Z

@larsmans by looking at the output of "top" :) - here are some more details: #1122 (comment)

And no, it's quite a bit slower. But i think coo_matrix does not handle duplicates (quite astonishing):

>>> x = sparse.coo_matrix(([1, 1, 1, 1, 1, 1], ([0, 0, 0, 1, 1, 1], [1, 1, 1, 2, 2, 2])), shape=(2, 3))
>>> print x
  (0, 1)    1
  (0, 1)    1
  (0, 1)    1
  (1, 2)    1
  (1, 2)    1
  (1, 2)    1

ephes · 2012-09-10T17:21:34Z

I've remeasured memory usage (500K documents):

5.0Gb (lists, no aggregation)
1.3Gb (arrays, no aggregation)

With aggregation of token counts:

871Mb for n_features=100000000
500Mb for n_features=100

Hmm, different sizes of n_features have not much influence at all. Main improvement comes from not storing an (i, j, value)-triple for every token.

ogrisel · 2012-09-10T17:46:32Z

About duplicates: scipy.sparse.csr_matrix has eliminate_zeros and sort_indices methods. It don't think COO has such methods.

amueller · 2012-09-10T17:49:16Z

From the docs:
"By default when converting [COO] to CSR or CSC format, duplicate (i,j) entries will be summed together. This facilitates efficient construction of finite element matrices and the like. (see example)"

larsmans · 2012-09-10T17:54:29Z

Indeed, coo_matrix does handle duplicates, in its conversion routines to other formats :)

The idea with such large numbers of documents is that you vectorize them in batches, then feed them to a learner that has a partial_fit method or a trained model. I do intend to optimize for that case, not for vectorizing very large numbers of samples.

Besides, if you have 500,000 samples with a similar number of features (in the NLP problems that I handle, usually n_features > n_samples in training), you've got 5e5² > 2**31 elements, which is unreliable in scipy.sparse matrices anyway. (And yes, we should document that somewhere.)

In other words, I don't intend to use your dict trick. But please leave the commit around, since I do like the idea of using growing arrays to save some memory.

ephes · 2012-09-10T19:47:33Z

@larsmans thanks for the clarification of the intended use case. My actual dataset is even larger and has 3.4 million documents (all questions ever posted on stackoverflow - it's a public dataset). Number of unique tokens: 5054787. Number of tokens after removing all tokens with a document frequency < 2 (not useful for training): 1288578. So i won't hit the 2^31 dimensions limit anytime soon (but nice to know).

But i wonder whether the claim that the hashing-trick saves memory is valid at all. Without a vocabulary dict, you have to store all tokens as features and this makes the feature matrix about 4 times as big (at least in this case). And since the vocabulary dict is negligible small in comparison to the feature matrix, you have to pay big time in memory for not having one. Hmm, in an online setting with partial_fit you don't have to hold the complete feature matrix in memory i guess, which makes the memory overhead bearable...

ogrisel · 2012-09-10T20:01:34Z

When using the hashing trick you don't store the feature names (the string of the token): you hash the bytes of the string while reading them to find the integer index of the feature using a modulo into a fixed size of features (e.g. 10e5 possible features). The learning algorithms are usually more than able to deal with the occasional collisions. It's a kind of random projection where the projection matrix is implicitly defined by the hash function and its seed. No need for a vocabulary, no need to store the token string to feature index mapping and no fit-time state: this makes it possible to directly perform the transform operation in parallel on streamed chunks of data, with low and constant memory usage. At the cost off loosing the explainability of your model (no way to invert the mapping to get the feature names).

pprett · 2012-09-10T20:20:46Z

I've remeasured memory usage (500K documents):

5.0Gb (lists, no aggregation)

1.3Gb (arrays, no aggregation)

That's quite a lot of memory; what's the average number of tokens per document?

I'm +1 for building CSR matrices directly - most (all?) of your sparse classifiers/regressors operate on CSR matrics.
Furthermore, CSR matrices are more memory efficient than COO matrices.

larsmans · 2012-11-15T22:24:18Z

No I didn't, see updated comment :)

amueller · 2012-11-15T22:27:32Z

Ok :) more test are better but if you don't want to fiddle with it, just put it in dont_test for the moment.

larsmans · 2012-11-15T22:28:39Z

!@#$% (excuse me)

Turns out np.string_s are all instances of bytes, but Cython doesn't recognize them as strings. Sigh...

Do you think we should handle this case as well? Is it likely that users will try to input Numpy arrays of exotic dtype, even if we don't hint at the possibility in the docs?

larsmans · 2012-11-15T22:45:58Z

No wait, I was mistaken. The problem is that I tried

In [15]: a = np.array([[("foo", 1.), ("bar", 2.)]], dtype=object)

In [16]: b = np.array([[("foo", 1.), ("bar", 2.)]])

but that's not the kind of object I think it was:

In [17]: a[0]
Out[17]: 
array([[foo, 1.0],
       [bar, 2.0]], dtype=object)

In [18]: b[0]
Out[18]: 
array([['foo', '1.0'],
       ['bar', '2.0']], 
      dtype='|S3')

I give up on trying to support this kind of thing. Please pull when you think this is done.

mblondel · 2012-11-16T07:44:06Z

It would be nice to add support for dictionaries as default input_type. Being a drop-in replacement for DictVectorizer by default is important as we don't want users to learn too many input formats. If I'm not mistaken you just need to create a generator that consumes the dicts and yields pairs instead.

amueller · 2012-11-16T07:45:15Z

yield from dict.iteritems() ^^ (python 3.2 though :-/)

GaelVaroquaux · 2012-11-16T07:53:53Z

yield from dict.iteritems() ^^ (python 3.2 though :-/)

for k, v in dict.items():
   yield k, v

Note that iteritems is deprecated in Python 3 and that items() has been lazy for a little while.

amueller · 2012-11-16T07:54:37Z

oh, ok.

larsmans · 2012-11-16T10:33:17Z

I'll change the input format to dicts since they're actually faster to process.

ogrisel · 2012-11-16T13:54:31Z

+1 for dict as default input format for this kind of vectorizers.

ogrisel · 2012-11-17T13:54:31Z

@larsmans can please you merge master to that branch or rebase your branch for a final review of the mergeable state?

@ephes

Build CSR instead of COO matrix in FeatureHasher. Partially followed Jochen Wersdörfer's (@ephes's) suggestions.

@amueller

Suggestions from @amueller and @GaelVaroquaux: * default value for n_features * input validation in fit, with test * lots of small docfixes * capitalization and full stop in error messages

larsmans · 2012-11-17T16:29:54Z

Rebased. I also squashed a lot of the microcommits to prevent master from getting cluttered.

ogrisel · 2012-11-17T17:20:02Z

Looks good: +1 for merging!

amueller · 2012-11-17T17:49:11Z

Have you pointed out that it is now a drop in replacement for dict-vetorizer? (it is after your last change, right?)
Otherwise +1 for merge from me, too :)
This is a very awesome addition to sklearn!

larsmans · 2012-11-17T20:23:46Z

Not exactly as it doesn't do 1-of-K coding. I might add that later once I've found out if I can do it fast enough and factor out the common code. (The NLTK wrapper might also benefit from it.)

ogrisel · 2012-11-17T23:03:08Z

@mblondel a final comment before merging?

mblondel · 2012-11-18T04:31:41Z

+1 for merge
On Nov 18, 2012 8:03 AM, "Olivier Grisel" notifications@github.com wrote:

@mblondel https://github.com/mblondel a final comment before merging?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/909#issuecomment-10480109.

MRG feature hashing transformer

ogrisel · 2012-11-18T09:56:21Z

Done! Thanks @larsmans this will be very useful.

amueller · 2012-11-18T11:20:23Z

Awesome! Thanks @larsmans :)

larsmans · 2012-11-18T20:56:24Z

Thanks for the reviewing, everyone!

ogrisel reviewed Jun 24, 2012
View reviewed changes

larsmans added 7 commits November 17, 2012 17:29

ENH feature hashing transformer

2b8ab35

DOC narrative documentation for feature hashing

2ffe980

ENH speed up hashing and reduce memory usage by 1/3

f5a4ad2

Build CSR instead of COO matrix in FeatureHasher. Partially followed Jochen Wersdörfer's (@ephes's) suggestions.

ENH allow (feature, value) pairs in FeatureHasher

cbb9b01

ENH 20newsgroups example for FeatureHasher

c8e644a

ENH + DOC FeatureHasher

205a947

Suggestions from @amueller and @GaelVaroquaux: * default value for n_features * input validation in fit, with test * lots of small docfixes * capitalization and full stop in error messages

ENH add dict support to FeatureHasher and make it the default input_type

3de0f43

ogrisel added a commit that referenced this pull request Nov 18, 2012

Merge pull request #909 from larsmans/hashing-trick

c89d208

MRG feature hashing transformer

ogrisel merged commit c89d208 into scikit-learn:master Nov 18, 2012

jnothman mentioned this pull request Oct 8, 2016

Hash collisions in the FeatureHasher #7513

Closed

MRG feature hashing transformer #909

MRG feature hashing transformer #909

Conversation

larsmans commented Jun 15, 2012

ogrisel commented Jun 15, 2012

ogrisel commented Jun 15, 2012

larsmans commented Jun 15, 2012

ogrisel commented Jun 16, 2012

larsmans commented Jun 17, 2012

ogrisel commented Jun 18, 2012

larsmans commented Jun 18, 2012

pprett commented Jun 18, 2012

larsmans commented Jun 24, 2012

ogrisel commented Jun 24, 2012

ogrisel Jun 24, 2012

Choose a reason for hiding this comment

larsmans Jun 24, 2012

Choose a reason for hiding this comment

ogrisel Jun 24, 2012

Choose a reason for hiding this comment

larsmans commented Jun 24, 2012

ogrisel commented Jun 24, 2012

ogrisel commented Jul 18, 2012

larsmans commented Jul 18, 2012

larsmans commented Sep 4, 2012

amueller commented Sep 4, 2012

ogrisel commented Sep 4, 2012

ogrisel commented Sep 4, 2012

larsmans commented Sep 4, 2012

ephes commented Sep 10, 2012

larsmans commented Sep 10, 2012

ephes commented Sep 10, 2012

ephes commented Sep 10, 2012

ogrisel commented Sep 10, 2012

amueller commented Sep 10, 2012

larsmans commented Sep 10, 2012

ephes commented Sep 10, 2012

ogrisel commented Sep 10, 2012

pprett commented Sep 10, 2012

larsmans commented Nov 15, 2012

amueller commented Nov 15, 2012

larsmans commented Nov 15, 2012

larsmans commented Nov 15, 2012

mblondel commented Nov 16, 2012

amueller commented Nov 16, 2012

GaelVaroquaux commented Nov 16, 2012

amueller commented Nov 16, 2012

larsmans commented Nov 16, 2012

ogrisel commented Nov 16, 2012

ogrisel commented Nov 17, 2012

larsmans commented Nov 17, 2012

ogrisel commented Nov 17, 2012

amueller commented Nov 17, 2012

larsmans commented Nov 17, 2012

ogrisel commented Nov 17, 2012

mblondel commented Nov 18, 2012

ogrisel commented Nov 18, 2012

amueller commented Nov 18, 2012

larsmans commented Nov 18, 2012