-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Hash collisions in the FeatureHasher #7513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We've got two different concepts here:
As counts, negative values don't make much sense. In the As a matrix compression, feature hashing is effectively an extreme form of sparse random projections. In this model, one gets theoretical guarantees about approximate equivalence of euclidean distance in the input and reduced spaces. The features are also centred on 0. Given that there is theoretical basis for this model, I don't think Note (@amueller) that an L2-norm on this reduced space means that vector products in the hashed space approximate cosine distance in the input space... |
TLDR:
|
Also, Caveat: I've not read Weinberger et al. |
Very Interesting, thanks for pointing out the link with sparse random projection! I have only skimmed through Weinberger et al. (and I didn't understand all the practical implications), but I also had the impression that preserving the properties of the inner product was a major point. The thing is though, that for a CountVectorizer applying the L2 norm makes the dot product calculate the cosine similarity. The same is true when hashing the keys, using the HashingVectorizer assuming that the The default options for HashingVectorizer is Bag of Words with The case of matrix compression is interesting, and I believe should be explicitly mentioned in the documentation of the HashingVectorizer, but IMO this should be something that users activate because they know what they are doing, not there by default. Currently HashingVectorizer is presented in the docs as just a CountVectorizer replacement for large document collections without the inverse transform, while more things are actually happening by default without the user being aware of it. I'll try to run a few benchmarks on this as soon as I find some time. |
(Yes, I meant that the inner product is cosine sim, not that it preserves cosine distance.) Your argument is fair, and in reality there are many ways we can spin this. But changing a default value is a tricky deprecation and just not worth it where the case is not definitive or users are being frequently misled. Will you offer a PR to extend / fix the behaviour of |
OK, I finally looked a bit more into this, here is the classification accuracy and the relative error between the inner product in the hashed space vs the one before hashing, as a function of the hashing table size, for 3 cases,
Full results are available here, run using the code from the PR #7565. |
Thanks for this, and fair enough. The inner product preservation only seems I've looked at #909 where this was introduced. Apart from the case of Let's be bold and deprecate the current non_negative=True semantics. So: On 8 October 2016 at 05:11, Roman Yurchak notifications@github.com wrote:
|
Description
I find the current output of the
HashingVectorizer
quite counter-intuitive, mostly due to the mechanism used to handle hash collisions in theFeatureHasher
(the current implementation is conceptually close to the second example here). Without hash collisions, if two words have the same hash, the corresponding count value would be just of 2 (i.e. there was a collision). With this mechanism a random sign is added to the feature count (+1 or -1), so that when a collision occurs the result can be -2, 0 or +2.The second case also is the reason for issue #3637 and #2665, which I tried to address in PR #7502 by adding a
eliminate_zeros()
call (as suggested in parent issues), but I'm not convinced this is the right thing to do anymore (i.e. why would we remove any mention that some words were in the corpus just because they happen to have the same hash).I understand that the current implementation is based on Weinberger et al 2009 paper which is implemented in Vowpal Wabbit. However, is there any other studies that demonstrate that this hashing collision resolution is necessary and consistently gives better results?
Example
Issues
Negative feature counts in the output of HashingVectorizer do not make sense. It is not logical that roughly speaking hashing the keys of the CountVectorizer would also add a random sign on the counts.
The documentation states
I don't think the latter part is correct. The collisions occur at the
sum_duplicates
call before the absolute value is taken, so this has no impact on collision handling IMO.Issues HashingVectorizer + TfidfTransformer fails because of a stored zero #3637 and possibly inefficient sparse matrix creation in FeatureHasher #2665
Proposed solution
It would be more logical to make
non_negative=True
disable the hash collision resolution mechanism (i.e. make all the counts positive) in the HashingVectorizer which would address the issues 2. and 3 above. Makingnon_negative=True
the default would address 1. (though it is not backward compatible).cc @jnothman @larsmans
The text was updated successfully, but these errors were encountered: