Skip to content

Hash collisions in the FeatureHasher #7513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rth opened this issue Sep 28, 2016 · 7 comments · Fixed by #7565
Closed

Hash collisions in the FeatureHasher #7513

rth opened this issue Sep 28, 2016 · 7 comments · Fixed by #7565
Labels
Milestone

Comments

@rth
Copy link
Member

rth commented Sep 28, 2016

Description

I find the current output of the HashingVectorizer quite counter-intuitive, mostly due to the mechanism used to handle hash collisions in the FeatureHasher (the current implementation is conceptually close to the second example here). Without hash collisions, if two words have the same hash, the corresponding count value would be just of 2 (i.e. there was a collision). With this mechanism a random sign is added to the feature count (+1 or -1), so that when a collision occurs the result can be -2, 0 or +2.

  • the +2, or -2 is essentially what would have happened without this mechanism, and is equivalent to say that the two words that produce the collision are identical (from the point of view of this algorithm), which is expected.
  • the case of 0 (50% chance) is considered a successful hash collision resolution, and is equivalent of saying that neither of those 2 words were present in the text document to begin with. I fail to see how this is better than the case above.

The second case also is the reason for issue #3637 and #2665, which I tried to address in PR #7502 by adding a eliminate_zeros() call (as suggested in parent issues), but I'm not convinced this is the right thing to do anymore (i.e. why would we remove any mention that some words were in the corpus just because they happen to have the same hash).

I understand that the current implementation is based on Weinberger et al 2009 paper which is implemented in Vowpal Wabbit. However, is there any other studies that demonstrate that this hashing collision resolution is necessary and consistently gives better results?

Example

from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
text = ['This is the first document.']
CountVectorizer().fit_transform(text).data
# => `array([1, 1, 1, 1, 1], dtype=int64)`, as expected 

HashingVectorizer(norm=None).fit_transform(text).data
# => `array([ 1.,  1., -1., -1., -1.])`  #  additional random sign added to the extracted elements.  

HashingVectorizer(norm=None, n_features=10).fit_transform(text).data
# => array([ 0., -1.,  1., -1.])`  # one collision of +1 and -1 features producing 0 

Issues

  1. Negative feature counts in the output of HashingVectorizer do not make sense. It is not logical that roughly speaking hashing the keys of the CountVectorizer would also add a random sign on the counts.

  2. The documentation states

    If non_negative=True is passed to the constructor, the absolute value is taken. This undoes some of the collision handling [..]

    I don't think the latter part is correct. The collisions occur at the sum_duplicates call before the absolute value is taken, so this has no impact on collision handling IMO.

  3. Issues HashingVectorizer + TfidfTransformer fails because of a stored zero #3637 and possibly inefficient sparse matrix creation in FeatureHasher #2665

Proposed solution

It would be more logical to make non_negative=True disable the hash collision resolution mechanism (i.e. make all the counts positive) in the HashingVectorizer which would address the issues 2. and 3 above. Making non_negative=True the default would address 1. (though it is not backward compatible).

cc @jnothman @larsmans

@amueller amueller added the Bug label Sep 28, 2016
@amueller amueller added this to the 0.19 milestone Sep 28, 2016
@jnothman
Copy link
Member

We've got two different concepts here:

  • a hashed feature matrix as counts
  • a hashed feature matrix as a compression of the input space

As counts, negative values don't make much sense. In the HashingVectorizer context. And you're right, I think there should be an option that disables the "hash collision resolution mechanism" as you say. I'd be happy to see that as a further option for non_negative, and perhaps that the current non_negative=True behaviour be deprecated. In the context of n-gram features, collisions are equivalent to creating new polysemy.

As a matrix compression, feature hashing is effectively an extreme form of sparse random projections. In this model, one gets theoretical guarantees about approximate equivalence of euclidean distance in the input and reduced spaces. The features are also centred on 0. Given that there is theoretical basis for this model, I don't think non_negative=True as a default is necessarily appropriate, although I appreciate that this model may be less pertinent with text than when hashing non-text features.

Note (@amueller) that an L2-norm on this reduced space means that vector products in the hashed space approximate cosine distance in the input space...

@jnothman
Copy link
Member

TLDR:

  • +1 for having a non_negative='total'
  • -1/2 for changing the non_negative=False default

@jnothman
Copy link
Member

Also, Caveat: I've not read Weinberger et al.

@rth
Copy link
Member Author

rth commented Sep 29, 2016

Very Interesting, thanks for pointing out the link with sparse random projection! I have only skimmed through Weinberger et al. (and I didn't understand all the practical implications), but I also had the impression that preserving the properties of the inner product was a major point.

The thing is though, that for a CountVectorizer applying the L2 norm makes the dot product calculate the cosine similarity. The same is true when hashing the keys, using the HashingVectorizer assuming that the n_features is large and there is little or no collisions. I would assume that this property of preserving inner product would becomes relevant only once there is a significant number of collisions (as otherwise that is approximately true anyway).

The default options for HashingVectorizer is Bag of Words with n_features=1048576 which certainly would produce almost no collisions. It does not make sense to also make default an option that would only be useful in the high collision cases. The whole spirit of the documentation is currently "get a hashing table big enough to avoid collisions" and I think that would be certainly the way it is mostly used.

The case of matrix compression is interesting, and I believe should be explicitly mentioned in the documentation of the HashingVectorizer, but IMO this should be something that users activate because they know what they are doing, not there by default. Currently HashingVectorizer is presented in the docs as just a CountVectorizer replacement for large document collections without the inverse transform, while more things are actually happening by default without the user being aware of it.

I'll try to run a few benchmarks on this as soon as I find some time.

@amueller amueller modified the milestone: 0.19 Sep 29, 2016
@jnothman
Copy link
Member

(Yes, I meant that the inner product is cosine sim, not that it preserves cosine distance.)

Your argument is fair, and in reality there are many ways we can spin this. But changing a default value is a tricky deprecation and just not worth it where the case is not definitive or users are being frequently misled.

Will you offer a PR to extend / fix the behaviour of non_negative?

@rth
Copy link
Member Author

rth commented Oct 7, 2016

OK, I finally looked a bit more into this, here is the classification accuracy and the relative error between the inner product in the hashed space vs the one before hashing, as a function of the hashing table size, for 3 cases,

  • non_negative=False: with the "alternating sign" / collision compensation mechanism
  • non_negative=True: same as above + absolute value of the result
  • non_negative="total": disabling this collision compensation mechanism
    index
    the observations are the following,
  • all this does not seem to matter for the text classification accuracy (here using Logistic Regression on the 2/20 newsgoups dataset), or if anything it can occasionally make it slightly worse
  • the "alternating sign" mechanism does indeed allow to preserve the inner product much better, however,
    • this only matters for small hashing table sizes: n_features < 1000 with BOW, n_features < 10000 for 4-char ngrams, etc.
    • The inner product conservation still is mostly true even if we apply some other post normalization (e.g. L2 norm to compute the cosine similarity or TFIDF weights)
    • for binary counts, this algorithm has almost no impact
  • Setting non_negative=True make the inner product conservation much worse
  • The region where this "alternating sign" mechanism becomes relevant, is also the region where the classification accuracy drops (i.e. need bigger hashing table).

Full results are available here, run using the code from the PR #7565.

@jnothman
Copy link
Member

jnothman commented Oct 8, 2016

Thanks for this, and fair enough. The inner product preservation only seems
relevant to me in the case where features are not all positives, i.e. not
relevant for text.

I've looked at #909 where this was introduced. Apart from the case of
binary data, where it might be beneficial to preserve binarity (although
this can be done another way), I agree that it's better to avoid this sign
alternation when non-negativity is sought. I'd be happy to hear if
@larsmans disagrees.

Let's be bold and deprecate the current non_negative=True semantics. So:
add a option non_negative='total' then say that this will become the
meaning of non_negative=True in two versions' time, whenever a user
specifies non_negative=True. If someone objects upon seeing this warning,
hopefully we'll hear about it and can backtrack on this decision.

On 8 October 2016 at 05:11, Roman Yurchak notifications@github.com wrote:

OK, I finally looked a bit more into this, here is the classification
accuracy and the relative error between the inner product in the hashed
space vs the one before hashing, as a function of the hashing table size,
for 3 cases,

  • non_negative=False: with the "alternating sign" / collision
    compensation mechanism
  • non_negative=True: same as above + absolute value of the result
  • non_negative="total": disabling this hash compensation mechanism [image:
    index]
    https://cloud.githubusercontent.com/assets/630936/19200076/5ec7bcfe-8cc7-11e6-8b45-acf7768a665b.png
    the observations are the following,
  • all this does not seem to matter for the text classification
    accuracy (here using 2/20 newsgoups dataset), or if anything it can
    occasionally make it slightly worse
  • the "alternating sign" mechanism does indeed allow to conserve the
    inner product much better, however,
    • this only matters for small hashing table sizes: n_features < 1000
      with BOW, n_features < 10000 for 4-char ngrams, etc.
    • The inner product conservation still is mostly true even if we
      apply some other post normalization (e.g. L2 norm to compute the cosine
      similarity or TFIDF weights)
    • for binary counts, this algorithm has almost no impact
  • Setting non_negative=True make the inner product conservation much
    worse
  • The region where this "alternating sign" mechanism becomes relevant,
    is also the region where the classification accuracy drops (i.e. need
    bigger hashing table).

Full results are available here
https://github.com/rth/notebooks/blob/master/sklearn/HashingVectorizer_collisions.ipynb
run using the code from the PR #7565
#7565.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7513 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6yRvaIP3BCcnMdz63NvG3uuEpqNjks5qxos7gaJpZM4KJAmN
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants