[MRG+1] FIX Hash collisions in the FeatureHasher #7565

rth · 2016-10-03T17:52:57Z

This PR fixes #7513 and extends the non_negative parameter in FeatureHasher and HashingVectorizer with a value of 'total' which disables the hash collision handling (and makes the output more intuitive).

Here are the annotations for the compiled _hashing.pyx. Suggestions on how to better explain the non_negative behavior in the doc-strings are welcome.

The documentation for the HashingVectorizer should also be improved IMO, e.g. to explain the preservation of the inner product in the hashed space, but maybe in a separate PR.

Update: this PR also supersedes #7502 that was a continuation of #5861 aiming to fix #3637 . It also resolves #2665 .

jnothman · 2016-10-04T12:04:00Z

Thanks for this. I'll try take a look tomorrow.

jnothman · 2016-10-04T12:56:04Z

sklearn/feature_extraction/_hashing.pyx

@@ -63,7 +63,8 @@ def transform(raw_X, Py_ssize_t n_features, dtype):

            array.resize_smart(indices, len(indices) + 1)
            indices[len(indices) - 1] = abs(h) % n_features
-            value *= (h >= 0) * 2 - 1
+            if alternate_sign: # counter the effect of hash collision (issue #7513)


I don't think this description is correct.

(also, PEP8: keep two spaces before a comment)

jnothman

I'm coming to consider the idea that the current non_negative is rubbish, and that maybe we should deprecate it an install another parameter in its place...

jnothman · 2016-10-05T12:16:32Z

sklearn/feature_extraction/_hashing.pyx

@@ -63,7 +63,8 @@ def transform(raw_X, Py_ssize_t n_features, dtype):

            array.resize_smart(indices, len(indices) + 1)
            indices[len(indices) - 1] = abs(h) % n_features
-            value *= (h >= 0) * 2 - 1
+            if alternate_sign: # counter the effect of hash collision (issue #7513)


(also, PEP8: keep two spaces before a comment)

jnothman · 2016-10-05T12:16:47Z

sklearn/feature_extraction/_hashing.pyx

@@ -15,7 +15,7 @@ np.import_array()

 @cython.boundscheck(False)
 @cython.cdivision(True)
-def transform(raw_X, Py_ssize_t n_features, dtype):
+def transform(raw_X, Py_ssize_t n_features, dtype, char alternate_sign):


type should be bint, not char

jnothman · 2016-10-05T12:17:50Z

sklearn/feature_extraction/hashing.py

-        When False, output values will have expected value zero.
+    non_negative : boolean or 'total', optional, default False
+        When True or False, an alternating sign is added to the counts as to
+        approximately conserve the inner product in the hashed space.


usually one would use "preserve" not "conserve"

jnothman · 2016-10-05T12:18:58Z

sklearn/feature_extraction/hashing.py

-        When True, output values can be interpreted as frequencies.
-        When False, output values will have expected value zero.
+    non_negative : boolean or 'total', optional, default False
+        When True or False, an alternating sign is added to the counts as to


Not necessarily counts, here.

jnothman · 2016-10-05T12:21:07Z

sklearn/feature_extraction/hashing.py

+        approximately conserve the inner product in the hashed space.
+        When True, an absolute value is additionally applied to the result
+        prior to returning it.
+        When 'total' all counts are positive which disables collision handling.


It doesn't really disable collision handling, does it? It just does not faithfully preserve inner product. The output, however, can still include negative values if input features had negative values. So total might not be correct nomenclature either. I think you might say "this is suitable for non-negative features; each feature value is then the sum of all features with that hash"

jnothman · 2016-10-05T12:22:33Z

sklearn/feature_extraction/hashing.py

@@ -148,6 +152,6 @@ def transform(self, raw_X, y=None):
        X = sp.csr_matrix((values, indices, indptr), dtype=self.dtype,
                          shape=(n_samples, self.n_features))
        X.sum_duplicates()  # also sorts the indices
-        if self.non_negative:
+        if self.non_negative is True:  # if non_negative == 'total', X>0 anyway


Not necessarily true. We need to decide how to handle that case.

One solution is to add a further parameter... But honestly, if there is negative input, non_negative should be False.

jnothman · 2016-10-05T12:23:51Z

sklearn/feature_extraction/tests/test_feature_hasher.py

+def test_hasher_non_negative():
+    raw_X = [["foo", "bar", "baz"]]
+
+    def it():  # iterable


why not use raw_X directly?

And convention would use X for input, and Xt for output.

rth · 2016-10-07T18:22:20Z

@jnothman Thanks a lot for the detailed review! I wasn't sure about boolean in Cython and did miss the second space before the comment in the .pyx. ) I agree that having 3 values for the current non_negative parameter not great and that something should be done about it.

For the formulation, I still think that saying that this mechanism would allow to "approximately preserve inner product in the hashed space" is misleading. As is illustrated in my other comment, a more exact formulation would be that it allows to better preserve the inner product, for a) small hashing table sizes b) when binary counts are not used, so at the end it is still about how collisions are handled with respect to this property.

I'll look into addressing the other comments in your review soon.

rth · 2016-11-02T12:18:08Z

Sorry for the late response. @jnothman I addressed review comments above, and the ones in your post in the original issue.

I'm just not sure about when the deprecation warning should be raised: in the current PR it is raised both for non_negative=True (to warn that it's deprecated) and for non_negative='total' (to warn that this option will be renamed to non_negative=True ), but that doesn't sound right as this means that the users won't be able to do anything to make this warning go away for 2 versions. Should I deprecate just non_negative=True then?

jnothman · 2016-11-02T12:22:03Z

Yes, don't deprecate non_negative='total'.

jnothman · 2016-11-02T12:24:29Z

sklearn/feature_extraction/hashing.py

+                             " True, False, 'total'.")
+        if non_negative in ['total', True]:
+            warnings.warn("the option non_negative=True has been deprecated"
+                          " in 0.19. As of 0.21 non_negative='total' would be"


Perhaps "From version 0.21, non_negative=True will be interpreted as non_negative='total'."

jnothman · 2016-11-02T12:27:02Z

sklearn/feature_extraction/tests/test_feature_hasher.py

+def test_hasher_non_negative():
+    raw_X = [["foo", "bar", "baz"]]
+
+    def it():  # iterable


why not use raw_X directly?

And convention would use X for input, and Xt for output.

jnothman · 2016-11-02T12:27:26Z

sklearn/feature_extraction/tests/test_feature_hasher.py

+
+    X = FeatureHasher(non_negative=False,
+                      input_type='string').fit_transform(it())
+    assert_true((X.data > 0).any() and (X.data < 0).any())


surely this test is more useful if we check that there's a zero?

Well the point here was to test that the output has both negative and positive values, not that is has zeros (which it doesn't have in this case). Changed the test to be more explicit about it.

rth · 2016-11-02T12:39:51Z

Thanks! I made the changes requested above.

The only comment I'm not sure how to address is about when the absolute value should be applied. You mean that if the features counts are negative to begin with, then with this code self.non_negative==True will make the output positive, but self.non_negative=='total' will not? But at the same time unnecessarily applying the absolute value to feature counts with HashingVectorizer (probably 95% use cases for this) would have some overhead, wouldn't it? Otherwise the solution could be to rename the non_negative parameter to something else (as with this PR making the output positive/negative would be just a side effect of the improved inner product preservation, which is what this parameter does in the end), but that would give additional head ache with backward compatibility...

jnothman · 2016-11-02T12:49:45Z

sklearn/feature_extraction/tests/test_feature_hasher.py

+    assert_true((Xt.data >= 0).all())  # zeros are acceptable
+    Xt = FeatureHasher(non_negative='total',
+                      input_type='string').fit_transform(X)
+    assert_true((Xt.data > 0).all())  # strictly positive counts


But this isn't testing that non_negative='total' is working, is it?

You are right, thanks, changed the test to actually have a hash collision so this part is tested.

jnothman · 2016-11-02T12:51:30Z

Could you run a benchmark on the overhead of abs? otherwise maybe we could indeed introduce a parameter sign_flipping=True (or something) and deprecate non_negative.

rth · 2016-11-02T13:32:29Z

I checked the overhead of abs on the 20 newsgroups when using HashingVectorizer, is is negligible (1 ms out of a 3.65s total run time). Another issue is that right now there is no way of doing just the plain feature hashing (i.e. just count the occurrences in each element of the hashing table): if non_negative = False this ~ alternating sign is applied, if non_negative in [True, 'total'] we also apply the absolute value. So in the case when some features are negative initially, there is no way to just count them without an additional abs. This is because the current non_negative parameter mixes abs with the "alternating sign" functionality.

I agree that adding a new parameter (how about rand_sparse_proj or altern_sign?) and deprecating non_negative would be annoying, but it may be the best thing to do here?

jnothman · 2016-11-06T12:03:06Z

I think altern_sign=True would be clear enough. Let's go with that..?

jnothman · 2016-11-06T12:03:49Z

Make it alternate_sign

jnothman · 2016-11-06T21:22:17Z

sklearn/feature_extraction/hashing.py

+        When True, an absolute value is applied to the features matrix prior to
+        returning it. When used in conjunction with alternate_sign=True, this
+        significantly reduces the inner product preservation property.
+        This option is deprecated as of 0.19.


you can use the .. deprecated directive. Be sure to specify when it'll be removed (0.21).

Actually, not so important seeing as you have 0.21 in the code

jnothman · 2016-11-06T21:29:10Z

This looks good. But in case someone were to reimplement it with an abs, please add a test with mixed-sign data.

rth · 2016-11-06T21:50:48Z

@jnothman Thanks for the feedback! I added a test with mixed-sign data and the ..deprecated directives.

jnothman · 2016-11-06T21:53:52Z

LGTM. Thanks very much!

jnothman · 2016-11-06T22:43:08Z

Please add an entry in whats_new

rth · 2017-03-30T15:29:33Z

Thanks for the comment @lesteve , fixed the rst formatting...

jnothman · 2017-05-27T11:20:56Z

@raghavrv and @lesteve, you both made requests here. Are we ready to merge?

lesteve · 2017-06-01T13:05:15Z

I am not very familiar with text so I am afraid I am not the best one to review this one.

raghavrv · 2017-06-07T12:36:34Z

@jnothman do you think it's better to have random somewhere in the parameter name just to make things clear that it does not always alternate...

Maybe randomly_alternate_sign or (random_alter_sign if you prefer something short)?

jnothman · 2017-06-07T12:38:50Z

It's not random, it's a function of the hash...

raghavrv

Thanks for the PR. Looks good to me, pending minor comments... I'll make a final pass and merge once you address those...

raghavrv · 2017-06-02T13:29:37Z

sklearn/feature_extraction/_hashing.pyx

@@ -15,7 +15,7 @@ np.import_array()

 @cython.boundscheck(False)
 @cython.cdivision(True)
-def transform(raw_X, Py_ssize_t n_features, dtype):
+def transform(raw_X, Py_ssize_t n_features, dtype, bint alternate_sign):


Can you set the default value to be True here?

I know it's not the public API but it would be nice to preserve the previous functionality...

(Feel free to ignore the suggestion)

raghavrv · 2017-06-07T12:38:09Z

sklearn/feature_extraction/tests/test_feature_hasher.py

+
+def test_hasher_alternate_sign():
+    X = [["foo", "bar", "baz", "investigation need", "records"]]
+    # the last two tokens produce a hash collision that sums as 0


Can you move this comment above the previous line?

(+1 for a well thought of test case. Thx)

raghavrv · 2017-06-07T12:39:18Z

sklearn/feature_extraction/tests/test_feature_hasher.py

+    assert_true(len(Xt.data) < len(X[0]))
+    assert_true((Xt.data == 0.).any())
+
+    Xt = FeatureHasher(alternate_sign=True, non_negative=True,


This will raise a deprecation warning right? can you wrap it using ignore_warnings(DeprecatedWarning) maybe?

raghavrv · 2017-06-07T12:40:35Z

sklearn/feature_extraction/text.py

+    alternate_sign : boolean, optional, default True
+        When True, an alternating sign is added to the features as to
+        approximately conserve the inner product in the hashed space even for
+        small n_features. This approach is similar to sparse random projection.


you need a new feature (or feature added?) tag?

(has been addressed below)

raghavrv · 2017-06-07T12:41:47Z

It's not random, it's a function of the hash...

Ah! got it thanks :)

…shing_vect-nn

rth · 2017-06-08T07:18:54Z

Thanks for the review @raghavrv ! I addressed all your comments (I think). Appveyor would take a day or so to build, unfortunately..

raghavrv · 2017-06-08T11:23:42Z

Thanks @rth. I'll approve it for now... Please ping me when appveyor passes. I'll merge it... (Or is it okay to merge without appveyor, @jnothman?)

jnothman · 2017-06-08T11:34:35Z

By induction, appveyor will provably pass :p

Thanks @rth!

* HashingVectorizer: optionaly disable alternate signs

rth changed the title ~~FIX Hash collisions in the FeatureHasher~~ [MRG] FIX Hash collisions in the FeatureHasher Oct 3, 2016

jnothman reviewed Oct 4, 2016

View reviewed changes

jnothman requested changes Oct 5, 2016

View reviewed changes

rth mentioned this pull request Oct 7, 2016

Hash collisions in the FeatureHasher #7513

Closed

jnothman requested changes Nov 2, 2016

View reviewed changes

jnothman reviewed Nov 2, 2016

View reviewed changes

rth force-pushed the hashing_vect-nn branch from a850b29 to e6004f6 Compare November 2, 2016 13:13

rth force-pushed the hashing_vect-nn branch from e6004f6 to 498b7f5 Compare November 3, 2016 14:16

jnothman added Bug Need Contributor Waiting for Reviewer and removed Need Contributor labels Nov 6, 2016

rth force-pushed the hashing_vect-nn branch from e5cd4ff to 56bc5a5 Compare November 6, 2016 21:20

jnothman reviewed Nov 6, 2016

View reviewed changes

jnothman changed the title ~~[MRG] FIX Hash collisions in the FeatureHasher~~ [MRG+1] FIX Hash collisions in the FeatureHasher Nov 6, 2016

rth mentioned this pull request Jan 23, 2017

[MRG] FIX HashingVectorizer + TfidfTransformer fails because of a stored zero #7502

Closed

rth force-pushed the hashing_vect-nn branch from e9761dc to bdf4c96 Compare January 23, 2017 15:41

Fixed rst formatting

f3e0d25

jnothman approved these changes May 27, 2017

View reviewed changes

raghavrv added the Sprint label Jun 6, 2017

raghavrv suggested changes Jun 7, 2017

View reviewed changes

rth added 3 commits June 7, 2017 18:37

Merge branch 'master' into hashing_vect-nn

2d5112e

Fixing review comments

c51b45b

Merge branch 'hashing_vect-nn' of github.com:rth/scikit-learn into ha…

b25e235

…shing_vect-nn

rth force-pushed the hashing_vect-nn branch from f6be0db to b25e235 Compare June 8, 2017 06:53

raghavrv approved these changes Jun 8, 2017

View reviewed changes

jnothman merged commit 925a017 into scikit-learn:master Jun 8, 2017

rth deleted the hashing_vect-nn branch June 8, 2017 11:36

rth mentioned this pull request Jun 8, 2017

eliminate zeros after summing duplicates #5861

Closed

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG] FIX Hash collisions in the FeatureHasher (scikit-learn#7565)

a8ff1db

* HashingVectorizer: optionaly disable alternate signs

rth mentioned this pull request Jun 19, 2017

[MRG + 1] DOC remove deprecated option in HashingVectorizer examples #9163

Merged

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG] FIX Hash collisions in the FeatureHasher (scikit-learn#7565)

97c04e1

* HashingVectorizer: optionaly disable alternate signs

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG] FIX Hash collisions in the FeatureHasher (scikit-learn#7565)

e0b4382

* HashingVectorizer: optionaly disable alternate signs

NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG] FIX Hash collisions in the FeatureHasher (scikit-learn#7565)

1774606

* HashingVectorizer: optionaly disable alternate signs

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG] FIX Hash collisions in the FeatureHasher (scikit-learn#7565)

2bb5b71

* HashingVectorizer: optionaly disable alternate signs

AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

[MRG] FIX Hash collisions in the FeatureHasher (scikit-learn#7565)

3cf2cf3

* HashingVectorizer: optionaly disable alternate signs

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG] FIX Hash collisions in the FeatureHasher (scikit-learn#7565)

18d7d2a

* HashingVectorizer: optionaly disable alternate signs

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG] FIX Hash collisions in the FeatureHasher (scikit-learn#7565)

ac0aabb

* HashingVectorizer: optionaly disable alternate signs

Uh oh!

[MRG+1] FIX Hash collisions in the FeatureHasher #7565

[MRG+1] FIX Hash collisions in the FeatureHasher #7565

Uh oh!

Conversation

rth commented Oct 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Oct 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth commented Oct 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented Nov 2, 2016

Uh oh!

jnothman commented Nov 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth Nov 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth commented Nov 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Nov 2, 2016

Uh oh!

rth commented Nov 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Nov 6, 2016

Uh oh!

jnothman commented Nov 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Nov 6, 2016

Uh oh!

rth commented Nov 6, 2016

Uh oh!

jnothman commented Nov 6, 2016

Uh oh!

jnothman commented Nov 6, 2016

Uh oh!

rth commented Mar 30, 2017

Uh oh!

jnothman commented May 27, 2017

rth commented Oct 3, 2016 •

edited

Loading

rth commented Oct 7, 2016 •

edited

Loading

rth Nov 2, 2016 •

edited

Loading

rth commented Nov 2, 2016 •

edited

Loading