Skip to content

[MRG+1] FIX Hash collisions in the FeatureHasher #7565

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jun 8, 2017

Conversation

rth
Copy link
Member

@rth rth commented Oct 3, 2016

This PR fixes #7513 and extends the non_negative parameter in FeatureHasher and HashingVectorizer with a value of 'total' which disables the hash collision handling (and makes the output more intuitive).

Here are the annotations for the compiled _hashing.pyx. Suggestions on how to better explain the non_negative behavior in the doc-strings are welcome.

The documentation for the HashingVectorizer should also be improved IMO, e.g. to explain the preservation of the inner product in the hashed space, but maybe in a separate PR.

Update: this PR also supersedes #7502 that was a continuation of #5861 aiming to fix #3637 . It also resolves #2665 .

@rth rth changed the title FIX Hash collisions in the FeatureHasher [MRG] FIX Hash collisions in the FeatureHasher Oct 3, 2016
@jnothman
Copy link
Member

jnothman commented Oct 4, 2016

Thanks for this. I'll try take a look tomorrow.

@@ -63,7 +63,8 @@ def transform(raw_X, Py_ssize_t n_features, dtype):

array.resize_smart(indices, len(indices) + 1)
indices[len(indices) - 1] = abs(h) % n_features
value *= (h >= 0) * 2 - 1
if alternate_sign: # counter the effect of hash collision (issue #7513)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this description is correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, PEP8: keep two spaces before a comment)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm coming to consider the idea that the current non_negative is rubbish, and that maybe we should deprecate it an install another parameter in its place...

@@ -63,7 +63,8 @@ def transform(raw_X, Py_ssize_t n_features, dtype):

array.resize_smart(indices, len(indices) + 1)
indices[len(indices) - 1] = abs(h) % n_features
value *= (h >= 0) * 2 - 1
if alternate_sign: # counter the effect of hash collision (issue #7513)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, PEP8: keep two spaces before a comment)

@@ -15,7 +15,7 @@ np.import_array()

@cython.boundscheck(False)
@cython.cdivision(True)
def transform(raw_X, Py_ssize_t n_features, dtype):
def transform(raw_X, Py_ssize_t n_features, dtype, char alternate_sign):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type should be bint, not char

When False, output values will have expected value zero.
non_negative : boolean or 'total', optional, default False
When True or False, an alternating sign is added to the counts as to
approximately conserve the inner product in the hashed space.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually one would use "preserve" not "conserve"

When True, output values can be interpreted as frequencies.
When False, output values will have expected value zero.
non_negative : boolean or 'total', optional, default False
When True or False, an alternating sign is added to the counts as to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily counts, here.

approximately conserve the inner product in the hashed space.
When True, an absolute value is additionally applied to the result
prior to returning it.
When 'total' all counts are positive which disables collision handling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really disable collision handling, does it? It just does not faithfully preserve inner product. The output, however, can still include negative values if input features had negative values. So total might not be correct nomenclature either. I think you might say "this is suitable for non-negative features; each feature value is then the sum of all features with that hash"

@@ -148,6 +152,6 @@ def transform(self, raw_X, y=None):
X = sp.csr_matrix((values, indices, indptr), dtype=self.dtype,
shape=(n_samples, self.n_features))
X.sum_duplicates() # also sorts the indices
if self.non_negative:
if self.non_negative is True: # if non_negative == 'total', X>0 anyway
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily true. We need to decide how to handle that case.

One solution is to add a further parameter... But honestly, if there is negative input, non_negative should be False.

def test_hasher_non_negative():
raw_X = [["foo", "bar", "baz"]]

def it(): # iterable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use raw_X directly?

And convention would use X for input, and Xt for output.

@rth
Copy link
Member Author

rth commented Oct 7, 2016

@jnothman Thanks a lot for the detailed review! I wasn't sure about boolean in Cython and did miss the second space before the comment in the .pyx. ) I agree that having 3 values for the current non_negative parameter not great and that something should be done about it.

For the formulation, I still think that saying that this mechanism would allow to "approximately preserve inner product in the hashed space" is misleading. As is illustrated in my other comment, a more exact formulation would be that it allows to better preserve the inner product, for a) small hashing table sizes b) when binary counts are not used, so at the end it is still about how collisions are handled with respect to this property.

I'll look into addressing the other comments in your review soon.

@rth
Copy link
Member Author

rth commented Nov 2, 2016

Sorry for the late response. @jnothman I addressed review comments above, and the ones in your post in the original issue.

I'm just not sure about when the deprecation warning should be raised: in the current PR it is raised both for non_negative=True (to warn that it's deprecated) and for non_negative='total' (to warn that this option will be renamed to non_negative=True ), but that doesn't sound right as this means that the users won't be able to do anything to make this warning go away for 2 versions. Should I deprecate just non_negative=True then?

@jnothman
Copy link
Member

jnothman commented Nov 2, 2016

Yes, don't deprecate non_negative='total'.

" True, False, 'total'.")
if non_negative in ['total', True]:
warnings.warn("the option non_negative=True has been deprecated"
" in 0.19. As of 0.21 non_negative='total' would be"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps "From version 0.21, non_negative=True will be interpreted as non_negative='total'."

def test_hasher_non_negative():
raw_X = [["foo", "bar", "baz"]]

def it(): # iterable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use raw_X directly?

And convention would use X for input, and Xt for output.


X = FeatureHasher(non_negative=False,
input_type='string').fit_transform(it())
assert_true((X.data > 0).any() and (X.data < 0).any())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

surely this test is more useful if we check that there's a zero?

Copy link
Member Author

@rth rth Nov 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the point here was to test that the output has both negative and positive values, not that is has zeros (which it doesn't have in this case). Changed the test to be more explicit about it.

@rth
Copy link
Member Author

rth commented Nov 2, 2016

Thanks! I made the changes requested above.

The only comment I'm not sure how to address is about when the absolute value should be applied. You mean that if the features counts are negative to begin with, then with this code self.non_negative==True will make the output positive, but self.non_negative=='total' will not? But at the same time unnecessarily applying the absolute value to feature counts with HashingVectorizer (probably 95% use cases for this) would have some overhead, wouldn't it? Otherwise the solution could be to rename the non_negative parameter to something else (as with this PR making the output positive/negative would be just a side effect of the improved inner product preservation, which is what this parameter does in the end), but that would give additional head ache with backward compatibility...

assert_true((Xt.data >= 0).all()) # zeros are acceptable
Xt = FeatureHasher(non_negative='total',
input_type='string').fit_transform(X)
assert_true((Xt.data > 0).all()) # strictly positive counts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this isn't testing that non_negative='total' is working, is it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, thanks, changed the test to actually have a hash collision so this part is tested.

@jnothman
Copy link
Member

jnothman commented Nov 2, 2016

Could you run a benchmark on the overhead of abs? otherwise maybe we could indeed introduce a parameter sign_flipping=True (or something) and deprecate non_negative.

@rth rth force-pushed the hashing_vect-nn branch from a850b29 to e6004f6 Compare November 2, 2016 13:13
@rth
Copy link
Member Author

rth commented Nov 2, 2016

I checked the overhead of abs on the 20 newsgroups when using HashingVectorizer, is is negligible (1 ms out of a 3.65s total run time). Another issue is that right now there is no way of doing just the plain feature hashing (i.e. just count the occurrences in each element of the hashing table): if non_negative = False this ~ alternating sign is applied, if non_negative in [True, 'total'] we also apply the absolute value. So in the case when some features are negative initially, there is no way to just count them without an additional abs. This is because the current non_negative parameter mixes abs with the "alternating sign" functionality.

I agree that adding a new parameter (how about rand_sparse_proj or altern_sign?) and deprecating non_negative would be annoying, but it may be the best thing to do here?

@rth rth force-pushed the hashing_vect-nn branch from e6004f6 to 498b7f5 Compare November 3, 2016 14:16
@jnothman
Copy link
Member

jnothman commented Nov 6, 2016

I think altern_sign=True would be clear enough. Let's go with that..?

@jnothman
Copy link
Member

jnothman commented Nov 6, 2016

Make it alternate_sign

@rth rth force-pushed the hashing_vect-nn branch from e5cd4ff to 56bc5a5 Compare November 6, 2016 21:20
When True, an absolute value is applied to the features matrix prior to
returning it. When used in conjunction with alternate_sign=True, this
significantly reduces the inner product preservation property.
This option is deprecated as of 0.19.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use the .. deprecated directive. Be sure to specify when it'll be removed (0.21).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, not so important seeing as you have 0.21 in the code

@jnothman
Copy link
Member

jnothman commented Nov 6, 2016

This looks good. But in case someone were to reimplement it with an abs, please add a test with mixed-sign data.

@rth
Copy link
Member Author

rth commented Nov 6, 2016

@jnothman Thanks for the feedback! I added a test with mixed-sign data and the ..deprecated directives.

@jnothman
Copy link
Member

jnothman commented Nov 6, 2016

LGTM. Thanks very much!

@jnothman jnothman changed the title [MRG] FIX Hash collisions in the FeatureHasher [MRG+1] FIX Hash collisions in the FeatureHasher Nov 6, 2016
@jnothman
Copy link
Member

jnothman commented Nov 6, 2016

Please add an entry in whats_new

@rth
Copy link
Member Author

rth commented Mar 30, 2017

Thanks for the comment @lesteve , fixed the rst formatting...

@jnothman
Copy link
Member

@raghavrv and @lesteve, you both made requests here. Are we ready to merge?

@lesteve
Copy link
Member

lesteve commented Jun 1, 2017

I am not very familiar with text so I am afraid I am not the best one to review this one.

@raghavrv raghavrv added the Sprint label Jun 6, 2017
@raghavrv
Copy link
Member

raghavrv commented Jun 7, 2017

@jnothman do you think it's better to have random somewhere in the parameter name just to make things clear that it does not always alternate...

Maybe randomly_alternate_sign or (random_alter_sign if you prefer something short)?

@jnothman
Copy link
Member

jnothman commented Jun 7, 2017

It's not random, it's a function of the hash...

Copy link
Member

@raghavrv raghavrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Looks good to me, pending minor comments... I'll make a final pass and merge once you address those...

@@ -15,7 +15,7 @@ np.import_array()

@cython.boundscheck(False)
@cython.cdivision(True)
def transform(raw_X, Py_ssize_t n_features, dtype):
def transform(raw_X, Py_ssize_t n_features, dtype, bint alternate_sign):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you set the default value to be True here?

I know it's not the public API but it would be nice to preserve the previous functionality...

(Feel free to ignore the suggestion)


def test_hasher_alternate_sign():
X = [["foo", "bar", "baz", "investigation need", "records"]]
# the last two tokens produce a hash collision that sums as 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this comment above the previous line?

(+1 for a well thought of test case. Thx)

assert_true(len(Xt.data) < len(X[0]))
assert_true((Xt.data == 0.).any())

Xt = FeatureHasher(alternate_sign=True, non_negative=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will raise a deprecation warning right? can you wrap it using ignore_warnings(DeprecatedWarning) maybe?

alternate_sign : boolean, optional, default True
When True, an alternating sign is added to the features as to
approximately conserve the inner product in the hashed space even for
small n_features. This approach is similar to sparse random projection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need a new feature (or feature added?) tag?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(has been addressed below)

@raghavrv
Copy link
Member

raghavrv commented Jun 7, 2017

It's not random, it's a function of the hash...

Ah! got it thanks :)

@rth rth force-pushed the hashing_vect-nn branch from f6be0db to b25e235 Compare June 8, 2017 06:53
@rth
Copy link
Member Author

rth commented Jun 8, 2017

Thanks for the review @raghavrv ! I addressed all your comments (I think). Appveyor would take a day or so to build, unfortunately..

@raghavrv
Copy link
Member

raghavrv commented Jun 8, 2017

Thanks @rth. I'll approve it for now... Please ping me when appveyor passes. I'll merge it... (Or is it okay to merge without appveyor, @jnothman?)

@jnothman
Copy link
Member

jnothman commented Jun 8, 2017

By induction, appveyor will provably pass :p

Thanks @rth!

@jnothman jnothman merged commit 925a017 into scikit-learn:master Jun 8, 2017
@rth rth deleted the hashing_vect-nn branch June 8, 2017 11:36
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
* HashingVectorizer: optionaly disable alternate signs
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
* HashingVectorizer: optionaly disable alternate signs
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
* HashingVectorizer: optionaly disable alternate signs
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
* HashingVectorizer: optionaly disable alternate signs
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
* HashingVectorizer: optionaly disable alternate signs
AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
* HashingVectorizer: optionaly disable alternate signs
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
* HashingVectorizer: optionaly disable alternate signs
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
* HashingVectorizer: optionaly disable alternate signs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants