-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] FIX Hash collisions in the FeatureHasher #7565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for this. I'll try take a look tomorrow. |
@@ -63,7 +63,8 @@ def transform(raw_X, Py_ssize_t n_features, dtype): | |||
|
|||
array.resize_smart(indices, len(indices) + 1) | |||
indices[len(indices) - 1] = abs(h) % n_features | |||
value *= (h >= 0) * 2 - 1 | |||
if alternate_sign: # counter the effect of hash collision (issue #7513) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this description is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also, PEP8: keep two spaces before a comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm coming to consider the idea that the current non_negative
is rubbish, and that maybe we should deprecate it an install another parameter in its place...
@@ -63,7 +63,8 @@ def transform(raw_X, Py_ssize_t n_features, dtype): | |||
|
|||
array.resize_smart(indices, len(indices) + 1) | |||
indices[len(indices) - 1] = abs(h) % n_features | |||
value *= (h >= 0) * 2 - 1 | |||
if alternate_sign: # counter the effect of hash collision (issue #7513) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also, PEP8: keep two spaces before a comment)
@@ -15,7 +15,7 @@ np.import_array() | |||
|
|||
@cython.boundscheck(False) | |||
@cython.cdivision(True) | |||
def transform(raw_X, Py_ssize_t n_features, dtype): | |||
def transform(raw_X, Py_ssize_t n_features, dtype, char alternate_sign): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type should be bint
, not char
When False, output values will have expected value zero. | ||
non_negative : boolean or 'total', optional, default False | ||
When True or False, an alternating sign is added to the counts as to | ||
approximately conserve the inner product in the hashed space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually one would use "preserve" not "conserve"
When True, output values can be interpreted as frequencies. | ||
When False, output values will have expected value zero. | ||
non_negative : boolean or 'total', optional, default False | ||
When True or False, an alternating sign is added to the counts as to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily counts, here.
approximately conserve the inner product in the hashed space. | ||
When True, an absolute value is additionally applied to the result | ||
prior to returning it. | ||
When 'total' all counts are positive which disables collision handling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't really disable collision handling, does it? It just does not faithfully preserve inner product. The output, however, can still include negative values if input features had negative values. So total
might not be correct nomenclature either. I think you might say "this is suitable for non-negative features; each feature value is then the sum of all features with that hash"
@@ -148,6 +152,6 @@ def transform(self, raw_X, y=None): | |||
X = sp.csr_matrix((values, indices, indptr), dtype=self.dtype, | |||
shape=(n_samples, self.n_features)) | |||
X.sum_duplicates() # also sorts the indices | |||
if self.non_negative: | |||
if self.non_negative is True: # if non_negative == 'total', X>0 anyway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily true. We need to decide how to handle that case.
One solution is to add a further parameter... But honestly, if there is negative input, non_negative
should be False
.
def test_hasher_non_negative(): | ||
raw_X = [["foo", "bar", "baz"]] | ||
|
||
def it(): # iterable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use raw_X
directly?
And convention would use X
for input, and Xt
for output.
@jnothman Thanks a lot for the detailed review! I wasn't sure about boolean in Cython and did miss the second space before the comment in the For the formulation, I still think that saying that this mechanism would allow to "approximately preserve inner product in the hashed space" is misleading. As is illustrated in my other comment, a more exact formulation would be that it allows to better preserve the inner product, for a) small hashing table sizes b) when binary counts are not used, so at the end it is still about how collisions are handled with respect to this property. I'll look into addressing the other comments in your review soon. |
Sorry for the late response. @jnothman I addressed review comments above, and the ones in your post in the original issue. I'm just not sure about when the deprecation warning should be raised: in the current PR it is raised both for |
Yes, don't deprecate |
" True, False, 'total'.") | ||
if non_negative in ['total', True]: | ||
warnings.warn("the option non_negative=True has been deprecated" | ||
" in 0.19. As of 0.21 non_negative='total' would be" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps "From version 0.21, non_negative=True will be interpreted as non_negative='total'."
def test_hasher_non_negative(): | ||
raw_X = [["foo", "bar", "baz"]] | ||
|
||
def it(): # iterable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use raw_X
directly?
And convention would use X
for input, and Xt
for output.
|
||
X = FeatureHasher(non_negative=False, | ||
input_type='string').fit_transform(it()) | ||
assert_true((X.data > 0).any() and (X.data < 0).any()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
surely this test is more useful if we check that there's a zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the point here was to test that the output has both negative and positive values, not that is has zeros (which it doesn't have in this case). Changed the test to be more explicit about it.
Thanks! I made the changes requested above. The only comment I'm not sure how to address is about when the absolute value should be applied. You mean that if the features counts are negative to begin with, then with this code |
assert_true((Xt.data >= 0).all()) # zeros are acceptable | ||
Xt = FeatureHasher(non_negative='total', | ||
input_type='string').fit_transform(X) | ||
assert_true((Xt.data > 0).all()) # strictly positive counts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this isn't testing that non_negative='total'
is working, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, thanks, changed the test to actually have a hash collision so this part is tested.
Could you run a benchmark on the overhead of |
I checked the overhead of abs on the 20 newsgroups when using HashingVectorizer, is is negligible (1 ms out of a 3.65s total run time). Another issue is that right now there is no way of doing just the plain feature hashing (i.e. just count the occurrences in each element of the hashing table): if I agree that adding a new parameter (how about |
I think |
Make it |
When True, an absolute value is applied to the features matrix prior to | ||
returning it. When used in conjunction with alternate_sign=True, this | ||
significantly reduces the inner product preservation property. | ||
This option is deprecated as of 0.19. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use the .. deprecated
directive. Be sure to specify when it'll be removed (0.21).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, not so important seeing as you have 0.21 in the code
This looks good. But in case someone were to reimplement it with an abs, please add a test with mixed-sign data. |
@jnothman Thanks for the feedback! I added a test with mixed-sign data and the |
LGTM. Thanks very much! |
Please add an entry in whats_new |
Thanks for the comment @lesteve , fixed the rst formatting... |
I am not very familiar with text so I am afraid I am not the best one to review this one. |
@jnothman do you think it's better to have Maybe |
It's not random, it's a function of the hash... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Looks good to me, pending minor comments... I'll make a final pass and merge once you address those...
@@ -15,7 +15,7 @@ np.import_array() | |||
|
|||
@cython.boundscheck(False) | |||
@cython.cdivision(True) | |||
def transform(raw_X, Py_ssize_t n_features, dtype): | |||
def transform(raw_X, Py_ssize_t n_features, dtype, bint alternate_sign): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you set the default value to be True
here?
I know it's not the public API but it would be nice to preserve the previous functionality...
(Feel free to ignore the suggestion)
|
||
def test_hasher_alternate_sign(): | ||
X = [["foo", "bar", "baz", "investigation need", "records"]] | ||
# the last two tokens produce a hash collision that sums as 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this comment above the previous line?
(+1 for a well thought of test case. Thx)
assert_true(len(Xt.data) < len(X[0])) | ||
assert_true((Xt.data == 0.).any()) | ||
|
||
Xt = FeatureHasher(alternate_sign=True, non_negative=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will raise a deprecation warning right? can you wrap it using ignore_warnings(DeprecatedWarning)
maybe?
alternate_sign : boolean, optional, default True | ||
When True, an alternating sign is added to the features as to | ||
approximately conserve the inner product in the hashed space even for | ||
small n_features. This approach is similar to sparse random projection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need a new feature (or feature added?) tag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(has been addressed below)
Ah! got it thanks :) |
Thanks for the review @raghavrv ! I addressed all your comments (I think). Appveyor would take a day or so to build, unfortunately.. |
By induction, appveyor will provably pass :p Thanks @rth! |
* HashingVectorizer: optionaly disable alternate signs
* HashingVectorizer: optionaly disable alternate signs
* HashingVectorizer: optionaly disable alternate signs
* HashingVectorizer: optionaly disable alternate signs
* HashingVectorizer: optionaly disable alternate signs
* HashingVectorizer: optionaly disable alternate signs
* HashingVectorizer: optionaly disable alternate signs
* HashingVectorizer: optionaly disable alternate signs
This PR fixes #7513 and extends the
non_negative
parameter inFeatureHasher
andHashingVectorizer
with a value of'total'
which disables the hash collision handling (and makes the output more intuitive).Here are the annotations for the compiled
_hashing.pyx
. Suggestions on how to better explain thenon_negative
behavior in the doc-strings are welcome.The documentation for the HashingVectorizer should also be improved IMO, e.g. to explain the preservation of the inner product in the hashed space, but maybe in a separate PR.
Update: this PR also supersedes #7502 that was a continuation of #5861 aiming to fix #3637 . It also resolves #2665 .