[MRG] FIX HashingVectorizer + TfidfTransformer fails because of a stored zero #7502

rth · 2016-09-27T14:44:27Z

Continuation of PR #5861, fixes issue #3637 .

This eliminates zeros that can occur after a hash collision in the sparse array output of the HashingVectorizer.

Also added a regression test, and checked that this had non-measurable impact on performance (benchmarked on the 20 newsgroup dataset).

Edit: circle-ci failed due to timeout, "command ./build_tools/circle/build_doc.sh took more than 10 minutes since last output": I can't do much about that as I think it's unrelated to this PR.

jnothman · 2016-09-27T22:18:41Z

Related: #2665

jnothman · 2016-09-27T23:34:39Z

I think to be a strong test, you need to ensure that there's enough data to have expected a zero, or else have a test that can infer that a zero was produced, such as checking that 5 words produced 3 nonzero hashes (therefore a hash collision).

I suggest you use HashingVectorizer(analyzer='char', n_features=1). Then fit_transform(['ab', 'ac', 'ad', 'ae', 'af', ...]) should produce at least one row where a zero needs to be eliminated. The chance of this occurring is 1/2 ('a' being 1 and the other char being -1, or vice versa).

However, this is actually an issue in FeatureHasher, not HashingVectorizer, so you should really be testing that.

rth · 2016-09-28T17:50:34Z

@jnothman Thanks for the feedback, and for pointing out that other issue. For a hash collision the chance of getting a 0 is 1/2, but the whole process is deterministic, so once we found one case, it should be reproducible as long as the hashing function does not change. I did test that the previous unit test fail before the PR and passed after it: i.e. "investigation" and "need records" produce a hash collision that consistently sums up as 0 (example taken from one of the parent issues). Nevertheless, I updated the unit tests following your suggestion as that is indeed probably more robust.

However, now I'm not convinced that using eliminate_zeros() is a good solution here (cf the issue linked above) so putting this PR on hold for the moment.

jnothman · 2016-09-28T21:40:36Z

When I tried locally it didn't seem like the previous example created a collision with the default n_features. I got 5 nonzero entries for 3 unigrams and 2 bigrams.

jnothman · 2016-09-28T21:42:18Z

Apart from time, in what way does #2655 argue against eliminate_zeros?

rth · 2016-09-29T08:09:42Z

Well I did get a zero due to collision before that PR, in the unit test,

>>> HashingVectorizer(ngram_range=(1, 2), non_negative=True).fit_transform(['investigation need records']).data
array([ 0.57735027,  0.57735027,  0.        ,  0.57735027])

but then maybe some part of the processing is system dependent. Anyway, changed the unit tests to be more robust following your suggestions.

No I meant that addressing issue #7513 would probably a better way of solving this than using eliminate_zeros.

jnothman · 2016-09-29T08:43:45Z

oh. I must have typed it in incorrectly. Yes, that's a fair test... as long
as you then assert that it meets your expectations of having fewer than 5
elements in data.

On 29 September 2016 at 18:09, Roman Yurchak notifications@github.com
wrote:

Well I did get a zero due to collision before that PR, in the unit test,

HashingVectorizer(ngram_range=(1, 2), non_negative=True).fit_transform(['investigation need records']).data
array([ 0.57735027, 0.57735027, 0. , 0.57735027])

but then maybe some part of the processing is system dependent. Anyway,
changed the unit tests to be more robust following your suggestions.

No I meant that addressing issue #7513
#7513 would probably
a better way of solving this than using eliminate_zeros.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7502 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz68j4NJ1jjWMPywGuoK9fA1RMURD1ks5qu3JHgaJpZM4KHwVn
.

rth · 2017-01-23T15:13:26Z

Closing this PR as it's superseded by PR #7565.

justinvf and others added 2 commits September 27, 2016 15:50

PR 5861: Eliminate zeros after summing duplicates

9429747

Added a test for hash collision in HashingVectorizer

c9452cf

rth force-pushed the hash_vect_collision branch from 2c64064 to c9452cf Compare September 27, 2016 21:54

rth mentioned this pull request Sep 28, 2016

Hash collisions in the FeatureHasher #7513

Closed

rth force-pushed the hash_vect_collision branch from 6a332d3 to 4c1f7df Compare September 28, 2016 18:00

Moving the collision test to test_feature_hasher.py

5614e67

rth force-pushed the hash_vect_collision branch from 4c1f7df to 5614e67 Compare September 28, 2016 22:51

amueller added the Bug label Sep 29, 2016

amueller added this to the 0.19 milestone Sep 29, 2016

rth mentioned this pull request Jan 23, 2017

[MRG+1] FIX Hash collisions in the FeatureHasher #7565

Merged

rth closed this Jan 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] FIX HashingVectorizer + TfidfTransformer fails because of a stored zero #7502

[MRG] FIX HashingVectorizer + TfidfTransformer fails because of a stored zero #7502

rth commented Sep 27, 2016 •

edited

Loading

jnothman commented Sep 27, 2016

jnothman commented Sep 27, 2016 •

edited

Loading

rth commented Sep 28, 2016

jnothman commented Sep 28, 2016

jnothman commented Sep 28, 2016

rth commented Sep 29, 2016

jnothman commented Sep 29, 2016

rth commented Jan 23, 2017

[MRG] FIX HashingVectorizer + TfidfTransformer fails because of a stored zero #7502

[MRG] FIX HashingVectorizer + TfidfTransformer fails because of a stored zero #7502

Conversation

rth commented Sep 27, 2016 • edited Loading

jnothman commented Sep 27, 2016

jnothman commented Sep 27, 2016 • edited Loading

rth commented Sep 28, 2016

jnothman commented Sep 28, 2016

jnothman commented Sep 28, 2016

rth commented Sep 29, 2016

jnothman commented Sep 29, 2016

rth commented Jan 23, 2017

rth commented Sep 27, 2016 •

edited

Loading

jnothman commented Sep 27, 2016 •

edited

Loading