Skip to content

[MRG] FIX HashingVectorizer + TfidfTransformer fails because of a stored zero #7502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

rth
Copy link
Member

@rth rth commented Sep 27, 2016

Continuation of PR #5861, fixes issue #3637 .

This eliminates zeros that can occur after a hash collision in the sparse array output of the HashingVectorizer.

Also added a regression test, and checked that this had non-measurable impact on performance (benchmarked on the 20 newsgroup dataset).

Edit: circle-ci failed due to timeout, "command ./build_tools/circle/build_doc.sh took more than 10 minutes since last output": I can't do much about that as I think it's unrelated to this PR.

@rth rth force-pushed the hash_vect_collision branch from 2c64064 to c9452cf Compare September 27, 2016 21:54
@jnothman
Copy link
Member

Related: #2665

@jnothman
Copy link
Member

jnothman commented Sep 27, 2016

I think to be a strong test, you need to ensure that there's enough data to have expected a zero, or else have a test that can infer that a zero was produced, such as checking that 5 words produced 3 nonzero hashes (therefore a hash collision).

I suggest you use HashingVectorizer(analyzer='char', n_features=1). Then fit_transform(['ab', 'ac', 'ad', 'ae', 'af', ...]) should produce at least one row where a zero needs to be eliminated. The chance of this occurring is 1/2 ('a' being 1 and the other char being -1, or vice versa).

However, this is actually an issue in FeatureHasher, not HashingVectorizer, so you should really be testing that.

@rth
Copy link
Member Author

rth commented Sep 28, 2016

@jnothman Thanks for the feedback, and for pointing out that other issue. For a hash collision the chance of getting a 0 is 1/2, but the whole process is deterministic, so once we found one case, it should be reproducible as long as the hashing function does not change. I did test that the previous unit test fail before the PR and passed after it: i.e. "investigation" and "need records" produce a hash collision that consistently sums up as 0 (example taken from one of the parent issues). Nevertheless, I updated the unit tests following your suggestion as that is indeed probably more robust.

However, now I'm not convinced that using eliminate_zeros() is a good solution here (cf the issue linked above) so putting this PR on hold for the moment.

@rth rth force-pushed the hash_vect_collision branch from 6a332d3 to 4c1f7df Compare September 28, 2016 18:00
@jnothman
Copy link
Member

When I tried locally it didn't seem like the previous example created a collision with the default n_features. I got 5 nonzero entries for 3 unigrams and 2 bigrams.

@jnothman
Copy link
Member

Apart from time, in what way does #2655 argue against eliminate_zeros?

@rth rth force-pushed the hash_vect_collision branch from 4c1f7df to 5614e67 Compare September 28, 2016 22:51
@rth
Copy link
Member Author

rth commented Sep 29, 2016

Well I did get a zero due to collision before that PR, in the unit test,

>>> HashingVectorizer(ngram_range=(1, 2), non_negative=True).fit_transform(['investigation need records']).data
array([ 0.57735027,  0.57735027,  0.        ,  0.57735027])

but then maybe some part of the processing is system dependent. Anyway, changed the unit tests to be more robust following your suggestions.

No I meant that addressing issue #7513 would probably a better way of solving this than using eliminate_zeros.

@jnothman
Copy link
Member

oh. I must have typed it in incorrectly. Yes, that's a fair test... as long
as you then assert that it meets your expectations of having fewer than 5
elements in data.

On 29 September 2016 at 18:09, Roman Yurchak notifications@github.com
wrote:

Well I did get a zero due to collision before that PR, in the unit test,

HashingVectorizer(ngram_range=(1, 2), non_negative=True).fit_transform(['investigation need records']).data
array([ 0.57735027, 0.57735027, 0. , 0.57735027])

but then maybe some part of the processing is system dependent. Anyway,
changed the unit tests to be more robust following your suggestions.

No I meant that addressing issue #7513
#7513 would probably
a better way of solving this than using eliminate_zeros.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7502 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz68j4NJ1jjWMPywGuoK9fA1RMURD1ks5qu3JHgaJpZM4KHwVn
.

@amueller amueller added the Bug label Sep 29, 2016
@amueller amueller added this to the 0.19 milestone Sep 29, 2016
@rth
Copy link
Member Author

rth commented Jan 23, 2017

Closing this PR as it's superseded by PR #7565.

@rth rth closed this Jan 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants