-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] FIX HashingVectorizer + TfidfTransformer fails because of a stored zero #7502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2c64064
to
c9452cf
Compare
Related: #2665 |
I think to be a strong test, you need to ensure that there's enough data to have expected a zero, or else have a test that can infer that a zero was produced, such as checking that 5 words produced 3 nonzero hashes (therefore a hash collision). I suggest you use However, this is actually an issue in |
@jnothman Thanks for the feedback, and for pointing out that other issue. For a hash collision the chance of getting a 0 is 1/2, but the whole process is deterministic, so once we found one case, it should be reproducible as long as the hashing function does not change. I did test that the previous unit test fail before the PR and passed after it: i.e. "investigation" and "need records" produce a hash collision that consistently sums up as 0 (example taken from one of the parent issues). Nevertheless, I updated the unit tests following your suggestion as that is indeed probably more robust. However, now I'm not convinced that using |
6a332d3
to
4c1f7df
Compare
When I tried locally it didn't seem like the previous example created a collision with the default |
Apart from time, in what way does #2655 argue against |
4c1f7df
to
5614e67
Compare
Well I did get a zero due to collision before that PR, in the unit test,
but then maybe some part of the processing is system dependent. Anyway, changed the unit tests to be more robust following your suggestions. No I meant that addressing issue #7513 would probably a better way of solving this than using |
oh. I must have typed it in incorrectly. Yes, that's a fair test... as long On 29 September 2016 at 18:09, Roman Yurchak notifications@github.com
|
Closing this PR as it's superseded by PR #7565. |
Continuation of PR #5861, fixes issue #3637 .
This eliminates zeros that can occur after a hash collision in the sparse array output of the HashingVectorizer.
Also added a regression test, and checked that this had non-measurable impact on performance (benchmarked on the 20 newsgroup dataset).
Edit: circle-ci failed due to timeout, "command ./build_tools/circle/build_doc.sh took more than 10 minutes since last output": I can't do much about that as I think it's unrelated to this PR.