-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG + 1] MNT Platform independent hash collision tests in FeatureHasher #9710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG + 1] MNT Platform independent hash collision tests in FeatureHasher #9710
Conversation
d1ebfad
to
4f55747
Compare
@@ -137,6 +133,25 @@ def test_hasher_alternate_sign(): | |||
|
|||
|
|||
@ignore_warnings(category=DeprecationWarning) | |||
def test_hash_collisions(): | |||
X = [["a", "b", "c", "d", "e", "f", "g", "h"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could be really sure and do X = [list("Thequickbrownfoxjumped")]
LGTM. (I think you meant .5 ** 8 = 0.004) |
@amueller Thanks for the review. Increased the vocabulary size as you suggested.
Yes thanks, I keep making typos in every other comment, apparently. |
have you tried finding a docker to reproduce somehow?
…On 8 Sep 2017 10:47 pm, "Roman Yurchak" ***@***.***> wrote:
This PR aims to address the current failures of test_hasher_alternate_sign
on non amd64 platforms #9393 (comment)
<#9393 (comment)>
that is likely due to the fact the current test rely on Murmurhash3 results
to yield a particular hash value (that produces a collision) while it is
actually platform dependent #9393 (comment)
<#9393 (comment)>
. Since the original issue couldn't be reproduced, there is no guarantee
that this would fix it (hopefully it would), but in any case, it would make
the test_hasher_alternate_sign more robust ...
*Note:* these tests here rely on the fact that when hashing 8 strings
with alternate_sign=True, some of them will get a negative sign and some
a positive one (it's a 50%/50% probability). However, there is still a
(0.5)**2 = .004 probability that on a given platform all the signs will be
positive (in which case these tests will fail) but hopefully, that's
unlikely enough...
cc @jnothman <https://github.com/jnothman>
------------------------------
You can view, comment on, or merge this pull request online at:
#9710
Commit Summary
- More robust hash collision tests in the FeatureHasher
File Changes
- *M* sklearn/feature_extraction/tests/test_feature_hasher.py
<https://github.com/scikit-learn/scikit-learn/pull/9710/files#diff-0>
(37)
Patch Links:
- https://github.com/scikit-learn/scikit-learn/pull/9710.patch
- https://github.com/scikit-learn/scikit-learn/pull/9710.diff
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9710>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAEz688_XL5_MbkScMaOfEy01icQSJeEks5sgTdigaJpZM4PRIkv>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, thanks @rth!
remove outdated comment fix also for FeatureUnion [MRG+2] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) (scikit-learn#8742) [MRG+1] Remove hard dependency on nose (scikit-learn#9670) MAINT Stop vendoring sphinx-gallery (scikit-learn#9403) CI upgrade travis to run on new numpy release (scikit-learn#9096) CI Make it possible to run doctests in .rst files with pytest (scikit-learn#9697) * doc/datasets/conftest.py to implement the equivalent of nose fixtures * add conftest.py in root folder to ensure that sklearn local folder is used rather than the package in site-packages * test doc with pytest in Travis * move custom_data_home definition from nose fixture to .rst file [MRG+1] avoid integer overflow by using floats for matthews_corrcoef (scikit-learn#9693) * Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef * matthews_corrcoef: cosmetic change requested by jnothman * Add test_matthews_corrcoef_overflow for Bug#9622 * test_matthews_corrcoef_overflow: clean-up and make deterministic * matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype * test_matthews_corrcoef_overflow: add simple deterministic tests TST Platform independent hash collision tests in FeatureHasher (scikit-learn#9710) TST More informative error message in test_preserve_trustworthiness_approximately (scikit-learn#9738) add some rudimentary tests for meta-estimators fix extra whitespace in error message add missing if_delegate_has_method in pipeline don't test tuple pipeline for now only copy list if not list already? doesn't seem to help?
This PR aims to address the current failures of
test_hasher_alternate_sign
on non amd64 platforms #9393 (comment) that is likely due to the fact the current test rely on Murmurhash3 results to yield a particular hash value (that produces a collision) while it is actually platform dependent #9393 (comment) . Since the original issue couldn't be reproduced, there is no guarantee that this would fix it (hopefully it would), but in any case, it would make thetest_hasher_alternate_sign
more robust ...Note: these tests here rely on the fact that when hashing 8 strings with
alternate_sign=True
, some of them will get a negative sign and some a positive one (it's a 50%/50% probability). However, there is still a (0.5)**2 = .004 probability that on a given platform all the signs will be positive (in which case these tests will fail) but hopefully, that's unlikely enough...cc @jnothman