[MRG + 1] MNT Platform independent hash collision tests in FeatureHasher #9710

rth · 2017-09-08T12:47:29Z

This PR aims to address the current failures of test_hasher_alternate_sign on non amd64 platforms #9393 (comment) that is likely due to the fact the current test rely on Murmurhash3 results to yield a particular hash value (that produces a collision) while it is actually platform dependent #9393 (comment) . Since the original issue couldn't be reproduced, there is no guarantee that this would fix it (hopefully it would), but in any case, it would make the test_hasher_alternate_sign more robust ...

Note: these tests here rely on the fact that when hashing 8 strings with alternate_sign=True, some of them will get a negative sign and some a positive one (it's a 50%/50% probability). However, there is still a (0.5)**2 = .004 probability that on a given platform all the signs will be positive (in which case these tests will fail) but hopefully, that's unlikely enough...

cc @jnothman

amueller · 2017-09-08T18:50:46Z

sklearn/feature_extraction/tests/test_feature_hasher.py

@@ -137,6 +133,25 @@ def test_hasher_alternate_sign():


 @ignore_warnings(category=DeprecationWarning)
+def test_hash_collisions():
+    X = [["a", "b", "c", "d", "e", "f", "g", "h"]]


You could be really sure and do X = [list("Thequickbrownfoxjumped")]

amueller · 2017-09-08T18:52:40Z

LGTM. (I think you meant .5 ** 8 = 0.004)

rth · 2017-09-08T19:18:03Z

@amueller Thanks for the review. Increased the vocabulary size as you suggested.

(I think you meant .5 ** 8 = 0.004)

Yes thanks, I keep making typos in every other comment, apparently.

jnothman · 2017-09-09T10:30:20Z

have you tried finding a docker to reproduce somehow?

…

On 8 Sep 2017 10:47 pm, "Roman Yurchak" ***@***.***> wrote: This PR aims to address the current failures of test_hasher_alternate_sign on non amd64 platforms #9393 (comment) <#9393 (comment)> that is likely due to the fact the current test rely on Murmurhash3 results to yield a particular hash value (that produces a collision) while it is actually platform dependent #9393 (comment) <#9393 (comment)> . Since the original issue couldn't be reproduced, there is no guarantee that this would fix it (hopefully it would), but in any case, it would make the test_hasher_alternate_sign more robust ... *Note:* these tests here rely on the fact that when hashing 8 strings with alternate_sign=True, some of them will get a negative sign and some a positive one (it's a 50%/50% probability). However, there is still a (0.5)**2 = .004 probability that on a given platform all the signs will be positive (in which case these tests will fail) but hopefully, that's unlikely enough... cc @jnothman <https://github.com/jnothman> ------------------------------ You can view, comment on, or merge this pull request online at: #9710 Commit Summary - More robust hash collision tests in the FeatureHasher File Changes - *M* sklearn/feature_extraction/tests/test_feature_hasher.py <https://github.com/scikit-learn/scikit-learn/pull/9710/files#diff-0> (37) Patch Links: - https://github.com/scikit-learn/scikit-learn/pull/9710.patch - https://github.com/scikit-learn/scikit-learn/pull/9710.diff — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9710>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz688_XL5_MbkScMaOfEy01icQSJeEks5sgTdigaJpZM4PRIkv> .

jnothman

Very nice, thanks @rth!

…t-learn#9710)

remove outdated comment fix also for FeatureUnion [MRG+2] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) (scikit-learn#8742) [MRG+1] Remove hard dependency on nose (scikit-learn#9670) MAINT Stop vendoring sphinx-gallery (scikit-learn#9403) CI upgrade travis to run on new numpy release (scikit-learn#9096) CI Make it possible to run doctests in .rst files with pytest (scikit-learn#9697) * doc/datasets/conftest.py to implement the equivalent of nose fixtures * add conftest.py in root folder to ensure that sklearn local folder is used rather than the package in site-packages * test doc with pytest in Travis * move custom_data_home definition from nose fixture to .rst file [MRG+1] avoid integer overflow by using floats for matthews_corrcoef (scikit-learn#9693) * Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef * matthews_corrcoef: cosmetic change requested by jnothman * Add test_matthews_corrcoef_overflow for Bug#9622 * test_matthews_corrcoef_overflow: clean-up and make deterministic * matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype * test_matthews_corrcoef_overflow: add simple deterministic tests TST Platform independent hash collision tests in FeatureHasher (scikit-learn#9710) TST More informative error message in test_preserve_trustworthiness_approximately (scikit-learn#9738) add some rudimentary tests for meta-estimators fix extra whitespace in error message add missing if_delegate_has_method in pipeline don't test tuple pipeline for now only copy list if not list already? doesn't seem to help?

…t-learn#9710)

rth changed the title ~~Platform independent hash collision tests in FeatureHasher~~ [MRG] MNT Platform independent hash collision tests in FeatureHasher Sep 8, 2017

More robust hash collision tests in the FeatureHasher

4f55747

rth force-pushed the robust-hash-collision-tests branch from d1ebfad to 4f55747 Compare September 8, 2017 12:49

amueller reviewed Sep 8, 2017

View reviewed changes

amueller changed the title ~~[MRG] MNT Platform independent hash collision tests in FeatureHasher~~ [MRG + 1] MNT Platform independent hash collision tests in FeatureHasher Sep 8, 2017

jnothman mentioned this pull request Sep 8, 2017

Debian test failures (was test_preserve_trustworthiness_approximately fails on 32bit: AssertionError: 0.89166666666666661 not greater than 0.9) #9393

Closed

Use larger vocabulary for the hash collistion tests

c637174

jnothman reviewed Sep 12, 2017

View reviewed changes

jnothman merged commit e88baea into scikit-learn:master Sep 12, 2017

jnothman added this to the 0.19.1 milestone Sep 12, 2017

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Sep 12, 2017

TST Platform independent hash collision tests in FeatureHasher (sciki…

752e458

…t-learn#9710)

amueller pushed a commit to amueller/scikit-learn that referenced this pull request Sep 12, 2017

TST Platform independent hash collision tests in FeatureHasher (sciki…

7a82e94

…t-learn#9710)

massich pushed a commit to massich/scikit-learn that referenced this pull request Sep 15, 2017

TST Platform independent hash collision tests in FeatureHasher (sciki…

1a94271

…t-learn#9710)

rth deleted the robust-hash-collision-tests branch October 6, 2017 15:13

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

TST Platform independent hash collision tests in FeatureHasher (sciki…

01dc44a

…t-learn#9710)

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

TST Platform independent hash collision tests in FeatureHasher (sciki…

17d6a35

…t-learn#9710)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG + 1] MNT Platform independent hash collision tests in FeatureHasher #9710

[MRG + 1] MNT Platform independent hash collision tests in FeatureHasher #9710

rth commented Sep 8, 2017

amueller Sep 8, 2017

amueller commented Sep 8, 2017

rth commented Sep 8, 2017

jnothman commented Sep 9, 2017 via email

jnothman left a comment

[MRG + 1] MNT Platform independent hash collision tests in FeatureHasher #9710

[MRG + 1] MNT Platform independent hash collision tests in FeatureHasher #9710

Conversation

rth commented Sep 8, 2017

amueller Sep 8, 2017

Choose a reason for hiding this comment

amueller commented Sep 8, 2017

rth commented Sep 8, 2017

jnothman commented Sep 9, 2017 via email

jnothman left a comment

Choose a reason for hiding this comment