Python wrapper and tests for sklearn/svm/newrand/ #17157

Justin-Huber · 2020-05-07T17:39:55Z

Reference Issues/PRs

Resolves #16862

What does this implement/fix? Explain your changes.

Created a python wrapper and tests for sklearn/svm/newrand/
Test that newrand creates an unbiased distribution of random numbers in the given range by comparing a deterministic sampling against scipy's scipy.stats.uniform using scipy.stats.kstest (kolmogorov-smirnov test) at a significance level of 0.05.

smarie · 2020-05-09T19:24:18Z

I guess that " Fixes #16862 " is missing from description :)

smarie · 2020-05-09T19:29:15Z

To answer your question about why this is failing: the build log states:

sklearn/svm/_newrand.c:623:10: fatal error: newrand-cache.h: No such file or directory
 #include "newrand-cache.h"
          ^~~~~~~~~~~~~~~~~
compilation terminated.

Does the build work for you locally ? If so that means that you probably forgot to add this newrand-cache.h file ?

Note that you seem to also have forgotten the .cpp one.

Justin-Huber · 2020-05-11T21:33:23Z

Haha I had the tag in a comment so it didn't show. And yeah I was just an idiot. I was playing around with Cython and never changed it back before pushing.

sklearn/svm/tests/test_bounds.py

Co-authored-by: smarie <sylvain.marie@schneider-electric.com>

Justin-Huber · 2020-05-14T17:56:52Z

Oh cool, didn't realize that's what that was.

smarie · 2020-05-15T08:06:41Z

sklearn/svm/src/newrand/newrand.h

 */
 #ifndef _NEWRAND_H
 #define _NEWRAND_H

 #ifdef __cplusplus
+#include <random>


Any reason for adding this include ? It did not seem to be useful before

That's so cython can create a .cpp file from newrand.h

ok I see. Maybe we can add this as a comment before the include line ?

smarie · 2020-05-15T10:08:10Z

sklearn/svm/tests/test_bounds.py

+@pytest.mark.parametrize("orig_range", [_MAX_INT, 100])
+def test_bounded_rand_int(orig_range):
+    n_pts = 1000


Maybe it would be good to also parametrize the number of points, as _MAX_INT is, well, a bit large. Something like this ?

Suggested change

@pytest.mark.parametrize("orig_range", [_MAX_INT, 100])

def test_bounded_rand_int(orig_range):

n_pts = 1000

@pytest.mark.parametrize("orig_range,n_pts", [(_MAX_INT, 10000), (100, 10)])

def test_bounded_rand_int(orig_range, n_pts):

smarie · 2020-05-15T10:09:21Z

sklearn/svm/tests/test_bounds.py

+        sample = [bounded_rand_int_wrap(orig_range) for _ in range(n_pts)]
+        res = stats.kstest(sample, uniform_dist.cdf)
+        ks_pvals.append(res.pvalue)
+    avg_pval = np.mean(ks_pvals)


We could be a bit more strict and use a quantile. In my tests, using the lower 10% quantile seems to pass:

min_10pct_pval = np.quantile(ks_pvals, q=0.1)

Good idea, any reason 10% quantile in particular?

At first I was expecting the minimum to be ok (as you first tried). However this did not work.
So I tried to find the lowest possible quantile that works :)

I finally found the right scientific explanation in https://stats.stackexchange.com/questions/298350/uniform-distribution-testing

That is how all hypothesis tests work. Under the null hypothesis, p is uniformly distributed.

We should probably include this as a comment, as it is non-trivial for non-statistics experts.

So maybe "the" best way to do the final test (in addition to the current assert on quantile, which can not hurt) is to do a final additional K-S uniformity test on the p-values themselves. They should be uniform on [0, 1] if null hypothesis is not rejected, so the final p-value of that last test is the one to compare with 0.05

I learned something today :)

smarie · 2020-05-15T10:11:31Z

sklearn/svm/tests/test_bounds.py

+        res = stats.kstest(sample, uniform_dist.cdf)
+        ks_pvals.append(res.pvalue)
+    avg_pval = np.mean(ks_pvals)
+    assert(avg_pval > 0.05),\


I would suggest to add a comment for readability/maintenance:

"pval < 0.05 means that the test rejects the null hypothesis that the sample came from the uniform distribution"

In addition to the test failure message?

If you wish to leave all the details in the messages you can, too. In that case the message should be a bit more detailed. I usually put a very detailed comment in the code comments, and a brief assert message. But that's a matter of style, I do not know what is the rule here in sklearn

Justin-Huber · 2020-05-15T22:04:45Z

Also, unrelated question. Is there a make target for running pull-request checks (pep8, tests, coverage, etc.)?

thomasjpfan · 2020-05-16T23:06:01Z

For testing you would need to build from source. You can find out more information in the contributing guide

Justin-Huber · 2020-05-17T18:53:43Z

Seems like the numpy version in the azure pipeline is 1.13.3 which doesn't have quantile. Does the pipeline use the same versions as the latest version of sklearn?

thomasjpfan · 2020-05-17T19:54:16Z

We test with multiple versions of numpy. For now our minimum supported version of numpy is 1.13. An alternative to quantile would be np.percentile.

Justin-Huber · 2020-05-17T21:02:06Z

Ah ok, thanks for the clarification!

Justin-Huber · 2020-06-15T19:20:54Z

Hey @smarie, I wanted to check in on this PR. What's left that needs done for this?

sklearn/svm/src/newrand/newrand.h

sklearn/svm/tests/test_bounds.py

smarie · 2020-06-16T13:51:04Z

Hi @Justin-Huber , thanks for asking. It seems that my previous review comments were somehow hidden to you by github so I re-created both (one on the include header comment and one on the test per se). The comment concerning the test seems important: I discovered that p-values should be uniformly distributed under null hypothesis, so we can do a statistical test on the p-values uniformity itself. This does not add much but should provide a more understandable answer to the question "why a quantile" ?

If you're ok with these two proposals and merge them in, the rest is ok for me and I'll hand over to the "official" scikit learn reviewers (@thomasjpfan seems to be the one) for style and other nitpicks, since I do not master completely all official style rules in sklearn ;)

Merged suggestion Co-authored-by: smarie <sylvain.marie@schneider-electric.com>

smarie · 2020-06-23T11:47:43Z

sklearn/svm/tests/test_bounds.py

+    # Under the null hypothesis,
+    #   p-values should be uniformly distributed and not concentrated
+    # on low values
+    #   (this may seem counter-intuitive but is backed by multiple refs)


Minor formatting issue

smarie · 2020-06-23T11:49:46Z

@Justin-Huber I think that your idea of adding a test for determinism is indeed great. Could you please complete it ? Other than that and a few nitpicks, LGTM :)

Justin-Huber · 2020-06-23T20:53:02Z

So I added the deterministic check back in which should only pass for the 32 bit mtrand19937 but it all passes so I think the mt19937_64 is never being created among all the tests. Are there any tests ran on a 64-bit system where the INT_MAX in random.h would be 64 bit?

smarie · 2020-06-24T09:21:23Z

sklearn/svm/tests/test_bounds.py

+                         [(None, 81),
+                          (0, 54),
+                          (_MAX_UNSIGNED_INT, 9)])
+def test_set_seed(seed, val):


Since this is located in a file shared with other tests, I recommend to put a specific name and explanation

Suggested change

def test_set_seed(seed, val):

def test_newrand_seed(seed, val):

"""Test that set_seed_wrap works correctly with bounded_rand_int_wrap"""

smarie · 2020-06-24T09:22:47Z

sklearn/svm/tests/test_bounds.py

+
+@pytest.mark.parametrize('orig_range, n_pts',
+                         [(_MAX_INT, 10000), (100, 10)])
+def test_bounded_rand_int(orig_range, n_pts):


Since this is located in a file shared with other tests, I recommend to put a specific name and explanation

Suggested change

def test_bounded_rand_int(orig_range, n_pts):

def test_newrand_bounded_rand_int(orig_range, n_pts):

"""Test that `bounded_rand_int` follows a uniform distribution"""

smarie · 2020-06-24T10:12:46Z

(EDITED 24/06/2020 13:41)

Thanks @Justin-Huber for the update. I suggested a few minor changes concerning test names and docstring, see above.

Concerning your last question can you explain why you think your test_set_seed test should not pass for a mt19937_64 ? Is this because of the _MAX_UNSIGNED_INT seed ?

It seems from this reference that LONG_MAX would be more appropriate than INT_MAX to detect 64 bits systems in our _newrand.h file. However, reading back the results from our main source for newrand, and in particular the sentence just before the "conclusions" chapter:

[...] perhaps more importantly, these results also show that if you don't actually need 64-bit output, using a 64-bit PRNG is usually slower than using a 32-bit PRNG

So since current behaviour seems to be actually using always a 32 bits PRNG (which is recommended), maybe we should keep it like this and remove the currently unused mt19937_64 switch entirely. We would also need to add a comment in bounded_rand_int like this: // Note: orig_range is actually a 32 bits unsigned int so in [0; 4,294,967,295]

To conclude, according to what you have found, I suggest to do the following things:

Remove the mt19937_64 switch as explained above
Change the bounded_rand_int function in _newrand.h to at least include a comment as suggested above, or to change its signature as inline uint32_t bounded_rand_int(uint32_t orig_range) {
Change the _newrand.pyx so that unsigned or unsigned int (more readable) is used everywhere in this wrapper instead of int
Add the case (_MAX_UNSIGNED_INT, 10000) to test_bounded_rand_int. It should now work (before the above mod it would lead to an overflow)
Add two negative tests showing which values can overflow the wrapper:

@pytest.mark.parametrize('seed',
                         [-1, _MAX_UNSIGNED_INT + 1])
def test_set_seed_overflow(seed):
    """Test that `set_seed_wrap` is defined for unsigned 32bits ints"""
    with pytest.raises(OverflowError):
        set_seed_wrap(seed)

@pytest.mark.parametrize('orig_range',
                         [-1, _MAX_UNSIGNED_INT + 1])
def test_bounded_rand_int_limits(orig_range):
    """Test that `bounded_rand_int_wrap` is defined for unsigned 32bits ints"""
    with pytest.raises(OverflowError):
        bounded_rand_int_wrap(orig_range)

What do you think ?

I would love to get the feedback from people more at ease with C/C++ than myself here, and involved in the original PR. Maybe @jnothman and @thomasjpfan ?

smarie · 2020-06-24T12:09:23Z

Note for reference: in order to double-check that current implementation does not accept more than a 32 bits unsigned int for orig_range, I temporarily unlocked the types to unsigned long long in the Cython wrapper and ran the uniformity test with _MAX_UNSIGNED_INT * 2. It correctly failed : all numbers returned were below _MAX_UNSIGNED_INT. Such a double-check may seem naive but at least I'm now confident in the above proposals :)

Justin-Huber · 2020-06-25T03:08:13Z

Added the changes, let me know what you think. :)

smarie · 2020-06-25T07:54:26Z

sklearn/svm/src/newrand/newrand.h

    // from http://www.pcg-random.org/posts/bounded-rands.html
-    // TODO how could we make this casting safer, raising an error if lost information?
-    uint32_t range = uint32_t(orig_range);
    uint32_t x = mt_rand();


Don't you think that it would be safer to keep this ? Indeed bounded_rand_int is marked as inline so I guess (but I'm no C++ expert) that the input can actually be anything, there is absolutely no protection. At least having this cast would ensure that when this code is badly used directly from C++ there are no side effects (from python that's already safe thanks to the Cython wrapper). What do you think ?

Now that the input to bounded_rand_int is uint32_t, I think this extra step is redundant (casting uint32_t to uint32_t)

sklearn/svm/tests/test_bounds.py

smarie · 2020-06-25T07:57:56Z

Great @Justin-Huber ! Apart from two remaining comments (see above), looks very good to me. Almost there! Thanks for your patience

Co-authored-by: smarie <sylvain.marie@schneider-electric.com>

jnothman

This looks good, only stylistic nitpicks

sklearn/svm/tests/test_bounds.py

jnothman · 2020-06-26T00:26:51Z

sklearn/svm/tests/test_bounds.py

+    # (1) check uniformity of p-values
+    uniform_p_vals_dist = stats.uniform(loc=0, scale=1)
+    res_pvals = stats.kstest(ks_pvals, uniform_p_vals_dist.cdf)
+    assert res_pvals.pvalue > 0.05,\


We prefer to avoid \ for line continuation. Instead, open a parenthesis, like:

assert res_pvals.pvalue > 0.05, ( "Null hypothesis rejected: generated random numbers are not uniform." " Details: the (meta) p-value of the test of uniform distribution" f" of p-values is {res_pvals.pvalue} which is not > 0.05")

jnothman · 2020-06-26T00:27:32Z

sklearn/svm/tests/test_bounds.py

+    min_10pct_pval = np.percentile(ks_pvals, q=10)
+    # lower 10th quantile pvalue <= 0.05 means that the test rejects the
+    # null hypothesis that the sample came from the uniform distribution
+    assert(min_10pct_pval > 0.05),\


Please remove the parentheses around this expression, and avoid \ for line continuation

sklearn/svm/tests/test_bounds.py

smarie

All good for me, thanks for your patience !

Co-authored-by: Joel Nothman <joel.nothman@gmail.com>

Justin-Huber · 2020-06-26T23:21:38Z

Awesome I made those last style fixes :)

adrinjalali

@Justin-Huber thanks for the persistence and your patience, and @smarie thanks a whole ton for your continued detailed review. Really appreciate it :)

smarie · 2020-06-28T19:51:05Z

You're welcome @adrinjalali :)
Thanks @Justin-Huber !

(note for what's worth: two style comments from @jnothman seem to have been forgotten - maybe this can be done by a maintainer directly on master ?)

Justin-Huber · 2020-06-29T04:18:17Z

Awesome! It's been a pleasure, thanks for the help! @smarie :)

…7157) * Implemented issue scikit-learn#16862 * Implemented issue scikit-learn#16862 * Ready for pull request * Cleaned up redundancies * Removed test_set_seed_wrap for now * Update sklearn/svm/tests/test_bounds.py Co-authored-by: smarie <sylvain.marie@schneider-electric.com> * Review feedback changes * Style fixes * np.quantile -> np.percentile * Update sklearn/svm/src/newrand/newrand.h Merged suggestion Co-authored-by: smarie <sylvain.marie@schneider-electric.com> * Update sklearn/svm/tests/test_bounds.py Merge suggestions Co-authored-by: smarie <sylvain.marie@schneider-electric.com> * linting fix * linting fix * linting fix * Added deterministic test for mt19937 in svm/newrand.h * Linting fix * Review changes * Update sklearn/svm/tests/test_bounds.py Co-authored-by: smarie <sylvain.marie@schneider-electric.com> * Update sklearn/svm/tests/test_bounds.py Co-authored-by: Joel Nothman <joel.nothman@gmail.com> * Update sklearn/svm/tests/test_bounds.py Co-authored-by: Joel Nothman <joel.nothman@gmail.com> * Update test_bounds.py * Linting fixes Co-authored-by: smarie <sylvain.marie@schneider-electric.com> Co-authored-by: Joel Nothman <joel.nothman@gmail.com>

Justin-Huber added 2 commits May 7, 2020 12:19

Implemented issue #16862

9069c3f

Implemented issue #16862

7a92336

github-actions bot added the module:svm label May 7, 2020

Justin-Huber added 3 commits May 9, 2020 18:38

Ready for pull request

901a1b9

Cleaned up redundancies

9daf268

Removed test_set_seed_wrap for now

4734ad1

smarie reviewed May 14, 2020

View reviewed changes

sklearn/svm/tests/test_bounds.py Outdated Show resolved Hide resolved

Update sklearn/svm/tests/test_bounds.py

edfaf72

Co-authored-by: smarie <sylvain.marie@schneider-electric.com>

smarie reviewed May 15, 2020

View reviewed changes

Justin-Huber added 3 commits May 15, 2020 17:07

Review feedback changes

d8f720d

Review feedback changes (merge conflicts resolved)

6640f90

Style fixes

b95d87f

np.quantile -> np.percentile

e0e8e6d

smarie reviewed Jun 16, 2020

View reviewed changes

sklearn/svm/src/newrand/newrand.h Outdated Show resolved Hide resolved

smarie reviewed Jun 16, 2020

View reviewed changes

sklearn/svm/tests/test_bounds.py Outdated Show resolved Hide resolved

Update sklearn/svm/src/newrand/newrand.h

83fb693

Merged suggestion Co-authored-by: smarie <sylvain.marie@schneider-electric.com>

smarie reviewed Jun 23, 2020

View reviewed changes

Justin-Huber added 2 commits June 23, 2020 14:17

Added deterministic test for mt19937 in svm/newrand.h

f266f15

Linting fix

b37f3b4

smarie reviewed Jun 24, 2020

View reviewed changes

Review changes

1d68995

smarie reviewed Jun 25, 2020

View reviewed changes

sklearn/svm/tests/test_bounds.py Outdated Show resolved Hide resolved

Update sklearn/svm/tests/test_bounds.py

586aef6

Co-authored-by: smarie <sylvain.marie@schneider-electric.com>

jnothman reviewed Jun 26, 2020

View reviewed changes

smarie approved these changes Jun 26, 2020

View reviewed changes

Justin-Huber and others added 4 commits June 26, 2020 12:44

Update sklearn/svm/tests/test_bounds.py

b3395ba

Co-authored-by: Joel Nothman <joel.nothman@gmail.com>

Update sklearn/svm/tests/test_bounds.py

8abd97c

Co-authored-by: Joel Nothman <joel.nothman@gmail.com>

Update test_bounds.py

8d88b42

Linting fixes

f62ed4a

jnothman approved these changes Jun 27, 2020

View reviewed changes

adrinjalali approved these changes Jun 28, 2020

View reviewed changes

adrinjalali merged commit 430c208 into scikit-learn:master Jun 28, 2020

alfaro96 mentioned this pull request Jun 29, 2020

FIX Specify C++11 compiler for sklearn/svm/_newrand #17767

Merged

	def test_set_seed(seed, val):
	def test_newrand_seed(seed, val):
	"""Test that set_seed_wrap works correctly with bounded_rand_int_wrap"""

	def test_bounded_rand_int(orig_range, n_pts):
	def test_newrand_bounded_rand_int(orig_range, n_pts):
	"""Test that `bounded_rand_int` follows a uniform distribution"""

Uh oh!

Python wrapper and tests for sklearn/svm/newrand/ #17157

Python wrapper and tests for sklearn/svm/newrand/ #17157

Uh oh!

Conversation

Justin-Huber commented May 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

smarie commented May 9, 2020

Uh oh!

smarie commented May 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Justin-Huber commented May 11, 2020

Uh oh!

Uh oh!

Justin-Huber commented May 14, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smarie May 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smarie May 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smarie May 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Justin-Huber commented May 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan commented May 16, 2020

Uh oh!

Justin-Huber commented May 17, 2020

Uh oh!

thomasjpfan commented May 17, 2020

Uh oh!

Justin-Huber commented May 17, 2020

Uh oh!

Justin-Huber commented Jun 15, 2020

Uh oh!

Uh oh!

Uh oh!

smarie commented Jun 16, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smarie commented Jun 23, 2020

Uh oh!

Justin-Huber commented Jun 23, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smarie commented Jun 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smarie commented Jun 24, 2020

Uh oh!

Justin-Huber commented Jun 25, 2020

Uh oh!

Justin-Huber commented May 7, 2020 •

edited

Loading

smarie commented May 9, 2020 •

edited

Loading

smarie May 15, 2020 •

edited

Loading

smarie May 18, 2020 •

edited

Loading

smarie May 15, 2020 •

edited

Loading

Justin-Huber commented May 15, 2020 •

edited

Loading

smarie commented Jun 24, 2020 •

edited

Loading

smarie Jun 25, 2020 •

edited

Loading

smarie commented Jun 28, 2020 •

edited

Loading