Skip to content

Python wrapper and tests for sklearn/svm/newrand/ #17157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jun 28, 2020
Merged

Python wrapper and tests for sklearn/svm/newrand/ #17157

merged 23 commits into from
Jun 28, 2020

Conversation

Justin-Huber
Copy link
Contributor

@Justin-Huber Justin-Huber commented May 7, 2020

Reference Issues/PRs

Resolves #16862

What does this implement/fix? Explain your changes.

Created a python wrapper and tests for sklearn/svm/newrand/
Test that newrand creates an unbiased distribution of random numbers in the given range by comparing a deterministic sampling against scipy's scipy.stats.uniform using scipy.stats.kstest (kolmogorov-smirnov test) at a significance level of 0.05.

@smarie
Copy link
Contributor

smarie commented May 9, 2020

I guess that " Fixes #16862 " is missing from description :)

@smarie
Copy link
Contributor

smarie commented May 9, 2020

To answer your question about why this is failing: the build log states:

sklearn/svm/_newrand.c:623:10: fatal error: newrand-cache.h: No such file or directory
 #include "newrand-cache.h"
          ^~~~~~~~~~~~~~~~~
compilation terminated.

Does the build work for you locally ? If so that means that you probably forgot to add this newrand-cache.h file ?

Note that you seem to also have forgotten the .cpp one.

@Justin-Huber
Copy link
Contributor Author

Haha I had the tag in a comment so it didn't show. And yeah I was just an idiot. I was playing around with Cython and never changed it back before pushing.

Co-authored-by: smarie <sylvain.marie@schneider-electric.com>
@Justin-Huber
Copy link
Contributor Author

Oh cool, didn't realize that's what that was.

*/
#ifndef _NEWRAND_H
#define _NEWRAND_H

#ifdef __cplusplus
#include <random>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for adding this include ? It did not seem to be useful before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's so cython can create a .cpp file from newrand.h

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I see. Maybe we can add this as a comment before the include line ?

Comment on lines 101 to 103
@pytest.mark.parametrize("orig_range", [_MAX_INT, 100])
def test_bounded_rand_int(orig_range):
n_pts = 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be good to also parametrize the number of points, as _MAX_INT is, well, a bit large. Something like this ?

Suggested change
@pytest.mark.parametrize("orig_range", [_MAX_INT, 100])
def test_bounded_rand_int(orig_range):
n_pts = 1000
@pytest.mark.parametrize("orig_range,n_pts", [(_MAX_INT, 10000), (100, 10)])
def test_bounded_rand_int(orig_range, n_pts):

sample = [bounded_rand_int_wrap(orig_range) for _ in range(n_pts)]
res = stats.kstest(sample, uniform_dist.cdf)
ks_pvals.append(res.pvalue)
avg_pval = np.mean(ks_pvals)
Copy link
Contributor

@smarie smarie May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could be a bit more strict and use a quantile. In my tests, using the lower 10% quantile seems to pass:

min_10pct_pval = np.quantile(ks_pvals, q=0.1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, any reason 10% quantile in particular?

Copy link
Contributor

@smarie smarie May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I was expecting the minimum to be ok (as you first tried). However this did not work.
So I tried to find the lowest possible quantile that works :)

I finally found the right scientific explanation in https://stats.stackexchange.com/questions/298350/uniform-distribution-testing

That is how all hypothesis tests work. Under the null hypothesis, p is uniformly distributed.

We should probably include this as a comment, as it is non-trivial for non-statistics experts.

So maybe "the" best way to do the final test (in addition to the current assert on quantile, which can not hurt) is to do a final additional K-S uniformity test on the p-values themselves. They should be uniform on [0, 1] if null hypothesis is not rejected, so the final p-value of that last test is the one to compare with 0.05

I learned something today :)

res = stats.kstest(sample, uniform_dist.cdf)
ks_pvals.append(res.pvalue)
avg_pval = np.mean(ks_pvals)
assert(avg_pval > 0.05),\
Copy link
Contributor

@smarie smarie May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to add a comment for readability/maintenance:

"pval < 0.05 means that the test rejects the null hypothesis that the sample came from the uniform distribution"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the test failure message?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wish to leave all the details in the messages you can, too. In that case the message should be a bit more detailed. I usually put a very detailed comment in the code comments, and a brief assert message. But that's a matter of style, I do not know what is the rule here in sklearn

@Justin-Huber
Copy link
Contributor Author

Justin-Huber commented May 15, 2020

Also, unrelated question. Is there a make target for running pull-request checks (pep8, tests, coverage, etc.)?

@thomasjpfan
Copy link
Member

For testing you would need to build from source. You can find out more information in the contributing guide

@Justin-Huber
Copy link
Contributor Author

Seems like the numpy version in the azure pipeline is 1.13.3 which doesn't have quantile. Does the pipeline use the same versions as the latest version of sklearn?

@thomasjpfan
Copy link
Member

We test with multiple versions of numpy. For now our minimum supported version of numpy is 1.13. An alternative to quantile would be np.percentile.

@Justin-Huber
Copy link
Contributor Author

Ah ok, thanks for the clarification!

@Justin-Huber
Copy link
Contributor Author

Hey @smarie, I wanted to check in on this PR. What's left that needs done for this?

@smarie
Copy link
Contributor

smarie commented Jun 16, 2020

Hi @Justin-Huber , thanks for asking. It seems that my previous review comments were somehow hidden to you by github so I re-created both (one on the include header comment and one on the test per se). The comment concerning the test seems important: I discovered that p-values should be uniformly distributed under null hypothesis, so we can do a statistical test on the p-values uniformity itself. This does not add much but should provide a more understandable answer to the question "why a quantile" ?

If you're ok with these two proposals and merge them in, the rest is ok for me and I'll hand over to the "official" scikit learn reviewers (@thomasjpfan seems to be the one) for style and other nitpicks, since I do not master completely all official style rules in sklearn ;)

Merged suggestion

Co-authored-by: smarie <sylvain.marie@schneider-electric.com>
Comment on lines 114 to 117
# Under the null hypothesis,
# p-values should be uniformly distributed and not concentrated
# on low values
# (this may seem counter-intuitive but is backed by multiple refs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor formatting issue

@smarie
Copy link
Contributor

smarie commented Jun 23, 2020

@Justin-Huber I think that your idea of adding a test for determinism is indeed great. Could you please complete it ? Other than that and a few nitpicks, LGTM :)

@Justin-Huber
Copy link
Contributor Author

So I added the deterministic check back in which should only pass for the 32 bit mtrand19937 but it all passes so I think the mt19937_64 is never being created among all the tests. Are there any tests ran on a 64-bit system where the INT_MAX in random.h would be 64 bit?

[(None, 81),
(0, 54),
(_MAX_UNSIGNED_INT, 9)])
def test_set_seed(seed, val):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is located in a file shared with other tests, I recommend to put a specific name and explanation

Suggested change
def test_set_seed(seed, val):
def test_newrand_seed(seed, val):
"""Test that set_seed_wrap works correctly with bounded_rand_int_wrap"""


@pytest.mark.parametrize('orig_range, n_pts',
[(_MAX_INT, 10000), (100, 10)])
def test_bounded_rand_int(orig_range, n_pts):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is located in a file shared with other tests, I recommend to put a specific name and explanation

Suggested change
def test_bounded_rand_int(orig_range, n_pts):
def test_newrand_bounded_rand_int(orig_range, n_pts):
"""Test that `bounded_rand_int` follows a uniform distribution"""

@smarie
Copy link
Contributor

smarie commented Jun 24, 2020

(EDITED 24/06/2020 13:41)

Thanks @Justin-Huber for the update. I suggested a few minor changes concerning test names and docstring, see above.

Concerning your last question can you explain why you think your test_set_seed test should not pass for a mt19937_64 ? Is this because of the _MAX_UNSIGNED_INT seed ?

It seems from this reference that LONG_MAX would be more appropriate than INT_MAX to detect 64 bits systems in our _newrand.h file. However, reading back the results from our main source for newrand, and in particular the sentence just before the "conclusions" chapter:

[...] perhaps more importantly, these results also show that if you don't actually need 64-bit output, using a 64-bit PRNG is usually slower than using a 32-bit PRNG

So since current behaviour seems to be actually using always a 32 bits PRNG (which is recommended), maybe we should keep it like this and remove the currently unused mt19937_64 switch entirely. We would also need to add a comment in bounded_rand_int like this: // Note: orig_range is actually a 32 bits unsigned int so in [0; 4,294,967,295]

To conclude, according to what you have found, I suggest to do the following things:

  • Remove the mt19937_64 switch as explained above
  • Change the bounded_rand_int function in _newrand.h to at least include a comment as suggested above, or to change its signature as inline uint32_t bounded_rand_int(uint32_t orig_range) {
  • Change the _newrand.pyx so that unsigned or unsigned int (more readable) is used everywhere in this wrapper instead of int
  • Add the case (_MAX_UNSIGNED_INT, 10000) to test_bounded_rand_int. It should now work (before the above mod it would lead to an overflow)
  • Add two negative tests showing which values can overflow the wrapper:
@pytest.mark.parametrize('seed',
                         [-1, _MAX_UNSIGNED_INT + 1])
def test_set_seed_overflow(seed):
    """Test that `set_seed_wrap` is defined for unsigned 32bits ints"""
    with pytest.raises(OverflowError):
        set_seed_wrap(seed)

@pytest.mark.parametrize('orig_range',
                         [-1, _MAX_UNSIGNED_INT + 1])
def test_bounded_rand_int_limits(orig_range):
    """Test that `bounded_rand_int_wrap` is defined for unsigned 32bits ints"""
    with pytest.raises(OverflowError):
        bounded_rand_int_wrap(orig_range)

What do you think ?

I would love to get the feedback from people more at ease with C/C++ than myself here, and involved in the original PR. Maybe @jnothman and @thomasjpfan ?

@smarie
Copy link
Contributor

smarie commented Jun 24, 2020

Note for reference: in order to double-check that current implementation does not accept more than a 32 bits unsigned int for orig_range, I temporarily unlocked the types to unsigned long long in the Cython wrapper and ran the uniformity test with _MAX_UNSIGNED_INT * 2. It correctly failed : all numbers returned were below _MAX_UNSIGNED_INT. Such a double-check may seem naive but at least I'm now confident in the above proposals :)

@Justin-Huber
Copy link
Contributor Author

Added the changes, let me know what you think. :)

Comment on lines 35 to 36
// from http://www.pcg-random.org/posts/bounded-rands.html
// TODO how could we make this casting safer, raising an error if lost information?
uint32_t range = uint32_t(orig_range);
uint32_t x = mt_rand();
Copy link
Contributor

@smarie smarie Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you think that it would be safer to keep this ? Indeed bounded_rand_int is marked as inline so I guess (but I'm no C++ expert) that the input can actually be anything, there is absolutely no protection. At least having this cast would ensure that when this code is badly used directly from C++ there are no side effects (from python that's already safe thanks to the Cython wrapper). What do you think ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the input to bounded_rand_int is uint32_t, I think this extra step is redundant (casting uint32_t to uint32_t)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok !

@smarie
Copy link
Contributor

smarie commented Jun 25, 2020

Great @Justin-Huber ! Apart from two remaining comments (see above), looks very good to me. Almost there! Thanks for your patience

Co-authored-by: smarie <sylvain.marie@schneider-electric.com>
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, only stylistic nitpicks

# (1) check uniformity of p-values
uniform_p_vals_dist = stats.uniform(loc=0, scale=1)
res_pvals = stats.kstest(ks_pvals, uniform_p_vals_dist.cdf)
assert res_pvals.pvalue > 0.05,\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer to avoid \ for line continuation. Instead, open a parenthesis, like:

    assert res_pvals.pvalue > 0.05, (
        "Null hypothesis rejected: generated random numbers are not uniform."
        " Details: the (meta) p-value of the test of uniform distribution"
        f" of p-values is {res_pvals.pvalue} which is not > 0.05")

min_10pct_pval = np.percentile(ks_pvals, q=10)
# lower 10th quantile pvalue <= 0.05 means that the test rejects the
# null hypothesis that the sample came from the uniform distribution
assert(min_10pct_pval > 0.05),\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the parentheses around this expression, and avoid \ for line continuation

Copy link
Contributor

@smarie smarie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good for me, thanks for your patience !

Justin-Huber and others added 4 commits June 26, 2020 12:44
Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
@Justin-Huber
Copy link
Contributor Author

Awesome I made those last style fixes :)

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Justin-Huber thanks for the persistence and your patience, and @smarie thanks a whole ton for your continued detailed review. Really appreciate it :)

@adrinjalali adrinjalali merged commit 430c208 into scikit-learn:master Jun 28, 2020
@smarie
Copy link
Contributor

smarie commented Jun 28, 2020

You're welcome @adrinjalali :)
Thanks @Justin-Huber !

(note for what's worth: two style comments from @jnothman seem to have been forgotten - maybe this can be done by a maintainer directly on master ?)

@Justin-Huber
Copy link
Contributor Author

Awesome! It's been a pleasure, thanks for the help! @smarie :)

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Jul 17, 2020
…7157)

* Implemented issue scikit-learn#16862

* Implemented issue scikit-learn#16862

* Ready for pull request

* Cleaned up redundancies

* Removed test_set_seed_wrap for now

* Update sklearn/svm/tests/test_bounds.py

Co-authored-by: smarie <sylvain.marie@schneider-electric.com>

* Review feedback changes

* Style fixes

* np.quantile -> np.percentile

* Update sklearn/svm/src/newrand/newrand.h

Merged suggestion

Co-authored-by: smarie <sylvain.marie@schneider-electric.com>

* Update sklearn/svm/tests/test_bounds.py

Merge suggestions

Co-authored-by: smarie <sylvain.marie@schneider-electric.com>

* linting fix

* linting fix

* linting fix

* Added deterministic test for mt19937 in svm/newrand.h

* Linting fix

* Review changes

* Update sklearn/svm/tests/test_bounds.py

Co-authored-by: smarie <sylvain.marie@schneider-electric.com>

* Update sklearn/svm/tests/test_bounds.py

Co-authored-by: Joel Nothman <joel.nothman@gmail.com>

* Update sklearn/svm/tests/test_bounds.py

Co-authored-by: Joel Nothman <joel.nothman@gmail.com>

* Update test_bounds.py

* Linting fixes

Co-authored-by: smarie <sylvain.marie@schneider-electric.com>
Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
…7157)

* Implemented issue scikit-learn#16862

* Implemented issue scikit-learn#16862

* Ready for pull request

* Cleaned up redundancies

* Removed test_set_seed_wrap for now

* Update sklearn/svm/tests/test_bounds.py

Co-authored-by: smarie <sylvain.marie@schneider-electric.com>

* Review feedback changes

* Style fixes

* np.quantile -> np.percentile

* Update sklearn/svm/src/newrand/newrand.h

Merged suggestion

Co-authored-by: smarie <sylvain.marie@schneider-electric.com>

* Update sklearn/svm/tests/test_bounds.py

Merge suggestions

Co-authored-by: smarie <sylvain.marie@schneider-electric.com>

* linting fix

* linting fix

* linting fix

* Added deterministic test for mt19937 in svm/newrand.h

* Linting fix

* Review changes

* Update sklearn/svm/tests/test_bounds.py

Co-authored-by: smarie <sylvain.marie@schneider-electric.com>

* Update sklearn/svm/tests/test_bounds.py

Co-authored-by: Joel Nothman <joel.nothman@gmail.com>

* Update sklearn/svm/tests/test_bounds.py

Co-authored-by: Joel Nothman <joel.nothman@gmail.com>

* Update test_bounds.py

* Linting fixes

Co-authored-by: smarie <sylvain.marie@schneider-electric.com>
Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a python wrapper and unit test for the new bounded_rand_int C function used in LibSVM and LibLinear
5 participants