-
-
Notifications
You must be signed in to change notification settings - Fork 26k
Python wrapper and tests for sklearn/svm/newrand/ #17157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I guess that " Fixes #16862 " is missing from description :) |
To answer your question about why this is failing: the build log states:
Does the build work for you locally ? If so that means that you probably forgot to add this Note that you seem to also have forgotten the |
Haha I had the tag in a comment so it didn't show. And yeah I was just an idiot. I was playing around with Cython and never changed it back before pushing. |
Co-authored-by: smarie <sylvain.marie@schneider-electric.com>
Oh cool, didn't realize that's what that was. |
sklearn/svm/src/newrand/newrand.h
Outdated
*/ | ||
#ifndef _NEWRAND_H | ||
#define _NEWRAND_H | ||
|
||
#ifdef __cplusplus | ||
#include <random> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for adding this include
? It did not seem to be useful before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's so cython can create a .cpp file from newrand.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I see. Maybe we can add this as a comment before the include line ?
sklearn/svm/tests/test_bounds.py
Outdated
@pytest.mark.parametrize("orig_range", [_MAX_INT, 100]) | ||
def test_bounded_rand_int(orig_range): | ||
n_pts = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be good to also parametrize the number of points, as _MAX_INT
is, well, a bit large. Something like this ?
@pytest.mark.parametrize("orig_range", [_MAX_INT, 100]) | |
def test_bounded_rand_int(orig_range): | |
n_pts = 1000 | |
@pytest.mark.parametrize("orig_range,n_pts", [(_MAX_INT, 10000), (100, 10)]) | |
def test_bounded_rand_int(orig_range, n_pts): |
sklearn/svm/tests/test_bounds.py
Outdated
sample = [bounded_rand_int_wrap(orig_range) for _ in range(n_pts)] | ||
res = stats.kstest(sample, uniform_dist.cdf) | ||
ks_pvals.append(res.pvalue) | ||
avg_pval = np.mean(ks_pvals) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could be a bit more strict and use a quantile. In my tests, using the lower 10% quantile seems to pass:
min_10pct_pval = np.quantile(ks_pvals, q=0.1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, any reason 10% quantile in particular?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first I was expecting the minimum to be ok (as you first tried). However this did not work.
So I tried to find the lowest possible quantile that works :)
I finally found the right scientific explanation in https://stats.stackexchange.com/questions/298350/uniform-distribution-testing
That is how all hypothesis tests work. Under the null hypothesis, p is uniformly distributed.
We should probably include this as a comment, as it is non-trivial for non-statistics experts.
So maybe "the" best way to do the final test (in addition to the current assert on quantile, which can not hurt) is to do a final additional K-S uniformity test on the p-values themselves. They should be uniform on [0, 1] if null hypothesis is not rejected, so the final p-value of that last test is the one to compare with 0.05
I learned something today :)
sklearn/svm/tests/test_bounds.py
Outdated
res = stats.kstest(sample, uniform_dist.cdf) | ||
ks_pvals.append(res.pvalue) | ||
avg_pval = np.mean(ks_pvals) | ||
assert(avg_pval > 0.05),\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to add a comment for readability/maintenance:
"pval < 0.05 means that the test rejects the null hypothesis that the sample came from the uniform distribution"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the test failure message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wish to leave all the details in the messages you can, too. In that case the message should be a bit more detailed. I usually put a very detailed comment in the code comments, and a brief assert message. But that's a matter of style, I do not know what is the rule here in sklearn
Also, unrelated question. Is there a make target for running pull-request checks (pep8, tests, coverage, etc.)? |
For testing you would need to build from source. You can find out more information in the contributing guide |
Seems like the numpy version in the azure pipeline is 1.13.3 which doesn't have quantile. Does the pipeline use the same versions as the latest version of sklearn? |
We test with multiple versions of numpy. For now our minimum supported version of numpy is 1.13. An alternative to |
Ah ok, thanks for the clarification! |
Hey @smarie, I wanted to check in on this PR. What's left that needs done for this? |
Hi @Justin-Huber , thanks for asking. It seems that my previous review comments were somehow hidden to you by github so I re-created both (one on the include header comment and one on the test per se). The comment concerning the test seems important: I discovered that p-values should be uniformly distributed under null hypothesis, so we can do a statistical test on the p-values uniformity itself. This does not add much but should provide a more understandable answer to the question "why a quantile" ? If you're ok with these two proposals and merge them in, the rest is ok for me and I'll hand over to the "official" scikit learn reviewers (@thomasjpfan seems to be the one) for style and other nitpicks, since I do not master completely all official style rules in sklearn ;) |
Merged suggestion Co-authored-by: smarie <sylvain.marie@schneider-electric.com>
sklearn/svm/tests/test_bounds.py
Outdated
# Under the null hypothesis, | ||
# p-values should be uniformly distributed and not concentrated | ||
# on low values | ||
# (this may seem counter-intuitive but is backed by multiple refs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor formatting issue
@Justin-Huber I think that your idea of adding a test for determinism is indeed great. Could you please complete it ? Other than that and a few nitpicks, LGTM :) |
So I added the deterministic check back in which should only pass for the 32 bit mtrand19937 but it all passes so I think the mt19937_64 is never being created among all the tests. Are there any tests ran on a 64-bit system where the INT_MAX in random.h would be 64 bit? |
sklearn/svm/tests/test_bounds.py
Outdated
[(None, 81), | ||
(0, 54), | ||
(_MAX_UNSIGNED_INT, 9)]) | ||
def test_set_seed(seed, val): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is located in a file shared with other tests, I recommend to put a specific name and explanation
def test_set_seed(seed, val): | |
def test_newrand_seed(seed, val): | |
"""Test that set_seed_wrap works correctly with bounded_rand_int_wrap""" |
sklearn/svm/tests/test_bounds.py
Outdated
|
||
@pytest.mark.parametrize('orig_range, n_pts', | ||
[(_MAX_INT, 10000), (100, 10)]) | ||
def test_bounded_rand_int(orig_range, n_pts): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is located in a file shared with other tests, I recommend to put a specific name and explanation
def test_bounded_rand_int(orig_range, n_pts): | |
def test_newrand_bounded_rand_int(orig_range, n_pts): | |
"""Test that `bounded_rand_int` follows a uniform distribution""" |
(EDITED 24/06/2020 13:41) Thanks @Justin-Huber for the update. I suggested a few minor changes concerning test names and docstring, see above. Concerning your last question can you explain why you think your It seems from this reference that
So since current behaviour seems to be actually using always a 32 bits PRNG (which is recommended), maybe we should keep it like this and remove the currently unused To conclude, according to what you have found, I suggest to do the following things:
@pytest.mark.parametrize('seed',
[-1, _MAX_UNSIGNED_INT + 1])
def test_set_seed_overflow(seed):
"""Test that `set_seed_wrap` is defined for unsigned 32bits ints"""
with pytest.raises(OverflowError):
set_seed_wrap(seed)
@pytest.mark.parametrize('orig_range',
[-1, _MAX_UNSIGNED_INT + 1])
def test_bounded_rand_int_limits(orig_range):
"""Test that `bounded_rand_int_wrap` is defined for unsigned 32bits ints"""
with pytest.raises(OverflowError):
bounded_rand_int_wrap(orig_range) What do you think ? I would love to get the feedback from people more at ease with C/C++ than myself here, and involved in the original PR. Maybe @jnothman and @thomasjpfan ? |
Note for reference: in order to double-check that current implementation does not accept more than a 32 bits unsigned int for |
Added the changes, let me know what you think. :) |
// from http://www.pcg-random.org/posts/bounded-rands.html | ||
// TODO how could we make this casting safer, raising an error if lost information? | ||
uint32_t range = uint32_t(orig_range); | ||
uint32_t x = mt_rand(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you think that it would be safer to keep this ? Indeed bounded_rand_int
is marked as inline
so I guess (but I'm no C++ expert) that the input can actually be anything, there is absolutely no protection. At least having this cast would ensure that when this code is badly used directly from C++ there are no side effects (from python that's already safe thanks to the Cython wrapper). What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that the input to bounded_rand_int is uint32_t, I think this extra step is redundant (casting uint32_t to uint32_t)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok !
Great @Justin-Huber ! Apart from two remaining comments (see above), looks very good to me. Almost there! Thanks for your patience |
Co-authored-by: smarie <sylvain.marie@schneider-electric.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, only stylistic nitpicks
sklearn/svm/tests/test_bounds.py
Outdated
# (1) check uniformity of p-values | ||
uniform_p_vals_dist = stats.uniform(loc=0, scale=1) | ||
res_pvals = stats.kstest(ks_pvals, uniform_p_vals_dist.cdf) | ||
assert res_pvals.pvalue > 0.05,\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prefer to avoid \
for line continuation. Instead, open a parenthesis, like:
assert res_pvals.pvalue > 0.05, (
"Null hypothesis rejected: generated random numbers are not uniform."
" Details: the (meta) p-value of the test of uniform distribution"
f" of p-values is {res_pvals.pvalue} which is not > 0.05")
sklearn/svm/tests/test_bounds.py
Outdated
min_10pct_pval = np.percentile(ks_pvals, q=10) | ||
# lower 10th quantile pvalue <= 0.05 means that the test rejects the | ||
# null hypothesis that the sample came from the uniform distribution | ||
assert(min_10pct_pval > 0.05),\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the parentheses around this expression, and avoid \
for line continuation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good for me, thanks for your patience !
Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
Awesome I made those last style fixes :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Justin-Huber thanks for the persistence and your patience, and @smarie thanks a whole ton for your continued detailed review. Really appreciate it :)
You're welcome @adrinjalali :) (note for what's worth: two style comments from @jnothman seem to have been forgotten - maybe this can be done by a maintainer directly on master ?) |
Awesome! It's been a pleasure, thanks for the help! @smarie :) |
…7157) * Implemented issue scikit-learn#16862 * Implemented issue scikit-learn#16862 * Ready for pull request * Cleaned up redundancies * Removed test_set_seed_wrap for now * Update sklearn/svm/tests/test_bounds.py Co-authored-by: smarie <sylvain.marie@schneider-electric.com> * Review feedback changes * Style fixes * np.quantile -> np.percentile * Update sklearn/svm/src/newrand/newrand.h Merged suggestion Co-authored-by: smarie <sylvain.marie@schneider-electric.com> * Update sklearn/svm/tests/test_bounds.py Merge suggestions Co-authored-by: smarie <sylvain.marie@schneider-electric.com> * linting fix * linting fix * linting fix * Added deterministic test for mt19937 in svm/newrand.h * Linting fix * Review changes * Update sklearn/svm/tests/test_bounds.py Co-authored-by: smarie <sylvain.marie@schneider-electric.com> * Update sklearn/svm/tests/test_bounds.py Co-authored-by: Joel Nothman <joel.nothman@gmail.com> * Update sklearn/svm/tests/test_bounds.py Co-authored-by: Joel Nothman <joel.nothman@gmail.com> * Update test_bounds.py * Linting fixes Co-authored-by: smarie <sylvain.marie@schneider-electric.com> Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
…7157) * Implemented issue scikit-learn#16862 * Implemented issue scikit-learn#16862 * Ready for pull request * Cleaned up redundancies * Removed test_set_seed_wrap for now * Update sklearn/svm/tests/test_bounds.py Co-authored-by: smarie <sylvain.marie@schneider-electric.com> * Review feedback changes * Style fixes * np.quantile -> np.percentile * Update sklearn/svm/src/newrand/newrand.h Merged suggestion Co-authored-by: smarie <sylvain.marie@schneider-electric.com> * Update sklearn/svm/tests/test_bounds.py Merge suggestions Co-authored-by: smarie <sylvain.marie@schneider-electric.com> * linting fix * linting fix * linting fix * Added deterministic test for mt19937 in svm/newrand.h * Linting fix * Review changes * Update sklearn/svm/tests/test_bounds.py Co-authored-by: smarie <sylvain.marie@schneider-electric.com> * Update sklearn/svm/tests/test_bounds.py Co-authored-by: Joel Nothman <joel.nothman@gmail.com> * Update sklearn/svm/tests/test_bounds.py Co-authored-by: Joel Nothman <joel.nothman@gmail.com> * Update test_bounds.py * Linting fixes Co-authored-by: smarie <sylvain.marie@schneider-electric.com> Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
Reference Issues/PRs
Resolves #16862
What does this implement/fix? Explain your changes.
Created a python wrapper and tests for sklearn/svm/newrand/
Test that newrand creates an unbiased distribution of random numbers in the given range by comparing a deterministic sampling against scipy's scipy.stats.uniform using scipy.stats.kstest (kolmogorov-smirnov test) at a significance level of 0.05.