-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[WIP] Make random_state accept np.random.Generator #23962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[WIP] Make random_state accept np.random.Generator #23962
Conversation
Done: - Added tests for estimators - Made tests for estimators pass Missing - splitters - other components, e.g. for creating random datasets - documentation - docstrings Caveats It's almost impossible to have a complete test coverage for this feature. The reason is that even though we check all estimators that support random_state, we don't know if the code path that actually uses random_state is being taken or not, since it might depend on hyper-parameters.
sklearn/utils/fixes.py
Outdated
@@ -21,6 +21,13 @@ | |||
from .._config import config_context, get_config | |||
from ..externals._packaging.version import parse as parse_version | |||
|
|||
# below copied verbatim from scipy._lib._util.py to be used in rng_integers | |||
try: | |||
from numpy.random import Generator as Generator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our minimum supported NumPy version is 1.17.3, so we can assume that Generators can be imported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed that 👍
sklearn/tree/tests/test_tree.py
Outdated
@@ -175,7 +176,7 @@ | |||
# NB: despite their names X_sparse_* are numpy arrays (and not sparse matrices) | |||
X_sparse_pos = random_state.uniform(size=(20, 5)) | |||
X_sparse_pos[X_sparse_pos <= 0.8] = 0.0 | |||
y_random = random_state.randint(0, 4, size=(20,)) | |||
y_random = rng_integers(random_state, 0, 4, size=(20,)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make this PR smaller, I prefer to leave the test unchanged. Currently, the tests are always using a RandomState
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -73,8 +74,8 @@ def make_sparse_random_data(n_samples, n_features, n_nonzeros, random_state=None | |||
( | |||
rng.randn(n_nonzeros), | |||
( | |||
rng.randint(n_samples, size=n_nonzeros), | |||
rng.randint(n_features, size=n_nonzeros), | |||
rng_integers(rng, n_samples, size=n_nonzeros), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here regarding not needing to change files in the main sklearn
files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
- Import can assume that Generator exists - Revert rng_integers use where not necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The linting error and CI failure looks related to this PR.
If it's okay, I would address the linting problems later, before creating the non-draft PR. |
The CI does not fully run unless linting passes. This makes it harder to evaluate the PR even as a draft. |
@thomasjpfan Regarding the failing tests, I think we have an interesting problem here. |
For me, it is about the scope. This PR turns on Generators on everywhere, which touches a lot of estimators all at once. I prefer to incrementally turn on |
- Remove rng_integers call where not necessary - Isinstance check for random ints also accepts np.int_ - Black formatting
I changed the tests to also accept
This is certainly feasible. A disadvantage would be that |
There are still problems stemming from the integer dtype, e.g. here: I believe those go back to the problem I mentioned earlier:
There are some possible solutions to that problem but I'm not sure which one to take. |
I am okay with that as long as we document which estimator supports generators in their docstrings. We can incrementally update the docstrings of
I'm thinking more about estimators opting into to Generator support and not about user opt-in. Let's say I want "random_state": ["random_state", np.random.Generator], and include it in the common test. During review, we can look at MDS's code to make sure estimator is configured in a way that actually uses the random state. With this PR turning Generator support everywhere, it is confirm that all estimators is configured to actually use the generator. For me, this makes it harder to review.
Pass |
We decided to opt in estimators (and all the rest) step by step into using Generators. Therefore, I reverted all the changes in the actual estimators that were necessary to accomodate Generators, which comes down to the use of the rng_integers for now. The common test has been adjusted to have a long list of excluded estimators -- currently containing all estimators -- that are skipped for testing. The idea is that if a new PR comes along that opts an estimator in, it should be as easy as crossing that estimator off the list to be able to check if it still works. Note for future developers. The "random_state" variable is sometimes also referred to "rng" or "rnd" (and maybe others that I missed), so a simple grep for "random_state" is not enough.
Updated statusAfter discussion with maintainers, we decided that estimators, splitters, and other functions should be opted in step by step into allowing Currently, this PR thus contains tests for estimators that check if they can be fitted with Guide how to opt an estimator into allowing
|
No actual changes to how the code works, since KBinsDiscretizer only uses the 'choice' method, which is backwards compatible.
TODOsHere is a list of classes and functions I could find that use a Estimators
Splitters
Rest
DocumentationBesides the individual docstrings of the classes/functions mentioned above, the documentation should be adjusted here: |
@thomasjpfan as discussed, I changed the PR to only test a single estimator to decrease the review burden. That estimator is As for the updated docstring, for now I went with this very simple change: - random_state : int, RandomState instance or None, default=None
+ random_state : int, RandomState/Generator instance or None, default=None The reason is that this line is already quite long and from what I can tell, it is not desired to have very long lines for parameter types (or line breaks for that matter). The body itself has not been altered. LMK if we want to do that. |
Reference Issues/PRs
Fixes #16988
What does this implement/fix? Explain your changes.
The
random_state
argument accepts numpy.random.Generator.Any other comments?
Context
Update: Please see this comment.
This is WIP and I discussed with @thomasjpfan that it would make sense to share the current progress to evaluate if the scope is sufficiently small for a single PR or if we need to split it.
Done
Made tests for estimators pass(reverted)Missing
n_jobs>1
is probably out of scopeImplementation
One difficulty is that
Generator
has a slightly different API than the existingRandomState
class, namely that creating integers now happens through theintegers
method, notrandint
. We (Thomas and I) discussed 3 different approaches to supportGenerators
:RandomState
If
check_random_state
sees aGenerator
, it returns an adapter that supports therandint
method with the old signature. This would be backwards compatible with all existing code but locks sklearn into the "old way". Also, the appearance of this new class could be surprising to users.Generator
If
check_random_state
sees aRandomState
, it returns an adapter that supports theintegers
method with the old signature. This would be forwards compatible with the "new way". However, it requires changing all existing calls torandint
and the appearance of this new class could be surprising to users.This is the way that scipy approached the problem. It also requires to change all the calls to
randint
but it's more transparent than solution 2. One disadvantage is that all other sampling functions are method calls on the object, only integers require this function, which can be surprising.In the end, we decided to go with option 3. because we assume that it worked well for scipy and should thus also serve sklearn well.
Another decision that I made while working on the feature is not to change
randint
method calls where the object is known to be aRandomState
. E.g. there are many tests that go like:or
Therefore, grepping through the repo for
randint
still reveals many direct calls, but unless I overlooked something, they should all be safe.Caveats
It's almost impossible to have a complete test coverage for this feature. The reason is that even though we check all estimators that support
random_state
, we don't know if the code path that actually usesrandom_state
is being taken or not, since it might depend on hyper-parameters. A similar argument applies to splitters and other functions.