-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG + 1] Set a random random_state at init to ensure deterministic randomness when random_state is None
#7935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG + 1] Set a random random_state at init to ensure deterministic randomness when random_state is None
#7935
Conversation
None
Sorry about this. |
No problem at all! :) |
# are random for each initialization of splitter but consistent | ||
# across multiple calls for the same initialization. | ||
self._random_state = check_random_state( | ||
random_state).randint(99999999) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.iinfo(np.int32).max
instead of 99999999
?
same below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you forgot this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh sorry. Done :)
for data in zip((X, X2), (y, y2)): | ||
# For the same initialilzation, splits should be same across |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initialilzation
# multiple split calls, even when random_state is not set. | ||
np.testing.assert_equal(list(cvs[0].split(*data)), | ||
list(cvs[0].split(*data))) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add # For different initialisations, splits should not be same when random_state is not set.
Have addressed your comments @TomDLT... Thx :) |
@GaelVaroquaux Would you have some time for this one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to undo all the list-casting that is already in master, right? Also, I think I'm +.5 for not introducing a private attribute and using random_state
instead directly.
@@ -284,7 +284,16 @@ def __init__(self, n_splits, shuffle, random_state): | |||
|
|||
self.n_splits = n_splits | |||
self.shuffle = shuffle | |||
# For repr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by this comment? You don't like an explicit random state in the __repr__
? Because it's surprising for the user? It would make reproducing easier and not introduce a private variable that is dependent on a global state....
This means that doing cv.random_state = 10
doesn't do anything, though, even though it changes the __repr__
.
On the other hand, having a number pop up here if it wasn't specified might be surprising for users that expect this to work like an Estimator object. Which it does not...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about not having this but assumed it would be cleaner to reproduce the input that the user gave in repr instead of random. But your legitimate reasons justify the removal of this... I'll do and push in a while... Thnx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done...
@@ -560,9 +569,9 @@ def __init__(self, n_splits=3, shuffle=False, random_state=None): | |||
|
|||
def _make_test_folds(self, X, y=None, groups=None): | |||
if self.shuffle: | |||
rng = check_random_state(self.random_state) | |||
rng = check_random_state(self._random_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this if
still bothers me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about not having this but assumed it would be cleaner to reproduce the input that the user gave in repr instead of random. But your legitimate reasons justify the removal of this... I'll do and push in a while... Thnx
try: | ||
np.testing.assert_equal(list(cv.split(*data)), | ||
list(cv.split(*data))) | ||
np.testing.assert_equal(list(cvs[1].split(*data)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you do assert_false(list(...) == list(...))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesnt compare nested list of ndarrays...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_false(all([(x == y).all() for x, y in zip(cvs[1].split(*data), cv[2].split(*data))]))
hm... maybe not much better than the current one? It took me several reads to understand the error message you provide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @amueller pointed out I had a similar implementation of numpy.testing.assert_not_equal in #7946. I'd suggest we add that as a assert helper in sklearn/utils/testing.py.
@amueller I don't think your solution is future proof because you are in an edge case of array comparison when comparing arrays containing numpy arrays. Example:
import numpy as np
arr = np.array([np.array([1, 2]), np.array([1, 2, 3])])
arr = arr
# Returns: array([True, True], dtype=bool)
# with warning: DeprecationWarning: numpy equal will not check object identity in the future.
# The error trying to get the boolean value of the comparison result will be raised.
arr1 = np.array([np.array([1, 2]), np.array([1, 2, 3])])
arr == arr1
# Returns: False
# with warning: DeprecationWarning: elementwise == comparison failed; this will raise an error in the future.
Basically the first case kind of works because it falls-back to comparing identity of the inner arrays when element comparison does not return a scalar. The second case returns False because inner arrays are not identical.
More details at numpy/numpy#8323 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going by the current numpy implementation, which I guess you found a bug in?
Thanks for the review. The undoing of list casting is done at #7941 which would benefit from a review! |
@raghavrv how are these two separate PRs? The undoing of list casting can only be done once the change in this PR is merged, right? |
also, we're doing 0.18.2 for this? Sorry I was offline over the weekend but that's news to me. |
# are random for each initialization of splitter but consistent | ||
# across multiple calls for the same initialization. | ||
self._random_state = check_random_state( | ||
random_state).randint(np.iinfo(np.int32).max) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gaël had instead suggested the use of get_state
. Apart from the ability to represent many more than 2**32 possible states, it is faster to set the state than to seed a new RandomState
.
I'm okay with this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing check_random_state(self.random_state).get_state()
in all cases will ensure similar behaviour regardless of the form of the parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry could you elaborate a bit more? Get state gets me a tuple of the RandomState(...)
object which represents it's state... Do you mean we store this at random_state
? (That garbles the repr
...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "garble"? Show a RandomState object? That's fine, isn't it?
self._random_state = check_random_state( | ||
random_state).randint(np.iinfo(np.int32).max) | ||
else: | ||
self._random_state = random_state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not produce the same splits each time if this is a RandomState
instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait I thought about this and did a copy
I think... Or maybe I had it in my mind but didn't do anything... Anyway this needs a test... Thx!
Not exactly as we need to support one time iterables (at least for the time being) in grid search, I ensured the order of iteration while populating jobs for parallel is such that the splits are generated once and used for all parameters... |
@raghavrv that seems to be true for some of the materializations but not all, which means that we could to that but then would have another PR to fix the remaining one. If we accept this PR, there is no need to change the loops, right? |
All except
But what about one time iterables? Do we still support them?
I'm afraid that I'm gonna suggest yes, for like @GaelVaroquaux pointed out it really consumes a lot of memory especially if your data and/or splits are pretty large... |
I'm sorry these approaches didn't cross my mind back then :/ |
Hm ok, for one-time iterables, I guess we do need to change the order. Did we support them in 0.17? |
No currently in 0.18.1, even for a normal cv splitter, the list gets materialized and hence consumes memory. I'll post a snippet comparing the memory tomorrow... But something like |
From my perspective, anything without split() has to be materialised as a
list.
…On 29 November 2016 at 10:10, Raghav RV ***@***.***> wrote:
No currently in 0.18.1, even for a normal cv splitter, the list gets
materialized and hence consumes memory. I'll post a snipped comparing the
memory tomorrow... But something like GridSearchCV(est, ...,
cv=StratifiedShuffledSplit(n_splits=100)).split(X, y) where X is ~ 1M
samples would quickly give you an idea...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7935 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_ow61gbvsKDCRpuNubjo87CyVNrks5rC19mgaJpZM4K7ykQ>
.
|
I guess if I'm not sure I want to push out another hotfix for something we are not entirely sure about. The stronger contract is certainly better, but the back and forth with the behavior is also confusing. And for people doing repeated CV - which is relatively common - the new behavior could give pretty wrong results. |
Ah sorry I meant a fit not a split there. Have edited accordingly... |
Hm... do we want |
I thought that we wanted the splits to be consistent for multiple split calls? Joel's comment |
@raghavrv indeed, but that's not what your code currently implements, right? Have you tested that? |
oh you changed it 4 minutes ago lol. I agree with your newest version, I was looking at the previous version and wanted to suggest exactly the fix you made (I think, haven't reviewed). |
lol. I was skeptical when I got a review as soon as I pushed ;P |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good apart from nitpicks.
@@ -378,11 +378,11 @@ class KFold(_BaseKFold): | |||
>>> from sklearn.model_selection import KFold | |||
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) | |||
>>> y = np.array([1, 2, 3, 4]) | |||
>>> kf = KFold(n_splits=2) | |||
>>> kf = KFold(n_splits=2, random_state=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? There is no shuffle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The repr
displays a random random_state that fails the test... So I explicitly set one...
@@ -543,11 +543,11 @@ class StratifiedKFold(_BaseKFold): | |||
>>> from sklearn.model_selection import StratifiedKFold | |||
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) | |||
>>> y = np.array([0, 0, 1, 1]) | |||
>>> skf = StratifiedKFold(n_splits=2) | |||
>>> skf = StratifiedKFold(n_splits=2, random_state=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no shuffle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(for repr)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this. What we could do is instantiate the RandomState
object only if shuffle=True
? That would break if someone assigned cv.shuffle = True
but that seems like an odd thing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just confusing to users as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is why I wanted to preserve the user-set random state by preserving it in the original random_state
variable. You suggested that it would be clear to the user to display the explicitly set random_state
and hence the current setup...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how your comment relates to my comment. You said the tests failed. You could have fixed that with a ...
.
If we want the repr
to be nicer if shuffle
is not set, we could not change it then.
I don't think we should change the repr in other cases and hide information from the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so we want repr
to preserve the user set random state or the default of None
if shuffle
is False
, correct? In other cases set an explicit random_state
if None
and display the explicitly set random_state
in repr
... For the tests use ellipsis to suppress the test error. Sounds good?
@@ -284,7 +284,14 @@ def __init__(self, n_splits, shuffle, random_state): | |||
|
|||
self.n_splits = n_splits | |||
self.shuffle = shuffle | |||
self.random_state = random_state | |||
if not isinstance(random_state, (np.integer, numbers.Integral)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind changing the order of the if instead of negating?
@@ -922,7 +926,14 @@ def __init__(self, n_splits=10, test_size=0.1, train_size=None, | |||
self.n_splits = n_splits | |||
self.test_size = test_size | |||
self.train_size = train_size | |||
self.random_state = random_state | |||
if not isinstance(random_state, (np.integer, numbers.Integral)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, maybe change the order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -147,17 +147,17 @@ def test_cross_validator_with_default_params(): | |||
groups = np.array([1, 2, 3, 4]) | |||
loo = LeaveOneOut() | |||
lpo = LeavePOut(p) | |||
kf = KFold(n_splits) | |||
skf = StratifiedKFold(n_splits) | |||
kf = KFold(n_splits, random_state=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? There's no shuffle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above (repr)
|
||
for cv in (kf, skf): | ||
for cv in (kf, skf, kf2, skf2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment saying "check that calling split
twice yields the same results"
for data in zip((X, X2), (y, y2)): | ||
# For the same initialization, splits should be same across | ||
# multiple split calls, even when random_state is not set. | ||
np.testing.assert_equal(list(cvs[0].split(*data)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we only need one cross-validation object for that. Why is that not in the loop above?
try: | ||
np.testing.assert_equal(list(cv.split(*data)), | ||
list(cv.split(*data))) | ||
np.testing.assert_equal(list(cvs[1].split(*data)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need three objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1st object is used to test if multiple calls to split produce same result if random_state
is not set...
2nd and 3rd object are used to test if they produce different result if random_state
is not set... (To ensure that by implicitly setting random_state
we are not making it any less random for splits from separate initializations.) Though I could reuse 1st. I'd prefer having it this way to ensure that the previous test's split call on the 1st object does not affect the next test...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not do the first one in the loop above? that makes much more sense to me.
Have addressed / clarified your comments. Could I have your +1? |
try: | ||
np.testing.assert_equal(list(cv.split(*data)), | ||
list(cv.split(*data))) | ||
np.testing.assert_equal(list(cv1.split(*data)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this is an ugly pattern but whatever... Is this the only place where we use this? Otherwise we might want to put this into utils
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe for another PR? There are quite a few places (even outside model_selection
) using np.testing.assert_*equal
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Though I was referring to the try: else:
construct.
LGTM |
None
None
Before I review this again, I think we still need to determine whether the current approach is broken enough to justify this kind of backwards-incompatible change. With #7941 we're no longer materialising the list of CV indices, so we no longer have the memory issue. The problem is then that it may be surprising but documented behaviour that multiple calls to |
Hm... we introduced this behavior in 0.18 and now we're going back to where we were in 0.17, right? We didn't really intentionally change that in 0.18 and one could argue that this is a bug-fix to restore the old behavior. I'm kind of torn on this. Actually the behavior in current master is consistent with how estimators behave on |
We had a different interface in 0.17. We had sufficient information at
__init__ to determine the entire partition statically. Here we only have
that information when split is called.
…On 8 December 2016 at 03:23, Andreas Mueller ***@***.***> wrote:
Hm... we introduced this behavior in 0.18 and now we're going back to
where we were in 0.17, right? We didn't really intentionally change that in
0.18 and one could argue that this is a bug-fix to restore the old behavior.
On the other hand I guess changing twice is worse than changing once.
I'm kind of torn on this. Actually the behavior in current master is
consistent with how estimators behave on fit.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7935 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67Q3UFPegXsEqxFVbxSKsuoqw4Ynks5rFt1ugaJpZM4K7ykQ>
.
|
You might make the analogy that current behaviour considers split like fit;
proposed behaviour here considers split like transform.
…On 8 December 2016 at 07:54, Joel Nothman ***@***.***> wrote:
We had a different interface in 0.17. We had sufficient information at
__init__ to determine the entire partition statically. Here we only have
that information when split is called.
On 8 December 2016 at 03:23, Andreas Mueller ***@***.***>
wrote:
> Hm... we introduced this behavior in 0.18 and now we're going back to
> where we were in 0.17, right? We didn't really intentionally change that in
> 0.18 and one could argue that this is a bug-fix to restore the old behavior.
> On the other hand I guess changing twice is worse than changing once.
>
> I'm kind of torn on this. Actually the behavior in current master is
> consistent with how estimators behave on fit.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#7935 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAEz67Q3UFPegXsEqxFVbxSKsuoqw4Ynks5rFt1ugaJpZM4K7ykQ>
> .
>
|
So shall I leave this open for further discussions or can we close this a won't fix? With #7941, the more serious underlying issue (of grid search not evaluating on consistent splits for different param) is fixed. This is just a syntactic nice to have to ensure compatibility with The implications are that with Is there a direct unpleasant practical consequence of not having this PR? Could I request a final vote from @GaelVaroquaux @vene @agramfort apart from Andy and Joel? |
I'm not sure if this should be tagged a "bug" anymore. Removing "blocker" and "bug" labels... |
Closing as #7941 is sufficient... Comment if you want this reopened... |
This is a better fix for #6726 as opposed to my previous #7660
Ensures deterministic randomness across multiple calls to
split
even whenrandom_state
is not set...This replicates the behaviour of cross-validators in pre 0.18...
In master (0.18.1)
In this branch, splits are different for multiple initializations of cross-validators but same across multiple split calls for the same initialization.
Ping @GaelVaroquaux @jnothman @amueller
(Sorry for the tongue twisting PR title :p)