Skip to content

[MRG + 1] Set a random random_state at init to ensure deterministic randomness when random_state is None #7935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

raghavrv
Copy link
Member

@raghavrv raghavrv commented Nov 24, 2016

This is a better fix for #6726 as opposed to my previous #7660

Ensures deterministic randomness across multiple calls to split even when random_state is not set...

This replicates the behaviour of cross-validators in pre 0.18...

In master (0.18.1)

>>> from sklearn.model_selection import KFold

# Splits are different for multiple split calls when random_state is not set
>>> kf = KFold(n_splits=2, shuffle=True, random_state=None)
>>> for _ in range(2):
...     print(list(kf.split([1,] * 4)))
[(array([2, 3]), array([0, 1])), (array([0, 1]), array([2, 3]))]
[(array([1, 2]), array([0, 3])), (array([0, 3]), array([1, 2]))]

In this branch, splits are different for multiple initializations of cross-validators but same across multiple split calls for the same initialization.

>>> from sklearn.model_selection import KFold

# Splits are same for same initialization in successive split calls.
>>> kf = KFold(n_splits=2, shuffle=True, random_state=None)
>>> for _ in range(2):
...     print(list(kf.split([1,] * 4)))
[(array([2, 3]), array([0, 1])), (array([0, 1]), array([2, 3]))]
[(array([2, 3]), array([0, 1])), (array([0, 1]), array([2, 3]))]

# Splits are different for different initializations of cross-validators.
>>> for _ in range(2):
...     print(list(KFold(n_splits=2, shuffle=True, random_state=None).split([1,] * 4)))
[(array([0, 2]), array([1, 3])), (array([1, 3]), array([0, 2]))]
[(array([0, 3]), array([1, 2])), (array([1, 2]), array([0, 3]))]

Ping @GaelVaroquaux @jnothman @amueller

(Sorry for the tongue twisting PR title :p)

@raghavrv raghavrv added this to the 0.18.2 milestone Nov 24, 2016
@raghavrv raghavrv changed the title [MRG] Set a random random_state at init to ensure deterministic randomness [MRG] Set a random random_state at init to ensure deterministic randomness when random_state is None Nov 24, 2016
@jnothman
Copy link
Member

Sorry about this.

@raghavrv
Copy link
Member Author

No problem at all! :)

# are random for each initialization of splitter but consistent
# across multiple calls for the same initialization.
self._random_state = check_random_state(
random_state).randint(99999999)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.iinfo(np.int32).max instead of 99999999?
same below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you forgot this one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh sorry. Done :)

for data in zip((X, X2), (y, y2)):
# For the same initialilzation, splits should be same across
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialilzation

# multiple split calls, even when random_state is not set.
np.testing.assert_equal(list(cvs[0].split(*data)),
list(cvs[0].split(*data)))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add # For different initialisations, splits should not be same when random_state is not set.

@raghavrv
Copy link
Member Author

Have addressed your comments @TomDLT... Thx :)

@raghavrv
Copy link
Member Author

@GaelVaroquaux Would you have some time for this one?

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to undo all the list-casting that is already in master, right? Also, I think I'm +.5 for not introducing a private attribute and using random_state instead directly.

@@ -284,7 +284,16 @@ def __init__(self, n_splits, shuffle, random_state):

self.n_splits = n_splits
self.shuffle = shuffle
# For repr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by this comment? You don't like an explicit random state in the __repr__? Because it's surprising for the user? It would make reproducing easier and not introduce a private variable that is dependent on a global state....

This means that doing cv.random_state = 10 doesn't do anything, though, even though it changes the __repr__.

On the other hand, having a number pop up here if it wasn't specified might be surprising for users that expect this to work like an Estimator object. Which it does not...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about not having this but assumed it would be cleaner to reproduce the input that the user gave in repr instead of random. But your legitimate reasons justify the removal of this... I'll do and push in a while... Thnx

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done...

@@ -560,9 +569,9 @@ def __init__(self, n_splits=3, shuffle=False, random_state=None):

def _make_test_folds(self, X, y=None, groups=None):
if self.shuffle:
rng = check_random_state(self.random_state)
rng = check_random_state(self._random_state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this if still bothers me.

Copy link
Member Author

@raghavrv raghavrv Nov 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about not having this but assumed it would be cleaner to reproduce the input that the user gave in repr instead of random. But your legitimate reasons justify the removal of this... I'll do and push in a while... Thnx

try:
np.testing.assert_equal(list(cv.split(*data)),
list(cv.split(*data)))
np.testing.assert_equal(list(cvs[1].split(*data)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you do assert_false(list(...) == list(...))?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesnt compare nested list of ndarrays...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_false(all([(x == y).all() for x, y in zip(cvs[1].split(*data), cv[2].split(*data))]))

hm... maybe not much better than the current one? It took me several reads to understand the error message you provide.

Copy link
Member

@lesteve lesteve Nov 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @amueller pointed out I had a similar implementation of numpy.testing.assert_not_equal in #7946. I'd suggest we add that as a assert helper in sklearn/utils/testing.py.

@amueller I don't think your solution is future proof because you are in an edge case of array comparison when comparing arrays containing numpy arrays. Example:

import numpy as np

arr = np.array([np.array([1, 2]), np.array([1, 2, 3])])
arr = arr 
# Returns: array([True, True], dtype=bool)
# with warning: DeprecationWarning: numpy equal will not check object identity in the future.
# The error trying to get the boolean value of the comparison result will be raised.
arr1 = np.array([np.array([1, 2]), np.array([1, 2, 3])])
arr == arr1
# Returns: False
# with warning: DeprecationWarning: elementwise == comparison failed; this will raise an error in the future.

Basically the first case kind of works because it falls-back to comparing identity of the inner arrays when element comparison does not return a scalar. The second case returns False because inner arrays are not identical.
More details at numpy/numpy#8323 (comment).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going by the current numpy implementation, which I guess you found a bug in?

@raghavrv
Copy link
Member Author

Thanks for the review. The undoing of list casting is done at #7941 which would benefit from a review!

@amueller
Copy link
Member

@raghavrv how are these two separate PRs? The undoing of list casting can only be done once the change in this PR is merged, right?

@amueller
Copy link
Member

also, we're doing 0.18.2 for this? Sorry I was offline over the weekend but that's news to me.

# are random for each initialization of splitter but consistent
# across multiple calls for the same initialization.
self._random_state = check_random_state(
random_state).randint(np.iinfo(np.int32).max)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gaël had instead suggested the use of get_state. Apart from the ability to represent many more than 2**32 possible states, it is faster to set the state than to seed a new RandomState.

I'm okay with this though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing check_random_state(self.random_state).get_state() in all cases will ensure similar behaviour regardless of the form of the parameter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry could you elaborate a bit more? Get state gets me a tuple of the RandomState(...) object which represents it's state... Do you mean we store this at random_state? (That garbles the repr...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "garble"? Show a RandomState object? That's fine, isn't it?

self._random_state = check_random_state(
random_state).randint(np.iinfo(np.int32).max)
else:
self._random_state = random_state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not produce the same splits each time if this is a RandomState instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait I thought about this and did a copy I think... Or maybe I had it in my mind but didn't do anything... Anyway this needs a test... Thx!

@raghavrv
Copy link
Member Author

@raghavrv how are these two separate PRs? The undoing of list casting can only be done once the change in this PR is merged, right?

Not exactly as we need to support one time iterables (at least for the time being) in grid search, I ensured the order of iteration while populating jobs for parallel is such that the splits are generated once and used for all parameters...

@amueller
Copy link
Member

@raghavrv that seems to be true for some of the materializations but not all, which means that we could to that but then would have another PR to fix the remaining one. If we accept this PR, there is no need to change the loops, right?

@raghavrv
Copy link
Member Author

seems to be true for some

All except learning_curve which was like that pre-0.18.

If we accept this PR, there is no need to change the loops, right?

But what about one time iterables? Do we still support them?

also, we're doing 0.18.2 for this? Sorry I was offline over the weekend but that's news to me.

I'm afraid that I'm gonna suggest yes, for like @GaelVaroquaux pointed out it really consumes a lot of memory especially if your data and/or splits are pretty large...

@raghavrv
Copy link
Member Author

I'm sorry these approaches didn't cross my mind back then :/

@amueller
Copy link
Member

Hm ok, for one-time iterables, I guess we do need to change the order. Did we support them in 0.17?
And I don't think this is worth a minor release. Can you come up with a scenario using scikit-learn built-in functions in which this will actually fix a memory issue?

@raghavrv
Copy link
Member Author

raghavrv commented Nov 28, 2016

No currently in 0.18.1, even for a normal cv splitter, the list gets materialized and hence consumes memory. I'll post a snippet comparing the memory tomorrow... But something like GridSearchCV(est, ..., cv=StratifiedShuffledSplit(n_splits=100))).fit(X, y) where X is ~ 1M samples would quickly give you an idea...

@jnothman
Copy link
Member

jnothman commented Nov 28, 2016 via email

@amueller
Copy link
Member

No currently in 0.18.1, even for a normal cv splitter, the list gets materialized and hence consumes memory. I'll post a snipped comparing the memory tomorrow... But something like GridSearchCV(est, ..., cv=StratifiedShuffledSplit(n_splits=100)).split(X, y) where X is ~ 1M samples would quickly give you an idea...

I guess if n_splits >> n_features this could run you into trouble.

I'm not sure I want to push out another hotfix for something we are not entirely sure about. The stronger contract is certainly better, but the back and forth with the behavior is also confusing. And for people doing repeated CV - which is relatively common - the new behavior could give pretty wrong results.

@raghavrv
Copy link
Member Author

From my perspective, anything without split() has to be materialised as a list

Ah sorry I meant a fit not a split there. Have edited accordingly...

@amueller
Copy link
Member

amueller commented Dec 2, 2016

Hm... do we want KFold to be deterministic if random_state is a random state object?

@raghavrv
Copy link
Member Author

raghavrv commented Dec 2, 2016

I thought that we wanted the splits to be consistent for multiple split calls? Joel's comment

@amueller
Copy link
Member

amueller commented Dec 2, 2016

@raghavrv indeed, but that's not what your code currently implements, right? Have you tested that?

@amueller
Copy link
Member

amueller commented Dec 2, 2016

oh you changed it 4 minutes ago lol. I agree with your newest version, I was looking at the previous version and wanted to suggest exactly the fix you made (I think, haven't reviewed).

@raghavrv
Copy link
Member Author

raghavrv commented Dec 2, 2016

lol. I was skeptical when I got a review as soon as I pushed ;P

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good apart from nitpicks.

@@ -378,11 +378,11 @@ class KFold(_BaseKFold):
>>> from sklearn.model_selection import KFold
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([1, 2, 3, 4])
>>> kf = KFold(n_splits=2)
>>> kf = KFold(n_splits=2, random_state=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? There is no shuffle

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repr displays a random random_state that fails the test... So I explicitly set one...

@@ -543,11 +543,11 @@ class StratifiedKFold(_BaseKFold):
>>> from sklearn.model_selection import StratifiedKFold
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([0, 0, 1, 1])
>>> skf = StratifiedKFold(n_splits=2)
>>> skf = StratifiedKFold(n_splits=2, random_state=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no shuffle?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(for repr)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this. What we could do is instantiate the RandomState object only if shuffle=True? That would break if someone assigned cv.shuffle = True but that seems like an odd thing to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just confusing to users as is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is why I wanted to preserve the user-set random state by preserving it in the original random_state variable. You suggested that it would be clear to the user to display the explicitly set random_state and hence the current setup...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how your comment relates to my comment. You said the tests failed. You could have fixed that with a ....
If we want the repr to be nicer if shuffle is not set, we could not change it then.
I don't think we should change the repr in other cases and hide information from the user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so we want repr to preserve the user set random state or the default of None if shuffle is False, correct? In other cases set an explicit random_state if None and display the explicitly set random_state in repr... For the tests use ellipsis to suppress the test error. Sounds good?

@@ -284,7 +284,14 @@ def __init__(self, n_splits, shuffle, random_state):

self.n_splits = n_splits
self.shuffle = shuffle
self.random_state = random_state
if not isinstance(random_state, (np.integer, numbers.Integral)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind changing the order of the if instead of negating?

@@ -922,7 +926,14 @@ def __init__(self, n_splits=10, test_size=0.1, train_size=None,
self.n_splits = n_splits
self.test_size = test_size
self.train_size = train_size
self.random_state = random_state
if not isinstance(random_state, (np.integer, numbers.Integral)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, maybe change the order?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -147,17 +147,17 @@ def test_cross_validator_with_default_params():
groups = np.array([1, 2, 3, 4])
loo = LeaveOneOut()
lpo = LeavePOut(p)
kf = KFold(n_splits)
skf = StratifiedKFold(n_splits)
kf = KFold(n_splits, random_state=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? There's no shuffle

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above (repr)


for cv in (kf, skf):
for cv in (kf, skf, kf2, skf2):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment saying "check that calling split twice yields the same results"

for data in zip((X, X2), (y, y2)):
# For the same initialization, splits should be same across
# multiple split calls, even when random_state is not set.
np.testing.assert_equal(list(cvs[0].split(*data)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only need one cross-validation object for that. Why is that not in the loop above?

try:
np.testing.assert_equal(list(cv.split(*data)),
list(cv.split(*data)))
np.testing.assert_equal(list(cvs[1].split(*data)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need three objects?

Copy link
Member Author

@raghavrv raghavrv Dec 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1st object is used to test if multiple calls to split produce same result if random_state is not set...

2nd and 3rd object are used to test if they produce different result if random_state is not set... (To ensure that by implicitly setting random_state we are not making it any less random for splits from separate initializations.) Though I could reuse 1st. I'd prefer having it this way to ensure that the previous test's split call on the 1st object does not affect the next test...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not do the first one in the loop above? that makes much more sense to me.

@raghavrv
Copy link
Member Author

raghavrv commented Dec 5, 2016

Have addressed / clarified your comments. Could I have your +1?

@raghavrv
Copy link
Member Author

raghavrv commented Dec 6, 2016

@amueller @jnothman Could you see if the latest changes look okay?

try:
np.testing.assert_equal(list(cv.split(*data)),
list(cv.split(*data)))
np.testing.assert_equal(list(cv1.split(*data)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this is an ugly pattern but whatever... Is this the only place where we use this? Otherwise we might want to put this into utils?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for another PR? There are quite a few places (even outside model_selection) using np.testing.assert_*equal...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Though I was referring to the try: else: construct.

@amueller
Copy link
Member

amueller commented Dec 6, 2016

LGTM

@amueller amueller changed the title [MRG] Set a random random_state at init to ensure deterministic randomness when random_state is None [MRG + 1] Set a random random_state at init to ensure deterministic randomness when random_state is None Dec 6, 2016
@jnothman
Copy link
Member

jnothman commented Dec 7, 2016

Before I review this again, I think we still need to determine whether the current approach is broken enough to justify this kind of backwards-incompatible change. With #7941 we're no longer materialising the list of CV indices, so we no longer have the memory issue. The problem is then that it may be surprising but documented behaviour that multiple calls to split will return different partitions if random_state does not make it deterministic. Is that bad enough to justify breaking compatibility between 0.18 and 0.19?

@amueller
Copy link
Member

amueller commented Dec 7, 2016

Hm... we introduced this behavior in 0.18 and now we're going back to where we were in 0.17, right? We didn't really intentionally change that in 0.18 and one could argue that this is a bug-fix to restore the old behavior.
On the other hand I guess changing twice is worse than changing once.

I'm kind of torn on this. Actually the behavior in current master is consistent with how estimators behave on fit.

@jnothman
Copy link
Member

jnothman commented Dec 7, 2016 via email

@jnothman
Copy link
Member

jnothman commented Dec 7, 2016 via email

@raghavrv
Copy link
Member Author

So shall I leave this open for further discussions or can we close this a won't fix? With #7941, the more serious underlying issue (of grid search not evaluating on consistent splits for different param) is fixed. This is just a syntactic nice to have to ensure compatibility with cross_validation but adds complexity I guess...

The implications are that with model_selection, the interface is different. It is now similar to fit and hence you are expected to explicitly set an integer random_state if you want shuffling for multiple split calls to be consistent... Using None or numpy's random state obj will not produce consistent splits...

Is there a direct unpleasant practical consequence of not having this PR?

Could I request a final vote from @GaelVaroquaux @vene @agramfort apart from Andy and Joel?

@raghavrv
Copy link
Member Author

I'm not sure if this should be tagged a "bug" anymore. Removing "blocker" and "bug" labels...

@raghavrv
Copy link
Member Author

Closing as #7941 is sufficient... Comment if you want this reopened...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants