Skip to content

[MRG + 2] FIX Be robust to non re-entrant/ non deterministic cv.split calls #7660

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Oct 30, 2016

Conversation

raghavrv
Copy link
Member

Fixes #6726

  • To make consistent splits for all parameter settings when the cv iterators are non re-entrant (thanks to Joel for the word!) and non deterministic for successive split calls, we call split once and store it as a list that will be iterated multiple times.
  • Similarly in _CVIterableWrapper, we store the list rather than the cv object and iterate on it... (NOTE that this would make previously (intentionally) non deterministic cv iterables, deterministic with consistent output for successive split calls. Do we want that to happen?)

@raghavrv raghavrv added this to the 0.18.1 milestone Oct 13, 2016
@raghavrv
Copy link
Member Author

Ping @jnothman @amueller for review...

@raghavrv
Copy link
Member Author

Also what kind of tests should I add here? I can make a mock splitter which will iterate only once and pass it as cv to grid search. That will fail in master but will pass in this branch. Any other tests?

@jnothman
Copy link
Member

passing in a generator expression for cv would be a sufficient non-regression test.

Without tests, surely this should be WIP?

@raghavrv raghavrv changed the title [MRG] FIX Be robust to non re-entrant/ non deterministic cv.split calls [WIP] FIX Be robust to non re-entrant/ non deterministic cv.split calls Oct 13, 2016
@raghavrv
Copy link
Member Author

Without tests, surely this should be WIP?

Indeed. Sorry for marking it MRG

@raghavrv raghavrv changed the title [WIP] FIX Be robust to non re-entrant/ non deterministic cv.split calls [MRG] FIX Be robust to non re-entrant/ non deterministic cv.split calls Oct 13, 2016
@raghavrv
Copy link
Member Author

Done! Reviews please ;)

@@ -129,6 +128,7 @@ def cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None,
X, y, groups = indexable(X, y, groups)

cv = check_cv(cv, y, classifier=is_classifier(estimator))
cv_iter = cv.split(X, y, groups)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a sign of best practice denoting that we enlist the indices and store them as soon as we check the cv param...

Copy link
Member

@amueller amueller Oct 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I get the point. cv_iter is a generator, right? This adds a line and doesn't make the code easier to follow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argh sorry I somehow kept thinking I did list(cv.split(...)...

@@ -384,6 +384,7 @@ def cross_val_predict(estimator, X, y=None, groups=None, cv=None, n_jobs=1,
X, y, groups = indexable(X, y, groups)

cv = check_cv(cv, y, classifier=is_classifier(estimator))
cv_iter = cv.split(X, y, groups)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Member Author

@raghavrv raghavrv Oct 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Same comment as above)

Should I leave it as before?

'mean_fit_time', 'std_fit_time'):
gs.cv_results_.pop(key)
gs2.cv_results_.pop(key)
np.testing.assert_equal(gs.cv_results_, gs2.cv_results_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not scikit learn testing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numpy's testing.assert_equal is to check equivalence of nested dicts. We don't have an equivalent AFAIK...

@amueller
Copy link
Member

you're not testing the consistency of folds in GridSearchCV or learning_curve or validation_curve.

@raghavrv raghavrv force-pushed the use_split_only_once branch from 1898288 to d8d42a8 Compare October 14, 2016 15:37
def __init__(self, n_samples=100):
self.indices = KFold(n_splits=5).split(np.ones(n_samples))

def split(self, X=None, y=None, groups=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment that says that this split can only be called once.

_pop_time_keys(gs2.cv_results_))

# Check consistency of folds across the parameters
gs = GridSearchCV(LinearSVC(random_state=0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this test fail before? Maybe also add a one-line comment explaining what's happening.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it fails on master...

@raghavrv
Copy link
Member Author

@amueller Have fixed your comments. I'm not able to find a way to test the consistency of cv folds in learning_curve, the learning curve seems to filter duplicate values for train sizes. Otherwise I could have given two similar values and test their scores as a proof of consistency in splits...

Other than that, I have addressed all your comments, added explanatory comments in tests + all tests pass. One more review please?

cc: @jnothman too if you find time... :)

@@ -1467,7 +1467,7 @@ def get_n_splits(self, X=None, y=None, groups=None):
class _CVIterableWrapper(BaseCrossValidator):
"""Wrapper class for old style cv objects and iterables."""
def __init__(self, cv):
self.cv = cv
self.cv = list(cv)
Copy link
Member Author

@raghavrv raghavrv Oct 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman @amueller

Do we want this change? This would mean check_cv(old_type_one_time_iterator) --> new type multi entrant iterator?
I am not able to dig through the huge pile of conversations, but I recall us choosing this for the reason that if the iterable that is wrapped is non re-entrant the wrapped splitter should also be non re-entrant...

For instance

import numpy as np

from sklearn.model_selection import KFold
from sklearn.model_selection import check_cv
from sklearn.datasets import make_classification

from sklearn.utils.testing import assert_array_equal

X, y = make_classification(random_state=0)

# At master - This test will pass ------------------------------------------

kf_iter = KFold(n_splits=5).split(X, y)
kf_iter_wrapped = check_cv(kf_iter)
# First split call will work
list(kf_iter_wrapped.split(X, y))
# But not subsequent
assert_array_equal(list(kf_iter_wrapped.split(X, y)), [])

# At this PR - This test will pass ------------------------------------------

kf_iter = KFold(n_splits=5).split(X, y)
kf_iter_wrapped = check_cv(kf_iter)
# Since the wrapped iterable is enlisted and stored,
# split can be called any number of times to produce
# consistent results
assert_array_equal(
    list(kf_iter_wrapped.split(X, y)),
    list(kf_iter_wrapped.split(X, y)))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is fine. This is only used internally, right? All sklearn functions worked with a single-run iterator before, and now they will do that again after this PR. So there are no behavior changes, right?
If someone used check_cv they will get something different now, which is not entirely great, but this is a bug-fix on 0.18 so we have to change behavior. We never promised our wrapper would be one-iteration if the original cv was.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks for the comment...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just to note that I've added the 2nd test just to ensure this change has effect and to fail if reverted later...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose infinite CV strategies are unlikely. Expensive ones are a bit more likely, but I think this approach is okay.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we revert this change too? @jnothman @amueller?

@@ -61,6 +61,22 @@
from sklearn.linear_model import SGDClassifier


class CustomSplitter():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is used in two tests, it should probably be in utils.testing.

@@ -61,6 +61,7 @@
from sklearn.datasets import make_multilabel_classification

from sklearn.model_selection.tests.test_split import MockClassifier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think tests should import from other tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a few MockClassifier's that are specific to the tests and the module. I'm not sure there is much benefit in grouping them into utils.testing...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well if they are specific to a test, then I think they should stay there. If they are not, then maybe they should be shared? I'm not sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other option would be to have a separate private file (in the tests directory) for these kind of classifiers/mock splitters and import from that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well if they are specific to a test, then I think they should stay there.

They are specific to 2 sets of test in 2 different test files, but not anywhere else except the model_selection tests, hence my disinclination to move them to utils.testing :)

Maybe you want them moved to sklearn.model_selection.tests._test_setup?

cv=CustomSplitter(n_splits=n_splits, n_samples=30),
train_sizes=np.linspace(0.1, 1.0, 10))
if len(w) > 0:
raise RuntimeError("Unexpected warning: %r" % w[0].message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AssertionError?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just do an assert instead of an if?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apart from this LGTM

@jnothman
Copy link
Member

Elsewhere we have ./sklearn/cluster/tests/common.py

On 20 October 2016 at 01:10, Raghav RV notifications@github.com wrote:

@raghavrv commented on this pull request.

In sklearn/model_selection/tests/test_validation.py
#7660:

@@ -61,6 +61,7 @@
from sklearn.datasets import make_multilabel_classification

from sklearn.model_selection.tests.test_split import MockClassifier

Well if they are specific to a test, then I think they should stay there.

They are specific to 2 sets of test in 2 different test files, but not
anywhere else except the model_selection tests, hence my disinclination
to move them to utils.testing :)

Maybe you want them moved to sklearn.model_selection.tests._test_setup?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7660, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AAEz6zAtCX1DnzgNqZ-zMHJ7hicL9514ks5q1iTvgaJpZM4KVa_v
.

@raghavrv
Copy link
Member Author

Elsewhere we have ./sklearn/cluster/tests/common.py

Thanks for checking. I'll move them to model_selection/tests/common.py...

@raghavrv raghavrv force-pushed the use_split_only_once branch from cf986ed to d84b517 Compare October 19, 2016 19:13
@jnothman
Copy link
Member

jnothman commented Nov 2, 2016

split(..) evaluates to a generator, not a generator function.

We could call split multiple times in gscv, but at the risk that multiple calls return distinct splits, as they currently do if random_state is a non-int in any that support shuffle=True.

@jnothman
Copy link
Member

jnothman commented Nov 2, 2016

i wouldn't usually support materialising an iterator as a list unnecessarily, but I think this is the solution of least surprise to the user, especially as CV splits aren't usually expensive to compute or store relative to the rest of the problem, and their number is necessarily finite and predetermined, at least in gscv/rscv.

amueller pushed a commit to amueller/scikit-learn that referenced this pull request Nov 9, 2016
@amueller
Copy link
Member

this needs a whatsnew.

@GaelVaroquaux
Copy link
Member

I think that the solutions used in this PR are not correct: it is important to rely on iterators and not lists. Lists are leading to heavy data duplication #7919

@raghavrv
Copy link
Member Author

Ahh okay should we resort to itertools.tee then?

@jnothman
Copy link
Member

itertools.tee effectively stores things. We may just have to prohibit
non-reentrant iterators??

On 22 November 2016 at 10:05, Raghav RV notifications@github.com wrote:

Ahh okay should we resort to itertools.tee
https://docs.python.org/2/library/itertools.html#itertools.tee then?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7660 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz690iKqNOWD-N12f1bWVIQNKtdVXfks5rAiPAgaJpZM4KVa_v
.

@raghavrv
Copy link
Member Author

I'm +1 for that. That would simplify a few things + discourage hacked up cvs that do not conform to our splitter API (Though I am not sure if reentrancy is kind of implied by our API)...

@jnothman
Copy link
Member

I think this still comes back to things like randomization between calls to split(). I'll try think about this more later.

:(

@amueller
Copy link
Member

@GaelVaroquaux how do you ensure that an generator returns identical iterators when called twice without storing it?
So we could give up on guaranteeing consistency across runs using arbitrary cross-validation iterators, and make sure ours are generating consistent sequences?

What do you think should be the semantics of passing KFold(shuffle=True) to GridSearchCV?
Our previous behavior (pre 0.18 and post 0.18.1) is that all parameters have the same folds.
In fact, iterating of KFold(shuffle=True) any number of times used to create exactly the same folds. Since 0.18, each iteration over KFold(shuffle=True) returns different folds.
And that is something we were not really careful enough about :(

I would argue that having it return always the same sequence is somewhat counter-intuitive. What do you think?

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux how do you ensure that an generator returns identical iterators
when called twice without storing it?

Well, a generator strictly speak not. But we could ensure that the split
method of the CV object returns the same thing.

I would argue that having it return always the same sequence is somewhat
counter-intuitive. What do you think?

In terms of intuition, maybe if you didn't specific the random_state, I
would agree with you.

However, that's not a nice semantic, because it will require the user to
store the list of indices generated in many cases. For instance when
running multiple estimators and wanting to compare the results. It will
be easy to get it wrong. If you look at the bug that was fixed by
@raghavrv, I believe that the code would be wrong without the list if the
iteration was not reproducible.
https://github.com/scikit-learn/scikit-learn/pull/7660/files#diff-676ff6854a280680ed7d91103ef4b0b5R553

The list expands the generator, and hence avoids any irreproducibility
and side effects. But in a sense it defeats the purpose of having
generators. If we consider the extreme case of really large n small p
(say p = 30, and n in the millions), doing 100 iterations of
shuffle-split would create more indices that are stored in the memory
than actual data.

@amueller
Copy link
Member

The semantics of always producing the same result are somewhat nicer than what's there now, and yes, using lists defeats the purpose of iterators.
We can go back to checking the random_state on __init__ if we agree that "deterministic randomness" is fine. We should be very explicit about that, though.

@GaelVaroquaux
Copy link
Member

We can go back to checking the random_state on init if we agree that
"deterministic randomness" is fine. We should be very explicit about that,
though.

With insight, I think that this is better. The bug and corresponding fix
point it out clearly, IMHO.

@raghavrv
Copy link
Member Author

raghavrv commented Nov 23, 2016

Since 0.18, each iteration over KFold(shuffle=True) returns different folds.

If that is the only concern here we could change the order of iteration here. It would paralelize over parameters rather than splits. The problem with that approach is it could be slower when one parameter setting happens to train faster than another, unless joblib is smart to figure that out and let the next job take over that core. (For instance say I iterate over n_estimators=[1, 10, 100, 2000] of random forest, then for a 4 core machine the first core could be idle unless joblib has some nice heuristics that I am unaware of..)

@jnothman
Copy link
Member

Joblib has no problem with pooling and taking the next job off the queue afaik, so that's not an issue.

@raghavrv
Copy link
Member Author

Shall I proceed with the PR. Are you okay with that approach @amueller @GaelVaroquaux?

@raghavrv
Copy link
Member Author

raghavrv commented Nov 24, 2016

And for the problem of shuffle in spliltters what do you think of setting a random int as random_state if random_state is None.

And then reusing this guy to regenerate rng at split?

# At init
if random_state is None:
    random_state = check_random_state(random_state).randint(0, 100000)

# At split
rng = check_random_state(random_state)

Would take care of determinism for successive splits and will also ensure the splits are random for multiple initialization but consistent across splits after initialization.

@raghavrv
Copy link
Member Author

If, after both my changes, all tests pass we are good to go I think...

@raghavrv
Copy link
Member Author

raghavrv commented Nov 24, 2016

we could change the order of iteration here.

Sweeeet now we have to worry about one time parameter samplers / grid iterators ;@ I think copying / changing parameter grid / sampler will break user code more than explicitly not supporting non-reentrant cvs. A lot of users hack the ParameterGrid in a variety of ways that may not be compatible with copying et all...

@raghavrv
Copy link
Member Author

raghavrv commented Nov 24, 2016

I think we can have #7935, revert all the list(iter) changes brought about by this PR and add a note at cv doc stating that multiple iterations are done and hence we expect the cv to not be a non-reentrant one. If they want they can materialize the iterator externally... WDYT?

@raghavrv
Copy link
Member Author

Sweeeet now we have to worry about one time parameter samplers / grid iterators

Wait since all the parameter dicts are anyway generated and stored, it wouldn't actually hurt to materialize that as a list and store them. If my intuition here is correct, doing this will not consume any additional memory...

@jnothman
Copy link
Member

jnothman commented Nov 28, 2016 via email

sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants