-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG + 2] FIX Be robust to non re-entrant/ non deterministic cv.split calls #7660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Also what kind of tests should I add here? I can make a mock splitter which will iterate only once and pass it as cv to grid search. That will fail in master but will pass in this branch. Any other tests? |
passing in a generator expression for Without tests, surely this should be WIP? |
Indeed. Sorry for marking it MRG |
Done! Reviews please ;) |
@@ -129,6 +128,7 @@ def cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, | |||
X, y, groups = indexable(X, y, groups) | |||
|
|||
cv = check_cv(cv, y, classifier=is_classifier(estimator)) | |||
cv_iter = cv.split(X, y, groups) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a sign of best practice denoting that we enlist the indices and store them as soon as we check the cv
param...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I get the point. cv_iter
is a generator, right? This adds a line and doesn't make the code easier to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
argh sorry I somehow kept thinking I did list(cv.split(...)
...
@@ -384,6 +384,7 @@ def cross_val_predict(estimator, X, y=None, groups=None, cv=None, n_jobs=1, | |||
X, y, groups = indexable(X, y, groups) | |||
|
|||
cv = check_cv(cv, y, classifier=is_classifier(estimator)) | |||
cv_iter = cv.split(X, y, groups) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Same comment as above)
Should I leave it as before?
'mean_fit_time', 'std_fit_time'): | ||
gs.cv_results_.pop(key) | ||
gs2.cv_results_.pop(key) | ||
np.testing.assert_equal(gs.cv_results_, gs2.cv_results_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not scikit learn testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numpy's testing.assert_equal
is to check equivalence of nested dicts. We don't have an equivalent AFAIK...
you're not testing the consistency of folds in |
1898288
to
d8d42a8
Compare
def __init__(self, n_samples=100): | ||
self.indices = KFold(n_splits=5).split(np.ones(n_samples)) | ||
|
||
def split(self, X=None, y=None, groups=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment that says that this split
can only be called once.
_pop_time_keys(gs2.cv_results_)) | ||
|
||
# Check consistency of folds across the parameters | ||
gs = GridSearchCV(LinearSVC(random_state=0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this test fail before? Maybe also add a one-line comment explaining what's happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it fails on master...
@amueller Have fixed your comments. I'm not able to find a way to test the consistency of cv folds in Other than that, I have addressed all your comments, added explanatory comments in tests + all tests pass. One more review please? cc: @jnothman too if you find time... :) |
@@ -1467,7 +1467,7 @@ def get_n_splits(self, X=None, y=None, groups=None): | |||
class _CVIterableWrapper(BaseCrossValidator): | |||
"""Wrapper class for old style cv objects and iterables.""" | |||
def __init__(self, cv): | |||
self.cv = cv | |||
self.cv = list(cv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this change? This would mean check_cv(old_type_one_time_iterator) --> new type multi entrant iterator
?
I am not able to dig through the huge pile of conversations, but I recall us choosing this for the reason that if the iterable that is wrapped is non re-entrant the wrapped splitter should also be non re-entrant...
For instance
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import check_cv
from sklearn.datasets import make_classification
from sklearn.utils.testing import assert_array_equal
X, y = make_classification(random_state=0)
# At master - This test will pass ------------------------------------------
kf_iter = KFold(n_splits=5).split(X, y)
kf_iter_wrapped = check_cv(kf_iter)
# First split call will work
list(kf_iter_wrapped.split(X, y))
# But not subsequent
assert_array_equal(list(kf_iter_wrapped.split(X, y)), [])
# At this PR - This test will pass ------------------------------------------
kf_iter = KFold(n_splits=5).split(X, y)
kf_iter_wrapped = check_cv(kf_iter)
# Since the wrapped iterable is enlisted and stored,
# split can be called any number of times to produce
# consistent results
assert_array_equal(
list(kf_iter_wrapped.split(X, y)),
list(kf_iter_wrapped.split(X, y)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is fine. This is only used internally, right? All sklearn functions worked with a single-run iterator before, and now they will do that again after this PR. So there are no behavior changes, right?
If someone used check_cv
they will get something different now, which is not entirely great, but this is a bug-fix on 0.18 so we have to change behavior. We never promised our wrapper would be one-iteration if the original cv was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thanks for the comment...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Just to note that I've added the 2nd test just to ensure this change has effect and to fail if reverted later...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose infinite CV strategies are unlikely. Expensive ones are a bit more likely, but I think this approach is okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -61,6 +61,22 @@ | |||
from sklearn.linear_model import SGDClassifier | |||
|
|||
|
|||
class CustomSplitter(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is used in two tests, it should probably be in utils.testing
.
@@ -61,6 +61,7 @@ | |||
from sklearn.datasets import make_multilabel_classification | |||
|
|||
from sklearn.model_selection.tests.test_split import MockClassifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think tests should import from other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are quite a few MockClassifier
's that are specific to the tests and the module. I'm not sure there is much benefit in grouping them into utils.testing
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well if they are specific to a test, then I think they should stay there. If they are not, then maybe they should be shared? I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other option would be to have a separate private file (in the tests directory) for these kind of classifiers/mock splitters and import from that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well if they are specific to a test, then I think they should stay there.
They are specific to 2 sets of test in 2 different test files, but not anywhere else except the model_selection
tests, hence my disinclination to move them to utils.testing
:)
Maybe you want them moved to sklearn.model_selection.tests._test_setup
?
cv=CustomSplitter(n_splits=n_splits, n_samples=30), | ||
train_sizes=np.linspace(0.1, 1.0, 10)) | ||
if len(w) > 0: | ||
raise RuntimeError("Unexpected warning: %r" % w[0].message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AssertionError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just do an assert
instead of an if
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apart from this LGTM
Elsewhere we have ./sklearn/cluster/tests/common.py On 20 October 2016 at 01:10, Raghav RV notifications@github.com wrote:
|
Thanks for checking. I'll move them to |
cf986ed
to
d84b517
Compare
We could call |
i wouldn't usually support materialising an iterator as a list unnecessarily, but I think this is the solution of least surprise to the user, especially as CV splits aren't usually expensive to compute or store relative to the rest of the problem, and their number is necessarily finite and predetermined, at least in gscv/rscv. |
this needs a whatsnew. |
I think that the solutions used in this PR are not correct: it is important to rely on iterators and not lists. Lists are leading to heavy data duplication #7919 |
Ahh okay should we resort to itertools.tee then? |
itertools.tee effectively stores things. We may just have to prohibit On 22 November 2016 at 10:05, Raghav RV notifications@github.com wrote:
|
I'm +1 for that. That would simplify a few things + discourage hacked up |
I think this still comes back to things like randomization between calls to :( |
@GaelVaroquaux how do you ensure that an generator returns identical iterators when called twice without storing it? What do you think should be the semantics of passing I would argue that having it return always the same sequence is somewhat counter-intuitive. What do you think? |
Well, a generator strictly speak not. But we could ensure that the split
In terms of intuition, maybe if you didn't specific the random_state, I However, that's not a nice semantic, because it will require the user to The list expands the generator, and hence avoids any irreproducibility |
The semantics of always producing the same result are somewhat nicer than what's there now, and yes, using lists defeats the purpose of iterators. |
With insight, I think that this is better. The bug and corresponding fix |
If that is the only concern here we could change the order of iteration here. It would paralelize over parameters rather than splits. The problem with that approach is it could be slower when one parameter setting happens to train faster than another, unless joblib is smart to figure that out and let the next job take over that core. (For instance say I iterate over |
Joblib has no problem with pooling and taking the next job off the queue afaik, so that's not an issue. |
Shall I proceed with the PR. Are you okay with that approach @amueller @GaelVaroquaux? |
And for the problem of shuffle in spliltters what do you think of setting a random int as And then reusing this guy to regenerate rng at # At init
if random_state is None:
random_state = check_random_state(random_state).randint(0, 100000)
# At split
rng = check_random_state(random_state) Would take care of determinism for successive splits and will also ensure the splits are random for multiple initialization but consistent across splits after initialization. |
If, after both my changes, all tests pass we are good to go I think... |
Sweeeet now we have to worry about one time parameter samplers / grid iterators ;@ I think copying / changing parameter grid / sampler will break user code more than explicitly not supporting non-reentrant cvs. A lot of users hack the |
I think we can have #7935, revert all the |
Wait since all the parameter dicts are anyway generated and stored, it wouldn't actually hurt to materialize that as a list and store them. If my intuition here is correct, doing this will not consume any additional memory... |
Yes, we store parameter lists anyway.
…On 26 November 2016 at 03:16, Raghav RV ***@***.***> wrote:
Sweeeet now we have to worry about one time parameter samplers / grid
iterators
Wait since all the parameter dicts are anyway generated and stored, it
wouldn't actually hurt to materialize that as a list and store them. If my
intuition here is correct, doing this will not consume any additional
memory...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7660 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_P41GgVCASELpUSfkZ_kZQhP3KMks5rBwn1gaJpZM4KVa_v>
.
|
Fixes #6726
split
calls, we call split once and store it as a list that will be iterated multiple times._CVIterableWrapper
, we store the list rather than the cv object and iterate on it... (NOTE that this would make previously (intentionally) non deterministic cv iterables, deterministic with consistent output for successivesplit
calls. Do we want that to happen?)