-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Add repeated cross-validations #7960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add repeated cross-validations #7960
Conversation
I think the point of this is that shuffle will always be true. we may also want to reuse KFold internally. |
@jnothman Is it okay to call |
@neerajgangwar yes, though depends on what exactly you wanna do. |
I want to inherit |
Calling |
n_reps : int, default=2 | ||
Number of times KFold needs to be repeated. | ||
|
||
n_splits : int, default=3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a bit inconsistent, but if I could go back, I'd rather have this as 5 in KFold. Do we maybe want 5 here?
From what I understood from the code and changes in #7935, if def __init__(self, n_reps=2, n_splits=3, random_state=None):
if random_state is None:
random_state = check_random_state(random_state)
self.n_reps = n_reps
super(RepeatedKFold, self).__init__(n_splits, True, random_state) and use |
We want each call to |
I missed that point. |
@neerajgangwar well it is not true in master and the discussion in the issue I linked is somewhat complicated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs more tests, documentation and maybe an example?
|
||
Parameters | ||
---------- | ||
n_reps : int, default=2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_iter
?
n_splits : int, default=3 | ||
Number of folds. Must be atleast 2. | ||
|
||
shuffle : boolean, default=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are not shuffling what would be the use of a repeated k fold?
for fold_size in fold_sizes: | ||
start, stop = current, current + fold_size | ||
yield indices[start:stop] | ||
current = stop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be simpler to re-use the k fold's code here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how? (see my comments)
@@ -465,6 +531,7 @@ class GroupKFold(_BaseKFold): | |||
For splitting the data according to explicit domain-specific | |||
stratification of the dataset. | |||
""" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change can be reverted. We don't usually follow the one line after docstring convention...
Thanks for the PR! Right now in master, it is different. But after #7935, it will not be so... So I suggest, Reinitialing a new |
It will generate different splits for each |
Well you could create |
Yes that's an option. I was wondering if there is a better way? |
In case, we create |
I don't think there's any reason to have RepeatedKFold inherit from KFold.
I would appreciate a design in which this could be easily replicated with
minimal code for other underlying CV iterators, but I don't think that
inheritance adds value. On the other hand, I'm still not altogether
committed to the idea that this feature requires a separate class.
(I'm still not entirely confident that we can change existing behaviour as
much as we will in #7935, without warning.)
…On 3 December 2016 at 16:15, Neeraj Gangwar ***@***.***> wrote:
In case, we create n_repetitions many integer random states and use these
to create a KFold object at each repetition, RepeatedKFold can't
inherited from KFold. Would it make sense to inherit RepeatedKFold from
_BaseKFold in that case?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7960 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6yowWxeHTeofpGyT2uTNoFLMZCWPks5rEPr2gaJpZM4LBokV>
.
|
regardless of implementation, this needs tests. |
self._random_states = [] | ||
for _ in range(n_splits): | ||
random_state = check_random_state( | ||
None).randint(np.iinfo(np.int32).max) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still needs to be possible to pass and int random_state
for deterministic splitting.
I am adding test cases and documentation. |
Does that mean you should mark this MRG? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GaelVaroquaux, @amueller who proposed the need for a RepeatedKFold
: how about a RepeatedSplits
or repeat_splits
meta-splitter that wraps any splitter taking shuffle
and random_state
? Thus RepeatedSplits(KFold(n_splits=5), n_repeats=3)
.
class BaseRepeatedCrossValidator(with_metaclass(ABCMeta)): | ||
"""Base class for repeated cross validators | ||
|
||
Implementations must define `_get_cv`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems inelegant to me. A mixin is an option, perhaps. Otherwise, see my main comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would a mixin be better in this case?
How would you control |
One option is to draw a random int for each instance. The other is to share a random_state and ensure they are iterated in order. |
I meant once you pass an instance of |
Fair point. I'd been thinking of making get_params for CVs anyway, which
would enable cloning... (but implementing `set_params` would require some
design decisions...)
…On 6 December 2016 at 00:38, Neeraj Gangwar ***@***.***> wrote:
I meant once you pass an instance of KFold to RepeatedSplits (the first
parameter in RepeatedSplits signature), you can't change it in the
current implementation.
Or did you mean that use can tell RepeatedSplits or meta-splitter which
cross validator instance to create?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7960 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6yDQ_vGSMv0BXSrmaAN_dLToxoZeks5rFBPIgaJpZM4LBokV>
.
|
I'm undecided. What other cv objects fulfil that? See http://scikit-learn.org/dev/modules/classes.html#splitter-classes For the The main reason I wanted a class was discoverability, and because I feel these are "first class citizens" for many people. |
Okay. Happy not to make it too generic.
…On 6 December 2016 at 09:58, Andreas Mueller ***@***.***> wrote:
@GaelVaroquaux <https://github.com/GaelVaroquaux>, @amueller
<https://github.com/amueller> who proposed the need for a RepeatedKFold:
how about a RepeatedSplits or repeat_splits meta-splitter that wraps any
splitter taking shuffle and random_state? Thus
RepeatedSplits(KFold(n_splits=5), n_repeats=3).
I'm undecided. What other cv objects fulfil that? See
http://scikit-learn.org/dev/modules/classes.html#splitter-classes
For the ShuffleSplit this doesn't really make sense, as you could just
increase the number of iterations. All the others are deterministic apart
from KFold and StratifiedKFold.
The main reason I wanted a class was discoverability, and because I feel
these are "first class citizens" for many people.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7960 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz616NZegRkTr-6z71Yno0kIFylOA7ks5rFJcmgaJpZM4LBokV>
.
|
Which option are we going ahead with? |
|
Sounds good |
For this, we need Also, I would like to submit a new PR for the new implementation. This one has a lot of unnecessary commits. |
Remind me why we need |
If we are going with |
Yes, but under the current setup you can just set a |
I have opened a new PR #8120. Is it okay if I close this one? |
sure.
…On 27 December 2016 at 22:33, Neeraj Gangwar ***@***.***> wrote:
I have opened a new PR #8120
<#8120>. Is it okay if I
close this one?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7960 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz63ukYCYGU8CsvLr2bMdI1IIXMhKdks5rMPeDgaJpZM4LBokV>
.
|
Reference Issue
#7948
What does this implement/fix? Explain your changes.
Implements RepeatedKFold
Any other comments?
No