Skip to content

[WIP] Add repeated cross-validations #7960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

neerajgangwar
Copy link
Contributor

Reference Issue

#7948

What does this implement/fix? Explain your changes.

Implements RepeatedKFold

Any other comments?

No

@jnothman
Copy link
Member

jnothman commented Dec 2, 2016

I think the point of this is that shuffle will always be true. we may also want to reuse KFold internally.

@neerajgangwar
Copy link
Contributor Author

@jnothman Is it okay to call _iter_test_indices of KFold inside RepeatedKFold if I am inheriting RepeatedKFold from KFold?

@amueller
Copy link
Member

amueller commented Dec 2, 2016

@neerajgangwar yes, though depends on what exactly you wanna do.

@neerajgangwar
Copy link
Contributor Author

I want to inherit RepeatedKFold from KFold and override _iter_test_indices which will use _iter_test_indices of KFold internally.

@amueller
Copy link
Member

amueller commented Dec 2, 2016

Calling KFold._iter_test_indices is a bit tricky unless we want to modify self.random_state which will mess with repeatability.

n_reps : int, default=2
Number of times KFold needs to be repeated.

n_splits : int, default=3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a bit inconsistent, but if I could go back, I'd rather have this as 5 in KFold. Do we maybe want 5 here?

@amueller
Copy link
Member

amueller commented Dec 2, 2016

@jnothman do you see an easy way to reuse KFold._iter_test_indices? It also depends somewhat on what we decide in #7935, I guess?

@neerajgangwar
Copy link
Contributor Author

From what I understood from the code and changes in #7935, if KFold.random_state is set to anything but int, every split call will generate different splits. So I was thinking what if we do the following RepeatedKFold.__init__

def __init__(self, n_reps=2, n_splits=3, random_state=None):
        if random_state is None:
            random_state = check_random_state(random_state)
        self.n_reps = n_reps
        super(RepeatedKFold, self).__init__(n_splits, True, random_state)

and use KFold._iter_test_indices inside RepeatedKFold._iter_test_indices since KFold._iter_test_indices shuffles the indices on every call?

@amueller
Copy link
Member

amueller commented Dec 2, 2016

We want each call to split to yield the same results, but we want each internal repetition shuffled differently. If we change self.random_state each call to split will yield a different result.

@neerajgangwar
Copy link
Contributor Author

I missed that point.

@amueller
Copy link
Member

amueller commented Dec 2, 2016

@neerajgangwar well it is not true in master and the discussion in the issue I linked is somewhat complicated.

Copy link
Member

@raghavrv raghavrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs more tests, documentation and maybe an example?


Parameters
----------
n_reps : int, default=2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_iter?

n_splits : int, default=3
Number of folds. Must be atleast 2.

shuffle : boolean, default=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not shuffling what would be the use of a repeated k fold?

for fold_size in fold_sizes:
start, stop = current, current + fold_size
yield indices[start:stop]
current = stop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be simpler to re-use the k fold's code here...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how? (see my comments)

@@ -465,6 +531,7 @@ class GroupKFold(_BaseKFold):
For splitting the data according to explicit domain-specific
stratification of the dataset.
"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change can be reverted. We don't usually follow the one line after docstring convention...

@raghavrv
Copy link
Member

raghavrv commented Dec 2, 2016

Thanks for the PR! Right now in master, it is different. But after #7935, it will not be so... So I suggest, Reinitialing a new KFold object for each repetition. Also like Andy suggests please reuse the KFold...

@neerajgangwar
Copy link
Contributor Author

So I suggest, Reinitialing a new KFold object for each repetition.

It will generate different splits for each split function call.

@amueller
Copy link
Member

amueller commented Dec 3, 2016

Well you could create n_repetitions many integer random states during __init__ and then use these to create the KFold objects each time?

@neerajgangwar
Copy link
Contributor Author

Yes that's an option. I was wondering if there is a better way?

@neerajgangwar
Copy link
Contributor Author

neerajgangwar commented Dec 3, 2016

In case, we create n_repetitions many integer random states and use these to create a KFold object at each repetition, RepeatedKFold can't be inherited from KFold. Would it make sense to inherit RepeatedKFold from _BaseKFold in that case?

@jnothman
Copy link
Member

jnothman commented Dec 3, 2016 via email

@jnothman
Copy link
Member

jnothman commented Dec 4, 2016

regardless of implementation, this needs tests.

self._random_states = []
for _ in range(n_splits):
random_state = check_random_state(
None).randint(np.iinfo(np.int32).max)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still needs to be possible to pass and int random_state for deterministic splitting.

@neerajgangwar
Copy link
Contributor Author

I am adding test cases and documentation.

@raghavrv raghavrv added this to the 0.19 milestone Dec 4, 2016
@jnothman
Copy link
Member

jnothman commented Dec 5, 2016

Does that mean you should mark this MRG?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GaelVaroquaux, @amueller who proposed the need for a RepeatedKFold: how about a RepeatedSplits or repeat_splits meta-splitter that wraps any splitter taking shuffle and random_state? Thus RepeatedSplits(KFold(n_splits=5), n_repeats=3).

class BaseRepeatedCrossValidator(with_metaclass(ABCMeta)):
"""Base class for repeated cross validators

Implementations must define `_get_cv`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems inelegant to me. A mixin is an option, perhaps. Otherwise, see my main comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would a mixin be better in this case?

@neerajgangwar
Copy link
Contributor Author

Thus RepeatedSplits(KFold(n_splits=5), n_repeats=3)

How would you control random_state for each repetition?

@jnothman
Copy link
Member

jnothman commented Dec 5, 2016

How would you control random_state for each repetition?

One option is to draw a random int for each instance. The other is to share a random_state and ensure they are iterated in order.

@neerajgangwar
Copy link
Contributor Author

neerajgangwar commented Dec 5, 2016

I meant once you pass an instance of KFold to RepeatedSplits (the first parameter in RepeatedSplits signature), you can't change random_state (in the current implementation).
Or did you mean that user tells RepeatedSplits or meta-splitter which cross validator instance to create and instance is created internally?

@jnothman
Copy link
Member

jnothman commented Dec 5, 2016 via email

@amueller
Copy link
Member

amueller commented Dec 5, 2016

@GaelVaroquaux, @amueller who proposed the need for a RepeatedKFold: how about a RepeatedSplits or repeat_splits meta-splitter that wraps any splitter taking shuffle and random_state? Thus RepeatedSplits(KFold(n_splits=5), n_repeats=3).

I'm undecided. What other cv objects fulfil that? See http://scikit-learn.org/dev/modules/classes.html#splitter-classes

For the ShuffleSplit this doesn't really make sense, as you could just increase the number of iterations. All the others are deterministic apart from KFold and StratifiedKFold.

The main reason I wanted a class was discoverability, and because I feel these are "first class citizens" for many people.

@jnothman
Copy link
Member

jnothman commented Dec 5, 2016 via email

@neerajgangwar
Copy link
Contributor Author

Which option are we going ahead with? RepeatedSplits or RepeatedKFold?

@jnothman
Copy link
Member

_RepeatedSplits and RepeatedKFold for now? We can make _RepeatedSplits public when someone finds a use for it!

@amueller
Copy link
Member

Sounds good

@neerajgangwar
Copy link
Contributor Author

For this, we need set_params implemented in KFold class. Do we follow any convention for set_params and get_params methods? Any code that I can look at?

Also, I would like to submit a new PR for the new implementation. This one has a lot of unnecessary commits.

@jnothman
Copy link
Member

Remind me why we need set_params?

@neerajgangwar
Copy link
Contributor Author

If we are going with RepeatedSplits(KFold(n_splits=5), n_repeats=3) signature then once we pass an instance of KFold to RepeatedSplits, we need to control the random_state of this instance for each repetition. This is how we are implementing it or I missed something?

@jnothman
Copy link
Member

Yes, but under the current setup you can just set a random_state attribute in the instance, no?

@neerajgangwar
Copy link
Contributor Author

I have opened a new PR #8120. Is it okay if I close this one?

@jnothman
Copy link
Member

jnothman commented Dec 27, 2016 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants