[MRG+1] Repeated K-Fold and Repeated Stratified K-Fold #8120

neerajgangwar · 2016-12-27T07:33:32Z

Reference Issue

Fixes #7948

What does this implement/fix? Explain your changes.

Implements RepeatedKFold and RepeatedStratifiedKFold

Any other comments?

For previous discussion on this, please refer to #7960

jnothman · 2016-12-27T11:25:31Z

Do you intend to close #7960

neerajgangwar · 2016-12-27T11:32:49Z

Yes.

jnothman · 2016-12-27T11:53:02Z

doc/modules/cross_validation.rst

+  >>> from sklearn.model_selection import RepeatedKFold, KFold
+  >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
+  >>> random_states = [12883823, 28827347]
+  >>> rkf = RepeatedKFold(KFold(n_splits=2), n_repeats=2, random_states=random_states)


We want a solution that can accept a single random_state. It can generate random states for each KFold instance.

I also don't get why you should accept a KFold instance to the constructor of RepeatedKFold.

Yes, it would be better to accept n_splits to the constructor of RepeatedKFold and create an instance of KFold inside. Likewise for RepeatedStratifiedKFold.

I'll make the changes.

…eats, random_state

jnothman · 2016-12-27T21:34:05Z

sklearn/model_selection/_split.py

+        if n_repeats <= 1:
+            raise ValueError("Number of repetitions must be greater than 1.")
+
+        rng = check_random_state(random_state)


we are not certain it's the best design, but currently all the splitters do this in split, not in __init__

Is there any other way to achieve this other than initializing self.random_states = [] in __init__ and generate random states when split is called for the first time? The code 946-950 will move to split inside an if condition?

jnothman · 2016-12-28T08:45:40Z

you don't need to store the random states, just generate them from the initial random state in split.

…

On 28 December 2016 at 17:06, Neeraj Gangwar ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sklearn/model_selection/_split.py <#8120>: > + random_state : None, int or RandomState, default=None + Random state to be used to generate random state for each + repetition. + """ + def __init__(self, cv, n_repeats=5, random_state=None): + if not isinstance(cv, (KFold, StratifiedKFold)): + raise ValueError( + "cv must be an instance of KFold or StratifiedKFold.") + + if not isinstance(n_repeats, (np.integer, numbers.Integral)): + raise ValueError("Number of repetitions must be of Integral type.") + + if n_repeats <= 1: + raise ValueError("Number of repetitions must be greater than 1.") + + rng = check_random_state(random_state) Is there any other way to achieve this other than initializing self.random_states = [] in __init__ and generate random states when split is called for the first time? The code 946-950 will move to split inside an if condition? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8120>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz61GmYoTB4NCB-HyVXSsgCpyWImAsks5rMfxwgaJpZM4LWF1S> .

neerajgangwar · 2016-12-29T03:54:11Z

Do you mean initialize RandomState in each split call with check_random_state and generate random states? In this case, if initial random_state is int, it will work fine as check_random_state will return RandomState with the same initial seed on every call. But if it's None, it will return RandomState with different seed on every call and if it's RandomState, it'll return the same object. In both of these cases, split will produce different splits on different calls. To generate same splits on different split calls, initial state needs to be stored somewhere probably?

Or are you referring to some other way?

jnothman · 2016-12-29T04:27:04Z

Currently KFold with shuffle=True will generate different splits on different calls to split, will it not? This should behave the same.

…

On 29 December 2016 at 14:54, Neeraj Gangwar ***@***.***> wrote: Do you mean initialize RandomState in each split call with check_random_state and generate random states? In this case, if initial random_state is int, it will work fine as check_random_state will return RandomState with the same initial seed on every call. But if it's None, it will return RandomState with different seed on every call and if it's RandomState, it'll return the same object. In both of these cases, split will produce different splits on different calls. To generate same splits on different split calls, initial state needs to be stored somewhere probably? Or are you referring to some other way? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8120 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz67TFHjdy2Vo5fEV2HOdLZ_VqDUOpks5rMy7lgaJpZM4LWF1S> .

jnothman

thanks

jnothman · 2016-12-29T12:23:41Z

doc/modules/cross_validation.rst

+  [1 2] [0 3]
+  [0 3] [1 2]
+
+


I think you should just mention RepeatStratifiedKFold here and under StratifiedKFold. Also in the "see also"s of relevant classes.

jnothman · 2016-12-29T12:24:04Z

doc/modules/cross_validation.rst

@@ -409,6 +432,30 @@ two slightly unbalanced classes::
  [0 1 3 4 5 8 9] [2 6 7]
  [0 1 2 4 5 6 7] [3 8 9]

+Repeated Stratified K-Fold


I.e. I think this is overkill

jnothman · 2016-12-29T12:24:37Z

sklearn/model_selection/_split.py

@@ -913,6 +915,238 @@ def get_n_splits(self, X, y, groups):
        return int(comb(len(np.unique(groups)), self.n_groups, exact=True))


+class _RepeatedSplits(with_metaclass(ABCMeta)):
+    """Repeated splits for K-Fold and Stratified K-Fold


"for an arbitrary randomized CV splitter"

jnothman · 2016-12-29T12:24:46Z

sklearn/model_selection/_split.py

+class _RepeatedSplits(with_metaclass(ABCMeta)):
+    """Repeated splits for K-Fold and Stratified K-Fold
+
+    Repeats splits for cross-validators n times.


with different randomization

jnothman · 2016-12-29T12:26:18Z

sklearn/model_selection/_split.py

+        return self._repeated_splits.get_n_repeats()
+
+
+class RepeatedStratifiedKFold(with_metaclass(ABCMeta)):


Why is this an ABC?

jnothman · 2016-12-29T12:27:57Z

sklearn/model_selection/_split.py

+    def __init__(self, cv, n_repeats=5, random_state=None):
+        if not isinstance(cv, (KFold, StratifiedKFold)):
+            raise ValueError(
+                "cv must be an instance of KFold or StratifiedKFold.")


I think only KFold and StratifiedKFold use random_state. That's why I added this check. Should I remove it? And also, is there a way to check if cv is an instance of cross-validator with randomized split functionality?

Yes, remove it.

It's a private class, it doesn't require such validation.

jnothman · 2016-12-29T12:28:43Z

sklearn/model_selection/_split.py

+            for train_index, test_index in cv.split(X, y, groups):
+                yield train_index, test_index
+
+    def get_n_repeats(self):


I don't think we need this. n_repeats is already an attribute.

Removing from all classes.

jnothman · 2016-12-29T12:30:05Z

sklearn/model_selection/_split.py

+        test : ndarray
+            The testing set indices for that split.
+        """
+        cv = self.cv


I wonder if instead we should be constructing new CV objects in here (i.e. in split). Thus _RepeatedSplits.__init__ would take a constructor for cv rather than cv.

It might be nice as it would remove the dependency of KFold from RepeatedKFold in terms of parameters. I am thinking something like def __init__(self, cv, n_repeats=5, random_state=None, **cvargs):. It will be called as _RepeatedSplits(KFold, n_repeats, random_state, n_splits=n_splits). This is what you meant, right?

I have one doubt though. Since shuffle should always be True and random_state will be generated inside split function, would it be okay to just mention that user should not pass these arguments and one random_state that is passed does not correspond to random_state parameter of KFold?

You can have **cvargs in the _RepeatedSplits class while still only allowing specified named args in RepeatedKFold

jnothman · 2016-12-29T12:31:51Z

sklearn/model_selection/tests/test_split.py

+    train, test = next(splits)
+    assert_array_equal(train, [0, 1, 2])
+    assert_array_equal(test, [3, 4])
+


please also check that a second call to split produces the same sets.

add a comment to explain why this is repeated.
perhaps use a loop or a helper function to avoid duplicated code.

you also don't check here that the iterator is exhausted after 4 elements

jnothman · 2016-12-29T12:32:11Z

sklearn/model_selection/tests/test_split.py

+
+def test_repeated_stratified_kfold_errors():
+    # n_repeats is not integer or <= 1
+    assert_raises(ValueError, RepeatedStratifiedKFold, n_repeats=1)


do this in a loop together with the RepeatedKFold case.

…RepeatedSplits and other review changes

jnothman

This is looking much better, thanks!

jnothman · 2017-01-01T01:10:38Z

sklearn/model_selection/_split.py

+    See also
+    --------
+    RepeatedStratifiedKFold: Repeats Stratified K-Fold n times.
+


please remove blank line

jnothman · 2017-01-01T01:19:38Z

sklearn/model_selection/tests/test_split.py

+    train, test = next(splits)
+    assert_array_equal(train, [0, 1, 2])
+    assert_array_equal(test, [3, 4])
+


add a comment to explain why this is repeated.
perhaps use a loop or a helper function to avoid duplicated code.

jnothman · 2017-01-01T01:20:19Z

sklearn/model_selection/tests/test_split.py

+    train, test = next(splits)
+    assert_array_equal(train, [0, 1, 2])
+    assert_array_equal(test, [3, 4])
+


you also don't check here that the iterator is exhausted after 4 elements

…add StopIteration check in testcase

neerajgangwar · 2017-01-01T04:48:32Z

Thanks @jnothman for the review. And a very happy new year :)

jnothman · 2017-01-01T09:51:13Z

LGTM!

tguillemot · 2017-01-18T14:47:49Z

sklearn/model_selection/_split.py

+            raise ValueError("Number of repetitions must be of Integral type.")
+
+        if n_repeats <= 1:
+            raise ValueError("Number of repetitions must be greater than 1.")


Never check values in __init__. Move it to split.

Shouldn't error be thrown at the construction time if there is some discrepancy with the parameters passed? In _BaseKFold also, values are checked in __init__.

In sklearn for the estimator, we never check the error in init because of set_params but these classes are not estimators. I imagine this rule is not applied here.

@jnothman As I'm not 100% sure can you confirm that ?

For now, at least, CV splitters are a bit special in this regard. Checking in __init__ is consistent with other splitters.

ok thx @jnothman

tguillemot · 2017-01-18T14:48:18Z

sklearn/model_selection/_split.py

+
+    **cvargs : additional params
+        Constructor parameters for cv. Must not contain random_state
+        and shuffle.


Not an obligation as _RepeatedSplits is private but can you raise an error in split to check that ?

tguillemot · 2017-01-18T14:49:49Z

sklearn/model_selection/_split.py

+        rng = check_random_state(self.random_state)
+
+        for idx in range(n_repeats):
+            random_state = rng.randint(np.iinfo(np.int32).max)


Maybe I'm missing something but why directly send rng to random_state ?

Integer random state generated by rng is sent as random_state. Do you have any other way in mind?

Sorry I was not clear.
Can you remove the line random_state = rng.randint(np.iinfo(np.int32).max) and change random_state by rng later?

Do you mean after creating the object for cv? If yes, how would it make a difference?

I mean : cv = self.cv(random_state=rng, shuffle=True, **self.cvargs)

Your code is similar to the code following :

In [1]: from sklearn.utils import check_random_state In [2]: class Foo: ...: def __init__(self, random_state): ...: self.rng = check_random_state(random_state) ...: ...: def fit(self): ...: print(self.rng.randint(1000)) ...: In [3]: rng = check_random_state(0) In [4]: f1 = Foo(rng) In [5]: f2 = Foo(rng) In [6]: f1.fit() 684 In [7]: f1.fit() 559 In [8]: f2.fit() 629 In [9]: f2.fit() 192

rng is an object and it will be modified every time you call cv.split. So for me is not necessary to generate a specific random_state at each iteration. Maybe I am missing something ?

check_random_state does not create a copy of rng if rng is a random_state.

I can't find any case which we'll miss by this approach. But I think current implementation keeps the use of random_state clean. I am not really sure.

@jnothman thoughts?

I think passing rng directly should be okay. If it's not okay, we need to be able to construct a test case that proves so!

I am not able to find any such testcase. So making the changes. Thanks!

…ded a check for cvargs

tguillemot

LGTM.

Circle seems unrelated.

amueller · 2017-02-19T17:57:57Z

sklearn/model_selection/_split.py

+
+    Parameters
+    ----------
+    n_splits : int, default=3


This is consistent with the other estimators but seem pretty useless in practice. I would make 10 times 10-fold the default. Why would you want to do 3x5 instead of 10 fold?

Changing it to 5x10 (n_splits x n_repeats) by default. Is that fine?

amueller

Looks good apart from minor changes, in particular default parameters. How do others feel about adding this to check_cv? We could have a mxn syntax to have m repetitions of n-fold with automatically detecting stratified vs not, so you could do cv="10x10". Not entirely sure about that though.

amueller · 2017-02-19T17:59:31Z

doc/modules/cross_validation.rst

+---------------
+
+:class:`RepeatedKFold` repeats K-Fold n times. It can be used when one
+requires to run :class:`KFold` n times, producing different splits in


I would say "it can be used to run KFold multiple times to increase the fidelity of the estimate? Or can we say to decrease the variance? Is that accurate?

amueller · 2017-02-19T18:00:30Z

sklearn/model_selection/_split.py

+    Parameters
+    ----------
+    cv : callable
+        Constructor of cross-validator.


Isn't this the cross validation class itself? That seems more natural than passing the __init__ method.

We are not passing the __init__ method. It's called as RepeatedSplits(KFold, n_repeats, random_state, n_splits=n_splits). I am changing description to "Cross-validator class.".

amueller · 2017-02-19T18:01:08Z

sklearn/model_selection/_split.py

+    cv : callable
+        Constructor of cross-validator.
+
+    n_repeats : int, default=5


I would probably do 10x10 by default, or maybe 5 times 10 fold. This is something you use when you care about accuracy but not necessarily time.

Changing default values of n_splits to 5 and n_repeats to 10.

amueller · 2017-02-19T18:03:03Z

sklearn/model_selection/_split.py

+        if n_repeats <= 1:
+            raise ValueError("Number of repetitions must be greater than 1.")
+
+        if any(key in cvargs for key in ('random_state', 'shuffle')):


if set(cvargs).intersection({'random_state', 'shuffle'})? Though not really shorter :-/

Keeping the same as both are of equal length. :P

amueller · 2017-02-19T18:05:24Z

sklearn/model_selection/_split.py

+        rng = check_random_state(self.random_state)
+
+        for idx in range(n_repeats):
+            cv = self.cv(random_state=rng, shuffle=True,


Do we maybe want to raise nice errors if these arguments are not present?

I didn't get you. Which arguments?

Honestly I have no idea what I meant.... ?

amueller · 2017-03-04T01:38:01Z

LGTM. Can you add an entry to whatsnew.rst?

…hangelog

neerajgangwar · 2017-03-04T15:34:14Z

@amueller Conflict in doc/whats_new.rst. Help?

amueller · 2017-03-04T16:10:35Z

Fixed it (which you could have also done locally ;) omg this github feature is amazing !!

amueller · 2017-03-04T16:10:57Z

It's customary to add your username to the entry, but you don't have to.

codecov · 2017-03-04T16:43:55Z

Codecov Report

Merging #8120 into master will increase coverage by <.01%.
The diff coverage is 98.61%.

@@            Coverage Diff             @@
##           master    #8120      +/-   ##
==========================================
+ Coverage   95.48%   95.48%   +<.01%     
==========================================
  Files         342      342              
  Lines       60913    60985      +72     
==========================================
+ Hits        58160    58231      +71     
- Misses       2753     2754       +1

Impacted Files	Coverage Δ
sklearn/model_selection/tests/test_split.py	`95.96% <100%> (+0.25%)`	✅
sklearn/model_selection/init.py	`100% <100%> (ø)`	✅
sklearn/model_selection/_split.py	`98.6% <96%> (-0.17%)`	❌

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 75c892c...338997e. Read the comment docs.

neerajgangwar · 2017-03-04T20:12:38Z

@amueller Added my name :)

…8120) * Add _RepeatedSplits and RepeatedKFold class * Add RepeatedStratifiedKFold and doc for repeated cvs * Change default value of n_repeats * Change input parameters of repeated cv constructor to n_splits, n_repeats, random_state * Generate random states in split function rather than store it beforehand * Doc changes, inheriting RepeatedKFold, RepeatedStratifiedKFold from _RepeatedSplits and other review changes * Remove blank line, put testcases for deterministic split in loop and add StopIteration check in testcase * Using rng directly as random_state param to create cv instance and added a check for cvargs * Fix pep8 warnings * Changing default values for n_splits and n_repeats and add entry in changelog * Adding name to the feature * Missing space

neerajgangwar added 3 commits December 22, 2016 15:18

Add _RepeatedSplits and RepeatedKFold class

3742d69

Add RepeatedStratifiedKFold and doc for repeated cvs

42f711f

Change default value of n_repeats

ad2753c

neerajgangwar changed the title ~~Feature/repeated splits~~ [MRG] Repeated K-Fold and Stratified K-Fold Dec 27, 2016

neerajgangwar changed the title ~~[MRG] Repeated K-Fold and Stratified K-Fold~~ [MRG] Repeated K-Fold and Repeated Stratified K-Fold Dec 27, 2016

neerajgangwar mentioned this pull request Dec 27, 2016

[WIP] Add repeated cross-validations #7960

Closed

jnothman reviewed Dec 27, 2016

View reviewed changes

Change input parameters of repeated cv constructor to n_splits, n_rep…

e62bc27

…eats, random_state

jnothman reviewed Dec 27, 2016

View reviewed changes

Generate random states in split function rather than store it beforehand

c631b31

jnothman reviewed Dec 29, 2016

View reviewed changes

Doc changes, inheriting RepeatedKFold, RepeatedStratifiedKFold from _…

5e3078d

…RepeatedSplits and other review changes

jnothman reviewed Jan 1, 2017

View reviewed changes

Remove blank line, put testcases for deterministic split in loop and …

373e5a5

…add StopIteration check in testcase

jnothman changed the title ~~[MRG] Repeated K-Fold and Repeated Stratified K-Fold~~ [MRG+1] Repeated K-Fold and Repeated Stratified K-Fold Jan 1, 2017

tguillemot suggested changes Jan 18, 2017

View reviewed changes

neerajgangwar added 2 commits January 20, 2017 14:09

Using rng directly as random_state param to create cv instance and ad…

cba817a

…ded a check for cvargs

Fix pep8 warnings

218e409

tguillemot approved these changes Jan 24, 2017

View reviewed changes

amueller reviewed Feb 19, 2017

View reviewed changes

Changing default values for n_splits and n_repeats and add entry in c…

f9f48a8

…hangelog

Merge branch 'master' into feature/repeated-splits

feb9172

Adding name to the feature

eb38ebe

Missing space

338997e

amueller merged commit af1796e into scikit-learn:master Mar 4, 2017

neerajgangwar deleted the feature/repeated-splits branch March 5, 2017 03:39

Przemo10 mentioned this pull request Mar 17, 2017

update fork (#1) #8606

Closed

		return self._repeated_splits.get_n_repeats()


		class RepeatedStratifiedKFold(with_metaclass(ABCMeta)):

Uh oh!

[MRG+1] Repeated K-Fold and Repeated Stratified K-Fold #8120

[MRG+1] Repeated K-Fold and Repeated Stratified K-Fold #8120

Uh oh!

Conversation

neerajgangwar commented Dec 27, 2016 • edited by jnothman Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman commented Dec 27, 2016

Uh oh!

neerajgangwar commented Dec 27, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Dec 28, 2016 via email

Uh oh!

neerajgangwar commented Dec 29, 2016

Uh oh!

jnothman commented Dec 29, 2016 via email

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman Jan 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman Jan 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

neerajgangwar commented Dec 27, 2016 •

edited by jnothman

Loading

jnothman Jan 1, 2017 •

edited

Loading

jnothman Jan 1, 2017 •

edited

Loading