[MRG + 1] Added DisjointLabelKFold to perform K-Fold cv on sets with disjoint labels. #4444

JeanKossaifi · 2015-03-24T15:43:46Z

Added a SubjectIndependentKFold class to create subject independent folds.

landscape-bot · 2015-03-24T15:55:29Z

Code quality remained the same when pulling 53063ba on JeanKossaifi:sid_kfold into 6354e45 on scikit-learn:master.

amueller · 2015-03-24T16:17:07Z

This is the same as LeavePLabelOut, right?
http://scikit-learn.org/dev/modules/generated/sklearn.cross_validation.LeavePLabelOut.html#sklearn.cross_validation.LeavePLabelOut
Maybe we should rename it, it is not super clear...

amueller · 2015-03-24T16:18:40Z

SubjectIndependent is not a great name either, though, as it is very domain specific.

amueller · 2015-03-24T16:26:12Z

Sorry I just now saw the discussion on the mailing list. If it is very similar to LeavePLabelOut, we indeed should either add a Stratified version or add a stratified parameter. We don't want domain-specific names.

coveralls · 2015-03-24T16:30:28Z

Coverage decreased (-0.01%) to 95.1% when pulling 53063ba on JeanKossaifi:sid_kfold into 6354e45 on scikit-learn:master.

JeanKossaifi · 2015-03-24T16:38:51Z

Hi Andreas,

Thanks for the feedback.
My issue with adding a stratified parameter is that the name might be a bit
misleading (we don't want to leave P labels out, we just want disjoint
training and testing sets, while keeping approximately equilibrated folds).

Cheers,

Jean

2015-03-24 16:26 GMT+00:00 Andreas Mueller notifications@github.com:

Sorry I just now saw the discussion on the mailing list. If it is very
similar to LeavePLabelOut, we indeed should either add a Stratified
version or add a stratified parameter. We don't want domain-specific names.

—
Reply to this email directly or view it on GitHub
#4444 (comment)
.

amueller · 2015-03-24T17:55:20Z

I think adding what you want seems like a good idea. I just don't have a good name. Maybe GroupIndependentKFold?

amueller · 2015-03-24T17:55:26Z

sorry misclick.

ogrisel · 2015-03-24T21:53:34Z

sklearn/tests/test_cross_validation.py

+        with no subject appearing in two different folds
+    """
+    # Fix the seed for reproducibility 
+    np.random.seed(0)


To avoid side effect between tests, please avoid seeding the global PRNG instance but instead use:

rng = np.random.RandomState(0) ... subjects = np.random.randint(0, n_subjects, n_samples)

ogrisel · 2015-03-24T21:58:32Z

I agree with @amueller with respect to the name. Baybe we could use DisjointGroupKFold instead? I find the "subject" naming too specific and the "independent" naming confusing / misleading.

cosmetic changes test (fix seed correctly, use assert_equal for meaningful error messages)

JeanKossaifi · 2015-03-25T11:40:29Z

Thanks @ogrisel for the review!

Made the corrections and changed the name to DisjointGroupKFold.

landscape-bot · 2015-03-25T11:44:10Z

Code quality remained the same when pulling 37ecdd7 on JeanKossaifi:sid_kfold into 6354e45 on scikit-learn:master.

jnothman · 2015-03-25T21:17:31Z

I don't like the use of Group when elsewhere in CV Label actually means the
same thing. Group might be the better name, but we should be consistent.

On 25 March 2015 at 22:44, landscape-bot notifications@github.com wrote:

[image: Code Health] https://landscape.io/diff/118267
Code quality remained the same when pulling 37ecdd7
JeanKossaifi@37ecdd7
on JeanKossaifi:sid_kfold into 6354e45
6354e45
on scikit-learn:master.

—
Reply to this email directly or view it on GitHub
#4444 (comment)
.

GaelVaroquaux · 2015-03-26T06:55:23Z

I don't like the use of Group when elsewhere in CV Label actually means the same thing. Group might be the better name, but we should be consistent.

+1

JeanKossaifi · 2015-03-26T11:17:20Z

@jnothman, @GaelVaroquaux +1 for consistency
I changed the name to DisjointLabelKFold.

raghavrv · 2015-03-26T12:17:40Z

sklearn/cross_validation.py

+            The folds are built so that the same label doesn't appear in two different folds
+
+        n_folds: int, default is 3
+            number of folds


Could you add a nice Examples section like we have for LeavePLabelOut here?

raghavrv · 2015-03-26T12:50:03Z

Thanks for adding the example :)

( There is a whitespace related failure at https://travis-ci.org/scikit-learn/scikit-learn/jobs/55942000#L1336 )

raghavrv · 2015-03-26T13:41:01Z

sklearn/tests/test_cross_validation.py

+def test_disjoint_label_folds():
+    """ Check that the function produces equilibrated folds
+        with no label appearing in two different folds
+    """


Could you also convert this to a comment following #4432 ? (sorry for the trouble)

Thanks for the quick response!

Thanks, I didn't see that issue. Just changed the docstring into comment.

…into sid_kfold

amueller · 2015-06-23T19:52:35Z

sklearn/tests/test_cross_validation.py

+        assert_greater_equal(tolerance, abs(sum(folds == i) - ideal_n_labels_per_fold))
+
+    # Check that each subjects appears only in 1 fold
+    for label in np.unique(labels):


you could remove the for loop by constructing a coo_matrix instead, but I'm not sure it's worth it. this one might be cleaner.

I agree, to the reader it is probably easier to have the loopy version.

ogrisel · 2015-06-24T18:20:55Z

sklearn/cross_validation.py

+    # number of occurrence of each label (its "weight")
+    samples_per_label = np.bincount(y)
+    # We want to distribute the most frequent labels first
+    ind = np.argsort(samples_per_label)[::-1]


I wonder if we should not use a stable sort algorithm like np.argsort(samples_per_label, kind="mergesort") here.

@larsmans any idea if a non-stable sort could cause reproducibility issues in this case?

If the NumPy implementation of quicksort changes, you can get a different ordering for tied labels. From a quick glance at the code, I don't see how that would not affect the output.

Use mergesort.

jnothman · 2015-06-25T10:26:04Z

sklearn/cross_validation.py

@@ -332,6 +332,119 @@ def __len__(self):
        return self.n_folds


+def disjoint_label_folds(y, n_folds=3):


I'm sure I missed discussion somewhere, but why did this get called y?

We could call it labels, y is simply for consistency with other cross-validation methods (.e.g. StratifiedKFold).

But y means something different there; there it is the target of prediction over which samples are stratified. This variable is much more like label in LeaveOneLabelOut

+1, you cannot predict something you have never seen on the training set. y in scikit-learn is always the target variable for supervised prediction. I agree, it's a bad name but it's too late to change.

@jnothman @ogrisel ok, changed y to labels :)

jnothman · 2015-06-27T13:39:52Z

I'm trying to think of degenerate cases. We should raise an error if n_labels < n_folds, or else we'll produce an empty test set. Perhaps we should also say in the documentation that this will work best if n_labels >> n_folds and the samples are somewhat evenly spread among the labels.

JeanKossaifi · 2015-06-29T14:46:30Z

@jnothman good point, I added a test.

ogrisel · 2015-07-01T16:46:08Z

sklearn/cross_validation.py

+
+
+class DisjointLabelKFold(_BaseKFold):
+    """Creates K approximately equilibrated folds.


I would not put emphasis on balancing but rather on mutual label exclusion:

class DisjointLabelKFold(_BaseKFold): """K-fold iterator variant with non-overlapping labels. The same label will not appear in two different folds (the number of labels has to be at least equal to the number of folds). The folds are approximately balanced in the sense so that the number of distinct labels is approximately the same in each fold. """

On 07/01/2015 12:46 PM, Olivier Grisel wrote:

I would not put emphasis on balancing but rather on mutual label
exclusion:

We also have that in #4583 though, right?

ogrisel · 2015-07-01T16:50:38Z

Please mention this new class in the "see also" section of the KFold, StratifiedKFold, LeaveOneLabelOut and LeavePLabelOut class docstrings.

Please also introduce this tool in the user guide in a new section in sklearn/doc/modules/cross_validation.rst, probably after the section on StratifiedKFold.

You also need to add a new entry for this class at the appropriate location in sklearn/doc/modules/classes.rst (reference documentation).

ogrisel · 2015-07-01T16:57:30Z

sklearn/cross_validation.py

+    samples_per_fold = np.zeros(n_folds)
+
+    # Mapping from label index to fold index
+    label_to_fold = np.zeros(len(unique_labels))


I think this should be np.zeros(len(unique_labels), dtype=np.uintp) instead.

JeanKossaifi · 2015-07-03T12:10:22Z

Thanks @ogrisel, I added the documentation and addressed the other issues.

glouppe · 2015-08-30T13:18:21Z

I know there was already some discussion and changes regarding the naming of this iterator, but what about simply LabelKFold?

(We just added LabelShuffleSplit as a label-variant of ShuffleSplit. Hence, LabelKFold would be the label-variant of KFold)

GaelVaroquaux · 2015-08-30T13:32:55Z

I know there was already some discussion and changes regarding the naming of this iterator, but what about simply LabelKFold?

I would be happy with that.

jnothman · 2015-08-30T14:16:21Z

I know we're stuck with "Label" for now, but "GroupedKFold" is much clearer!

On 30 August 2015 at 23:32, Gael Varoquaux notifications@github.com wrote:

I know there was already some discussion and changes regarding the
naming of
this iterator, but what about simply LabelKFold?

I would be happy with that.

—
Reply to this email directly or view it on GitHub
#4444 (comment)
.

agramfort · 2015-08-30T15:42:11Z

I like LabelKFold :)

glouppe · 2015-08-30T16:34:44Z

I am taking care of rebasing and renaming.

Added subject independent KFold

53063ba

amueller closed this Mar 24, 2015

amueller reopened this Mar 24, 2015

ogrisel reviewed Mar 24, 2015
View reviewed changes

Changed SubjectIndependentKFold to DisjointGroupKFold

37ecdd7

cosmetic changes test (fix seed correctly, use assert_equal for meaningful error messages)

Changed name to DisjointLabelKFold

79cea0c

raghavrv reviewed Mar 26, 2015
View reviewed changes

Added example of use

de5d272

JeanKossaifi added 2 commits March 26, 2015 12:56

FIX: whitespace related doctest failure

a809881

FIX: Python 2.6 requires the field numbers in print

c14f847

raghavrv reviewed Mar 26, 2015
View reviewed changes

JeanKossaifi added 2 commits March 26, 2015 13:45

FIX: change docstring to comment in test function

b0fc204

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

cd77b82

…into sid_kfold

amueller reviewed Jun 23, 2015
View reviewed changes

Cosmetic

b7fc3d8

ogrisel reviewed Jun 24, 2015
View reviewed changes

Use mergesort instead of quicksort for reproducibility.

65d96b3

jnothman reviewed Jun 25, 2015
View reviewed changes

Changed variable name 'y' to 'label'.

fa03f1b

JeanKossaifi added 2 commits June 29, 2015 15:36

Added test for degenerate case where n_folds > n_labels.

3faa60a

Documented the requirement n_labels > n_folds.

90275be

ogrisel reviewed Jul 1, 2015
View reviewed changes

JeanKossaifi added 4 commits July 1, 2015 20:09

DOC: improved description + added see also sections.

0b2dff6

Fixed dtype of temporary arrays.

6b4b63f

Improved test: check that one label is not in both test and training.

c9a636c

Added documentation for DisjoinLabelKFold.

ecb0ea0

glouppe mentioned this pull request Aug 30, 2015

[MRG+2] Label K-Fold #5190

Merged

glouppe closed this Aug 30, 2015

JeanKossaifi deleted the sid_kfold branch September 24, 2015 11:59

		@@ -332,6 +332,119 @@ def __len__(self):
		return self.n_folds


		def disjoint_label_folds(y, n_folds=3):



		class DisjointLabelKFold(_BaseKFold):
		"""Creates K approximately equilibrated folds.

Uh oh!

[MRG + 1] Added DisjointLabelKFold to perform K-Fold cv on sets with disjoint labels. #4444

[MRG + 1] Added DisjointLabelKFold to perform K-Fold cv on sets with disjoint labels. #4444

Uh oh!

Conversation

JeanKossaifi commented Mar 24, 2015

Uh oh!

landscape-bot commented Mar 24, 2015

Uh oh!

amueller commented Mar 24, 2015

Uh oh!

amueller commented Mar 24, 2015

Uh oh!

amueller commented Mar 24, 2015

Uh oh!

coveralls commented Mar 24, 2015

Uh oh!

JeanKossaifi commented Mar 24, 2015

Uh oh!

amueller commented Mar 24, 2015

Uh oh!

amueller commented Mar 24, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Mar 24, 2015

Uh oh!

JeanKossaifi commented Mar 25, 2015

Uh oh!

landscape-bot commented Mar 25, 2015

Uh oh!

jnothman commented Mar 25, 2015

Uh oh!

GaelVaroquaux commented Mar 26, 2015 via email

Uh oh!

JeanKossaifi commented Mar 26, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raghavrv commented Mar 26, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jun 27, 2015

Uh oh!

JeanKossaifi commented Jun 29, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Jul 1, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JeanKossaifi commented Jul 3, 2015

Uh oh!

glouppe commented Aug 30, 2015