Skip to content

[MRG + 1] Added DisjointLabelKFold to perform K-Fold cv on sets with disjoint labels. #4444

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 27 commits into from

Conversation

JeanKossaifi
Copy link
Contributor

Added a SubjectIndependentKFold class to create subject independent folds.

@landscape-bot
Copy link

Code Health
Code quality remained the same when pulling 53063ba on JeanKossaifi:sid_kfold into 6354e45 on scikit-learn:master.

@amueller
Copy link
Member

This is the same as LeavePLabelOut, right?
http://scikit-learn.org/dev/modules/generated/sklearn.cross_validation.LeavePLabelOut.html#sklearn.cross_validation.LeavePLabelOut
Maybe we should rename it, it is not super clear...

@amueller
Copy link
Member

SubjectIndependent is not a great name either, though, as it is very domain specific.

@amueller
Copy link
Member

Sorry I just now saw the discussion on the mailing list. If it is very similar to LeavePLabelOut, we indeed should either add a Stratified version or add a stratified parameter. We don't want domain-specific names.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 95.1% when pulling 53063ba on JeanKossaifi:sid_kfold into 6354e45 on scikit-learn:master.

@JeanKossaifi
Copy link
Contributor Author

Hi Andreas,

Thanks for the feedback.
My issue with adding a stratified parameter is that the name might be a bit
misleading (we don't want to leave P labels out, we just want disjoint
training and testing sets, while keeping approximately equilibrated folds).

Cheers,

Jean

2015-03-24 16:26 GMT+00:00 Andreas Mueller notifications@github.com:

Sorry I just now saw the discussion on the mailing list. If it is very
similar to LeavePLabelOut, we indeed should either add a Stratified
version or add a stratified parameter. We don't want domain-specific names.


Reply to this email directly or view it on GitHub
#4444 (comment)
.

@amueller
Copy link
Member

I think adding what you want seems like a good idea. I just don't have a good name. Maybe GroupIndependentKFold?

@amueller amueller closed this Mar 24, 2015
@amueller amueller reopened this Mar 24, 2015
@amueller
Copy link
Member

sorry misclick.

with no subject appearing in two different folds
"""
# Fix the seed for reproducibility
np.random.seed(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid side effect between tests, please avoid seeding the global PRNG instance but instead use:

rng = np.random.RandomState(0)

...
subjects = np.random.randint(0, n_subjects, n_samples)

@ogrisel
Copy link
Member

ogrisel commented Mar 24, 2015

I agree with @amueller with respect to the name. Baybe we could use DisjointGroupKFold instead? I find the "subject" naming too specific and the "independent" naming confusing / misleading.

cosmetic changes  test (fix seed correctly, use assert_equal for
meaningful error messages)
@JeanKossaifi
Copy link
Contributor Author

Thanks @ogrisel for the review!

Made the corrections and changed the name to DisjointGroupKFold.

@landscape-bot
Copy link

Code Health
Code quality remained the same when pulling 37ecdd7 on JeanKossaifi:sid_kfold into 6354e45 on scikit-learn:master.

@jnothman
Copy link
Member

I don't like the use of Group when elsewhere in CV Label actually means the
same thing. Group might be the better name, but we should be consistent.

On 25 March 2015 at 22:44, landscape-bot notifications@github.com wrote:

[image: Code Health] https://landscape.io/diff/118267
Code quality remained the same when pulling 37ecdd7
JeanKossaifi@37ecdd7
on JeanKossaifi:sid_kfold
into 6354e45
6354e45
on scikit-learn:master
.


Reply to this email directly or view it on GitHub
#4444 (comment)
.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Mar 26, 2015 via email

@JeanKossaifi
Copy link
Contributor Author

@jnothman, @GaelVaroquaux +1 for consistency
I changed the name to DisjointLabelKFold.

The folds are built so that the same label doesn't appear in two different folds

n_folds: int, default is 3
number of folds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a nice Examples section like we have for LeavePLabelOut here?

@raghavrv
Copy link
Member

Thanks for adding the example :)

( There is a whitespace related failure at https://travis-ci.org/scikit-learn/scikit-learn/jobs/55942000#L1336 )

def test_disjoint_label_folds():
""" Check that the function produces equilibrated folds
with no label appearing in two different folds
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also convert this to a comment following #4432 ? (sorry for the trouble)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick response!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I didn't see that issue. Just changed the docstring into comment.

assert_greater_equal(tolerance, abs(sum(folds == i) - ideal_n_labels_per_fold))

# Check that each subjects appears only in 1 fold
for label in np.unique(labels):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could remove the for loop by constructing a coo_matrix instead, but I'm not sure it's worth it. this one might be cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, to the reader it is probably easier to have the loopy version.

# number of occurrence of each label (its "weight")
samples_per_label = np.bincount(y)
# We want to distribute the most frequent labels first
ind = np.argsort(samples_per_label)[::-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should not use a stable sort algorithm like np.argsort(samples_per_label, kind="mergesort") here.

@larsmans any idea if a non-stable sort could cause reproducibility issues in this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the NumPy implementation of quicksort changes, you can get a different ordering for tied labels. From a quick glance at the code, I don't see how that would not affect the output.

Use mergesort.

@@ -332,6 +332,119 @@ def __len__(self):
return self.n_folds


def disjoint_label_folds(y, n_folds=3):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure I missed discussion somewhere, but why did this get called y?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could call it labels, y is simply for consistency with other cross-validation methods (.e.g. StratifiedKFold).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But y means something different there; there it is the target of prediction over which samples are stratified. This variable is much more like label in LeaveOneLabelOut

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, you cannot predict something you have never seen on the training set. y in scikit-learn is always the target variable for supervised prediction. I agree, it's a bad name but it's too late to change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman @ogrisel ok, changed y to labels :)

@jnothman
Copy link
Member

I'm trying to think of degenerate cases. We should raise an error if n_labels < n_folds, or else we'll produce an empty test set. Perhaps we should also say in the documentation that this will work best if n_labels >> n_folds and the samples are somewhat evenly spread among the labels.

@JeanKossaifi
Copy link
Contributor Author

@jnothman good point, I added a test.



class DisjointLabelKFold(_BaseKFold):
"""Creates K approximately equilibrated folds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not put emphasis on balancing but rather on mutual label exclusion:

class DisjointLabelKFold(_BaseKFold):
    """K-fold iterator variant with non-overlapping labels.

    The same label will not appear in two different folds  (the number of
    labels has to be at least equal to the number of folds).

    The folds are approximately balanced in the sense so that the number of
    distinct labels is approximately the same in each fold.
    """

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 07/01/2015 12:46 PM, Olivier Grisel wrote:

I would not put emphasis on balancing but rather on mutual label
exclusion:

We also have that in #4583 though, right?

@ogrisel
Copy link
Member

ogrisel commented Jul 1, 2015

Please mention this new class in the "see also" section of the KFold, StratifiedKFold, LeaveOneLabelOut and LeavePLabelOut class docstrings.

Please also introduce this tool in the user guide in a new section in sklearn/doc/modules/cross_validation.rst, probably after the section on StratifiedKFold.

You also need to add a new entry for this class at the appropriate location in sklearn/doc/modules/classes.rst (reference documentation).

samples_per_fold = np.zeros(n_folds)

# Mapping from label index to fold index
label_to_fold = np.zeros(len(unique_labels))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be np.zeros(len(unique_labels), dtype=np.uintp) instead.

@JeanKossaifi
Copy link
Contributor Author

Thanks @ogrisel, I added the documentation and addressed the other issues.

@glouppe
Copy link
Contributor

glouppe commented Aug 30, 2015

I know there was already some discussion and changes regarding the naming of this iterator, but what about simply LabelKFold?

(We just added LabelShuffleSplit as a label-variant of ShuffleSplit. Hence, LabelKFold would be the label-variant of KFold)

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Aug 30, 2015 via email

@jnothman
Copy link
Member

I know we're stuck with "Label" for now, but "GroupedKFold" is much clearer!

On 30 August 2015 at 23:32, Gael Varoquaux notifications@github.com wrote:

I know there was already some discussion and changes regarding the
naming of
this iterator, but what about simply LabelKFold?

I would be happy with that.


Reply to this email directly or view it on GitHub
#4444 (comment)
.

@agramfort
Copy link
Member

agramfort commented Aug 30, 2015 via email

@glouppe
Copy link
Contributor

glouppe commented Aug 30, 2015

I am taking care of rebasing and renaming.

@glouppe glouppe mentioned this pull request Aug 30, 2015
@glouppe glouppe closed this Aug 30, 2015
@JeanKossaifi JeanKossaifi deleted the sid_kfold branch September 24, 2015 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.