-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG + 1] Added DisjointLabelKFold to perform K-Fold cv on sets with disjoint labels. #4444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is the same as LeavePLabelOut, right? |
SubjectIndependent is not a great name either, though, as it is very domain specific. |
Sorry I just now saw the discussion on the mailing list. If it is very similar to LeavePLabelOut, we indeed should either add a |
Hi Andreas, Thanks for the feedback. Cheers, Jean 2015-03-24 16:26 GMT+00:00 Andreas Mueller notifications@github.com:
|
I think adding what you want seems like a good idea. I just don't have a good name. Maybe GroupIndependentKFold? |
sorry misclick. |
with no subject appearing in two different folds | ||
""" | ||
# Fix the seed for reproducibility | ||
np.random.seed(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid side effect between tests, please avoid seeding the global PRNG instance but instead use:
rng = np.random.RandomState(0)
...
subjects = np.random.randint(0, n_subjects, n_samples)
I agree with @amueller with respect to the name. Baybe we could use |
cosmetic changes test (fix seed correctly, use assert_equal for meaningful error messages)
Thanks @ogrisel for the review! Made the corrections and changed the name to DisjointGroupKFold. |
I don't like the use of Group when elsewhere in CV Label actually means the On 25 March 2015 at 22:44, landscape-bot notifications@github.com wrote:
|
I don't like the use of Group when elsewhere in CV Label actually means the
same thing. Group might be the better name, but we should be consistent.
+1
|
@jnothman, @GaelVaroquaux +1 for consistency |
The folds are built so that the same label doesn't appear in two different folds | ||
|
||
n_folds: int, default is 3 | ||
number of folds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a nice Examples
section like we have for LeavePLabelOut
here?
Thanks for adding the example :) ( There is a whitespace related failure at https://travis-ci.org/scikit-learn/scikit-learn/jobs/55942000#L1336 ) |
def test_disjoint_label_folds(): | ||
""" Check that the function produces equilibrated folds | ||
with no label appearing in two different folds | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also convert this to a comment following #4432 ? (sorry for the trouble)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick response!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I didn't see that issue. Just changed the docstring into comment.
assert_greater_equal(tolerance, abs(sum(folds == i) - ideal_n_labels_per_fold)) | ||
|
||
# Check that each subjects appears only in 1 fold | ||
for label in np.unique(labels): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could remove the for loop by constructing a coo_matrix instead, but I'm not sure it's worth it. this one might be cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, to the reader it is probably easier to have the loopy version.
# number of occurrence of each label (its "weight") | ||
samples_per_label = np.bincount(y) | ||
# We want to distribute the most frequent labels first | ||
ind = np.argsort(samples_per_label)[::-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should not use a stable sort algorithm like np.argsort(samples_per_label, kind="mergesort")
here.
@larsmans any idea if a non-stable sort could cause reproducibility issues in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the NumPy implementation of quicksort changes, you can get a different ordering for tied labels. From a quick glance at the code, I don't see how that would not affect the output.
Use mergesort.
@@ -332,6 +332,119 @@ def __len__(self): | |||
return self.n_folds | |||
|
|||
|
|||
def disjoint_label_folds(y, n_folds=3): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure I missed discussion somewhere, but why did this get called y
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could call it labels, y is simply for consistency with other cross-validation methods (.e.g. StratifiedKFold).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But y
means something different there; there it is the target of prediction over which samples are stratified. This variable is much more like label
in LeaveOneLabelOut
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, you cannot predict something you have never seen on the training set. y
in scikit-learn is always the target variable for supervised prediction. I agree, it's a bad name but it's too late to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to think of degenerate cases. We should raise an error if |
@jnothman good point, I added a test. |
|
||
|
||
class DisjointLabelKFold(_BaseKFold): | ||
"""Creates K approximately equilibrated folds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not put emphasis on balancing but rather on mutual label exclusion:
class DisjointLabelKFold(_BaseKFold):
"""K-fold iterator variant with non-overlapping labels.
The same label will not appear in two different folds (the number of
labels has to be at least equal to the number of folds).
The folds are approximately balanced in the sense so that the number of
distinct labels is approximately the same in each fold.
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On 07/01/2015 12:46 PM, Olivier Grisel wrote:
I would not put emphasis on balancing but rather on mutual label
exclusion:We also have that in #4583 though, right?
Please mention this new class in the "see also" section of the Please also introduce this tool in the user guide in a new section in You also need to add a new entry for this class at the appropriate location in |
samples_per_fold = np.zeros(n_folds) | ||
|
||
# Mapping from label index to fold index | ||
label_to_fold = np.zeros(len(unique_labels)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be np.zeros(len(unique_labels), dtype=np.uintp)
instead.
Thanks @ogrisel, I added the documentation and addressed the other issues. |
I know there was already some discussion and changes regarding the naming of this iterator, but what about simply (We just added |
I know there was already some discussion and changes regarding the naming of
this iterator, but what about simply LabelKFold?
I would be happy with that.
|
I know we're stuck with "Label" for now, but "GroupedKFold" is much clearer! On 30 August 2015 at 23:32, Gael Varoquaux notifications@github.com wrote:
|
I like LabelKFold :)
|
I am taking care of rebasing and renaming. |
Added a SubjectIndependentKFold class to create subject independent folds.