-
-
Notifications
You must be signed in to change notification settings - Fork 26k
DOC Clarify when GroupKFold same as LeaveOneGroupOut #24104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR!
Thanks @thomasjpfan - Have amended, I think it's less messy but maybe there could be better way still? |
sklearn/model_selection/_split.py
Outdated
... y_train, y_test = y[train_index], y[test_index] | ||
... print(X_train, X_test, y_train, y_test) | ||
... train_group, test_group = groups[train_index], groups[test_index] | ||
... print("TRAIN:\tIndex:", train_index, ", Group:", train_group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To remove the space between the comma and the train_index
array:
... print("TRAIN:\tIndex:", train_index, ", Group:", train_group) | |
... print(f"TRAIN:\tIndex: {train_index}, Group: {train_group}") |
This results in:
TRAIN: Index: [2 3], Group: [2 2]
doc/modules/cross_validation.rst
Outdated
:class:`GroupKFold` is the same as :class:`LeaveOneGroupOut` in the case where | ||
`n_splits` is equal to the number of groups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking this information is better in the LeaveOneGroupOut
section. As a reader learning about just GroupKFold
, I think this is too much information. But if I was reading about LeaveOneGroupOut
, it is nice to know that LeaveOneGroupOut
is a special case of GroupKFold
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit, otherwise LGTM
doc/modules/cross_validation.rst
Outdated
``n_groups=1`` and the same as :class:`GroupKFold` in the case where | ||
`n_splits` is equal to the number of groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Using "with" here is more consistent with the first part of the sentence which says "with n_groups=1"
``n_groups=1`` and the same as :class:`GroupKFold` in the case where | |
`n_splits` is equal to the number of groups | |
`n_groups=1` and the same as :class:`GroupKFold` with `n_splits` equal | |
to the number of groups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lucyleeow - I marked a couple of things on formatting to consider.
sklearn/model_selection/_split.py
Outdated
... train_group, test_group = groups[train_index], groups[test_index] | ||
... print(f"TRAIN:\tIndex: {train_index}, Group: {train_group}") | ||
... print(f"TEST:\tIndex: {test_index}, Group: {test_group}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about including information on the fold to make it a little more clear what is happening?
Fold 1:
train_index=[2 3], train_group=[2 2]
test_index=[0 1 4 5], test_group=[0 0 3 3]
Fold 2:
train_index=[0 1 4 5], train_group=[0 0 3 3]
test_index=[2 3], test_group=[2 2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this but couldn't think of a way to do it neatly, I got (first line is a bit too long):
>>> for i, (train_index, test_index) in enumerate(group_kfold.split(X, y, groups)):
... train_group, test_group = groups[train_index], groups[test_index]
... print(f"Fold {i}")
... print(f"\tTRAIN:\tIndex: {train_index}, Group: {train_group}")
... print(f"\tTEST:\tIndex: {test_index}, Group: {test_group}")
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally don't mind this, and I don't think you can make that first line shorter without changing the variable names or doing something non-Pythonic. Another way would be the following, which produces what I had above – it makes the print()
statements easier to follow, but the first line is the same:
>>> for i, (train_index, test_index) in enumerate(group_kfold.split(X, y, groups)):
... train_group, test_group = groups[train_index], groups[test_index]
... print(f"Fold {i + 1}")
... print(f"\t{train_index=}, {train_group=}")
... print(f"\t{test_index=}, {test_group=}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, happy to amend but would like to get some other opinions (maybe @thomasjpfan?) just because I would like to update all other similar example sections for other classes in _split.py
(e.g., GroupKFold) to the same format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given what we decide here will influence other examples, I prefer to split this PR into two. This PR can only do what the title states and update doc/modules/cross_validation.rst
.
Then you can open a new PR for this docstring's example where we can agree on a formatting. As for my opinion, I like:
for i, (train_index, test_index) in enumerate(group_kfold.split(X, y, groups)):
print(f"Fold {i}:")
print(f" Train: index={train_index}, group={groups[train_index]}")
print(f" Test: index={test_index}, group={groups[test_index]}")
which turns into:
Fold 0:
Train: index=[2 3], group=[2 2]
Test: index=[0 1 4 5], group=[0 0 3 3]
Fold 1:
Train: index=[0 1 4 5], group=[0 0 3 3]
Test: index=[2 3], group=[2 2]
Edit: Corrected with group=
As suggested by @stefmolin, I like having the fold number in, but I find https://github.com/scikit-learn/scikit-learn/pull/24104/files#r964290370 harder to parse:
Fold 1
train_index=array([2, 3]), train_group=array([2, 2])
test_index=array([0, 1, 4, 5]), test_group=array([0, 0, 3, 3])
Fold 2
train_index=array([0, 1, 4, 5]), train_group=array([0, 0, 3, 3])
test_index=array([2, 3]), test_group=array([2, 2])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That also LGTM. Should it be group=
to match index=
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it should be group=
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from comment LGTM, thanks for the contribution @lucyleeow!
doc/modules/cross_validation.rst
Outdated
:class:`LeavePGroupsOut` with ``n_groups=1``. | ||
related to a specific group. This is the same as :class:`LeavePGroupsOut` with | ||
`n_groups=1` and the same as :class:`GroupKFold` with `n_splits` equal to the | ||
number of groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
number of groups | |
number of unique labels passed to the `groups` parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wording tweak to avoid possible confusion with the parameter n_groups
.
Thanks for the reviews! I have amended so |
Yup, let's follow up with a PR on improving the the example. I think the formatting suggested in #24104 (comment) is sufficient. |
Reference Issues/PRs
closes #16853
closes #16869 (supersedes)
What does this implement/fix? Explain your changes.
Clarify
GroupKFold
same asLeaveOneGroupOut
whenn_splits
is same as number of groups. Also amends the code exampleinGroupKFold
such thatn_splits
is NOT the same as number of groupsAny other comments?