Skip to content

DOC Clarify when GroupKFold same as LeaveOneGroupOut #24104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Sep 10, 2022

Conversation

lucyleeow
Copy link
Member

@lucyleeow lucyleeow commented Aug 4, 2022

Reference Issues/PRs

closes #16853
closes #16869 (supersedes)

What does this implement/fix? Explain your changes.

Clarify GroupKFold same as LeaveOneGroupOut when n_splits is same as number of groups. Also amends the code examplein GroupKFold such that n_splits is NOT the same as number of groups

Any other comments?

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

@lucyleeow
Copy link
Member Author

Thanks @thomasjpfan - Have amended, I think it's less messy but maybe there could be better way still?

... y_train, y_test = y[train_index], y[test_index]
... print(X_train, X_test, y_train, y_test)
... train_group, test_group = groups[train_index], groups[test_index]
... print("TRAIN:\tIndex:", train_index, ", Group:", train_group)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To remove the space between the comma and the train_index array:

Suggested change
... print("TRAIN:\tIndex:", train_index, ", Group:", train_group)
... print(f"TRAIN:\tIndex: {train_index}, Group: {train_group}")

This results in:

TRAIN:  Index: [2 3], Group: [2 2]

Comment on lines 627 to 628
:class:`GroupKFold` is the same as :class:`LeaveOneGroupOut` in the case where
`n_splits` is equal to the number of groups.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking this information is better in the LeaveOneGroupOut section. As a reader learning about just GroupKFold, I think this is too much information. But if I was reading about LeaveOneGroupOut, it is nice to know that LeaveOneGroupOut is a special case of GroupKFold.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, otherwise LGTM

Comment on lines 723 to 724
``n_groups=1`` and the same as :class:`GroupKFold` in the case where
`n_splits` is equal to the number of groups
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Using "with" here is more consistent with the first part of the sentence which says "with n_groups=1"

Suggested change
``n_groups=1`` and the same as :class:`GroupKFold` in the case where
`n_splits` is equal to the number of groups
`n_groups=1` and the same as :class:`GroupKFold` with `n_splits` equal
to the number of groups.

Copy link
Contributor

@stefmolin stefmolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucyleeow - I marked a couple of things on formatting to consider.

Comment on lines 488 to 490
... train_group, test_group = groups[train_index], groups[test_index]
... print(f"TRAIN:\tIndex: {train_index}, Group: {train_group}")
... print(f"TEST:\tIndex: {test_index}, Group: {test_group}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about including information on the fold to make it a little more clear what is happening?

Fold 1:
    train_index=[2 3], train_group=[2 2]
    test_index=[0 1 4 5], test_group=[0 0 3 3]
Fold 2:
    train_index=[0 1 4 5],  train_group=[0 0 3 3]
    test_index=[2 3], test_group=[2 2]

Copy link
Member Author

@lucyleeow lucyleeow Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this but couldn't think of a way to do it neatly, I got (first line is a bit too long):

    >>> for i, (train_index, test_index) in enumerate(group_kfold.split(X, y, groups)):
    ...     train_group, test_group = groups[train_index], groups[test_index]
    ...     print(f"Fold {i}")
    ...     print(f"\tTRAIN:\tIndex: {train_index}, Group: {train_group}")
    ...     print(f"\tTEST:\tIndex: {test_index}, Group: {test_group}")

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't mind this, and I don't think you can make that first line shorter without changing the variable names or doing something non-Pythonic. Another way would be the following, which produces what I had above – it makes the print() statements easier to follow, but the first line is the same:

    >>> for i, (train_index, test_index) in enumerate(group_kfold.split(X, y, groups)):
    ...     train_group, test_group = groups[train_index], groups[test_index]
    ...     print(f"Fold {i + 1}")
    ...     print(f"\t{train_index=}, {train_group=}")
    ...     print(f"\t{test_index=}, {test_group=}")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, happy to amend but would like to get some other opinions (maybe @thomasjpfan?) just because I would like to update all other similar example sections for other classes in _split.py (e.g., GroupKFold) to the same format.

Copy link
Member

@thomasjpfan thomasjpfan Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given what we decide here will influence other examples, I prefer to split this PR into two. This PR can only do what the title states and update doc/modules/cross_validation.rst.

Then you can open a new PR for this docstring's example where we can agree on a formatting. As for my opinion, I like:

for i, (train_index, test_index) in enumerate(group_kfold.split(X, y, groups)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}, group={groups[train_index]}")
    print(f"  Test:  index={test_index}, group={groups[test_index]}")

which turns into:

Fold 0:
  Train: index=[2 3], group=[2 2]
  Test:  index=[0 1 4 5], group=[0 0 3 3]
Fold 1:
  Train: index=[0 1 4 5], group=[0 0 3 3]
  Test:  index=[2 3], group=[2 2]

Edit: Corrected with group=

As suggested by @stefmolin, I like having the fold number in, but I find https://github.com/scikit-learn/scikit-learn/pull/24104/files#r964290370 harder to parse:

Fold 1
	train_index=array([2, 3]), train_group=array([2, 2])
	test_index=array([0, 1, 4, 5]), test_group=array([0, 0, 3, 3])
Fold 2
	train_index=array([0, 1, 4, 5]), train_group=array([0, 0, 3, 3])
	test_index=array([2, 3]), test_group=array([2, 2])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That also LGTM. Should it be group= to match index=?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should be group=.

Copy link
Member

@ArturoAmorQ ArturoAmorQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from comment LGTM, thanks for the contribution @lucyleeow!

:class:`LeavePGroupsOut` with ``n_groups=1``.
related to a specific group. This is the same as :class:`LeavePGroupsOut` with
`n_groups=1` and the same as :class:`GroupKFold` with `n_splits` equal to the
number of groups
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
number of groups
number of unique labels passed to the `groups` parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wording tweak to avoid possible confusion with the parameter n_groups.

@lucyleeow
Copy link
Member Author

lucyleeow commented Sep 10, 2022

Thanks for the reviews!

I have amended so cross_validation.rst is amended, and the example in GroupKFold such that n_splits is NOT the same as number of groups.
Reverted formatting which I will do in another PR.

@thomasjpfan thomasjpfan merged commit b4c2ebf into scikit-learn:main Sep 10, 2022
@thomasjpfan
Copy link
Member

Yup, let's follow up with a PR on improving the the example. I think the formatting suggested in #24104 (comment) is sufficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clarify that LeaveOneGroupOut is a particular case of GroupKFold
4 participants