DOC Clarify when GroupKFold same as LeaveOneGroupOut #24104

lucyleeow · 2022-08-04T04:22:42Z

Reference Issues/PRs

closes #16853
closes #16869 (supersedes)

What does this implement/fix? Explain your changes.

Clarify GroupKFold same as LeaveOneGroupOut when n_splits is same as number of groups. Also amends the code examplein GroupKFold such that n_splits is NOT the same as number of groups

Any other comments?

thomasjpfan

Thank you for the PR!

sklearn/model_selection/_split.py

lucyleeow · 2022-08-05T04:07:38Z

Thanks @thomasjpfan - Have amended, I think it's less messy but maybe there could be better way still?

thomasjpfan · 2022-08-05T19:51:03Z

sklearn/model_selection/_split.py

-    ...     y_train, y_test = y[train_index], y[test_index]
-    ...     print(X_train, X_test, y_train, y_test)
+    ...     train_group, test_group = groups[train_index], groups[test_index]
+    ...     print("TRAIN:\tIndex:", train_index, ", Group:", train_group)


To remove the space between the comma and the train_index array:

Suggested change

... print("TRAIN:\tIndex:", train_index, ", Group:", train_group)

... print(f"TRAIN:\tIndex: {train_index}, Group: {train_group}")

This results in:

TRAIN: Index: [2 3], Group: [2 2]

thomasjpfan · 2022-08-08T14:42:40Z

doc/modules/cross_validation.rst

+:class:`GroupKFold` is the same as :class:`LeaveOneGroupOut` in the case where
+`n_splits` is equal to the number of groups.


I'm thinking this information is better in the LeaveOneGroupOut section. As a reader learning about just GroupKFold, I think this is too much information. But if I was reading about LeaveOneGroupOut, it is nice to know that LeaveOneGroupOut is a special case of GroupKFold.

thomasjpfan

Minor nit, otherwise LGTM

thomasjpfan · 2022-08-09T16:50:31Z

doc/modules/cross_validation.rst

+``n_groups=1`` and the same as :class:`GroupKFold` in the case where
+`n_splits` is equal to the number of groups


Nit: Using "with" here is more consistent with the first part of the sentence which says "with n_groups=1"

Suggested change

``n_groups=1`` and the same as :class:`GroupKFold` in the case where

`n_splits` is equal to the number of groups

`n_groups=1` and the same as :class:`GroupKFold` with `n_splits` equal

to the number of groups.

stefmolin

@lucyleeow - I marked a couple of things on formatting to consider.

doc/modules/cross_validation.rst

stefmolin · 2022-09-06T02:03:12Z

sklearn/model_selection/_split.py

+    ...     train_group, test_group = groups[train_index], groups[test_index]
+    ...     print(f"TRAIN:\tIndex: {train_index}, Group: {train_group}")
+    ...     print(f"TEST:\tIndex: {test_index}, Group: {test_group}")


What about including information on the fold to make it a little more clear what is happening?

Fold 1: train_index=[2 3], train_group=[2 2] test_index=[0 1 4 5], test_group=[0 0 3 3] Fold 2: train_index=[0 1 4 5], train_group=[0 0 3 3] test_index=[2 3], test_group=[2 2]

I thought about this but couldn't think of a way to do it neatly, I got (first line is a bit too long):

>>> for i, (train_index, test_index) in enumerate(group_kfold.split(X, y, groups)): ... train_group, test_group = groups[train_index], groups[test_index] ... print(f"Fold {i}") ... print(f"\tTRAIN:\tIndex: {train_index}, Group: {train_group}") ... print(f"\tTEST:\tIndex: {test_index}, Group: {test_group}")

WDYT?

I personally don't mind this, and I don't think you can make that first line shorter without changing the variable names or doing something non-Pythonic. Another way would be the following, which produces what I had above – it makes the print() statements easier to follow, but the first line is the same:

>>> for i, (train_index, test_index) in enumerate(group_kfold.split(X, y, groups)): ... train_group, test_group = groups[train_index], groups[test_index] ... print(f"Fold {i + 1}") ... print(f"\t{train_index=}, {train_group=}") ... print(f"\t{test_index=}, {test_group=}")

Thanks, happy to amend but would like to get some other opinions (maybe @thomasjpfan?) just because I would like to update all other similar example sections for other classes in _split.py (e.g., GroupKFold) to the same format.

Given what we decide here will influence other examples, I prefer to split this PR into two. This PR can only do what the title states and update doc/modules/cross_validation.rst.

Then you can open a new PR for this docstring's example where we can agree on a formatting. As for my opinion, I like:

for i, (train_index, test_index) in enumerate(group_kfold.split(X, y, groups)): print(f"Fold {i}:") print(f" Train: index={train_index}, group={groups[train_index]}") print(f" Test: index={test_index}, group={groups[test_index]}")

which turns into:

Fold 0: Train: index=[2 3], group=[2 2] Test: index=[0 1 4 5], group=[0 0 3 3] Fold 1: Train: index=[0 1 4 5], group=[0 0 3 3] Test: index=[2 3], group=[2 2]

Edit: Corrected with group=

As suggested by @stefmolin, I like having the fold number in, but I find https://github.com/scikit-learn/scikit-learn/pull/24104/files#r964290370 harder to parse:

Fold 1 train_index=array([2, 3]), train_group=array([2, 2]) test_index=array([0, 1, 4, 5]), test_group=array([0, 0, 3, 3]) Fold 2 train_index=array([0, 1, 4, 5]), train_group=array([0, 0, 3, 3]) test_index=array([2, 3]), test_group=array([2, 2])

That also LGTM. Should it be group= to match index=?

Yes it should be group=.

ArturoAmorQ

Apart from comment LGTM, thanks for the contribution @lucyleeow!

ArturoAmorQ · 2022-09-09T08:52:03Z

doc/modules/cross_validation.rst

-:class:`LeavePGroupsOut` with ``n_groups=1``.
+related to a specific group. This is the same as :class:`LeavePGroupsOut` with
+`n_groups=1` and the same as :class:`GroupKFold` with `n_splits` equal to the
+number of groups


Suggested change

number of groups

number of unique labels passed to the `groups` parameter.

Wording tweak to avoid possible confusion with the parameter n_groups.

lucyleeow · 2022-09-10T00:11:20Z

Thanks for the reviews!

I have amended so cross_validation.rst is amended, and the example in GroupKFold such that n_splits is NOT the same as number of groups.
Reverted formatting which I will do in another PR.

thomasjpfan · 2022-09-10T01:22:01Z

Yup, let's follow up with a PR on improving the the example. I think the formatting suggested in #24104 (comment) is sufficient.

)

clarify groupkfold

f635863

github-actions bot added module:model_selection Documentation labels Aug 4, 2022

thomasjpfan reviewed Aug 4, 2022

View reviewed changes

sklearn/model_selection/_split.py Show resolved Hide resolved

sklearn/model_selection/_split.py Show resolved Hide resolved

change print

f2c53ff

thomasjpfan reviewed Aug 5, 2022

View reviewed changes

review

4b438b4

thomasjpfan reviewed Aug 8, 2022

View reviewed changes

move to leaveonegroupout

82f0886

thomasjpfan approved these changes Aug 9, 2022

View reviewed changes

review

3b81044

thomasjpfan added the Quick Review For PRs that are quick to review label Aug 10, 2022

cmarmo mentioned this pull request Aug 15, 2022

DOC: Clarify that LeaveOneGroupOut can be a particular case of GroupKFold #16869

Closed

stefmolin reviewed Sep 6, 2022

View reviewed changes

lucyleeow mentioned this pull request Sep 6, 2022

[MRG] DOC Link items explictly #14817

Merged

Merge branch 'main' into doc_group

5711383

ArturoAmorQ approved these changes Sep 9, 2022

View reviewed changes

review

61807b5

thomasjpfan merged commit b4c2ebf into scikit-learn:main Sep 10, 2022

lucyleeow deleted the doc_group branch September 10, 2022 01:25

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Sep 12, 2022

DOC Clarify when GroupKFold same as LeaveOneGroupOut (scikit-learn#24104

cdd14f5

)

This was referenced Sep 19, 2022

DOC Improve format in code examples of splitters #24466

Merged

DOC Improve format in docstring code examples of splitters #24475

Merged

	... print("TRAIN:\tIndex:", train_index, ", Group:", train_group)
	... print(f"TRAIN:\tIndex: {train_index}, Group: {train_group}")

		:class:`GroupKFold` is the same as :class:`LeaveOneGroupOut` in the case where
		`n_splits` is equal to the number of groups.

		``n_groups=1`` and the same as :class:`GroupKFold` in the case where
		`n_splits` is equal to the number of groups

	number of groups
	number of unique labels passed to the `groups` parameter.

Uh oh!

DOC Clarify when GroupKFold same as LeaveOneGroupOut #24104

DOC Clarify when GroupKFold same as LeaveOneGroupOut #24104

Uh oh!

Conversation

lucyleeow commented Aug 4, 2022 • edited by thomasjpfan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lucyleeow commented Aug 5, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stefmolin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucyleeow Sep 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Sep 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArturoAmorQ left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucyleeow commented Sep 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan commented Sep 10, 2022

Uh oh!

Uh oh!

lucyleeow commented Aug 4, 2022 •

edited by thomasjpfan

Loading

lucyleeow Sep 6, 2022 •

edited

Loading

thomasjpfan Sep 8, 2022 •

edited

Loading

lucyleeow commented Sep 10, 2022 •

edited

Loading