[MRG + 1] FIX Validate and convert X, y and groups to ndarray before splitting #7593

raghavrv · 2016-10-06T15:12:07Z

At sklearn 0.18.0

>>> from sklearn.model_selection import train_test_split
>>> X, y = [[1,], [2,], [3,], [4,], [5,], [6,]], ['1', '2', '1', '2', '1', '2']
>>> _ = train_test_split(X, y, stratify=y)
IndexError: index 0 is out of bounds for axis 1 with size 0

That is fixed after this PR.

This PR also cleans up some docstrings and adds test for LeavePGroupsOut and LeaveOneGroupOut...

@jnothman @amueller @lesteve Reviews please :)

raghavrv · 2016-10-06T15:17:00Z

sklearn/model_selection/_split.py

@@ -1708,3 +1714,23 @@ def _build_repr(self):
        params[key] = value

    return '%s(%s)' % (class_name, _pprint(params, offset=len(class_name)))
+
+
+def _check_X_y_groups(X, y, groups):


Should this reside inside utils.validation?

probably. Is the same applicable for sample_weights? What do we usually do with sample_weights?
We might just write check_X_y and then do a check_consistent_length(X, sample_weights) and check_array(sample_weights).

amueller · 2016-10-07T16:35:53Z

sklearn/model_selection/_split.py

+                        allow_nd=True)
+        check_consistent_length(X, y)
+    if groups is not None:
+        groups = check_array(groups, accept_sparse=['coo', 'csr', 'csc'],


groups can be infinite? and sparse? and nd? Is that tested? ;)

cannot be sparse, surely.

amueller · 2016-10-07T16:36:42Z

sklearn/model_selection/_split.py

+                    dtype=None, force_all_finite=False, ensure_2d=False,
+                    allow_nd=True)
+    if y is not None:
+        y = check_array(y, accept_sparse=['coo', 'csr', 'csc'],


Same for y. Are these tested? Should they be? I guess we should be as loose as possible with the test as long as the cross-validation classes work.

raghavrv · 2016-10-08T09:04:39Z

There is a test for train_test_split which tests support for nd arrays... And we cannot allow nd only there as it uses ShuffleSplit internally... You are correct, groups cannot have nan or be nd but they can be sparse I think...

And we could do the check_X_y followed by checks for groups, but it doesnt allow a None for y

raghavrv · 2016-10-12T14:44:19Z

Okay, I did away with the helper and made a case to case minimial validation for y and groups. For X, indexability is alone checked. One more pass @jnothman @amueller please!

amueller · 2016-10-13T21:16:26Z

sklearn/model_selection/tests/test_split.py

@@ -843,6 +877,20 @@ def test_shufflesplit_reproducible():
                       list(a for a, b in ss.split(X)))


+def test_shufflesplit_list_input():
+    # Check that when y is a list / list of string labels, it works.
+    ss = ShuffleSplit(random_state=42)


shouldn't that be StratifiedShuffleSplit?

amueller · 2016-10-13T21:17:28Z

sklearn/model_selection/_split.py

@@ -1087,6 +1091,8 @@ def __init__(self, n_splits=5, test_size=0.2, train_size=None,
    def _iter_indices(self, X, y, groups):
        if groups is None:
            raise ValueError("The groups parameter should not be None")
+        groups = check_array(groups, ensure_2d=False, dtype=None)


How about GroupKFold, LeaveOneGroupOut, LeavePGroupsOut?

Fixed... Thanks for the catch!!

raghavrv · 2016-10-16T22:01:40Z

I fixed #7126 along the way... One more look at this @amueller @jnothman

raghavrv · 2016-10-16T23:13:33Z

Argh. There seemed to have been no tests for LeavePGroupsOut and LeaveOneGroupOut in the old/new tests... Have added them too...

amueller · 2016-10-17T21:33:36Z

sklearn/model_selection/_split.py

@@ -891,6 +901,8 @@ def get_n_splits(self, X, y, groups):
        """
        if groups is None:
            raise ValueError("The groups parameter should not be None")
+        X, y, groups = indexable(X, y, groups)
+        groups = check_array(groups, ensure_2d=False, dtype=None)


I'd to it the other way around, I think.

check_array followed by indexable?

amueller

Looks good apart from some nitpicks.

amueller · 2016-10-17T21:36:41Z

sklearn/model_selection/tests/test_split.py

+
+    for j, (cv, p_groups_out) in enumerate(((logo, 1), (lpgo_1, 1),
+                                            (lpgo_2, 2))):
+        groups = (np.array([1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3]),


do we want these to be file-level constants?

amueller · 2016-10-17T21:38:48Z

sklearn/model_selection/tests/test_split.py

+    logo = LeaveOneGroupOut()
+    lpgo_1 = LeavePGroupsOut(n_groups=1)
+    lpgo_2 = LeavePGroupsOut(n_groups=2)
+    lpgo_3 = LeavePGroupsOut(n_groups=3)


for this one you only test the repr, right?

amueller · 2016-10-17T21:41:02Z

sklearn/model_selection/tests/test_split.py

+                  [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3],
+                  ['1', '1', '1', '1', '2', '2', '2', '3', '3', '3', '3', '3'])
+
+        all_n_splits = np.array([[3, 3, 3],


why do you hard-code it like this? that seems hard to validate. It's just scipy.misc.comb(len(np.unique(groups_i)), p_groups_out) right?

It's just scipy.misc.comb(len(np.unique(groups_i)), p_groups_out) right

That is the implementation in _split.py. I thought it would be better to compare it against hand calculated values?

Hm. The correctness of your "hand calculated values" is not immediately obvious to me.
How about

n_groups = len(np.unique(groups_i)) n_splits = n_groups if p_groups_out == 1 else n_groups * (n_groups - 1) / 2 ?

but I'm also fine leaving it like it is.
Why is all_n_splits of length 7 when groups is of length 6? (or github shows me a weird diff)

amueller · 2016-10-17T21:44:28Z

sklearn/model_selection/tests/test_split.py

+                # First test: no train group is in the test set and vice versa
+                grps_train_unique = np.unique(groups_arr[train])
+                grps_test_unique = np.unique(groups_arr[test])
+                assert_false(np.any(np.in1d(groups_arr[train],


why not test the intersection is empty?
assert_equal(set(groups_arr[train]).intersection(groups_arr[test]), set())

(or intersect1d if you prefer)

Sure. Thanks

Wait that is already done in the next 2 lines...

("third test")

third tests checks whether indices are disjoint, my code checks if the groups are disjoint.

amueller · 2016-10-17T21:45:42Z

sklearn/model_selection/tests/test_split.py

+                                            grps_train_unique)))
+
+                # Second test: train and test add up to all the data
+                assert_equal(groups_arr[train].size +


len(train) + len(test) = len(groups)?

amueller · 2016-10-19T18:59:10Z

lgtm apart from minor comments

raghavrv · 2016-10-20T12:32:19Z

Have addressed your comments. A 2nd look please? @jnothman @vene @TomDLT ?

amueller · 2016-10-24T20:58:17Z

travis fails?

raghavrv · 2016-10-24T21:35:20Z

Sorry about that. Should be fixed now...

raghavrv · 2016-11-03T10:36:31Z

sklearn/model_selection/tests/test_split.py

-    np.testing.assert_equal(y_train2, y_train3)
-    np.testing.assert_equal(X_test1, X_test3)
-    np.testing.assert_equal(y_test3, y_test2)
+    for stratify in ((y1, y2, y3), (None, None, None)):


Does this seem okay? @jnothman @amueller

raghavrv · 2016-11-03T10:36:54Z

Apologies for the delay! Have rebased and added the test... Could you check if it's okay?

jnothman · 2016-11-03T12:33:09Z

sklearn/model_selection/tests/test_split.py

+
+    for stratify in ((y1, y2, y3), (None, None, None)):
+        X_train1, X_test1, y_train1, y_test1 = train_test_split(
+            X, y1, stratify=stratify[0], random_state=0)


I think stratify=y1 if stratify else None would be more readable (where stratify in (True, False) is iterated)

jnothman · 2016-11-03T22:48:17Z

(Maybe we should allow stratify to be an int index into the **args)

raghavrv · 2016-11-03T22:57:27Z

Thanks for the patient review and merge!

…tting (scikit-learn#7593)

amueller · 2016-11-14T13:27:17Z

needs a whatsnew maybe?

…tting (scikit-learn#7593)

raghavrv added the Waiting for Reviewer label Oct 6, 2016

raghavrv added this to the 0.18.1 milestone Oct 6, 2016

raghavrv mentioned this pull request Oct 6, 2016

REGRESSION: StratifiedShuffleSplit errors on list y #7582

Closed

raghavrv commented Oct 6, 2016

View reviewed changes

raghavrv force-pushed the check_X_y_groups branch from d45b75c to ff5f379 Compare October 6, 2016 15:20

amueller reviewed Oct 7, 2016

View reviewed changes

raghavrv removed the Waiting for Reviewer label Oct 9, 2016

raghavrv force-pushed the check_X_y_groups branch from 76027d5 to 578442b Compare October 12, 2016 14:43

raghavrv added the Waiting for Reviewer label Oct 13, 2016

amueller requested changes Oct 13, 2016

View reviewed changes

RPGOne approved these changes Oct 17, 2016

View reviewed changes

raghavrv force-pushed the check_X_y_groups branch from f117a07 to 13f1e95 Compare October 17, 2016 14:45

amueller reviewed Oct 17, 2016

View reviewed changes

amueller changed the title ~~[MRG] FIX Validate and convert X, y and groups to ndarray before splitting~~ [MRG + 1] FIX Validate and convert X, y and groups to ndarray before splitting Oct 19, 2016

raghavrv force-pushed the check_X_y_groups branch from 13f1e95 to fce36af Compare October 20, 2016 12:31

raghavrv changed the title ~~[MRG + 1] FIX Validate and convert X, y and groups to ndarray before splitting~~ [MRG + 2] FIX Validate and convert X, y and groups to ndarray before splitting Oct 20, 2016

raghavrv changed the title ~~[MRG + 2] FIX Validate and convert X, y and groups to ndarray before splitting~~ [MRG + 1] FIX Validate and convert X, y and groups to ndarray before splitting Oct 20, 2016

raghavrv force-pushed the check_X_y_groups branch from b5d1fe3 to 44f6db6 Compare October 24, 2016 14:24

amueller approved these changes Oct 24, 2016

View reviewed changes

amueller added the Blocker label Oct 25, 2016

raghavrv added 12 commits November 3, 2016 11:25

Cleanup docstring; Note that np.zeros can be used as a placeholder

ab2b9b6

FIX tests

b38a001

Avoid indexable

68019ed

Add test for LeaveOneGroupOut and LeavePGroupsOut

cdefd99

FLAKE8

05d5c97

Swap check_array/indexable; Make test_groups file constant

a687a13

TST Check for empty intersection instead

96b5146

Substitute the removed var

109cdea

Remove redundant check for length of groups in LOGO

ed63675

DOC stratify param would be the class labels not groups

b13c315

Empty arrays are caught by check_array validation

68b4f28

flake8

1ca13d1

raghavrv force-pushed the check_X_y_groups branch from 0516776 to 1ca13d1 Compare November 3, 2016 10:27

TST if y can be a list if stratify is None

9ebaf4c

raghavrv commented Nov 3, 2016

View reviewed changes

jnothman reviewed Nov 3, 2016

View reviewed changes

Address Joel's review

03e2af7

jnothman merged commit d7c956a into scikit-learn:master Nov 3, 2016

raghavrv deleted the check_X_y_groups branch November 3, 2016 22:57

amueller pushed a commit to amueller/scikit-learn that referenced this pull request Nov 9, 2016

[MRG] FIX Validate and convert X, y and groups to ndarray before spli…

b8845f8

…tting (scikit-learn#7593)

raghavrv mentioned this pull request Nov 14, 2016

[MRG + 1] DOC Add whatsnew for the 3 model_selection bugfixes #7868

Merged

sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017

[MRG] FIX Validate and convert X, y and groups to ndarray before spli…

b8ec5b0

…tting (scikit-learn#7593)

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG] FIX Validate and convert X, y and groups to ndarray before spli…

179fd6d

…tting (scikit-learn#7593)

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG] FIX Validate and convert X, y and groups to ndarray before spli…

547ccb3

…tting (scikit-learn#7593)

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG] FIX Validate and convert X, y and groups to ndarray before spli…

c68e86f

…tting (scikit-learn#7593)

amueller mentioned this pull request Aug 5, 2019

Changed signature of CV split to accept X=None #7128

Closed

[MRG + 1] FIX Validate and convert X, y and groups to ndarray before splitting #7593

[MRG + 1] FIX Validate and convert X, y and groups to ndarray before splitting #7593

Conversation

raghavrv commented Oct 6, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amueller Oct 7, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raghavrv commented Oct 8, 2016

raghavrv commented Oct 12, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raghavrv commented Oct 16, 2016

raghavrv commented Oct 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amueller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amueller commented Oct 19, 2016

raghavrv commented Oct 20, 2016

amueller commented Oct 24, 2016

raghavrv commented Oct 24, 2016

Choose a reason for hiding this comment

raghavrv commented Nov 3, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Nov 3, 2016

raghavrv commented Nov 3, 2016

amueller commented Nov 14, 2016

raghavrv commented Oct 6, 2016 •

edited

Loading

amueller Oct 7, 2016 •

edited

Loading