[QST] Updating indices creation for split #14036

quasiben · 2019-06-06T21:25:30Z

I am interested in using GridSearchCV machinery from sklearn with non-numpy libraries like cuDF or dask.array. In exploring this work, I think functions like _iter_test_indices and _iter_test_masks could be amended slightly to return slices instead of numpy arrays.

For example in _iter_test_indices:

scikit-learn/sklearn/model_selection/_split.py

Lines 423 to 436 in ec35ed2

    
           def _iter_test_indices(self, X, y=None, groups=None): 
        
               n_samples = _num_samples(X) 
        
               indices = np.arange(n_samples) 
        
               if self.shuffle: 
        
                   check_random_state(self.random_state).shuffle(indices) 
        
               n_splits = self.n_splits 
        
               fold_sizes = np.full(n_splits, n_samples // n_splits, dtype=np.int) 
        
               fold_sizes[:n_samples % n_splits] += 1 
        
               current = 0 
        
               for fold_size in fold_sizes: 
        
                   start, stop = current, current + fold_size 
        
                   yield indices[start:stop] 
        
                   current = stop

If there is a shuffle, much of the code could be encapsulated in the shuffle conditional. And when there is
no shuffle, we can yield the appropriate slice object: slice(0, 71)...

I wanted to check here before proceeding with the work and/or if folks have any advice before I begin?

The text was updated successfully, but these errors were encountered:

jnothman · 2019-06-06T23:48:49Z

I have wondered about this in the past. One complication is simply that we have no idea how much code out there was relying on the sample being a copy and not a view...

mrocklin · 2019-06-07T00:03:06Z

I think that even a conservative change like the following would be welcome:

chunk = x[slc].copy()

In Dask at least the copy is free for us.

jnothman · 2019-06-07T00:29:58Z

Why is that beneficial then?

jnothman · 2019-06-07T00:33:50Z

I guess I can see why it's beneficial to sparse matrices, so I get it. I suspect that this is not a trivial change in code, particularly because of the separation of duties between the splitter and the indexation.

mrocklin · 2019-08-14T14:56:30Z

We could also maybe send in a cupy.random.RandomState object rather than a numpy.random.RandomState object so that we generate the random values on the GPU instead of on the CPU. This doesn't solve the np.arange problem, but maybe that points us in another possible direction. We would still need to remove explicit np.asarray calls on inputs, and allow array-like arguments through in some cases.

This came up in conversation with @ogrisel a couple of months ago.

Also cc @jakirkham , who has been looking at related problems recently.

jakirkham · 2019-08-14T17:13:12Z

Yeah with np.arange, I was hoping we could write something like this...

np.cumsum(np.ones_like(X, dtype="i8", shape=(n_samples,))) - 1

jakirkham · 2019-08-14T17:18:37Z

There's also a call to np.full. Currently that looks like this...

np.full(n_splits, n_samples // n_splits, dtype=np.int)

We might want to generalize it to something like this...

np.full_like(X, n_samples // n_splits, shape=(n_splits,), dtype=np.int)

mrocklin · 2019-08-14T17:18:54Z

Right, so that opens up a question for Scikit-Learn maintainers of if they're willing to accept code like this that is less intuitive in a few places for the case where the inputs are Numpy-like arrays.

My guess is that accepting a user-defined RandomState object is fine (it looks like there is good precedent for this already) while replacing np.arange calls with ones_like/cumsum is maybe less welcome.

jakirkham · 2019-08-14T17:21:12Z

There's of course the point that this syntax only works with NumPy 1.17, which may be it's own kind of deal breaker. Anyways figured I'd float these snippets to start the conversation.

jakirkham · 2019-08-14T17:23:18Z

Also there's a separate discussion about having a way to cast between different arrays in a subthread of issue ( numpy/numpy#13831 ), which is ostensibly about handling asarray without forcing array-like things to become NumPy arrays.

cc @pentschev (whose involved in that discussion and also thinking about these things)

amueller · 2019-08-14T17:46:46Z

Yeah it'll be a while till we can use numpy 1.17.

jakirkham · 2019-08-14T17:58:13Z

My guess is that accepting a user-defined RandomState object is fine...

Yeah it would be great if RandomState created the array and shuffled it for us, but wasn't finding the API we would need to do that. .shuffle does not seem to work if one does rs.shuffle(5) for example.

jakirkham · 2019-08-21T17:26:34Z

Another concern that Ben helped identify earlier is that if we want to allow a user-defined RandomState object, we'd want to make sure this passes check_random_state. Currently that means the user-defined RandomState object would need to subclass NumPy's RandomState object (please see code below). This is a bit undesirable as we will end up carrying around a Cython base class that has some irrelevant state. Not sure how well this will behave in practice either. Would it be possible to generalize this check to merely make sure the RandomState satisfies some abstract base class or has some particular methods?

scikit-learn/sklearn/utils/validation.py

Lines 775 to 782 in 4b6273b

    
           if seed is None or seed is np.random: 
        
               return np.random.mtrand._rand 
        
           if isinstance(seed, numbers.Integral): 
        
               return np.random.RandomState(seed) 
        
           if isinstance(seed, np.random.RandomState): 
        
               return seed 
        
           raise ValueError('%r cannot be used to seed a numpy.random.RandomState' 
        
                            ' instance' % seed)

amueller · 2019-08-21T19:04:32Z

@jakirkham I don't see why not.

thomasjpfan · 2019-08-21T19:22:07Z

Would it be possible to generalize this check to merely make sure the RandomState satisfies some abstract base class or has some particular methods?

Which interface do you have in mind? The interface for cupy.random.RandomState and np.random.RandomState are pretty extensive.

amueller · 2019-08-21T19:23:49Z

we kinda only need to distinguish it from an integer or None, right? We could check for a single common method existing. If we then try to use some obscure method and it's not implemented, well that's a reasonable error.
If it has a uniform method it's much more likely to be something that implements sampling than an integer, I think ;)

jakirkham · 2019-09-09T20:12:23Z

cc @ogrisel (as we discussed this during EuroSciPy)

ogrisel · 2022-05-13T09:19:39Z

It would be great if someone wanted to try to prototype Array API spec support in scikit-learn Cross-Validation tools in a draft PR on top of the #22554 PR.

jakirkham mentioned this issue Aug 21, 2019

[FEA] Enable Dask hyperparameter optimization in XGBoost rapidsai/cuml#1042

Closed

quasiben mentioned this issue Aug 21, 2019

[QST] cuDF performance with gridsearchcv rapidsai/cudf#1888

Closed

cmarmo added Question RFC labels Mar 4, 2022

jakirkham mentioned this issue May 13, 2022

Path for Adopting the Array API spec #22352

Open

ogrisel mentioned this issue Sep 13, 2022

Allow to reconstruct a library-specific DataFrame object from an interchange object data-apis/dataframe-api#85

Open

ogrisel added the Array API label May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Updating indices creation for split #14036

[QST] Updating indices creation for split #14036

quasiben commented Jun 6, 2019 •

edited

Loading

jnothman commented Jun 6, 2019 via email

mrocklin commented Jun 7, 2019

jnothman commented Jun 7, 2019 via email

jnothman commented Jun 7, 2019 via email

mrocklin commented Aug 14, 2019

jakirkham commented Aug 14, 2019

jakirkham commented Aug 14, 2019

mrocklin commented Aug 14, 2019

jakirkham commented Aug 14, 2019

jakirkham commented Aug 14, 2019

amueller commented Aug 14, 2019

jakirkham commented Aug 14, 2019

jakirkham commented Aug 21, 2019

amueller commented Aug 21, 2019

thomasjpfan commented Aug 21, 2019

amueller commented Aug 21, 2019 •

edited

Loading

jakirkham commented Sep 9, 2019

ogrisel commented May 13, 2022

[QST] Updating indices creation for split #14036

[QST] Updating indices creation for split #14036

Comments

quasiben commented Jun 6, 2019 • edited Loading

jnothman commented Jun 6, 2019 via email

mrocklin commented Jun 7, 2019

jnothman commented Jun 7, 2019 via email

jnothman commented Jun 7, 2019 via email

mrocklin commented Aug 14, 2019

jakirkham commented Aug 14, 2019

jakirkham commented Aug 14, 2019

mrocklin commented Aug 14, 2019

jakirkham commented Aug 14, 2019

jakirkham commented Aug 14, 2019

amueller commented Aug 14, 2019

jakirkham commented Aug 14, 2019

jakirkham commented Aug 21, 2019

amueller commented Aug 21, 2019

thomasjpfan commented Aug 21, 2019

amueller commented Aug 21, 2019 • edited Loading

jakirkham commented Sep 9, 2019

ogrisel commented May 13, 2022

quasiben commented Jun 6, 2019 •

edited

Loading

amueller commented Aug 21, 2019 •

edited

Loading