Skip to content

[QST] Updating indices creation for split #14036

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
quasiben opened this issue Jun 6, 2019 · 18 comments
Open

[QST] Updating indices creation for split #14036

quasiben opened this issue Jun 6, 2019 · 18 comments

Comments

@quasiben
Copy link

quasiben commented Jun 6, 2019

I am interested in using GridSearchCV machinery from sklearn with non-numpy libraries like cuDF or dask.array. In exploring this work, I think functions like _iter_test_indices and _iter_test_masks could be amended slightly to return slices instead of numpy arrays.

For example in _iter_test_indices:

def _iter_test_indices(self, X, y=None, groups=None):
n_samples = _num_samples(X)
indices = np.arange(n_samples)
if self.shuffle:
check_random_state(self.random_state).shuffle(indices)
n_splits = self.n_splits
fold_sizes = np.full(n_splits, n_samples // n_splits, dtype=np.int)
fold_sizes[:n_samples % n_splits] += 1
current = 0
for fold_size in fold_sizes:
start, stop = current, current + fold_size
yield indices[start:stop]
current = stop

If there is a shuffle, much of the code could be encapsulated in the shuffle conditional. And when there is
no shuffle, we can yield the appropriate slice object: slice(0, 71)...

I wanted to check here before proceeding with the work and/or if folks have any advice before I begin?

@jnothman
Copy link
Member

jnothman commented Jun 6, 2019 via email

@mrocklin
Copy link

mrocklin commented Jun 7, 2019

I think that even a conservative change like the following would be welcome:

chunk = x[slc].copy()

In Dask at least the copy is free for us.

@jnothman
Copy link
Member

jnothman commented Jun 7, 2019 via email

@jnothman
Copy link
Member

jnothman commented Jun 7, 2019 via email

@mrocklin
Copy link

We could also maybe send in a cupy.random.RandomState object rather than a numpy.random.RandomState object so that we generate the random values on the GPU instead of on the CPU. This doesn't solve the np.arange problem, but maybe that points us in another possible direction. We would still need to remove explicit np.asarray calls on inputs, and allow array-like arguments through in some cases.

This came up in conversation with @ogrisel a couple of months ago.

Also cc @jakirkham , who has been looking at related problems recently.

@jakirkham
Copy link
Contributor

Yeah with np.arange, I was hoping we could write something like this...

np.cumsum(np.ones_like(X, dtype="i8", shape=(n_samples,))) - 1

@jakirkham
Copy link
Contributor

There's also a call to np.full. Currently that looks like this...

np.full(n_splits, n_samples // n_splits, dtype=np.int)

We might want to generalize it to something like this...

np.full_like(X, n_samples // n_splits, shape=(n_splits,), dtype=np.int)

@mrocklin
Copy link

Right, so that opens up a question for Scikit-Learn maintainers of if they're willing to accept code like this that is less intuitive in a few places for the case where the inputs are Numpy-like arrays.

My guess is that accepting a user-defined RandomState object is fine (it looks like there is good precedent for this already) while replacing np.arange calls with ones_like/cumsum is maybe less welcome.

@jakirkham
Copy link
Contributor

There's of course the point that this syntax only works with NumPy 1.17, which may be it's own kind of deal breaker. Anyways figured I'd float these snippets to start the conversation.

@jakirkham
Copy link
Contributor

Also there's a separate discussion about having a way to cast between different arrays in a subthread of issue ( numpy/numpy#13831 ), which is ostensibly about handling asarray without forcing array-like things to become NumPy arrays.

cc @pentschev (whose involved in that discussion and also thinking about these things)

@amueller
Copy link
Member

Yeah it'll be a while till we can use numpy 1.17.

@jakirkham
Copy link
Contributor

My guess is that accepting a user-defined RandomState object is fine...

Yeah it would be great if RandomState created the array and shuffled it for us, but wasn't finding the API we would need to do that. .shuffle does not seem to work if one does rs.shuffle(5) for example.

@jakirkham
Copy link
Contributor

Another concern that Ben helped identify earlier is that if we want to allow a user-defined RandomState object, we'd want to make sure this passes check_random_state. Currently that means the user-defined RandomState object would need to subclass NumPy's RandomState object (please see code below). This is a bit undesirable as we will end up carrying around a Cython base class that has some irrelevant state. Not sure how well this will behave in practice either. Would it be possible to generalize this check to merely make sure the RandomState satisfies some abstract base class or has some particular methods?

if seed is None or seed is np.random:
return np.random.mtrand._rand
if isinstance(seed, numbers.Integral):
return np.random.RandomState(seed)
if isinstance(seed, np.random.RandomState):
return seed
raise ValueError('%r cannot be used to seed a numpy.random.RandomState'
' instance' % seed)

@amueller
Copy link
Member

@jakirkham I don't see why not.

@thomasjpfan
Copy link
Member

Would it be possible to generalize this check to merely make sure the RandomState satisfies some abstract base class or has some particular methods?

Which interface do you have in mind? The interface for cupy.random.RandomState and np.random.RandomState are pretty extensive.

@amueller
Copy link
Member

amueller commented Aug 21, 2019

we kinda only need to distinguish it from an integer or None, right? We could check for a single common method existing. If we then try to use some obscure method and it's not implemented, well that's a reasonable error.
If it has a uniform method it's much more likely to be something that implements sampling than an integer, I think ;)

@jakirkham
Copy link
Contributor

cc @ogrisel (as we discussed this during EuroSciPy)

@ogrisel
Copy link
Member

ogrisel commented May 13, 2022

It would be great if someone wanted to try to prototype Array API spec support in scikit-learn Cross-Validation tools in a draft PR on top of the #22554 PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants