-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[QST] Updating indices creation for split #14036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have wondered about this in the past. One complication is simply that we
have no idea how much code out there was relying on the sample being a copy
and not a view...
|
I think that even a conservative change like the following would be welcome: chunk = x[slc].copy() In Dask at least the copy is free for us. |
Why is that beneficial then?
|
I guess I can see why it's beneficial to sparse matrices, so I get it.
I suspect that this is not a trivial change in code, particularly because
of the separation of duties between the splitter and the indexation.
|
We could also maybe send in a This came up in conversation with @ogrisel a couple of months ago. Also cc @jakirkham , who has been looking at related problems recently. |
Yeah with np.cumsum(np.ones_like(X, dtype="i8", shape=(n_samples,))) - 1 |
There's also a call to np.full(n_splits, n_samples // n_splits, dtype=np.int) We might want to generalize it to something like this... np.full_like(X, n_samples // n_splits, shape=(n_splits,), dtype=np.int) |
Right, so that opens up a question for Scikit-Learn maintainers of if they're willing to accept code like this that is less intuitive in a few places for the case where the inputs are Numpy-like arrays. My guess is that accepting a user-defined RandomState object is fine (it looks like there is good precedent for this already) while replacing |
There's of course the point that this syntax only works with NumPy 1.17, which may be it's own kind of deal breaker. Anyways figured I'd float these snippets to start the conversation. |
Also there's a separate discussion about having a way to cast between different arrays in a subthread of issue ( numpy/numpy#13831 ), which is ostensibly about handling cc @pentschev (whose involved in that discussion and also thinking about these things) |
Yeah it'll be a while till we can use numpy 1.17. |
Yeah it would be great if |
Another concern that Ben helped identify earlier is that if we want to allow a user-defined scikit-learn/sklearn/utils/validation.py Lines 775 to 782 in 4b6273b
|
@jakirkham I don't see why not. |
Which interface do you have in mind? The interface for cupy.random.RandomState and np.random.RandomState are pretty extensive. |
we kinda only need to distinguish it from an integer or None, right? We could check for a single common method existing. If we then try to use some obscure method and it's not implemented, well that's a reasonable error. |
cc @ogrisel (as we discussed this during EuroSciPy) |
It would be great if someone wanted to try to prototype Array API spec support in scikit-learn Cross-Validation tools in a draft PR on top of the #22554 PR. |
I am interested in using GridSearchCV machinery from sklearn with non-numpy libraries like cuDF or dask.array. In exploring this work, I think functions like
_iter_test_indices
and_iter_test_masks
could be amended slightly to return slices instead of numpy arrays.For example in
_iter_test_indices
:scikit-learn/sklearn/model_selection/_split.py
Lines 423 to 436 in ec35ed2
If there is a shuffle, much of the code could be encapsulated in the shuffle conditional. And when there is
no shuffle, we can yield the appropriate slice object:
slice(0, 71)
...I wanted to check here before proceeding with the work and/or if folks have any advice before I begin?
The text was updated successfully, but these errors were encountered: