-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FEA Add writeable parameter to check_array #29018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although we are adding a new parameter, this feels like a big enough bug fix to be in 1.5.1.
sklearn/tests/test_common.py
Outdated
_set_checking_parameters(estimator) | ||
|
||
# The following estimators can work inplace only with certain settings | ||
if estimator.__class__.__name__ == "HDBSCAN": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we are in our testing code sklearn/tests/test_common.py
can we use isinstance
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the class name because it's what we use in _set_checking_parameters(estimator)
. Otherwise we have to import all all estimators for which we want to set a specific param which feels a bit cumbersome.
I modified the behavior a bit. Previously I tried to change the writeability without copy and only copy if it failed. I felt that it was not a healthy behavior for the user. If a user calls Arguably, we could raise an error in the ambiguous |
I also found a use case that is currently failing unexpectedly in main (due to #28348) and works with this PR: import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
X, y = make_regression()
X.flags.writeable = False
df = pd.DataFrame(X, copy=False) # dataframe backed by a read-only array
LinearRegression().fit(df, y)
# fails, although LinearRegression doesn't even want to do any inplace operation |
When we're happy with the behavior, I'll set the new writeable param to all estimators that want to perform inplace operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with the proposed solution. This feels like a big enough bug fix to be in 1.5.1, but it adds a new parameter, where we usually wait for 1.6.
Alright, I'm going to add the new kwarg in estimators. In the mean time, I opened issues and PRs for the estimators with an unexpected behavior regarding the copy param and update the PR description with corresponding links.
I can propose an easy quick fix for #28899 in a separate PR that we'd include in 1.5.1, and then the rest of this PR would fit in 1.6. |
@thomasjpfan, given that #29018 (comment) is a regression that impacts all estimators, has been reported by other users (e.g. #29182), and is not fixed by #29103, I now think that we should release it in 1.5.1 despite adding a new public param to check_array. |
sklearn/utils/validation.py
Outdated
@@ -1187,6 +1207,13 @@ def check_X_y( | |||
Whether an array will be forced to be fortran or c-style. If | |||
`None`, then the input data's order is preserved when possible. | |||
|
|||
writeable : True or None, default=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think accepting True
but not False
is a bit weird. Can we use more descriptive names? For example:
writeable : True or None, default=None | |
writeable : "force" or "preserve", default="preserve" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's a bit weird. This is because I wanted to give it a similar status as order
and dtype
. I think a better option would have been ensure_writeable = True/False
but the existing parameters with this naming pattern (ensure_xxx) are just to enable or disable a check, not to act on the output array.
Maybe we could go with force_writeable = True/False
or make_writeable=True/False
. There's already force_all_finite
that is just to enable a check but arguably it should be ensure_all_finite
. What do you think ? I think my preference is force_writeable
and rename force_all_finite
into ensure_all_finite
later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR, I like force_writeable
.
Can you open an issue on ensure_all_finite
vs force_all_finite
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I made the change and opened #29262
seems to workaround of: scikit-learn/scikit-learn#29018 failing otherwise with ~: python3.11/site-packages/sklearn/utils/validation.py", line 1107, in check_array array.flags.writeable = True
…k-array-writeable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I triggered the array API tests for CUDA GPU on this branch (after an update with https://github.com/scikit-learn/scikit-learn/actions/runs/9568320710 EDIT: there are failing tests with CuPy:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A pass of feedback.
@@ -2848,6 +2865,7 @@ def _check_inputs(self, X, in_fit, accept_sparse_negative=False, copy=False): | |||
accept_sparse="csc", | |||
copy=copy, | |||
dtype=FLOAT_DTYPES, | |||
force_writeable=True if not in_fit else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would deserve an inline comment to explain why force_writeable
is needed only in transform
.
out = check_array(df, copy=False, force_writeable=True) | ||
# df is backed by a read-only array, a copy is made | ||
assert not np.may_share_memory(out, df) | ||
assert out.flags.writeable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a similar test for array API inputs using the generic yield_namespace_device_dtype_combinations
helper.
The array API states the following: https://data-apis.org/array-api/latest/design_topics/copies_views_and_mutation.html
So maybe we could just raise an exception when copy is False and force_writeable and not _is_numpy_namespace(xp)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide that we should not raise an exception in that case for some reason (e.g. by always triggering a copy for safety?), then we should have a dedicated test such as:
xp = pytest.importorskip("array_api_strict")
X_np = np.random.uniform(size=(10, 10))
X_np.flags.writeable = False
X_np_copy = X_np.copy()
X_xp = xp.asarray(X_np)
with sklearn.config_context(array_api_dispatch=True):
X_xp_checked = check_array(X_xp, copy=False, force_writeable=True)
out_ns, is_array_api = get_namespace(X_xp_checked)
assert is_array_api
assert out_ns == xp
assert_allclose(_convert_to_numpy(X_xp_checked), X_np)
X_xp_checked[:] = 0
assert_allclose(X_np, X_np_copy)
And maybe something similar for PyTorch. I am not sure if it's possible to create readonly PyTorch tensors. On CPU it might be possible with memmaping? EDIT: I experimented with joblib.load("serialized_pytorch_cpu_tensor.pkl", mmap_mode="r")
and I don't think it's possible: the result is a writeable PyTorch tensor.
However I am not sure this is what we want...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The array API draft spec for the 2024 version was updated to mention readonly flags exposed by the DLPack interchange protocol:
https://github.com/data-apis/array-api/pull/749/files
However numpy 1.16.6 does not support this (yet) and raises instead...
In [1]: import numpy as np
In [2]: a = np.random.randn(10)
In [3]: a.flags.writeable = False
In [4]: a.__dlpack__()
---------------------------------------------------------------------------
BufferError Traceback (most recent call last)
Cell In[4], line 1
----> 1 a.__dlpack__()
BufferError: Cannot export readonly array since signalling readonly is unsupported by DLPack.
maybe we can reconsider inspecting __dlpack__
attributes later, once it's more widely adopted by libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In retrospect, I think I would be in favor of raising an exception as first suggested in https://github.com/scikit-learn/scikit-learn/pull/29018/files#r1644733447.
However to keep that PR minimally focused on the changes actually needed to fix the blocking bugs for 1.5.1, we can defer the new array API specific tests and the exception to a dedicated follow-up PR for 1.6 and only fix:
https://github.com/scikit-learn/scikit-learn/pull/29018/files#r1645720337
as part of the current PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, let's do that in a follow-up PR, the current fix should be enough for 1.5.1
sklearn/utils/validation.py
Outdated
if force_writeable: | ||
array_data = array.data if sp.issparse(array) else array | ||
copy_params = {"order": "K"} if not sp.issparse(array) else {} | ||
if hasattr(array_data, "flags") and not array_data.flags.writeable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if hasattr(array_data, "flags") and not array_data.flags.writeable: | |
flags = getattr(array_data, "flags", None) | |
if not getattr(flags, "writeable", True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we need to trigger the array API tests again after this fix to be safe ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me do that here:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failures seem unrelated to this PR. We need to check if they also occur on main
:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed, those 8 failures are not related to this specific PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than fixing the existing array API tests (https://github.com/scikit-learn/scikit-learn/pull/29018/files#r1645849964) and other small details in the previous review, LGTM.
…k-array-writeable
I merged too quickly, we now get:
on |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
closes #28824
closes #14481
closes #29103
Fixes #28899
Fixes #29182
This PR proposes to add a
writeable
parameter tocheck_array
. It acts as a toggle: it can be True meaning it's set or None (unset). If True,check_array
will make sure that the returned array is writeable (which may require a copy). If None, the writeability of the array is left untouched.It has the same status as the
dtype
ororder
parameters. They define desired properties of the output array. Sometimes they can only be applied by making a copy, even if copy=False.Writeable arrays are required in estimators that can perform inplace operations. These estimators have a
copy
orcopy_X
parameter and currently they raise an error ifcopy=False
and X is read-only. This behavior seems expected of rthe basic use case where the user has full control over the input array of the estimator. But in a complex pipeline, in can happen that an array is created in read-only mode (e.g. from joblib's auto memmapping) at an intermediate step which triggers an (unexpected to me) error, the last one being #28781.This PR also presents an alternative to #28348, which isn't safe because changing the writeable flag of an array is not possible if the array doesn't own its data. And it happens even if the array is aleardy writeable, just trying to set the flag is not allowed. That's what happens in #28899.
I added a common test, which for now fails as expected for most estimators because I haven't applied the
writeable
param to all inplace estimators, so it shows the current behavior. A few of them still pass:writeable
in this PRwriteable
in this PRwriteable
in this PRcopy=False
so needs investigation (Fix Make centering inplace in KernelPCA #29100)copy=False
so needs investigation(I found that the only way to modify the input data is to pass a custom splitter that returns slices instead of arrays of indices which goes against our splitter doc, see doc, so we could deprecate the copy param or leave it as is because we don't want to put effort on this estimator and don't want to break user code)
cc/ @thomasjpfan Here's a proposed implementation of what we're discussing in #28824 (comment)