Skip to content

FEA Add writeable parameter to check_array #29018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Jun 20, 2024

Conversation

jeremiedbb
Copy link
Member

@jeremiedbb jeremiedbb commented May 14, 2024

closes #28824
closes #14481
closes #29103
Fixes #28899
Fixes #29182

This PR proposes to add a writeable parameter to check_array. It acts as a toggle: it can be True meaning it's set or None (unset). If True, check_array will make sure that the returned array is writeable (which may require a copy). If None, the writeability of the array is left untouched.

It has the same status as the dtype or order parameters. They define desired properties of the output array. Sometimes they can only be applied by making a copy, even if copy=False.

Writeable arrays are required in estimators that can perform inplace operations. These estimators have a copy or copy_X parameter and currently they raise an error if copy=False and X is read-only. This behavior seems expected of rthe basic use case where the user has full control over the input array of the estimator. But in a complex pipeline, in can happen that an array is created in read-only mode (e.g. from joblib's auto memmapping) at an intermediate step which triggers an (unexpected to me) error, the last one being #28781.

This PR also presents an alternative to #28348, which isn't safe because changing the writeable flag of an array is not possible if the array doesn't own its data. And it happens even if the array is aleardy writeable, just trying to set the flag is not allowed. That's what happens in #28899.

I added a common test, which for now fails as expected for most estimators because I haven't applied the writeable param to all inplace estimators, so it shows the current behavior. A few of them still pass:

  • AffinityPropagation: already applied writeable in this PR
  • FactorAnalysis: already applied writeable in this PR
  • SimpleImputer: already applied writeable in this PR
  • Birch: doesn't perform inplace operations so the param copy should be deprecated (Deprecate copy in Birch #29092)
  • TheilSenRegressor: doesn't perform inplace operations so the copy param should be deprecated (Deprecate copy_X in TheilSenRegressor #29098)
  • KernelPCA: seems to always make a copy even if copy=False so needs investigation (Fix Make centering inplace in KernelPCA #29100)
  • OrthogonalMatchingPursuitCV: seems to always make a copy even if copy=False so needs investigation
    (I found that the only way to modify the input data is to pass a custom splitter that returns slices instead of arrays of indices which goes against our splitter doc, see doc, so we could deprecate the copy param or leave it as is because we don't want to put effort on this estimator and don't want to break user code)

cc/ @thomasjpfan Here's a proposed implementation of what we're discussing in #28824 (comment)

Copy link

github-actions bot commented May 14, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 19dfb21. Link to the linter CI: here

@glemaitre glemaitre self-assigned this May 16, 2024
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we are adding a new parameter, this feels like a big enough bug fix to be in 1.5.1.

_set_checking_parameters(estimator)

# The following estimators can work inplace only with certain settings
if estimator.__class__.__name__ == "HDBSCAN":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we are in our testing code sklearn/tests/test_common.py can we use isinstance here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the class name because it's what we use in _set_checking_parameters(estimator). Otherwise we have to import all all estimators for which we want to set a specific param which feels a bit cumbersome.

@glemaitre glemaitre removed their assignment May 20, 2024
@glemaitre glemaitre self-requested a review May 20, 2024 13:35
@jeremiedbb
Copy link
Member Author

I modified the behavior a bit. Previously I tried to change the writeability without copy and only copy if it failed. I felt that it was not a healthy behavior for the user. If a user calls StandardScaler(copy=False).fit_transform(X) where X is read-only, it's an ambiguous situation and it should not be scikit-learn's role to decide which path to chose. So now we always make a copy in that situation. There's one exception that comes from #28348, where it's a intermediate array that is created in read-only mode.

Arguably, we could raise an error in the ambiguous copy=False + read-only situation, but I think it's better to not raise and make estimators work in that case because intermediate arrays in complex pipelines can be read-only due to joblib's auto-memmapping (last time it happened was #28781). There are always workarounds but with this PR, we should not have to worry about this issue in the future.

@jeremiedbb
Copy link
Member Author

jeremiedbb commented May 23, 2024

I also found a use case that is currently failing unexpectedly in main (due to #28348) and works with this PR:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

X, y = make_regression()
X.flags.writeable = False
df = pd.DataFrame(X, copy=False)  # dataframe backed by a read-only array

LinearRegression().fit(df, y)
# fails, although LinearRegression doesn't even want to do any inplace operation

@jeremiedbb
Copy link
Member Author

When we're happy with the behavior, I'll set the new writeable param to all estimators that want to perform inplace operations.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with the proposed solution. This feels like a big enough bug fix to be in 1.5.1, but it adds a new parameter, where we usually wait for 1.6.

@jeremiedbb
Copy link
Member Author

jeremiedbb commented May 24, 2024

Alright, I'm going to add the new kwarg in estimators. In the mean time, I opened issues and PRs for the estimators with an unexpected behavior regarding the copy param and update the PR description with corresponding links.

This feels like a big enough bug fix to be in 1.5.1, but it adds a new parameter, where we usually wait for 1.6.

I can propose an easy quick fix for #28899 in a separate PR that we'd include in 1.5.1, and then the rest of this PR would fit in 1.6.

Edited: quick fix for #28899 in #29103.

@jeremiedbb
Copy link
Member Author

@thomasjpfan, given that #29018 (comment) is a regression that impacts all estimators, has been reported by other users (e.g. #29182), and is not fixed by #29103, I now think that we should release it in 1.5.1 despite adding a new public param to check_array.

@@ -1187,6 +1207,13 @@ def check_X_y(
Whether an array will be forced to be fortran or c-style. If
`None`, then the input data's order is preserved when possible.

writeable : True or None, default=None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think accepting True but not False is a bit weird. Can we use more descriptive names? For example:

Suggested change
writeable : True or None, default=None
writeable : "force" or "preserve", default="preserve"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's a bit weird. This is because I wanted to give it a similar status as order and dtype. I think a better option would have been ensure_writeable = True/False but the existing parameters with this naming pattern (ensure_xxx) are just to enable or disable a check, not to act on the output array.

Maybe we could go with force_writeable = True/False or make_writeable=True/False. There's already force_all_finite that is just to enable a check but arguably it should be ensure_all_finite. What do you think ? I think my preference is force_writeable and rename force_all_finite into ensure_all_finite later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR, I like force_writeable.

Can you open an issue on ensure_all_finite vs force_all_finite?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I made the change and opened #29262

tarilabs added a commit to tarilabs/demo20240608-KFcomponents that referenced this pull request Jun 9, 2024
seems to workaround of:
scikit-learn/scikit-learn#29018

failing otherwise with ~:
python3.11/site-packages/sklearn/utils/validation.py", line 1107, in check_array array.flags.writeable = True
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ogrisel
Copy link
Member

ogrisel commented Jun 18, 2024

I triggered the array API tests for CUDA GPU on this branch (after an update with main) as I suspect that it might not be 100% neutral for non-numpy inputs:

https://github.com/scikit-learn/scikit-learn/actions/runs/9568320710

EDIT: there are failing tests with CuPy:

>           if hasattr(array_data, "flags") and not array_data.flags.writeable:
E           AttributeError: 'Flags' object has no attribute 'writeable'

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pass of feedback.

@@ -2848,6 +2865,7 @@ def _check_inputs(self, X, in_fit, accept_sparse_negative=False, copy=False):
accept_sparse="csc",
copy=copy,
dtype=FLOAT_DTYPES,
force_writeable=True if not in_fit else None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would deserve an inline comment to explain why force_writeable is needed only in transform.

out = check_array(df, copy=False, force_writeable=True)
# df is backed by a read-only array, a copy is made
assert not np.may_share_memory(out, df)
assert out.flags.writeable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a similar test for array API inputs using the generic yield_namespace_device_dtype_combinations helper.

The array API states the following: https://data-apis.org/array-api/latest/design_topics/copies_views_and_mutation.html

So maybe we could just raise an exception when copy is False and force_writeable and not _is_numpy_namespace(xp).

Copy link
Member

@ogrisel ogrisel Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide that we should not raise an exception in that case for some reason (e.g. by always triggering a copy for safety?), then we should have a dedicated test such as:

xp = pytest.importorskip("array_api_strict")
X_np = np.random.uniform(size=(10, 10))
X_np.flags.writeable = False
X_np_copy = X_np.copy()

X_xp = xp.asarray(X_np)
with sklearn.config_context(array_api_dispatch=True):
    X_xp_checked = check_array(X_xp, copy=False, force_writeable=True)
    out_ns, is_array_api = get_namespace(X_xp_checked)
    assert is_array_api
    assert out_ns == xp
    assert_allclose(_convert_to_numpy(X_xp_checked), X_np)
    X_xp_checked[:] = 0

assert_allclose(X_np, X_np_copy)

And maybe something similar for PyTorch. I am not sure if it's possible to create readonly PyTorch tensors. On CPU it might be possible with memmaping? EDIT: I experimented with joblib.load("serialized_pytorch_cpu_tensor.pkl", mmap_mode="r") and I don't think it's possible: the result is a writeable PyTorch tensor.

However I am not sure this is what we want...

Copy link
Member

@ogrisel ogrisel Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The array API draft spec for the 2024 version was updated to mention readonly flags exposed by the DLPack interchange protocol:

https://github.com/data-apis/array-api/pull/749/files

However numpy 1.16.6 does not support this (yet) and raises instead...

In [1]: import numpy as np

In [2]: a = np.random.randn(10)

In [3]: a.flags.writeable = False

In [4]: a.__dlpack__()
---------------------------------------------------------------------------
BufferError                               Traceback (most recent call last)
Cell In[4], line 1
----> 1 a.__dlpack__()

BufferError: Cannot export readonly array since signalling readonly is unsupported by DLPack.

maybe we can reconsider inspecting __dlpack__ attributes later, once it's more widely adopted by libraries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In retrospect, I think I would be in favor of raising an exception as first suggested in https://github.com/scikit-learn/scikit-learn/pull/29018/files#r1644733447.

However to keep that PR minimally focused on the changes actually needed to fix the blocking bugs for 1.5.1, we can defer the new array API specific tests and the exception to a dedicated follow-up PR for 1.6 and only fix:

https://github.com/scikit-learn/scikit-learn/pull/29018/files#r1645720337

as part of the current PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, let's do that in a follow-up PR, the current fix should be enough for 1.5.1

if force_writeable:
array_data = array.data if sp.issparse(array) else array
copy_params = {"order": "K"} if not sp.issparse(array) else {}
if hasattr(array_data, "flags") and not array_data.flags.writeable:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if hasattr(array_data, "flags") and not array_data.flags.writeable:
flags = getattr(array_data, "flags", None)
if not getattr(flags, "writeable", True):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we need to trigger the array API tests again after this fix to be safe ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failures seem unrelated to this PR. We need to check if they also occur on main:

Copy link
Member

@ogrisel ogrisel Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed, those 8 failures are not related to this specific PR.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than fixing the existing array API tests (https://github.com/scikit-learn/scikit-learn/pull/29018/files#r1645849964) and other small details in the previous review, LGTM.

@ogrisel ogrisel merged commit ef6efef into scikit-learn:main Jun 20, 2024
30 checks passed
@ogrisel
Copy link
Member

ogrisel commented Jun 21, 2024

I merged too quickly, we now get:

FAILED tests/test_common.py::test_check_inplace_ensure_writeable[KernelPCA()] - ValueError: output array is read-only

on main.

jeremiedbb added a commit to jeremiedbb/scikit-learn that referenced this pull request Jul 2, 2024
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@jeremiedbb jeremiedbb mentioned this pull request Jul 2, 2024
11 tasks
jeremiedbb added a commit to jeremiedbb/scikit-learn that referenced this pull request Jul 2, 2024
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
jeremiedbb added a commit that referenced this pull request Jul 2, 2024
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants