Skip to content

FIX change writeability only if not already writeable #29103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jeremiedbb
Copy link
Member

Fixes #28899

This is a quick and easy fix for the regression reported in #28899. The issue is that setting the flag is not always possible, even if the new value is the same as the old value, which is the case there.

I haven't been able to make a simpler reproducer and I'm not sure that we want to add a test using this complex reproducer, but maybe we don't need a test given that the fix is pretty straightforward.

Note that the lines of code involved cause other issues that can't be fixed as easily, see #29018 (comment). For that, there's ongoing work in #29018, but the changes are bigger and may not fit in a bug-fix release.

@jeremiedbb jeremiedbb added the Quick Review For PRs that are quick to review label May 24, 2024
Copy link

github-actions bot commented May 24, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 226e1f7. Link to the linter CI: here

@OmarManzoor
Copy link
Contributor

OmarManzoor commented May 29, 2024

Hi @jeremiedbb

I think this can be reproduced without the ProcessPoolExecutor part. The code in the details produces this output:

Traceback (most recent call last):
  File "/Users/omar.salman/PycharmProjects/pythonProject/pandas_error.py", line 37, in main
    regressor.fit(X_train, y_train)
  File "/Users/omar.salman/.pyenv/versions/test-env/lib/python3.11/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/omar.salman/.pyenv/versions/test-env/lib/python3.11/site-packages/sklearn/linear_model/_base.py", line 609, in fit
    X, y = self._validate_data(
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/omar.salman/.pyenv/versions/test-env/lib/python3.11/site-packages/sklearn/base.py", line 650, in _validate_data
    X, y = check_X_y(X, y, **check_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/omar.salman/.pyenv/versions/test-env/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1289, in check_X_y
    y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/omar.salman/.pyenv/versions/test-env/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1299, in _check_y
    y = check_array(
        ^^^^^^^^^^^^
  File "/Users/omar.salman/.pyenv/versions/test-env/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1107, in check_array
    array.flags.writeable = True
    ^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot set WRITEABLE flag to True of this array
from multiprocessing.managers import BaseManager
import traceback

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression


class MemoryDataset:
    def __init__(self):
        self._ds = None

    def save(self, ds):
        self._ds = ds

    def load(self):
        return self._ds


class MyManager(BaseManager):
    pass


def main():
    MyManager.register("MemoryDataset", MemoryDataset, exposed=("save", "load"))
    rng = np.random.default_rng()
    n_samples = 1000
    X_train = pd.DataFrame(rng.random((n_samples, 4)), columns=list('ABCD'))
    y_train = pd.Series(rng.random(n_samples))
    manager = MyManager()
    manager.start()
    dataset = manager.MemoryDataset()
    dataset.save((X_train, y_train))
    regressor = LinearRegression()
    X_train, y_train = dataset.load()
    try:
        regressor.fit(X_train, y_train)
    except Exception as _:
        print(traceback.format_exc())
    return regressor


if __name__ == '__main__':
    main()

Could this have something to do with pandas.Series objects being loaded from a read only buffer which is why we can't set the flags? Consider the following code which can also reproduce the above exception

import traceback

import joblib
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression


def main():
    rng = np.random.default_rng()
    n_samples = 1000
    X_train = pd.DataFrame(rng.random((n_samples, 4)), columns=list('ABCD'))
    y_train_df = pd.Series(rng.random(n_samples))
    y_filename = "y_train.pkl"
    joblib.dump(y_train_df, y_filename)
    y_train = joblib.load(y_filename, mmap_mode="r")
    regressor = LinearRegression()
    try:
        regressor.fit(X_train, y_train)
    except Exception as _:
        print(traceback.format_exc())
    return regressor


if __name__ == '__main__':
    main()

@jeremiedbb
Copy link
Member Author

@OmarManzoor it's more subtle. In your reproducer, the Series is backed by a read-only buffer, so the array is read-only and we try to make it writeable which is not possible. In the original reproducer, the array is already writeable, and just the attempt to szet the writeable flag to the same value it already has (i.e. True) fails.

@jeremiedbb
Copy link
Member Author

For the more general issue regarding arrays/dataframes/series backed by read-only buffers, I opened a different PR, #29018.

@OmarManzoor
Copy link
Contributor

@OmarManzoor it's more subtle. In your reproducer, the Series is backed by a read-only buffer, so the array is read-only and we try to make it writeable which is not possible. In the original reproducer, the array is already writeable, and just the attempt to szet the writeable flag to the same value it already has (i.e. True) fails.

Isn't it possible that the array in the original reproducer also comes from a read only buffer because of the memory manager and parallel processes involved.

@jeremiedbb
Copy link
Member Author

jeremiedbb commented Jun 4, 2024

I looked at the array flags just before calling

if _is_pandas_df_or_series(array_orig) and hasattr(array, "flags"):
array.flags.writeable = True
, and the writeable flag is True in the original reproducer.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a test?

@OmarManzoor
Copy link
Contributor

OmarManzoor commented Jun 4, 2024

@jeremiedbb

The writeable flag is also set to True in the code snippet that I shared above. Here is the output of the debugger on the line before the error
occurs

if _is_pandas_df_or_series(array_orig) and hasattr(array, "flags"):
hasattr(array, "flags")
Out[2]: True
array.flags
Out[3]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

@jeremiedbb
Copy link
Member Author

jeremiedbb commented Jun 4, 2024

@OmarManzoor, Are you sure it's not the check_array of X (this one is not backed by a read-only buffer) ? because when I do the same thing, I get

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : False
  ALIGNED : True
  WRITEBACKIFCOPY : False

for the call of check_array on y

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
@OmarManzoor
Copy link
Contributor

Are you sure it's not the check_array of X

Yes I made the mistake of checking the first call which is for X. For y its False as you stated.

if (
_is_pandas_df_or_series(array_orig)
and hasattr(array, "flags")
and not array.flags.writeable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the array is read from a read only buffer such that its writeable is expected to be False as we discussed, won't this be an issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is and I opened #29018 to treat this case correctly for arrays and dataframes. See in particular this #29018 (comment) where I show that our current code fails for a very simple use case.

Copy link
Contributor

@OmarManzoor OmarManzoor Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try this or will this result in other issues or have unnecessary memory problems?

try:
    array.flags.writeable = True
except ValueError:
    array = array.copy()
    array.flags.writeable = True

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ruled this out because I think that we should not change the writeability of the user's input data to prevent catastrophic mistakes. Let's discuss this kind of questions in #29018.

Copy link
Contributor

@OmarManzoor OmarManzoor Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I had a look at the other PR and some of the associated discussion. Since the actual solution is being implemented there, I think this may be okay as a temporary workaround. Maybe we could raise an exception with a more informative message for the case where the array is from a read only buffer?

@jeremiedbb
Copy link
Member Author

Given that #29018 fixes more than what this PR does and that I know think that we should include it in 1.5.1, let's leave this one for now. I'm leaving it open in the mean time in case we finally don't want to include #29018 in 1.5.1 but let's move the reviews and discussions there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:utils Quick Review For PRs that are quick to review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validation step fails when using shared memory with multiprocessing.managers.BaseManager
3 participants