-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX change writeability only if not already writeable #29103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX change writeability only if not already writeable #29103
Conversation
Hi @jeremiedbb I think this can be reproduced without the ProcessPoolExecutor part. The code in the details produces this output: Traceback (most recent call last):
File "/Users/omar.salman/PycharmProjects/pythonProject/pandas_error.py", line 37, in main
regressor.fit(X_train, y_train)
File "/Users/omar.salman/.pyenv/versions/test-env/lib/python3.11/site-packages/sklearn/base.py", line 1473, in wrapper
return fit_method(estimator, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/omar.salman/.pyenv/versions/test-env/lib/python3.11/site-packages/sklearn/linear_model/_base.py", line 609, in fit
X, y = self._validate_data(
^^^^^^^^^^^^^^^^^^^^
File "/Users/omar.salman/.pyenv/versions/test-env/lib/python3.11/site-packages/sklearn/base.py", line 650, in _validate_data
X, y = check_X_y(X, y, **check_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/omar.salman/.pyenv/versions/test-env/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1289, in check_X_y
y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/omar.salman/.pyenv/versions/test-env/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1299, in _check_y
y = check_array(
^^^^^^^^^^^^
File "/Users/omar.salman/.pyenv/versions/test-env/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1107, in check_array
array.flags.writeable = True
^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot set WRITEABLE flag to True of this array from multiprocessing.managers import BaseManager
import traceback
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
class MemoryDataset:
def __init__(self):
self._ds = None
def save(self, ds):
self._ds = ds
def load(self):
return self._ds
class MyManager(BaseManager):
pass
def main():
MyManager.register("MemoryDataset", MemoryDataset, exposed=("save", "load"))
rng = np.random.default_rng()
n_samples = 1000
X_train = pd.DataFrame(rng.random((n_samples, 4)), columns=list('ABCD'))
y_train = pd.Series(rng.random(n_samples))
manager = MyManager()
manager.start()
dataset = manager.MemoryDataset()
dataset.save((X_train, y_train))
regressor = LinearRegression()
X_train, y_train = dataset.load()
try:
regressor.fit(X_train, y_train)
except Exception as _:
print(traceback.format_exc())
return regressor
if __name__ == '__main__':
main() Could this have something to do with pandas.Series objects being loaded from a read only buffer which is why we can't set the flags? Consider the following code which can also reproduce the above exception import traceback
import joblib
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
def main():
rng = np.random.default_rng()
n_samples = 1000
X_train = pd.DataFrame(rng.random((n_samples, 4)), columns=list('ABCD'))
y_train_df = pd.Series(rng.random(n_samples))
y_filename = "y_train.pkl"
joblib.dump(y_train_df, y_filename)
y_train = joblib.load(y_filename, mmap_mode="r")
regressor = LinearRegression()
try:
regressor.fit(X_train, y_train)
except Exception as _:
print(traceback.format_exc())
return regressor
if __name__ == '__main__':
main() |
@OmarManzoor it's more subtle. In your reproducer, the Series is backed by a read-only buffer, so the array is read-only and we try to make it writeable which is not possible. In the original reproducer, the array is already writeable, and just the attempt to szet the writeable flag to the same value it already has (i.e. True) fails. |
For the more general issue regarding arrays/dataframes/series backed by read-only buffers, I opened a different PR, #29018. |
Isn't it possible that the array in the original reproducer also comes from a read only buffer because of the memory manager and parallel processes involved. |
I looked at the array flags just before calling scikit-learn/sklearn/utils/validation.py Lines 1103 to 1104 in 0878f91
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a test?
The writeable flag is also set to if _is_pandas_df_or_series(array_orig) and hasattr(array, "flags"): hasattr(array, "flags")
Out[2]: True
array.flags
Out[3]:
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False |
@OmarManzoor, Are you sure it's not the check_array of X (this one is not backed by a read-only buffer) ? because when I do the same thing, I get
for the call of check_array on y |
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Yes I made the mistake of checking the first call which is for X. For y its False as you stated. |
if ( | ||
_is_pandas_df_or_series(array_orig) | ||
and hasattr(array, "flags") | ||
and not array.flags.writeable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the array is read from a read only buffer such that its writeable is expected to be False as we discussed, won't this be an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is and I opened #29018 to treat this case correctly for arrays and dataframes. See in particular this #29018 (comment) where I show that our current code fails for a very simple use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we try this or will this result in other issues or have unnecessary memory problems?
try:
array.flags.writeable = True
except ValueError:
array = array.copy()
array.flags.writeable = True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ruled this out because I think that we should not change the writeability of the user's input data to prevent catastrophic mistakes. Let's discuss this kind of questions in #29018.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I had a look at the other PR and some of the associated discussion. Since the actual solution is being implemented there, I think this may be okay as a temporary workaround. Maybe we could raise an exception with a more informative message for the case where the array is from a read only buffer?
Fixes #28899
This is a quick and easy fix for the regression reported in #28899. The issue is that setting the flag is not always possible, even if the new value is the same as the old value, which is the case there.
I haven't been able to make a simpler reproducer and I'm not sure that we want to add a test using this complex reproducer, but maybe we don't need a test given that the fix is pretty straightforward.
Note that the lines of code involved cause other issues that can't be fixed as easily, see #29018 (comment). For that, there's ongoing work in #29018, but the changes are bigger and may not fit in a bug-fix release.