Skip to content

Pandas Copy-on-Write mode should be enabled in all tests #27879

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
s-banach opened this issue Nov 30, 2023 · 9 comments
Open

Pandas Copy-on-Write mode should be enabled in all tests #27879

s-banach opened this issue Nov 30, 2023 · 9 comments
Labels

Comments

@s-banach
Copy link
Contributor

s-banach commented Nov 30, 2023

Describe the bug

Pandas COW will be enabled by default in version 3.0.
For example, today I just found that TargetEncoder doesn't work properly with it enabled.
There are probably many other examples that could be uncovered by testing.

Steps/Code to Reproduce

import pandas as pd
from sklearn.preprocessing import TargetEncoder
pd.options.mode.copy_on_write = True

df = pd.DataFrame({
    "x": ["a", "b", "c", "c"],
    "y": [4., 5., 6., 7.]
})
t = TargetEncoder(target_type="continuous")
t.fit(df[["x"]], df["y"])

Expected Results

No error.

Actual Results

ValueError                                Traceback (most recent call last)
Cell In[2], line 10
      5 df = pd.DataFrame({
      6     "x": ["a", "b", "c", "c"],
      7     "y": [4., 5., 6., 7.]
      8 })
      9 t = TargetEncoder(target_type="continuous")
---> 10 t.fit(df[["x"]], df["y"])

File ~/.conda/envs/jhop311/lib/python3.11/site-packages/sklearn/base.py:1152, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1145     estimator._validate_params()
   1147 with config_context(
   1148     skip_parameter_validation=(
   1149         prefer_skip_nested_validation or global_skip_validation
   1150     )
   1151 ):
-> 1152     return fit_method(estimator, *args, **kwargs)

File ~/.conda/envs/jhop311/lib/python3.11/site-packages/sklearn/preprocessing/_target_encoder.py:203, in TargetEncoder.fit(self, X, y)
    186 @_fit_context(prefer_skip_nested_validation=True)
    187 def fit(self, X, y):
    188     """Fit the :class:`TargetEncoder` to X and y.
    189 
    190     Parameters
   (...)
    201         Fitted encoder.
    202     """
--> 203     self._fit_encodings_all(X, y)
    204     return self

File ~/.conda/envs/jhop311/lib/python3.11/site-packages/sklearn/preprocessing/_target_encoder.py:332, in TargetEncoder._fit_encodings_all(self, X, y)
    330 if self.smooth == "auto":
    331     y_variance = np.var(y)
--> 332     self.encodings_ = _fit_encoding_fast_auto_smooth(
    333         X_ordinal, y, n_categories, self.target_mean_, y_variance
    334     )
    335 else:
    336     self.encodings_ = _fit_encoding_fast(
    337         X_ordinal, y, n_categories, self.smooth, self.target_mean_
    338     )

File sklearn/preprocessing/_target_encoder_fast.pyx:82, in sklearn.preprocessing._target_encoder_fast._fit_encoding_fast_auto_smooth()

File stringsource:660, in View.MemoryView.memoryview_cwrapper()

File stringsource:350, in View.MemoryView.memoryview.__cinit__()

ValueError: buffer source array is read-only

Versions

System:
    python: 3.11.3 | packaged by conda-forge | (main, Apr  6 2023, 08:57:19) [GCC 11.3.0]
executable: /home/jhopfens/.conda/envs/jhop311/bin/python
   machine: Linux-3.10.0-1160.99.1.el7.x86_64-x86_64-with-glibc2.17

Python dependencies:
      sklearn: 1.3.2
          pip: 23.0.1
   setuptools: 67.6.1
        numpy: 1.25.2
        scipy: 1.11.2
       Cython: 3.0.0
       pandas: 2.1.0
   matplotlib: 3.7.2
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/jhopfens/.conda/envs/jhop311/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-5007b62f.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 64

       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/jhopfens/.conda/envs/jhop311/lib/python3.11/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
    num_threads: 128

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/jhopfens/.conda/envs/jhop311/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 64
@s-banach s-banach added Bug Needs Triage Issue requires triage labels Nov 30, 2023
@glemaitre
Copy link
Member

Thanks @s-banach. It should be linked to: https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html#read-only-numpy-arrays

We should probably modify check_array but I don't know yet what the fix (if we make the array writeable or we make a copy)

@glemaitre glemaitre removed the Needs Triage Issue requires triage label Nov 30, 2023
@s-banach
Copy link
Contributor Author

Is it a Cython thing where you can just add const or something?
I don't write cython, I'm just seeing that something similar was done here:
#25341

@s-banach
Copy link
Contributor Author

s-banach commented Nov 30, 2023

Okay, I see this issue: cython/cython#5230
I guess this is why we haven't fixed it yet.

In _target_encoder_fast.pyx, I believe it would suffice to replace both instances of Y_DTYPE[:] y with const Y_DTYPE[:] y.
This works if Y_DTYPE is a fused type of builtin cython types like float, double, int, long.
But cython currently fails to compile if we import type aliases like float64_t from some other file.

@glemaitre
Copy link
Member

Yep the signature is expecting something that is not read-only in the cython side. But I don't know if the estimator expect read-only data or it is expected to change the data inplace. I mean this is something that we will have to check for all estimators with the option activated.

@s-banach
Copy link
Contributor Author

s-banach commented Nov 30, 2023

Oh, yeah. I would hope that the only public-api estimators that should ever try to modify data inplace would be ones with an explicit copy=False option.

P.S. What I mean is that I have built sklearn from source with const added to the cython code, and it appears to have fixed the bug.

@lesteve
Copy link
Member

lesteve commented Dec 5, 2023

FYI, I searched and according to this pandas 3.0 is expected around April 2024.

@lesteve
Copy link
Member

lesteve commented Dec 5, 2023

A quick and dirty attempt seems to highlight only two failing tests in our test suite:

FAILED sklearn/impute/tests/test_impute.py::test_simple_impute_pd_na - ValueError: assignment destination is read-only
FAILED sklearn/tests/test_common.py::test_pandas_column_name_consistency[KernelDensity()] - ValueError: buffer source array is read-only

In particular it seems like the TargetEncoder issue reported in the top post is not covered by our tests (I think because only the default parameter is tested in the common tests).

@s-banach
Copy link
Contributor Author

s-banach commented Dec 5, 2023

Re pandas testing, I think there’s an option to output dataframes from transformers which is also not tested. It fails when the output would be a scipy matrix.
Edit: Invalid complaint.

@lesteve
Copy link
Member

lesteve commented Dec 6, 2023

Re pandas testing, I think there’s an option to output dataframes from transformers which is also not tested. It fails when the output would be a scipy matrix.

If you find the time, it would be great if you create a separate issue about this with a bit more details 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants