Pandas Copy-on-Write mode should be enabled in all tests #27879

s-banach · 2023-11-30T16:54:28Z

Describe the bug

Pandas COW will be enabled by default in version 3.0.
For example, today I just found that TargetEncoder doesn't work properly with it enabled.
There are probably many other examples that could be uncovered by testing.

Steps/Code to Reproduce

import pandas as pd
from sklearn.preprocessing import TargetEncoder
pd.options.mode.copy_on_write = True

df = pd.DataFrame({
    "x": ["a", "b", "c", "c"],
    "y": [4., 5., 6., 7.]
})
t = TargetEncoder(target_type="continuous")
t.fit(df[["x"]], df["y"])

Expected Results

No error.

Actual Results

ValueError                                Traceback (most recent call last)
Cell In[2], line 10
      5 df = pd.DataFrame({
      6     "x": ["a", "b", "c", "c"],
      7     "y": [4., 5., 6., 7.]
      8 })
      9 t = TargetEncoder(target_type="continuous")
---> 10 t.fit(df[["x"]], df["y"])

File ~/.conda/envs/jhop311/lib/python3.11/site-packages/sklearn/base.py:1152, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1145     estimator._validate_params()
   1147 with config_context(
   1148     skip_parameter_validation=(
   1149         prefer_skip_nested_validation or global_skip_validation
   1150     )
   1151 ):
-> 1152     return fit_method(estimator, *args, **kwargs)

File ~/.conda/envs/jhop311/lib/python3.11/site-packages/sklearn/preprocessing/_target_encoder.py:203, in TargetEncoder.fit(self, X, y)
    186 @_fit_context(prefer_skip_nested_validation=True)
    187 def fit(self, X, y):
    188     """Fit the :class:`TargetEncoder` to X and y.
    189 
    190     Parameters
   (...)
    201         Fitted encoder.
    202     """
--> 203     self._fit_encodings_all(X, y)
    204     return self

File ~/.conda/envs/jhop311/lib/python3.11/site-packages/sklearn/preprocessing/_target_encoder.py:332, in TargetEncoder._fit_encodings_all(self, X, y)
    330 if self.smooth == "auto":
    331     y_variance = np.var(y)
--> 332     self.encodings_ = _fit_encoding_fast_auto_smooth(
    333         X_ordinal, y, n_categories, self.target_mean_, y_variance
    334     )
    335 else:
    336     self.encodings_ = _fit_encoding_fast(
    337         X_ordinal, y, n_categories, self.smooth, self.target_mean_
    338     )

File sklearn/preprocessing/_target_encoder_fast.pyx:82, in sklearn.preprocessing._target_encoder_fast._fit_encoding_fast_auto_smooth()

File stringsource:660, in View.MemoryView.memoryview_cwrapper()

File stringsource:350, in View.MemoryView.memoryview.__cinit__()

ValueError: buffer source array is read-only

Versions

System:
    python: 3.11.3 | packaged by conda-forge | (main, Apr  6 2023, 08:57:19) [GCC 11.3.0]
executable: /home/jhopfens/.conda/envs/jhop311/bin/python
   machine: Linux-3.10.0-1160.99.1.el7.x86_64-x86_64-with-glibc2.17

Python dependencies:
      sklearn: 1.3.2
          pip: 23.0.1
   setuptools: 67.6.1
        numpy: 1.25.2
        scipy: 1.11.2
       Cython: 3.0.0
       pandas: 2.1.0
   matplotlib: 3.7.2
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/jhopfens/.conda/envs/jhop311/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-5007b62f.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 64

       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/jhopfens/.conda/envs/jhop311/lib/python3.11/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
    num_threads: 128

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/jhopfens/.conda/envs/jhop311/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 64

The text was updated successfully, but these errors were encountered:

glemaitre · 2023-11-30T17:13:44Z

Thanks @s-banach. It should be linked to: https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html#read-only-numpy-arrays

We should probably modify check_array but I don't know yet what the fix (if we make the array writeable or we make a copy)

s-banach · 2023-11-30T19:06:57Z

Is it a Cython thing where you can just add const or something?
I don't write cython, I'm just seeing that something similar was done here:
#25341

s-banach · 2023-11-30T20:59:31Z

Okay, I see this issue: cython/cython#5230
I guess this is why we haven't fixed it yet.

In _target_encoder_fast.pyx, I believe it would suffice to replace both instances of Y_DTYPE[:] y with const Y_DTYPE[:] y.
This works if Y_DTYPE is a fused type of builtin cython types like float, double, int, long.
But cython currently fails to compile if we import type aliases like float64_t from some other file.

glemaitre · 2023-11-30T21:37:09Z

Yep the signature is expecting something that is not read-only in the cython side. But I don't know if the estimator expect read-only data or it is expected to change the data inplace. I mean this is something that we will have to check for all estimators with the option activated.

s-banach · 2023-11-30T21:43:24Z

Oh, yeah. I would hope that the only public-api estimators that should ever try to modify data inplace would be ones with an explicit copy=False option.

P.S. What I mean is that I have built sklearn from source with const added to the cython code, and it appears to have fixed the bug.

lesteve · 2023-12-05T16:42:57Z

FYI, I searched and according to this pandas 3.0 is expected around April 2024.

lesteve · 2023-12-05T17:18:42Z

A quick and dirty attempt seems to highlight only two failing tests in our test suite:

FAILED sklearn/impute/tests/test_impute.py::test_simple_impute_pd_na - ValueError: assignment destination is read-only
FAILED sklearn/tests/test_common.py::test_pandas_column_name_consistency[KernelDensity()] - ValueError: buffer source array is read-only

In particular it seems like the TargetEncoder issue reported in the top post is not covered by our tests (I think because only the default parameter is tested in the common tests).

s-banach · 2023-12-05T17:24:24Z

~~Re pandas testing, I think there’s an option to output dataframes from transformers which is also not tested. It fails when the output would be a scipy matrix.~~
Edit: Invalid complaint.

lesteve · 2023-12-06T09:21:52Z

Re pandas testing, I think there’s an option to output dataframes from transformers which is also not tested. It fails when the output would be a scipy matrix.

If you find the time, it would be great if you create a separate issue about this with a bit more details 🙏

s-banach added Bug Needs Triage Issue requires triage labels Nov 30, 2023

glemaitre removed the Needs Triage Issue requires triage label Nov 30, 2023

s-banach mentioned this issue Jan 23, 2024

FIX Target encoder const y #28233

Merged

This was referenced Feb 1, 2024

RFC switch to Polars as the default dataframe lib in our examples #28341

Open

CI Fix pandas copy-on-write issues in scipy-dev #28348

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas Copy-on-Write mode should be enabled in all tests #27879

Pandas Copy-on-Write mode should be enabled in all tests #27879

s-banach commented Nov 30, 2023 •

edited

Loading

glemaitre commented Nov 30, 2023

s-banach commented Nov 30, 2023

s-banach commented Nov 30, 2023 •

edited

Loading

glemaitre commented Nov 30, 2023

s-banach commented Nov 30, 2023 •

edited

Loading

lesteve commented Dec 5, 2023

lesteve commented Dec 5, 2023 •

edited

Loading

s-banach commented Dec 5, 2023 •

edited

Loading

lesteve commented Dec 6, 2023

Pandas Copy-on-Write mode should be enabled in all tests #27879

Pandas Copy-on-Write mode should be enabled in all tests #27879

Comments

s-banach commented Nov 30, 2023 • edited Loading

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

glemaitre commented Nov 30, 2023

s-banach commented Nov 30, 2023

s-banach commented Nov 30, 2023 • edited Loading

glemaitre commented Nov 30, 2023

s-banach commented Nov 30, 2023 • edited Loading

lesteve commented Dec 5, 2023

lesteve commented Dec 5, 2023 • edited Loading

s-banach commented Dec 5, 2023 • edited Loading

lesteve commented Dec 6, 2023

s-banach commented Nov 30, 2023 •

edited

Loading

s-banach commented Nov 30, 2023 •

edited

Loading

s-banach commented Nov 30, 2023 •

edited

Loading

lesteve commented Dec 5, 2023 •

edited

Loading

s-banach commented Dec 5, 2023 •

edited

Loading