Skip to content

_BaseEncoder with boolean categories_ that include nan fails on transform when X is boolean #29241

@StijnDebackere

Description

@StijnDebackere

Describe the bug

An Encoder that was fitted on a DataFrame with boolean columns that include NaN will fail when transforming a boolean X due to a mismatch in the dtypes when calling _check_unknown. Since X has no object dtype, there is an attempt to call np.isnan(known_values), which fails because known_values does have an object dtype.

As far as I can tell, this can be fixed by casting the dtype of values in _check_unknown to the dtype of known_values:

if values.dtype != known_values.dtype:
     values = values.astype(known_values.dtype)

Steps/Code to Reproduce

import pandas as pd

from sklearn.preprocessing import OrdinalEncoder

x = pd.DataFrame({'a': [True, False, np.nan]})
o = OrdinalEncoder()
o.fit_transform(x)

y = pd.DataFrame({'a': [True, True, False]})
o.transform(y)

Expected Results

I expect the array to be transformed according to the known classes:

array([[1.],
       [1.],
       [0.]])

Actual Results

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], line 10
      7 o.fit_transform(x)
      9 y = pd.DataFrame({'a': [True, True, False]})
---> 10 o.transform(y)

File ~/miniconda3/envs/analytics-models-v2/lib/python3.11/site-packages/sklearn/utils/_set_output.py:295, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
    293 @wraps(f)
    294 def wrapped(self, X, *args, **kwargs):
--> 295     data_to_wrap = f(self, X, *args, **kwargs)
    296     if isinstance(data_to_wrap, tuple):
    297         # only wrap the first output for cross decomposition
    298         return_tuple = (
    299             _wrap_data_with_container(method, data_to_wrap[0], X, self),
    300             *data_to_wrap[1:],
    301         )

File ~/miniconda3/envs/analytics-models-v2/lib/python3.11/site-packages/sklearn/preprocessing/_encoders.py:1578, in OrdinalEncoder.transform(self, X)
   1564 """
   1565 Transform X to ordinal codes.
   1566 
   (...)
   1575     Transformed input.
   1576 """
   1577 check_is_fitted(self, "categories_")
-> 1578 X_int, X_mask = self._transform(
   1579     X,
   1580     handle_unknown=self.handle_unknown,
   1581     force_all_finite="allow-nan",
   1582     ignore_category_indices=self._missing_indices,
   1583 )
   1584 X_trans = X_int.astype(self.dtype, copy=False)
   1586 for cat_idx, missing_idx in self._missing_indices.items():

File ~/miniconda3/envs/analytics-models-v2/lib/python3.11/site-packages/sklearn/preprocessing/_encoders.py:206, in _BaseEncoder._transform(self, X, handle_unknown, force_all_finite, warn_on_unknown, ignore_category_indices)
    204 Xi = X_list[i]
    205 breakpoint()
--> 206 diff, valid_mask = _check_unknown(Xi, self.categories_[i], return_mask=True)
    208 if not np.all(valid_mask):
    209     if handle_unknown == "error":

File ~/miniconda3/envs/analytics-models-v2/lib/python3.11/site-packages/sklearn/utils/_encode.py:307, in _check_unknown(values, known_values, return_mask)
    304         valid_mask = np.ones(len(values), dtype=bool)
    306 # check for nans in the known_values
--> 307 if np.isnan(known_values).any():
    308     diff_is_nan = np.isnan(diff)
    309     if diff_is_nan.any():
    310         # removes nan from valid_mask

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Versions

System:
    python: 3.11.8 (main, Feb 26 2024, 15:43:17) [Clang 14.0.6 ]
executable: ~/miniconda3/envs/analytics-models-v2/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.4.1.post1
          pip: 23.3.1
   setuptools: 68.2.2
        numpy: 1.26.4
        scipy: 1.13.0
       Cython: None
       pandas: 2.1.4
   matplotlib: 3.8.4
       joblib: 1.4.0
threadpoolctl: 3.4.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 6
         prefix: libopenblas
       filepath: ~/miniconda3/envs/analytics-models-v2/lib/libopenblasp-r0.3.21.dylib
        version: 0.3.21
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 12
         prefix: libomp
       filepath: ~/miniconda3/envs/analytics-models-v2/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions