-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Closed
Labels
Description
Describe the bug
An Encoder
that was fitted on a DataFrame
with boolean columns that include NaN
will fail when transforming a boolean X
due to a mismatch in the dtype
s when calling _check_unknown
. Since X
has no object
dtype
, there is an attempt to call np.isnan(known_values)
, which fails because known_values
does have an object
dtype
.
As far as I can tell, this can be fixed by casting the dtype
of values
in _check_unknown
to the dtype
of known_values
:
if values.dtype != known_values.dtype:
values = values.astype(known_values.dtype)
Steps/Code to Reproduce
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
x = pd.DataFrame({'a': [True, False, np.nan]})
o = OrdinalEncoder()
o.fit_transform(x)
y = pd.DataFrame({'a': [True, True, False]})
o.transform(y)
Expected Results
I expect the array to be transformed according to the known classes:
array([[1.],
[1.],
[0.]])
Actual Results
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[1], line 10
7 o.fit_transform(x)
9 y = pd.DataFrame({'a': [True, True, False]})
---> 10 o.transform(y)
File ~/miniconda3/envs/analytics-models-v2/lib/python3.11/site-packages/sklearn/utils/_set_output.py:295, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
293 @wraps(f)
294 def wrapped(self, X, *args, **kwargs):
--> 295 data_to_wrap = f(self, X, *args, **kwargs)
296 if isinstance(data_to_wrap, tuple):
297 # only wrap the first output for cross decomposition
298 return_tuple = (
299 _wrap_data_with_container(method, data_to_wrap[0], X, self),
300 *data_to_wrap[1:],
301 )
File ~/miniconda3/envs/analytics-models-v2/lib/python3.11/site-packages/sklearn/preprocessing/_encoders.py:1578, in OrdinalEncoder.transform(self, X)
1564 """
1565 Transform X to ordinal codes.
1566
(...)
1575 Transformed input.
1576 """
1577 check_is_fitted(self, "categories_")
-> 1578 X_int, X_mask = self._transform(
1579 X,
1580 handle_unknown=self.handle_unknown,
1581 force_all_finite="allow-nan",
1582 ignore_category_indices=self._missing_indices,
1583 )
1584 X_trans = X_int.astype(self.dtype, copy=False)
1586 for cat_idx, missing_idx in self._missing_indices.items():
File ~/miniconda3/envs/analytics-models-v2/lib/python3.11/site-packages/sklearn/preprocessing/_encoders.py:206, in _BaseEncoder._transform(self, X, handle_unknown, force_all_finite, warn_on_unknown, ignore_category_indices)
204 Xi = X_list[i]
205 breakpoint()
--> 206 diff, valid_mask = _check_unknown(Xi, self.categories_[i], return_mask=True)
208 if not np.all(valid_mask):
209 if handle_unknown == "error":
File ~/miniconda3/envs/analytics-models-v2/lib/python3.11/site-packages/sklearn/utils/_encode.py:307, in _check_unknown(values, known_values, return_mask)
304 valid_mask = np.ones(len(values), dtype=bool)
306 # check for nans in the known_values
--> 307 if np.isnan(known_values).any():
308 diff_is_nan = np.isnan(diff)
309 if diff_is_nan.any():
310 # removes nan from valid_mask
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Versions
System:
python: 3.11.8 (main, Feb 26 2024, 15:43:17) [Clang 14.0.6 ]
executable: ~/miniconda3/envs/analytics-models-v2/bin/python
machine: macOS-10.16-x86_64-i386-64bit
Python dependencies:
sklearn: 1.4.1.post1
pip: 23.3.1
setuptools: 68.2.2
numpy: 1.26.4
scipy: 1.13.0
Cython: None
pandas: 2.1.4
matplotlib: 3.8.4
joblib: 1.4.0
threadpoolctl: 3.4.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
num_threads: 6
prefix: libopenblas
filepath: ~/miniconda3/envs/analytics-models-v2/lib/libopenblasp-r0.3.21.dylib
version: 0.3.21
threading_layer: pthreads
architecture: Haswell
user_api: openmp
internal_api: openmp
num_threads: 12
prefix: libomp
filepath: ~/miniconda3/envs/analytics-models-v2/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
version: None