Skip to content

SimpleImputer converts int32[pyarrow] extension array to float64, subsequently crashing with numpy int32 values #31373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
I-Al-Istannen opened this issue May 16, 2025 · 5 comments
Labels

Comments

@I-Al-Istannen
Copy link

Describe the bug

When using the SimpleImputer with a pyarrow-backed pandas DataFrame, any float/integer data is converted to None/float64 instead.
This causes the imputer to be fitted to float64, crashing on a dtype assertion when passing it a numpy-backed int32 DataFrame after fitting.

The flow is the following:

  1. The imputer calls _validate_input:
    def _validate_input(self, X, in_fit):
  2. This calls validate_data:
    X = validate_data(
    self,
    X,
    reset=in_fit,
    accept_sparse="csc",
    dtype=dtype,
    force_writeable=True if not in_fit else None,
    ensure_all_finite=ensure_all_finite,
    copy=self.copy,
    )
  3. This calls check_array:
    elif not no_val_X and no_val_y:
    out = check_array(X, input_name="X", **check_params)
  4. Our input is a pandas dataframe:
    if hasattr(array, "dtypes") and hasattr(array.dtypes, "__array__"):
  5. This now checks if the dtypes need to be converted:
    pandas_requires_conversion = any(
    _pandas_dtype_needs_early_conversion(i) for i in dtypes_orig
    )
  6. Our input is backed by an extension array and int32[pyarrow] is an integer datatype, so we return True here:
    if isinstance(pd_dtype, SparseDtype) or not is_extension_array_dtype(pd_dtype):
    # Sparse arrays will be converted later in `check_array`
    # Only handle extension arrays for integer and floats
    return False
    elif is_float_dtype(pd_dtype):
    # Float ndarrays can normally support nans. They need to be converted
    # first to map pd.NA to np.nan
    return True
    elif is_integer_dtype(pd_dtype):
    # XXX: Warn when converting from a high integer to a float
    return True
  7. Finally we pass the "needs conversion" check and convert the dataframe to dtype (which is None here, which apparently means float64):
    if pandas_requires_conversion:
    # pandas dataframe requires conversion earlier to handle extension dtypes with
    # nans
    # Use the original dtype for conversion if dtype is None
    new_dtype = dtype_orig if dtype is None else dtype
    array = array.astype(new_dtype)

Steps/Code to Reproduce

import polars as pl
from sklearn.impute import SimpleImputer
import numpy as np

print(
    SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=0)
      .fit(pl.DataFrame({"a": [10]}, schema={"a": pl.Int32}).to_pandas(use_pyarrow_extension_array=False))
      ._fit_dtype
)
# prints dtype('int32'), as expected

print(
    SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=0)
      .fit(pl.DataFrame({"a": [10]}, schema={"a": pl.Int32}).to_pandas(use_pyarrow_extension_array=True))
      ._fit_dtype
)
# prints dtype('float64') (!!)

Expected Results

Both imputers should be fitted with int32 values.

Actual Results

The imputer using the pyarrow extension array is fitted with float64.

This causes crashes when using the Imputer with normal int32 columns backed by numpy, as they won't be converted and therefore the dtypes differ.

Versions

System:
    python: 3.12.9 (main, Mar 11 2025, 17:26:57) [Clang 20.1.0 ]
executable: /tmp/scikit/.venv/bin/python3
   machine: Linux-6.14.4-arch1-2-x86_64-with-glibc2.41

Python dependencies:
      sklearn: 1.6.1
          pip: None
   setuptools: None
        numpy: 2.2.5
        scipy: 1.15.3
       Cython: None
       pandas: 2.2.3
   matplotlib: None
       joblib: 1.5.0
threadpoolctl: 3.6.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libscipy_openblas
       filepath: /tmp/scikit/.venv/lib/python3.12/site-packages/numpy.libs/libscipy_openblas64_-6bb31eeb.so
        version: 0.3.28
threading_layer: pthreads
   architecture: Haswell

       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libscipy_openblas
       filepath: /tmp/scikit/.venv/lib/python3.12/site-packages/scipy.libs/libscipy_openblas-68440149.so
        version: 0.3.28
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 16
         prefix: libgomp
       filepath: /tmp/scikit/.venv/lib/python3.12/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
@I-Al-Istannen I-Al-Istannen added Bug Needs Triage Issue requires triage labels May 16, 2025
@gdacciaro
Copy link
Contributor

The core issue stems from the fact that pandas.api.types.is_integer_dtype does not recognize polars (or pyarrow-backed) integer dtypes as integer types. For example:

from pandas.api.types import is_integer_dtype
import polars as pl
import pandas as pd

print(is_integer_dtype(int))           # True
print(is_integer_dtype(pl.Int32))      # False <--- problematic
print(is_integer_dtype(pd.Int32Dtype())) # True

This leads to unexpected conversion of pyarrow-backed integer columns to float64 inside check_array in scikit-learn, causing inconsistent dtype handling and eventual crashes when mixing pandas and polars dataframes or numpy arrays.

Proposal

I would like to work on making the check_array function dtype-checking logic more agnostic and robust, so that it correctly recognizes integer types from both pandas and polars (including pyarrow extension arrays). This would avoid unwanted dtype conversions and improve compatibility.

Happy to discuss the best approach and how to write tests for this.

/take

@I-Al-Istannen
Copy link
Author

I am not passing in a polars type, I am passing in a pandas dataframe backed by pyarrow. The is_integer_type check actually passes, as outlined in my issue — if it were to return False, it wouldn't try to convert it :)

>>> typ = pl.DataFrame({"a": [10]}, schema={"a": pl.Int32}).to_pandas(use_pyarrow_extension_array=True).dtypes["a"]
>>> is_integer_dtype(typ)
True

@Astroficboy
Copy link

Astroficboy commented May 21, 2025

The issue could be the first check in def _validate_input(self, X, in_fit): #

Image

It is checking for data types and assigning a default from FLOAT_DTYPES = (np.float64, np.float32, np.float16)

@gdacciaro will adding support for pyarrow dtypes in the if-else block work?

from pandas.api.types import is_extension_array_dtype, is_string_dtype, is_numeric_dtype

if self.strategy in ("most_frequent", "constant"):
        if isinstance(X, list) and any(
            isinstance(elem, str) for row in X for elem in row
        ):
            dtype = object
        else:
            dtype = None
    else:
        # Allow Arrow-backed numeric dtypes
        if hasattr(X, "dtypes") and all(
            is_numeric_dtype(dt) or is_extension_array_dtype(dt)
            for dt in X.dtypes
        ):
            dtype = None  # Let sklearn decide; don't enforce float64
        else:
            dtype = FLOAT_DTYPES

    return dtype

@gdacciaro
Copy link
Contributor

@Astroficboy
Thanks for the tag! No worries - feel free to implement it if you'd like.

@Astroficboy
Copy link

@gdacciaro Will try for a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants