SimpleImputer converts int32[pyarrow]
extension array to float64
, subsequently crashing with numpy int32
values
#31412
Labels
Needs Triage
Issue requires triage
Describe the bug
When using the
SimpleImputer
with a pyarrow-backed pandas DataFrame, any float/integer data is converted toNone
/float64
instead.This causes the imputer to be fitted to
float64
, crashing on a dtype assertion when passing it a numpy-backedint32
DataFrame after fitting.The flow is the following:
_validate_input
:scikit-learn/sklearn/impute/_base.py
Line 319 in 675736a
validate_data
:scikit-learn/sklearn/impute/_base.py
Lines 344 to 353 in 675736a
check_array
:scikit-learn/sklearn/utils/validation.py
Lines 2951 to 2952 in 675736a
scikit-learn/sklearn/utils/validation.py
Line 909 in 675736a
scikit-learn/sklearn/utils/validation.py
Lines 925 to 927 in 675736a
int32[pyarrow]
is an integer datatype, so we returnTrue
here:scikit-learn/sklearn/utils/validation.py
Lines 714 to 724 in 675736a
dtype
(which isNone
here, which apparently meansfloat64
):scikit-learn/sklearn/utils/validation.py
Lines 966 to 971 in 675736a
Steps/Code to Reproduce
Expected Results
Both imputers should be fitted with
int32
values.Actual Results
The imputer using the pyarrow extension array is fitted with
float64
.This causes crashes when using the Imputer with normal
int32
columns backed by numpy, as they won't be converted and therefore the dtypes differ.Versions
The text was updated successfully, but these errors were encountered: