Description
Describe the bug
At check_array, dtype_orig
is determined for array
objects that are pandas DataFrames by checking all(isinstance(dtype_iter, np.dtype) for dtype_iter in dtypes_orig)
. This excludes the pandas nullable extension types such as boolean
, Int64
, and Float64
, resulting in a dtype_orig
of None
.
If pandas_requires_conversion
, then there ends up being a call to array = array.astype(None)
, which pandas will take to mean a conversion to float64
should be attempted. If non numeric/boolean data is present in array
, this can result in a ValueError: could not convert string to float:
being raised if the data has the object
dtype with string data or ValueError: Cannot cast object dtype to float64
if the data has the category
dtype with object
categories.
I first found this in using the imblearn SMOTEN
and SMOTENC
oversamplers, but this could happen from other uses of check_array
.
Steps/Code to Reproduce
Reproduction via oversamplers
import pandas as pd
from imblearn import over_sampling as im
for dtype in ["boolean", "Int64", "Float64"]:
X = pd.DataFrame(
{
"a": pd.Series([1, 0, 1, 0], dtype=dtype),
"b": pd.Series(["a", "b", "c", "d"], dtype="object"),
"c": pd.Series([9, 8, 7, 6], dtype="int64"),
}
)
y = pd.Series([0, 1, 1, 0], dtype="int64")
for oversampler in [im.SMOTENC(categorical_features=[0, 1]), im.SMOTEN()]:
with pytest.raises(ValueError):
oversampler.fit_resample(X, y)
Reproduction via check_array directly
import pandas as pd
from sklearn.utils.validation import check_array
for dtype in ["boolean", "Int64", "Float64"]:
X = pd.DataFrame(
{
"a": pd.Series([1, 0, 1, 0], dtype=dtype),
"b": pd.Series(["a", "b", "c", "d"], dtype="object"),
"c": pd.Series([9, 8, 7, 6], dtype="int64"),
}
)
with pytest.raises(ValueError):
check_array(X, dtype=None)
Expected Results
We should get the same behavior that's seen with the non nullable equivalents ["bool", "int64", "float64"]
, which is no error.
import pandas as pd
from sklearn.utils.validation import check_array
for dtype in ["bool", "int64", "float64"]:
X = pd.DataFrame(
{
"a": pd.Series([1, 0, 1, 0], dtype=dtype),
"b": pd.Series(["a", "b", "c", "d"], dtype="object"),
"c": pd.Series([9, 8, 7, 6], dtype="int64"),
}
)
check_array(X, dtype=None)
Actual Results
The actual results is a ValueError: could not convert string to float:
being raised if the data has the object
dtype with string data or ValueError: Cannot cast object dtype to float64
if the data has the category
dtype with object
categories.
Versions
System:
python: 3.8.2 (default, May 21 2021, 12:12:59) [Clang 11.0.3 (clang-1103.0.32.62)]
executable: /Users/tamar.grey/.pyenv/versions/3.8.2/envs/evalml-dev/bin/python
machine: macOS-10.16-x86_64-i386-64bit
Python dependencies:
sklearn: 1.2.2
pip: 22.2.2
setuptools: 59.8.0
numpy: 1.22.4
scipy: 1.8.1
Cython: 0.29.32
pandas: 1.5.3
matplotlib: 3.5.3
joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True