Skip to content

check_array can call array.astype(None), raising ValueError if pandas extension types are present in a pd.DataFrame array  #25798

Closed
@tamargrey

Description

@tamargrey

Describe the bug

At check_array, dtype_orig is determined for array objects that are pandas DataFrames by checking all(isinstance(dtype_iter, np.dtype) for dtype_iter in dtypes_orig). This excludes the pandas nullable extension types such as boolean, Int64, and Float64, resulting in a dtype_orig of None.

If pandas_requires_conversion, then there ends up being a call to array = array.astype(None), which pandas will take to mean a conversion to float64 should be attempted. If non numeric/boolean data is present in array, this can result in a ValueError: could not convert string to float: being raised if the data has the object dtype with string data or ValueError: Cannot cast object dtype to float64 if the data has the category dtype with object categories.

I first found this in using the imblearn SMOTEN and SMOTENC oversamplers, but this could happen from other uses of check_array.

Steps/Code to Reproduce

Reproduction via oversamplers

    import pandas as pd
    from imblearn import over_sampling as im
    for dtype in ["boolean", "Int64", "Float64"]:
        X = pd.DataFrame(
            {
                "a": pd.Series([1, 0, 1, 0], dtype=dtype),
                "b": pd.Series(["a", "b", "c", "d"], dtype="object"),
                "c": pd.Series([9, 8, 7, 6], dtype="int64"),
            }
        )
        y = pd.Series([0, 1, 1, 0], dtype="int64")

        for oversampler in [im.SMOTENC(categorical_features=[0, 1]), im.SMOTEN()]:
            with pytest.raises(ValueError):
                oversampler.fit_resample(X, y)

Reproduction via check_array directly

    import pandas as pd
    from sklearn.utils.validation import check_array
    for dtype in ["boolean", "Int64", "Float64"]:
        X = pd.DataFrame(
            {
                "a": pd.Series([1, 0, 1, 0], dtype=dtype),
                "b": pd.Series(["a", "b", "c", "d"], dtype="object"),
                "c": pd.Series([9, 8, 7, 6], dtype="int64"),
            }
        )

        with pytest.raises(ValueError):
            check_array(X, dtype=None)

Expected Results

We should get the same behavior that's seen with the non nullable equivalents ["bool", "int64", "float64"], which is no error.

    import pandas as pd
    from sklearn.utils.validation import check_array

    for dtype in ["bool", "int64", "float64"]:
        X = pd.DataFrame(
            {
                "a": pd.Series([1, 0, 1, 0], dtype=dtype),
                "b": pd.Series(["a", "b", "c", "d"], dtype="object"),
                "c": pd.Series([9, 8, 7, 6], dtype="int64"),
            }
        )

        check_array(X, dtype=None)

Actual Results

The actual results is a ValueError: could not convert string to float: being raised if the data has the object dtype with string data or ValueError: Cannot cast object dtype to float64 if the data has the category dtype with object categories.

Versions

System:
    python: 3.8.2 (default, May 21 2021, 12:12:59)  [Clang 11.0.3 (clang-1103.0.32.62)]
executable: /Users/tamar.grey/.pyenv/versions/3.8.2/envs/evalml-dev/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.2.2
          pip: 22.2.2
   setuptools: 59.8.0
        numpy: 1.22.4
        scipy: 1.8.1
       Cython: 0.29.32
       pandas: 1.5.3
   matplotlib: 3.5.3
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions