Skip to content

check_array can call array.astype(None), raising ValueError if pandas extension types are present in a pd.DataFrame array #25798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tamargrey opened this issue Mar 9, 2023 · 2 comments · Fixed by #25814

Comments

@tamargrey
Copy link

Describe the bug

At check_array, dtype_orig is determined for array objects that are pandas DataFrames by checking all(isinstance(dtype_iter, np.dtype) for dtype_iter in dtypes_orig). This excludes the pandas nullable extension types such as boolean, Int64, and Float64, resulting in a dtype_orig of None.

If pandas_requires_conversion, then there ends up being a call to array = array.astype(None), which pandas will take to mean a conversion to float64 should be attempted. If non numeric/boolean data is present in array, this can result in a ValueError: could not convert string to float: being raised if the data has the object dtype with string data or ValueError: Cannot cast object dtype to float64 if the data has the category dtype with object categories.

I first found this in using the imblearn SMOTEN and SMOTENC oversamplers, but this could happen from other uses of check_array.

Steps/Code to Reproduce

Reproduction via oversamplers

    import pandas as pd
    from imblearn import over_sampling as im
    for dtype in ["boolean", "Int64", "Float64"]:
        X = pd.DataFrame(
            {
                "a": pd.Series([1, 0, 1, 0], dtype=dtype),
                "b": pd.Series(["a", "b", "c", "d"], dtype="object"),
                "c": pd.Series([9, 8, 7, 6], dtype="int64"),
            }
        )
        y = pd.Series([0, 1, 1, 0], dtype="int64")

        for oversampler in [im.SMOTENC(categorical_features=[0, 1]), im.SMOTEN()]:
            with pytest.raises(ValueError):
                oversampler.fit_resample(X, y)

Reproduction via check_array directly

    import pandas as pd
    from sklearn.utils.validation import check_array
    for dtype in ["boolean", "Int64", "Float64"]:
        X = pd.DataFrame(
            {
                "a": pd.Series([1, 0, 1, 0], dtype=dtype),
                "b": pd.Series(["a", "b", "c", "d"], dtype="object"),
                "c": pd.Series([9, 8, 7, 6], dtype="int64"),
            }
        )

        with pytest.raises(ValueError):
            check_array(X, dtype=None)

Expected Results

We should get the same behavior that's seen with the non nullable equivalents ["bool", "int64", "float64"], which is no error.

    import pandas as pd
    from sklearn.utils.validation import check_array

    for dtype in ["bool", "int64", "float64"]:
        X = pd.DataFrame(
            {
                "a": pd.Series([1, 0, 1, 0], dtype=dtype),
                "b": pd.Series(["a", "b", "c", "d"], dtype="object"),
                "c": pd.Series([9, 8, 7, 6], dtype="int64"),
            }
        )

        check_array(X, dtype=None)

Actual Results

The actual results is a ValueError: could not convert string to float: being raised if the data has the object dtype with string data or ValueError: Cannot cast object dtype to float64 if the data has the category dtype with object categories.

Versions

System:
    python: 3.8.2 (default, May 21 2021, 12:12:59)  [Clang 11.0.3 (clang-1103.0.32.62)]
executable: /Users/tamar.grey/.pyenv/versions/3.8.2/envs/evalml-dev/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.2.2
          pip: 22.2.2
   setuptools: 59.8.0
        numpy: 1.22.4
        scipy: 1.8.1
       Cython: 0.29.32
       pandas: 1.5.3
   matplotlib: 3.5.3
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True
@tamargrey tamargrey added Bug Needs Triage Issue requires triage labels Mar 9, 2023
@tamargrey tamargrey changed the title check_array can call array.astype(None) if pandas extension types are present in a pd.DataFrame array check_array can call array.astype(None), raising ValueError if pandas extension types are present in a pd.DataFrame array Mar 9, 2023
@tamargrey
Copy link
Author

This also seems to be the case when any category dtype is present with non numeric categories

@thomasjpfan
Copy link
Member

Thank you for opening the issue! I opened #25814 to fix it.

@thomasjpfan thomasjpfan added module:utils and removed Needs Triage Issue requires triage labels Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants