Skip to content

Support nullable pandas dtypes in unique_labels #25634

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tamargrey opened this issue Feb 17, 2023 · 1 comment · Fixed by #25638
Closed

Support nullable pandas dtypes in unique_labels #25634

tamargrey opened this issue Feb 17, 2023 · 1 comment · Fixed by #25638

Comments

@tamargrey
Copy link

Describe the workflow you want to enable

I would like to be able to pass the nullable pandas dtypes ("Int64", "Float64", "boolean") into sklearn's unique_labels function. Because the dtypes become object dtype when converted to numpy arrays we get ValueError: Mix type of y not allowed, got types {'binary', 'unknown'}:

Repro with sklearn 1.2.1

    import pandas as pd
    import pytest
    from sklearn.utils.multiclass import unique_labels
    
    for dtype in ["Int64", "Float64", "boolean"]:
        y_true = pd.Series([1, 0, 0, 1, 0, 1, 1, 0, 1], dtype=dtype)
        y_predicted = pd.Series([0, 0, 1, 1, 0, 1, 1, 1, 1], dtype="int64")

        with pytest.raises(ValueError, match="Mix type of y not allowed, got types"):
            unique_labels(y_true, y_predicted)

Describe your proposed solution

We should get the same behavior as when int64, float64, and bool dtypes are used, which is no error:

    import pandas as pd
    from sklearn.utils.multiclass import unique_labels
    
    for dtype in ["int64", "float64", "bool"]:
        y_true = pd.Series([1, 0, 0, 1, 0, 1, 1, 0, 1], dtype=dtype)
        y_predicted = pd.Series([0, 0, 1, 1, 0, 1, 1, 1, 1], dtype="int64")

        unique_labels(y_true, y_predicted)

Describe alternatives you've considered, if relevant

Our current workaround is to convert the data to numpy arrays with the corresponding dtype that works prior to passing it into unique_labels.

Additional context

No response

@thomasjpfan
Copy link
Member

Thank you for opening the issue! The underlying issue comes from how pandas exposes __array__ to be object dtypes with nullable dtypes:

import pandas as pd

y_true = pd.Series([1, 0, 0, 1, 0, 1, 1, 0, 1], dtype="Int64")
print(np.asarray(y_true).dtype)
# object

y_pred = pd.Series([1, 0, 0, 1, 0, 1, 1, 0, 1], dtype="int64")
print(np.asarray(y_pred).dtype)
# int64

Scikit-learn already has check_array that handles converting pandas nullable dtypes into their corresponding NumPy dytpes, so I opened #25638 to use check_array to fix this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants