-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
fix check_array dtype check for pandas series #12625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix check_array dtype check for pandas series #12625
Conversation
sklearn/utils/validation.py
Outdated
@@ -478,7 +478,7 @@ def check_array(array, accept_sparse=False, accept_large_sparse=True, | |||
# DataFrame), and store them. If not, store None. | |||
dtypes_orig = None | |||
if hasattr(array, "dtypes") and hasattr(array, "__array__"): | |||
dtypes_orig = np.array(array.dtypes) | |||
dtypes_orig = np.array(array.dtypes, ndmin=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this is the "correct" fix. It fixes the problem, but in essence, a Series should not take this path as it can never have multiple dtypes in it. So I would rather ensure that a Series does not pass this if
check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qinhanmin2014 had one in #12622 (comment)
I am trying to think of a robust "duck type check" for Series .. Personally, I would actually start doing actual isinstance checks (we could eg have a util function that combines that with trying to import pandas), but that's maybe a broader issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could also do and array.dtypes.ndim
so it stays None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good idea (for sure nicer than the hasattr(array.dtypes, "__array__")
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a "what's new" is probably needed?
def test_check_array_series(): | ||
# regression test that check_array works on pandas Series | ||
pd = importorskip("pandas") | ||
check_array(pd.Series([1, 2, 3]), ensure_2d=False, warn_on_dtype=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe assert that the output is equal to array([1, 2, 3])
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as well besides the CI failure on style :) Too many reviews.
hopefully good now ;) |
doc/whats_new/v0.20.rst
Outdated
@@ -176,6 +176,10 @@ Changelog | |||
precision issues in :class:`preprocessing.StandardScaler` and | |||
:class:`decomposition.IncrementalPCA` when using float32 datasets. | |||
:issue:`12338` by :user:`bauks <bauks>`. | |||
|
|||
- |Fix| Calling :func:`utils.check_array` on pandas `Series`, which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I think you can use `pandas.Series`
for an intersphinx link)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering after I saw the edit. Would be cool ;)
thanks! |
Reference Issues/PRs
Example: Fixes #12622
What does this implement/fix? Explain your changes.
Can't call set on zero-ndim array.
does this need a whatsnew? probably?