-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
check_array(X, dtype='numeric') should fail if X has strings #10229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
ping @jnothman
I think you mean #9835 (#9835 (comment)) ? Yes, from my perspective, if it is not intended, it seems a bug. |
it seems memorising four digit numbers is not my speciality
|
Five now! |
I think, @amueller, for the next little while, the mere knowledge that a
number has five digits leaves the first with distinctly low entropy
|
Well, I guess it's one more bit, though ;) |
@jnothman I'd be happy to give this a go with some limited guidance if no one else is working on it already. Looks like the behavior you noted comes from this line, where we're checking the array against the numpy object type when we'd like to check it against string and unicode types as well -- the |
I'd also be curious to hear what you had in mind in terms of longer term solution, i.e., what would replace
|
Something like that. Basically if dtype_numeric and array.dtype is not an
object dtype or a numeric dtype, we should raise.
…On 17 January 2018 at 13:17, Ryan ***@***.***> wrote:
@jnothman <https://github.com/jnothman> I'd be happy to give this a go
with some limited guidance if no one else is working on it already. Looks
like the behavior you noted comes from this line
<https://github.com/scikit-learn/scikit-learn/blob/202b5321f1798c4980abf69ac8c0a0969f01a2ec/sklearn/utils/validation.py#L480>,
where we're checking the array against the numpy object type when we'd like
to check it against string and unicode types as well -- the [['a', 'b',
'c']] list in your example appears to be cast to the numpy unicode array
type in your example by the time it reaches that line. Sound right?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10229 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz69hcymywNoXaDwoNalOeRc93uF3Uks5tLVg8gaJpZM4QwJIl>
.
|
We wouldn't deprecate |
Hi, I'm the newcomer here, can I take this? |
There is already a PR open on this one #10229, so I would encourage you to find another issue to work on. If you are a newcomer I would encourage you to read our contribution guidelines, in particular look at issues with the "good first issue" tag. |
Iesteve, I find this issue by this tag. |
I'm confused about the fix here. I'm using sklearn 0.23.2, and the behavior that @jnothman called out as a problem is still the same as he described. To reproduce:
Now everything's a string. It looks like the warning message was added in 2018 but the behavior was never changed. Am I missing something? |
Currently, dtype='numeric' is defined as "dtype is preserved unless array.dtype is object". This seems overly lenient and strange behaviour, as in #9342 where @qinhanmin2014 shows that
check_array(['a', 'b', 'c'], dtype='numeric')
works without error and produces an array of strings! This behaviour is not tested and it's hard to believe that it is useful and intended. Perhaps we need a deprecation cycle, but I think dtype='numeric' should raise an error, or attempt to coerce, if the data does not actually have a numeric, real-valued dtype.The text was updated successfully, but these errors were encountered: