Skip to content

check_array(X, dtype='numeric') should fail if X has strings #10229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jnothman opened this issue Nov 30, 2017 · 14 comments · Fixed by #10495
Closed

check_array(X, dtype='numeric') should fail if X has strings #10229

jnothman opened this issue Nov 30, 2017 · 14 comments · Fixed by #10495
Labels
Bug Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted

Comments

@jnothman
Copy link
Member

Currently, dtype='numeric' is defined as "dtype is preserved unless array.dtype is object". This seems overly lenient and strange behaviour, as in #9342 where @qinhanmin2014 shows that check_array(['a', 'b', 'c'], dtype='numeric') works without error and produces an array of strings! This behaviour is not tested and it's hard to believe that it is useful and intended. Perhaps we need a deprecation cycle, but I think dtype='numeric' should raise an error, or attempt to coerce, if the data does not actually have a numeric, real-valued dtype.

@jnothman jnothman added the Bug label Nov 30, 2017
@qinhanmin2014
Copy link
Member

ping @jnothman

This seems overly lenient and strange behaviour, as in #9342 where @qinhanmin2014 shows that check_array(['a', 'b', 'c'], dtype='numeric') works without error and produces an array of strings!

I think you mean #9835 (#9835 (comment)) ?

Yes, from my perspective, if it is not intended, it seems a bug.

@jnothman
Copy link
Member Author

jnothman commented Nov 30, 2017 via email

@amueller
Copy link
Member

Five now!
And I'm not entirely sure what my intended behavior was, but I agree with your assessment. This should error on strings.

@jnothman
Copy link
Member Author

jnothman commented Dec 12, 2017 via email

@amueller
Copy link
Member

Well, I guess it's one more bit, though ;)

@jnothman jnothman added Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted labels Jan 10, 2018
@jnothman jnothman changed the title check_array(X, dtype='numeric') should fail if X has strings? check_array(X, dtype='numeric') should fail if X has strings Jan 10, 2018
@rtlee9
Copy link
Contributor

rtlee9 commented Jan 17, 2018

@jnothman I'd be happy to give this a go with some limited guidance if no one else is working on it already. Looks like the behavior you noted comes from this line, where we're checking the array against the numpy object type when we'd like to check it against string and unicode types as well -- the [['a', 'b', 'c']] list in your example appears to be cast to the numpy unicode array type in your example by the time it reaches that line. Sound right?

@rtlee9
Copy link
Contributor

rtlee9 commented Jan 17, 2018

I'd also be curious to hear what you had in mind in terms of longer term solution, i.e., what would replace check_array if deprecated?

Perhaps we need a deprecation cycle

@jnothman
Copy link
Member Author

jnothman commented Jan 17, 2018 via email

@jnothman
Copy link
Member Author

We wouldn't deprecate check_array entirely, but we would warn for two releases that "In the future, this data with dtype('Uxx') would be rejected because it is not of a numeric dtype."

@SpirinEgor
Copy link

Hi, I'm the newcomer here, can I take this?

@lesteve
Copy link
Member

lesteve commented Feb 19, 2018

There is already a PR open on this one #10229, so I would encourage you to find another issue to work on.

If you are a newcomer I would encourage you to read our contribution guidelines, in particular look at issues with the "good first issue" tag.

@SpirinEgor
Copy link

Iesteve, I find this issue by this tag.
As far as I see, it was reopened, so I think, that I can help.
Nevertheless thank you, I will find another issue, where I can start my contributing way :)

jnothman pushed a commit that referenced this issue Feb 22, 2018
)

* Add deprecation warning to check_array for flexible array w/ dtype=numeric
@zeromh
Copy link

zeromh commented Sep 28, 2020

I'm confused about the fix here. I'm using sklearn 0.23.2, and the behavior that @jnothman called out as a problem is still the same as he described. To reproduce:

arr = np.array([[1, 's'],
                [1, 1]])
check_array(arr, dtype='numeric')

FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers
 if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in 
scikit-learn, for example by using your_array = your_array.astype(np.float64).
  return f(**kwargs)

array([['1', 's'],
       ['1', '1']], dtype='<U21')

Now everything's a string. It looks like the warning message was added in 2018 but the behavior was never changed. Am I missing something?

@ogrisel
Copy link
Member

ogrisel commented Oct 26, 2020

@zeromh I agree. We have a new opportunity to fix it in #18496.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants