-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MNT Correctly errors in check_array with dtype=numeric and string/bytes #18496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MNT Correctly errors in check_array with dtype=numeric and string/bytes #18496
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Could you mention closing #18660
I think that we will need to an entry in what's new mention it as a fix and also mention the change of behaviour.
Hi. I was wondering, is there any reason, why all So with the proposed changes named_uint64_dtype = np.dtype([('named_value', np.uint64, (1,))])
named_uint64_array = np.full((1, 1), np.iinfo(np.uint64).max, dtype=named_uint64_dtype)
print(named_uint64_array) # [([18446744073709551615],)]
checked = sklearn.utils.validation.check_array(named_uint64_array)
print(checked) # [[1.8447e+19]] Notice, that the checked array is silently converted to Also, structured dtypes with more than 1 element now will just break, when trying to convert to custom_complex_dtype = np.dtype([('real', np.float, (1,)), ('imag', np.float, (1,))])
custom_complex_array = np.zeros((1, 1), dtype=custom_complex_dtype)
print(custom_complex_array) # [[([0.], [0.])]]
sklearn.utils.validation.check_array(custom_complex_array)
# TypeError: Cannot cast array data from dtype([('real', '<f8', (1,)), ('imag', '<f8', (1,))]) to dtype('float64') according to the rule 'unsafe' I am not sure, whether the above is a desirable behaviour. It seems to me, that the "convert strings/bytes to floats" issue is covered by just converting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should still deprecate string / bytes inputs when dtype_numeric
, even in the case where they can be converted to numerical values with .astype(np.float64)
. I think we should even consider doing it for object dtype arrays.
For instance in the past @jnothman said: #10495 (review) and even before, in the parent issue, @amueller said: #10229 (comment)
dtype.kind == 'V'
should just error directly. Implicit conversion should probably always considered a bug as @ruro explained in #18496 (comment).
@ogrisel well, I would have preferred to keep the current behavior of passing the array unchanged for structured Although, I'd understand, if you still chose to reject |
@ogrisel So you are proposing to raise a new warning and wait for 2 more versions before raising the error? Would be fine with me. @thomasjpfan does it look OK with you? |
That will probably crash with a lower level error message on many scikit-learn estimators, for instance those written in Cython, right? What is your use case @ruro? Which sklearn estimators do you feed structured dtypes?
Yes sorry I don't thing we should support untested features that lack a meaning usecase.
I was proposing to start the deprecation for any string / bytes arrays in 0.24 (schedule for removal in 0.26). |
Rather than feeding structured As I mentioned, my particular use case is 100% in the "definitely-not-supported-messing-with-sklearn-internals" category. vbrute = sklearn.neighbors.VALID_METRICS['brute']
vmetrics = sklearn.metrics.pairwise._VALID_METRICS
pairwise = sklearn.metrics.pairwise.PAIRWISE_DISTANCE_FUNCTIONS
def install_metric(f):
name = f.__name__
if name not in vbrute:
vbrute.append(name)
if name not in vmetrics:
vmetrics.append(name)
pairwise[name] = f
return f Which allows me to register my own custom vectorized metrics: @install_metric
def custom_metric(X, Z, **kwargs):
"""
params:
X: ndarray of shape {x_samples, features}
Z: ndarray of shape {z_samples, features}
returns: ndarray of shape {x_samples, z_samples}
"""
pass And then use them in any sklearn algo, which accepts a metric: nn = sklearn.neighbors.NearestNeighbors(
n_neighbors=3, algorithm='brute',
metric='custom_metric', metric_params={'kwargs': 'here'}
) I know, that you can pass custom metric functions as There are probably similar (in spirit) use cases which don't rely on messing with sklearn internals, but I don't know any. |
ping @adrinjalali :) |
I wouldn't mind ignoring the On silently converting strings to numeric, I'd side on always warn, since this PR can make things a bit inconsistent. For instance, if a dataframe/array has some string columns, and in some folds we can convert things to numeric and in some folds we can't, then the two sets of folds are gonna show a very different behavior. I rather always warn in these cases. |
Updated PR with a warning when a conversation takes place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
# make sure we actually converted to numeric: | ||
if dtype_numeric and array.dtype.kind == "O": | ||
array = array.astype(np.float64) | ||
if dtype_numeric and array.dtype.kind in "OUSV": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't mind leaving U
out of this here. But not strong feelings either way.
@ogrisel If I follow the conversation and the code correctly, your requested changes are addressed and this PR is ready for merge, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan's new test fails on some platforms:
We shall probably silence the warning in this test to get a chance to reach the ValueError
first.
I am not sure why it does not fail consistently across all platforms, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming my last commit fixes the tests, I will merge when green.
Thank you for finishing this up @ogrisel ! |
Reference Issues/PRs
Finishes #10495
Edit: Closes #18660
What does this implement/fix? Explain your changes.
This PR finishes the deprecation in
check_array
by raising for object/str/bytes dtype whendtype='numeric'
.