Skip to content

MNT Correctly errors in check_array with dtype=numeric and string/bytes #18496

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Sep 29, 2020

Reference Issues/PRs

Finishes #10495

Edit: Closes #18660

What does this implement/fix? Explain your changes.

This PR finishes the deprecation in check_array by raising for object/str/bytes dtype when dtype='numeric'.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Could you mention closing #18660
I think that we will need to an entry in what's new mention it as a fix and also mention the change of behaviour.

@glemaitre glemaitre added this to the 0.24 milestone Oct 21, 2020
@ruro
Copy link

ruro commented Oct 23, 2020

Hi. I was wondering, is there any reason, why all np.void (meaning dtype.kind == 'V') arrays are also getting converted to float64? Afaik, np.void is only used for structured dtypes.

So with the proposed changes

named_uint64_dtype = np.dtype([('named_value', np.uint64, (1,))])
named_uint64_array = np.full((1, 1), np.iinfo(np.uint64).max, dtype=named_uint64_dtype)
print(named_uint64_array) # [([18446744073709551615],)]
checked = sklearn.utils.validation.check_array(named_uint64_array)
print(checked) # [[1.8447e+19]]

Notice, that the checked array is silently converted to np.float64, which actually changes the value slightly, since the uint64 value range can't be exactly represented by 64 bit floats.

Also, structured dtypes with more than 1 element now will just break, when trying to convert to np.float64:

custom_complex_dtype = np.dtype([('real', np.float, (1,)), ('imag', np.float, (1,))])
custom_complex_array = np.zeros((1, 1), dtype=custom_complex_dtype)
print(custom_complex_array) # [[([0.], [0.])]]
sklearn.utils.validation.check_array(custom_complex_array)
# TypeError: Cannot cast array data from dtype([('real', '<f8', (1,)), ('imag', '<f8', (1,))]) to dtype('float64') according to the rule 'unsafe'

I am not sure, whether the above is a desirable behaviour. It seems to me, that the "convert strings/bytes to floats" issue is covered by just converting dtype.kind in 'OUS' dtypes?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still deprecate string / bytes inputs when dtype_numeric, even in the case where they can be converted to numerical values with .astype(np.float64). I think we should even consider doing it for object dtype arrays.

For instance in the past @jnothman said: #10495 (review) and even before, in the parent issue, @amueller said: #10229 (comment)

dtype.kind == 'V' should just error directly. Implicit conversion should probably always considered a bug as @ruro explained in #18496 (comment).

@ruro
Copy link

ruro commented Oct 26, 2020

@ogrisel well, I would have preferred to keep the current behavior of passing the array unchanged for structured dtypes, since there are currently no known (afaik) issues with void typed ndarrays. Somebody (ahem, me) might be passing structured dtype arrays through the scikit machinery, relying on the fact that structured dtypes are preserved.

Although, I'd understand, if you still chose to reject void arrays, since this is definitely not a documented feature.

@glemaitre
Copy link
Member

@ogrisel So you are proposing to raise a new warning and wait for 2 more versions before raising the error?

Would be fine with me. @thomasjpfan does it look OK with you?

@ogrisel
Copy link
Member

ogrisel commented Oct 30, 2020

@ogrisel well, I would have preferred to keep the current behavior of passing the array unchanged for structured dtypes, since there are currently no known (afaik) issues with void typed ndarrays. Somebody (ahem, me) might be passing structured dtype arrays through the scikit machinery, relying on the fact that structured dtypes are preserved.

That will probably crash with a lower level error message on many scikit-learn estimators, for instance those written in Cython, right?

What is your use case @ruro? Which sklearn estimators do you feed structured dtypes?

Although, I'd understand, if you still chose to reject void arrays, since this is definitely not a documented feature.

Yes sorry I don't thing we should support untested features that lack a meaning usecase.

@ogrisel So you are proposing to raise a new warning and wait for 2 more versions before raising the error?

I was proposing to start the deprecation for any string / bytes arrays in 0.24 (schedule for removal in 0.26).

@ruro
Copy link

ruro commented Oct 31, 2020

What is your use case @ruro? Which sklearn estimators do you feed structured dtypes?

Rather than feeding structured dtypes to built-in estimators, the main use case IMO, is when sklearn acts as a middleman between 2 pieces of user code. For example, some estimators might allow the user to provide their own custom metrics, loss functions etc. In these cases sklearn often tends to compulsively cast everything to float, which is kind of annoying. You also have to pack features of different types (categorical, ordinal, boolean, real valued, textual etc) into composite dtypes, if you don't want to end up with floats inside your losses/metrics.

As I mentioned, my particular use case is 100% in the "definitely-not-supported-messing-with-sklearn-internals" category.
I use the following hack

vbrute = sklearn.neighbors.VALID_METRICS['brute']
vmetrics = sklearn.metrics.pairwise._VALID_METRICS
pairwise = sklearn.metrics.pairwise.PAIRWISE_DISTANCE_FUNCTIONS

def install_metric(f):
    name = f.__name__
    if name not in vbrute:
        vbrute.append(name)
    if name not in vmetrics:
        vmetrics.append(name)

    pairwise[name] = f
    return f

Which allows me to register my own custom vectorized metrics:

@install_metric
def custom_metric(X, Z, **kwargs):
    """
    params:
        X: ndarray of shape {x_samples, features}
        Z: ndarray of shape {z_samples, features}
    returns: ndarray of shape {x_samples, z_samples}
    """
    pass

And then use them in any sklearn algo, which accepts a metric:

nn = sklearn.neighbors.NearestNeighbors(
    n_neighbors=3, algorithm='brute',
    metric='custom_metric', metric_params={'kwargs': 'here'}
)

I know, that you can pass custom metric functions as pyfunc, but this evaluates the function for each sample, which is insanely slow and unconditionally casts everything to float, which is really annoying.

There are probably similar (in spirit) use cases which don't rely on messing with sklearn internals, but I don't know any.

@ogrisel
Copy link
Member

ogrisel commented Nov 5, 2020

ping @adrinjalali :)

@adrinjalali
Copy link
Member

I wouldn't mind ignoring the void issue, I guess in almost all cases, if the user passes that kinda data, they know what they're doing.

On silently converting strings to numeric, I'd side on always warn, since this PR can make things a bit inconsistent. For instance, if a dataframe/array has some string columns, and in some folds we can convert things to numeric and in some folds we can't, then the two sets of folds are gonna show a very different behavior. I rather always warn in these cases.

@thomasjpfan
Copy link
Member Author

Updated PR with a warning when a conversation takes place.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

# make sure we actually converted to numeric:
if dtype_numeric and array.dtype.kind == "O":
array = array.astype(np.float64)
if dtype_numeric and array.dtype.kind in "OUSV":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't mind leaving U out of this here. But not strong feelings either way.

@lorentzenchr
Copy link
Member

@ogrisel If I follow the conversation and the code correctly, your requested changes are addressed and this PR is ready for merge, right?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan's new test fails on some platforms:

https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=23933&view=ms.vss-test-web.build-test-results-tab

We shall probably silence the warning in this test to get a chance to reach the ValueError first.

I am not sure why it does not fail consistently across all platforms, though.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming my last commit fixes the tests, I will merge when green.

@ogrisel ogrisel merged commit b197cc0 into scikit-learn:master Nov 18, 2020
@thomasjpfan
Copy link
Member Author

Thank you for finishing this up @ogrisel !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

check_array(X, dtype='numeric') converts all data to strings if X contains any strings
7 participants