MNT Correctly errors in check_array with dtype=numeric and string/bytes #18496

thomasjpfan · 2020-09-29T17:50:34Z

Reference Issues/PRs

Finishes #10495

Edit: Closes #18660

What does this implement/fix? Explain your changes.

This PR finishes the deprecation in check_array by raising for object/str/bytes dtype when dtype='numeric'.

sklearn/utils/tests/test_validation.py

glemaitre

LGTM Could you mention closing #18660
I think that we will need to an entry in what's new mention it as a fix and also mention the change of behaviour.

ruro · 2020-10-23T21:41:48Z

Hi. I was wondering, is there any reason, why all np.void (meaning dtype.kind == 'V') arrays are also getting converted to float64? Afaik, np.void is only used for structured dtypes.

So with the proposed changes

named_uint64_dtype = np.dtype([('named_value', np.uint64, (1,))])
named_uint64_array = np.full((1, 1), np.iinfo(np.uint64).max, dtype=named_uint64_dtype)
print(named_uint64_array) # [([18446744073709551615],)]
checked = sklearn.utils.validation.check_array(named_uint64_array)
print(checked) # [[1.8447e+19]]

Notice, that the checked array is silently converted to np.float64, which actually changes the value slightly, since the uint64 value range can't be exactly represented by 64 bit floats.

Also, structured dtypes with more than 1 element now will just break, when trying to convert to np.float64:

custom_complex_dtype = np.dtype([('real', np.float, (1,)), ('imag', np.float, (1,))])
custom_complex_array = np.zeros((1, 1), dtype=custom_complex_dtype)
print(custom_complex_array) # [[([0.], [0.])]]
sklearn.utils.validation.check_array(custom_complex_array)
# TypeError: Cannot cast array data from dtype([('real', '<f8', (1,)), ('imag', '<f8', (1,))]) to dtype('float64') according to the rule 'unsafe'

I am not sure, whether the above is a desirable behaviour. It seems to me, that the "convert strings/bytes to floats" issue is covered by just converting dtype.kind in 'OUS' dtypes?

ogrisel

I think we should still deprecate string / bytes inputs when dtype_numeric, even in the case where they can be converted to numerical values with .astype(np.float64). I think we should even consider doing it for object dtype arrays.

For instance in the past @jnothman said: #10495 (review) and even before, in the parent issue, @amueller said: #10229 (comment)

dtype.kind == 'V' should just error directly. Implicit conversion should probably always considered a bug as @ruro explained in #18496 (comment).

ruro · 2020-10-26T18:06:23Z

@ogrisel well, I would have preferred to keep the current behavior of passing the array unchanged for structured dtypes, since there are currently no known (afaik) issues with void typed ndarrays. Somebody (ahem, me) might be passing structured dtype arrays through the scikit machinery, relying on the fact that structured dtypes are preserved.

Although, I'd understand, if you still chose to reject void arrays, since this is definitely not a documented feature.

glemaitre · 2020-10-30T10:24:55Z

@ogrisel So you are proposing to raise a new warning and wait for 2 more versions before raising the error?

Would be fine with me. @thomasjpfan does it look OK with you?

ogrisel · 2020-10-30T17:12:31Z

@ogrisel well, I would have preferred to keep the current behavior of passing the array unchanged for structured dtypes, since there are currently no known (afaik) issues with void typed ndarrays. Somebody (ahem, me) might be passing structured dtype arrays through the scikit machinery, relying on the fact that structured dtypes are preserved.

That will probably crash with a lower level error message on many scikit-learn estimators, for instance those written in Cython, right?

What is your use case @ruro? Which sklearn estimators do you feed structured dtypes?

Although, I'd understand, if you still chose to reject void arrays, since this is definitely not a documented feature.

Yes sorry I don't thing we should support untested features that lack a meaning usecase.

@ogrisel So you are proposing to raise a new warning and wait for 2 more versions before raising the error?

I was proposing to start the deprecation for any string / bytes arrays in 0.24 (schedule for removal in 0.26).

ruro · 2020-10-31T12:30:03Z

What is your use case @ruro? Which sklearn estimators do you feed structured dtypes?

Rather than feeding structured dtypes to built-in estimators, the main use case IMO, is when sklearn acts as a middleman between 2 pieces of user code. For example, some estimators might allow the user to provide their own custom metrics, loss functions etc. In these cases sklearn often tends to compulsively cast everything to float, which is kind of annoying. You also have to pack features of different types (categorical, ordinal, boolean, real valued, textual etc) into composite dtypes, if you don't want to end up with floats inside your losses/metrics.

As I mentioned, my particular use case is 100% in the "definitely-not-supported-messing-with-sklearn-internals" category.
I use the following hack

vbrute = sklearn.neighbors.VALID_METRICS['brute']
vmetrics = sklearn.metrics.pairwise._VALID_METRICS
pairwise = sklearn.metrics.pairwise.PAIRWISE_DISTANCE_FUNCTIONS

def install_metric(f):
    name = f.__name__
    if name not in vbrute:
        vbrute.append(name)
    if name not in vmetrics:
        vmetrics.append(name)

    pairwise[name] = f
    return f

Which allows me to register my own custom vectorized metrics:

@install_metric
def custom_metric(X, Z, **kwargs):
    """
    params:
        X: ndarray of shape {x_samples, features}
        Z: ndarray of shape {z_samples, features}
    returns: ndarray of shape {x_samples, z_samples}
    """
    pass

And then use them in any sklearn algo, which accepts a metric:

nn = sklearn.neighbors.NearestNeighbors(
    n_neighbors=3, algorithm='brute',
    metric='custom_metric', metric_params={'kwargs': 'here'}
)

I know, that you can pass custom metric functions as pyfunc, but this evaluates the function for each sample, which is insanely slow and unconditionally casts everything to float, which is really annoying.

There are probably similar (in spirit) use cases which don't rely on messing with sklearn internals, but I don't know any.

ogrisel · 2020-11-05T14:25:20Z

ping @adrinjalali :)

adrinjalali · 2020-11-05T20:11:45Z

I wouldn't mind ignoring the void issue, I guess in almost all cases, if the user passes that kinda data, they know what they're doing.

On silently converting strings to numeric, I'd side on always warn, since this PR can make things a bit inconsistent. For instance, if a dataframe/array has some string columns, and in some folds we can convert things to numeric and in some folds we can't, then the two sets of folds are gonna show a very different behavior. I rather always warn in these cases.

…_check_strings

thomasjpfan · 2020-11-06T17:15:21Z

Updated PR with a warning when a conversation takes place.

sklearn/utils/validation.py

adrinjalali

Otherwise LGTM

adrinjalali · 2020-11-08T19:00:15Z

sklearn/utils/validation.py

        # make sure we actually converted to numeric:
-        if dtype_numeric and array.dtype.kind == "O":
-            array = array.astype(np.float64)
+        if dtype_numeric and array.dtype.kind in "OUSV":


I wouldn't mind leaving U out of this here. But not strong feelings either way.

lorentzenchr · 2020-11-11T18:46:18Z

@ogrisel If I follow the conversation and the code correctly, your requested changes are addressed and this PR is ready for merge, right?

ogrisel

@thomasjpfan's new test fails on some platforms:

https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=23933&view=ms.vss-test-web.build-test-results-tab

We shall probably silence the warning in this test to get a chance to reach the ValueError first.

I am not sure why it does not fail consistently across all platforms, though.

sklearn/utils/validation.py

ogrisel

Assuming my last commit fixes the tests, I will merge when green.

thomasjpfan · 2020-11-18T17:23:04Z

Thank you for finishing this up @ogrisel !

MNT Correctly errors in check_array with dtype=numeric and string/bytes

6b7890e

github-actions bot added the module:utils label Sep 29, 2020

glemaitre self-requested a review October 21, 2020 11:01

glemaitre mentioned this pull request Oct 21, 2020

check_array(X, dtype='numeric') converts all data to strings if X contains any strings #18660

Closed

glemaitre reviewed Oct 21, 2020

View reviewed changes

sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved

glemaitre approved these changes Oct 21, 2020

View reviewed changes

glemaitre added this to the 0.24 milestone Oct 21, 2020

glemaitre added the Waiting for Reviewer label Oct 21, 2020

ogrisel mentioned this pull request Oct 26, 2020

check_array(X, dtype='numeric') should fail if X has strings #10229

Closed

ogrisel requested changes Oct 26, 2020

View reviewed changes

thomasjpfan added 2 commits November 6, 2020 12:14

ENH Adds warning when there is a converstion

765458c

Merge remote-tracking branch 'upstream/master' into check_array_error…

caa3e92

…_check_strings

ogrisel reviewed Nov 6, 2020

View reviewed changes

sklearn/utils/validation.py Outdated Show resolved Hide resolved

ENH Adjusts to a deprecation warning

7880b2f

adrinjalali approved these changes Nov 8, 2020

View reviewed changes

TST Remove value check

ad3de50

cmarmo removed the Waiting for Reviewer label Nov 12, 2020

ogrisel reviewed Nov 18, 2020

View reviewed changes

sklearn/utils/validation.py Outdated Show resolved Hide resolved

ogrisel and others added 3 commits November 18, 2020 14:55

Update sklearn/utils/validation.py

593e123

Ignore warning in test_check_array_dtype_numeric_errors

0802613

Update test to take deprecation into account

ec8a61e

Backward compat for numpy 1.13

6789dbf

ogrisel approved these changes Nov 18, 2020

View reviewed changes

ogrisel merged commit b197cc0 into scikit-learn:master Nov 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MNT Correctly errors in check_array with dtype=numeric and string/bytes #18496

MNT Correctly errors in check_array with dtype=numeric and string/bytes #18496

thomasjpfan commented Sep 29, 2020 •

edited by lorentzenchr

Loading

glemaitre left a comment

ruro commented Oct 23, 2020

ogrisel left a comment

ruro commented Oct 26, 2020

glemaitre commented Oct 30, 2020

ogrisel commented Oct 30, 2020

ruro commented Oct 31, 2020 •

edited

Loading

ogrisel commented Nov 5, 2020

adrinjalali commented Nov 5, 2020

thomasjpfan commented Nov 6, 2020

adrinjalali left a comment

adrinjalali Nov 8, 2020

lorentzenchr commented Nov 11, 2020

ogrisel left a comment

ogrisel left a comment

thomasjpfan commented Nov 18, 2020

MNT Correctly errors in check_array with dtype=numeric and string/bytes #18496

MNT Correctly errors in check_array with dtype=numeric and string/bytes #18496

Conversation

thomasjpfan commented Sep 29, 2020 • edited by lorentzenchr Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

glemaitre left a comment

Choose a reason for hiding this comment

ruro commented Oct 23, 2020

ogrisel left a comment

Choose a reason for hiding this comment

ruro commented Oct 26, 2020

glemaitre commented Oct 30, 2020

ogrisel commented Oct 30, 2020

ruro commented Oct 31, 2020 • edited Loading

ogrisel commented Nov 5, 2020

adrinjalali commented Nov 5, 2020

thomasjpfan commented Nov 6, 2020

adrinjalali left a comment

Choose a reason for hiding this comment

adrinjalali Nov 8, 2020

Choose a reason for hiding this comment

lorentzenchr commented Nov 11, 2020

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel left a comment

Choose a reason for hiding this comment

thomasjpfan commented Nov 18, 2020

thomasjpfan commented Sep 29, 2020 •

edited by lorentzenchr

Loading

ruro commented Oct 31, 2020 •

edited

Loading