Skip to content

ENH: change object-array comparisons to prefer OO->O unfuncs #14800

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 5, 2019

Conversation

mattip
Copy link
Member

@mattip mattip commented Oct 29, 2019

We have many FutureDeprecationWarnings, DeprecationWarnings, and inconsistent behaviour in array_richcompare functions. xref gh-577. In discussion there, it was suggested to prioritize OO->O over OO->? for object array richcompare, then to work through the rest of the warnings in array_richcompare and _failed_comparison_workaround. For now I completely removed OO->? and only needed to add result.astype(bool) in a few code paths.

The change did not adversely affect the sympy test suite.

@mattip
Copy link
Member Author

mattip commented Oct 29, 2019

I wasn't sure of the proper labels

a = np.array([1, np.array([1,2,3])], dtype=object)
b = np.array([1, np.array([1,2,3])], dtype=object)
self.assert_deprecated(op, args=(a, b), num=None)
res = op(a, b)
assert res.dtype == 'object'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now returns an object array instead of deprecating.

if isinstance(r, bool):
d = type(self)(r)
else:
d = r.view(type(self))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since all is np.logical_and.reduce, the OO->O mapping reduces the output to a boolean scalar, which has no view.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, could pass keepdims=True, and then throw away the dimension at the end

@@ -99,7 +99,7 @@ def _replace_nan(a, val):

if a.dtype == np.object_:
# object arrays do not support `isnan` (gh-9009), so make a guess
mask = a != a
mask = (a != a).astype(bool)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref gh-14802

a = np.array([1, np.array([1,2,3])], dtype=object)
b = np.array([1, np.array([1,2,3])], dtype=object)
self.assert_deprecated(op, args=(a, b), num=None)
res = op(a, b)
assert res.dtype == 'object'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert res.dtype == 'object'
assert res.dtype == object

@mattip mattip changed the title WIP, DEP, ENH: finish richcompare changes from 1.10 ENH: change object-array comparisons to prefer OO->O unfuncs Nov 2, 2019
@mattip
Copy link
Member Author

mattip commented Nov 2, 2019

Let's put this in as a first step to finishing the deprecations in richcompare. It cleans up some of the potential blockers to the deeper changes. i changed the PR title appropriately

@charris charris merged commit 4393e0c into numpy:master Nov 5, 2019
@charris
Copy link
Member

charris commented Nov 5, 2019

Let's try, thanks Matti. I do expect that there may be some fallout downstream. Perhaps the docstrings of the affected ufuncs should have a ..versionchanged:: someplace because the version dependence of the results may be confusing.

@TomAugspurger
Copy link

This seems to have caused downstream failures in pandas. pandas-dev/pandas#29432

Essentially, pandas has a datetime subclass, Timestamp, which we'd like to compare with a datetime64[ns] ndarray and have it return an boolean ndarray. The same behavior occurs with datetime.datetime

# Before this PR
In [6]: import numpy as np

In [7]: import datetime

In [8]: np.array(['1970-01-01T00:00:00.000000000', '1970-01-01T00:00:00.000000001'],
   ...:       dtype='datetime64[ns]') == datetime.datetime(1970, 1, 1)
Out[8]: array([False, False])

And with this PR

# With #14800
In [6]: import numpy as np

In [7]: import datetime

In [8]: np.array(['1970-01-01T00:00:00.000000000', '1970-01-01T00:00:00.000000001'],
   ...:       dtype='datetime64[ns]') == datetime.datetime(1970, 1, 1)
Out[8]: array([False, False], dtype=object)

This caused failures when we tried to mask with that (now object-dtype) array.

Should pandas update, or would that be considered a regression?

@charris
Copy link
Member

charris commented Nov 6, 2019

@TomAugspurger That's unfortunate, but not unexpected. I don't think we can afford to break Pandas just like that. What do you think is the best way forward here? How long do we need to maintain compatibility with Pandas?

I'd like to leave this in for a bit to see who else has problems, but will revert it for 1.18, which I plan to branch in a week or so.

@TomAugspurger
Copy link

TomAugspurger commented Nov 6, 2019

It's easy for us to workaround (pandas-dev/pandas#29433), so not at all a problem to leave it in master (though see my note below about performance).

And just to be clear, this only failed a single test, so it's not world-ending from pandas' point of view.

I worry somewhat that the pattern of

mask = ndarray == <scalar object>
other_ndarray[mask]

is somewhat common and will cause issues for other projects.

Second concern: will this operation be slower now that we're getting an object-dtype ndarray of bools? Or were we always getting object-dtype, and NumPy converted them to bool dtype before returning?

@charris
Copy link
Member

charris commented Nov 6, 2019

@TomAugspurger Does using np.equal(a, b, dtype=bool) work for pandas? I think it uses the old version of the ufunc loop, so should not change the performance. @mattip should be back in a day or two to comment.

@jbrockmendel
Copy link
Contributor

Is ndarray[datetime64].__equals__(Timestamp) no longer returning NotImplemented? If it still defers to Timestamp, then this is something we can address in Timestamp.__richcmp__

@TomAugspurger
Copy link

TomAugspurger commented Nov 6, 2019

Oh, nevermind, it broke a lot more tests. Just didn't see it since the first one errored while pytest was collecting tests :)

For example, this is object-dtype now, where it was previously bool

In [4]: np.array(['a', 'b', 'c'], dtype="object") == 'c'
Out[4]: array([False, False, True], dtype=object)

which does break the world for pandas & pandas users :)

@mattip
Copy link
Member Author

mattip commented Nov 6, 2019

So I guess if we want to pursue this, it needs to go through a deprecation cycle. It does sound like we should revert the change for now.

@eric-wieser
Copy link
Member

In my opinion, the breakage of a[a == x] for well-behaved objects is a pretty strong argument for this change being a bad idea.

charris added a commit that referenced this pull request Nov 6, 2019
MAINT: revert gh-14800, which gave precedence to OO->O over OO->?
@mattip mattip deleted the reorder-obj-comparison-loop branch November 2, 2020 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants