Skip to content

ENH: Add dtype support to the array comparison ops #18128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 21, 2021

Conversation

BvB93
Copy link
Member

@BvB93 BvB93 commented Jan 5, 2021

Per the title: this PR adds dtype support to the array comparison ops (__lt__, __le__, etc.).

Note that in order to get this to work the variance of the dtype and ndarray parameters had to be changed invariant (the default) to covariant. Without this, the likes of np.dtype[np.int64] would not be considered sub-types of np.dtype[np.integer], making things needlessly complicated.

Examples

from __future__ import annotations

from typing import Any, TYPE_CHECKING
import numpy as np

AR_i8: np.ndarray[Any, np.dtype[np.int64]]
AR_f8: np.ndarray[Any, np.dtype[np.float64]]
AR_m: np.ndarray[Any, np.dtype[np.timedelta64]]

if TYPE_CHECKING:
    # note: Revealed type is 'Union[numpy.bool_, numpy.ndarray[Any, numpy.dtype[numpy.bool_]]]'
    reveal_type(AR_f8 < AR_i8)

    # error: Unsupported operand types for < ("ndarray[Any, dtype[floating[_64Bit]]]" and "ndarray[Any, dtype[timedelta64]]")  [operator]
    reveal_type(AR_f8 < AR_m)

# `Sequence[int]`) and `str`. As `str` is a recusive sequence of
# strings, it will pass through the final overload otherwise

@overload
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize the various overloads:

  1. Overload for filtering out any str/bytes-based array-likes. This is needed as bytes would otherwise be recognized as a Sequence[int] sub-type (which is bad) and str would otherwise be caught by overload 6. (as str is a sequence of strings, which is a sequence of strings, etc.).
  2. Number-based overload
  3. Timedelta-based overload
  4. Datetime-based overload
  5. Object-based overload. other is typed as Any here as object arrays can be extremely flexible (depending on the actual underlying objects).
  6. A final overload to handle all sequences whose nesting level is too deep for the previous overloads.

@a-reich
Copy link

a-reich commented Jan 10, 2021

Just curious about something since I've been following the work to add typing support for dtypes: won't making a mutable container type covariant cause issues? For the usual reasons that, e.g. if my_func accepts ndarray[Any, np.dtype[np.float64]) it could assign values of float64 to the array, but calling my_func with ndarray[Any, np.dtype[np.float32]) would no longer raise typecheck errors despite being wrong.

@BvB93
Copy link
Member Author

BvB93 commented Jan 11, 2021

@a-reich this is arguably more of a problem with the variancy of number, one that I'm not quite about what would be the best solution. Namely, making number[~NBit] invariant (instead of covariant) w.r.t. its precision would resolve the issue. On the other hand, it would also make precision-based casting even difficult than it already is, to the point where I'm not convinced it'd be worthwhile. The latter means ditching the limited cases where we can currently deal with precisions, i.e. changing the return types to the likes of number[Any].

In any case, besides abovementioned issue the problem with leaving ndarray (and dtype) invariant is that it leads to some strange situations with the abstract(-ish) part of the generic hierarchy. For example, it would mean that a float64 array would not be acceptable if an inexact array is expected.

@seberg
Copy link
Member

seberg commented Jan 11, 2021

@BvB93 I am trying to get a full picture for this. For your last example, it seems clear that inexact must behave covariant (as do all abstract dtypes). The concrete dtypes/types must be invariant (in most ways at least), but because of that, my preferred solution is to not allow subclassing at all (and if we ever allow it, we would be very careful to note that it must be extremely limited – whatever that means exactly).

The number[Any], I am a bit confused about how/where this is used (with a bit-width)?
My first thought would be that number[Any] would be covariant as well (it is an abstract dtype/type), but it would be "invariant" with respect to precision since a higher precision is not a subtype to begin with?

@a-reich
Copy link

a-reich commented Jan 12, 2021

@BvB93 I see what you mean about the abstract parts of the dtype hierarchy making it difficult. However, your point about invariance of number w.r.t. to precision is interesting.
I guess ultimately, there might not be a way to have the whole type system work "correctly" here, so it comes down to which way generates more false positives/false negatives and is more annoying for downstream users to patch over? For example, how common actually are operations with different precisions, and if lack of automatic inference by mypy means users have to annotate their casting explicitly how bad is that? Compared to the false negatives for assigning with wrong precision (which admittedly does not raise at runtime, but silently stores truncated results).

@BvB93
Copy link
Member Author

BvB93 commented Jan 12, 2021

The number[Any], I am a bit confused about how/where this is used (with a bit-width)?

To give a bit of background: number is currently parametrized w.r.t. a set of npt.NBitBase subclasses,
the latter used for representing their precision via an object-based approach.

_NBit_co = TypeVar(..., covariant=True)

class floating(inexact[_NBit_co]):
    ...

# Note that the types below are not necasrelly treated as `floating` subclasses,
# it's more along the line of how `Sequence[int]` is a subtype of  plain `Sequence`.
#
# This greatly simplifies the rather complicated typing of precisions, as you can always 
# just hit the panic button when things get too complicated: _e.g._ `return floating[Any]`
float16 = floating[_16Bit]
float32 = floating[_32Bit]
float64 = floating[_64Bit]

The key point is that because NBitBase subclasses inherit from each other, whenever one of them is
used as a covariant parameter it would allow mypy to simplify the likes of Union[_64Bit, _32Bit] to
_64Bit, allowing for some basic precision-based casting without adding loads of overloads.

I recall from #17540 that something like this was actually possible for invariant parameters, but I
believe there were more immediate issues with it, at least compared to its covariant counterpart.
I'd have to a bit of digging, as I don't quite remember the details.

@BvB93
Copy link
Member Author

BvB93 commented Jan 12, 2021

As a side note:
Let's wait with merging until #18155 is merged, as it would allow for the removal of an overload.

@BvB93
Copy link
Member Author

BvB93 commented Jan 15, 2021

Never mind #18128 (comment), turns out there were, unfortunately, detrimental problems with the therein referenced PR

@charris
Copy link
Member

charris commented Jan 19, 2021

Needs rebase

Bas van Beek added 5 commits January 19, 2021 20:09
The dtypes scalar-type and ndarrays' dtype are now covariant instead of invariant.
This change is necasary in order to ensure that all generic subclasses can be used as underlying scalar type.
…ere neglected

More specifically operations between array-likes of `timedelta64` and `ndarray`s that can be cast into `timedelta64`.

For example:
    ar_i = np.array([1])
    seq_m = [np.timedelta64()]
    ar_i > seq_m
@BvB93
Copy link
Member Author

BvB93 commented Jan 21, 2021

The branch has been rebased and the tests seem to be passing.

@charris charris merged commit 33273e4 into numpy:master Jan 21, 2021
@charris
Copy link
Member

charris commented Jan 21, 2021

Thanks Bas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants