-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: Make isfinite work with datetime/timedelta. #7856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hrm, the short-circuit here seems to be changing how relational operators get resolved. |
So, it turns out the build failures here were actually just because of bugs that were introduced in |
else if (types[i] == PyArray_TYPE(op[i])) { | ||
/* | ||
* Don't bother doing casting checks if the type numbers exactly | ||
* match. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be safer to do this check outside the loop so that it only checks the first element, possibly special cased for timedelta/datetime. I can see this check going wrong for multiple input arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see this check going wrong for multiple input arguments.
I'm not sure I understand why the ufunc arity would make a difference. As I understand it, this function is used to check whether a ufunc-implementation registered for a specific set of input/output type codes is compatible with a set of supplied input/output arrays, and the loop here is iterating through the type codes registered for the inputs and checking whether the supplied arrays are compatible with those type codes. I can't think of a circumstance where we'd register a ufunc with input typecodes (a, b, c) where we wouldn't accept arrays with exactly those typecodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the worry that we might have something like a binary function taking two datetime/timedelta inputs that depends on the unit metadata matching exactly? I'd naively expect that it'd be that function's responsibility to validate metadata-specifc constraints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern with a ufunc taking multiple variables is that we don't have any way to distinguish between ufuncs that accept arguments with the same time units or any time units. E.g., if S
and T
are typevar variables, distinguishing between signatures (datetime64[T], datetime64[T])
and (datetime64[S], datetime64[T])
.
In practice, I'm not sure this is a major concern for ufuncs added in NumPy, because unit tests should catch these sorts of issues. But at least in theory, datetime64 ufuncs can also be written in third party code (e.g., with Numba).
Basically, I think it's important to emphasize that this is a special case rule in the comments and limit it's scope as much as possible to ensure as few complications as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern with a ufunc taking multiple variables is that we don't have any way to distinguish between ufuncs that accept arguments with the same time units or any time units.
That makes sense I think. It seems like the right long term fix for this would be to provide an API for ufunc registration that provides more granularity than just the typenum (for example, by allowing registration with a full PyArray_Descr
)?
Basically, I think it's important to emphasize that this is a special case rule in the comments and limit it's scope as much as possible to ensure as few complications as possible.
I think if we truly just want a special case here, it's safer/clearer to just add a new type resolver instead of adding an obscure branch to the default resolver to make one datetime ufunc work. Does that seem reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we truly just want a special case here, it's safer/clearer to just add a new type resolver instead of adding an obscure branch to the default resolver to make one datetime ufunc work. Does that seem reasonable?
Yes, agreed. We could make it slightly more generic (e.g., to handle all datetime64->bool signatures instead of just isfinite
) but this sounds good for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the long-term design we want here for handling parametrized dtypes in the ufunc system is that loops should be matched based on the typenum equivalent, and then the ufunc loop gets access to the actual PyArray_Descr
.
AppVeyor failure looks unrelated to the changes here. |
5d2d99c
to
f20d137
Compare
@shoyer updated to use a custom resolver. The first build passed on every target except Cursory googling suggests to me that that wouldn't be caused by a ref-counting issue, so I'm not sure what else in the changes here would cause that issue. |
Segfaulted on the second run. Failure appears to be real :/. It looks like the USE_CHROOT build is actually for testing 32-bit systems? How do people usually go about debugging 32-bit-specfic failures? |
☔ The latest upstream changes (presumably #7148) made this pull request unmergeable. Please resolve the merge conflicts. |
numpy/core/src/umath/loops.c.src
Outdated
@TYPE@_isfinite(char **args, npy_intp *dimensions, npy_intp *steps, void *NPY_UNUSED(func)) | ||
{ | ||
OUTPUT_LOOP { | ||
*((@type@ *)op1) = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@type@
seems wrong, probably the reason for your segfaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering whether we should do this, implement np.isnat
instead, or do both (and if we implement isnat, do I have to do this user type resolution correctly like you did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a user perspective, I would usually prefer to be able to use a function like np.isfinite
(or maybe np.isna
) that does not require me to know the dtype of the array ahead of time. But np.isnat
could also be useful in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a np.isna
function that handles both nans and NATs might be good. I'm thinking of the future when we might have "missing value" dtypes. Also note that all the nan*
functions can also be easily modified to handle NA, or even finite values. Makes me wonder if those functions are too specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in that future, I guess we would have np.isna
as well as np.ismissing
or so and np.isna
would be something like np.ismissing | np.isnan
?
8da67f9
to
3f8068d
Compare
- Adds new ufunc definitions for isfinite for datetime/timedelta. - Adds a new ufunc type resolver, ``PyUFunc_UnaryPredicateResolver``, which handles ufuncs that take a single argument and always produce a bool. This resolver is currently only used for ``isfinite``. - Adds explicit loop definitions for ``isfinite`` on integral and boolean types. Previously, ``isfinite`` worked because integer and boolean arrays would be (comparatively expensively) coerced to float and then dispatched to float loops. We now just fill the output array with 1s, since all integral values are finite. This yields a modest speedup, albeit for a somewhat silly operation.
In order to handle the difference between 0.0 and -0.0 on scalar values, assert_array_equal used to compare scalar inputs against 0. This triggers a DeprecationWarning when np.datetime64 and np.timedelta64 objects are provided, because comparisons between integers and datetime types currently works but is deprecated. By chance, we used to avoid this warning because we would call ``gisfinite`` in the previous block, which would raise a TypeError on datetime types, causing us to skip to direct equality comparison. Now that ``isfinite`` works for datetimes, we continue on to the next block, which triggers the warning. The fix is to avoid comparison against 0 entirely and just unconditionally check that signbits are equal. This is a harmless no-op for integer-based types, and still correctly handles floating-point types.
3f8068d
to
571b597
Compare
@shoyer I finally got around to fixing the issues on this PR. Segfaults on 32-bit are now fixed after applying @seberg's comment on the incorrect cast in the integer specializations for It looks like there's been some discussion in the interim on whether this is still a desirable change. Should I post a question to the mailing list? |
I don't think we need to debate this on the mailing list: this seems like a very clear win to me, regardless of whether or not we choose to add a more specialized function like |
@ssanderson: Yes and no. Yes, it's the right place, but no, the tests are not currently there. You want |
☔ The latest upstream changes (presumably #8967) made this pull request unmergeable. Please resolve the merge conflicts. |
This was actually implemented in NumPy 1.17 in a separate PR. |
Adds new ufunc definitions for isfinite for datetime/timedelta.
Fixes an issue where
ufunc_loop_matches
would always return Falsefor datetime inputs with units, because it used
PyArray_CanCast{Type,Array}To
to check if the input array wascompatible with a target dtype, but the target dtype is created with
PyArray_DescrFromType
, which means it will never have any unitmetadata.
The cleanest fix for this, I think, would be to copy the unit metadata
from the dtype of the input array to the temporary dtype used to
determine casting safety, but that requires the ability to call
get_datetime_metadata_from_dtype
, which is defined inmultiarray
from
ufunc_loop_matches
, which is defined inumath
, and I couldn'tfigure out how do make that happen correctly (I think I need to create
a public wrapper and put it in
numpy_api.py
?).The somewhat hacky fix is to short-circuit if the type number being
checked matches the type of the input being checked. This avoids
calling
PyArray_CanCastTo
in the case where we have exactly-equaltypes. It's not totally clear to me whether this is safe though.