-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: add _is_numeric
attribute for DType classes
#23190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
What is the significance of a dtype being numeric? Is there an explicitly defined API or set of behaviors associated with being numeric? For someone creating custom dtypes, that would be useful to have documented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this should be public or not, but if pandas is parsing dtype.kind
letter codes I'm betting there are users who would like this also - so I vote to make it public.
The main motivator here is pandas, which keys a number of behaviors mainly related to data display based on whether or not a dtype is considered numeric.
I think that just means that it behaves like a number and can be used interchangeably with e.g. a float or integer. So, for example, even though datetimes are just a number under the hood, since adding two dates isn't defined I didn't include it as a numeric dtype. I don't have a concrete suggestion for exactly where to draw the dividing line though. |
Python has a numbers heirarchy but it doesn't seem like the base class has any distinctive attributes. |
I think we should probably have the full type hierarchy we have for scalars, we probably need it anyway eventually to replace I would say even timedelta's are probably better of not being numbers, but that may just be something for downstream to figure out best practices for... |
As a use-case, it would be nice to know exactly how the Pandas code would treat numeric types differently from non-numeric types. What assumptions will Pandas make about the dtype when it is numeric? The exact meaning seems vague. Currently it feels like "We can't define it, but we know it when we see it."
Can you give a concrete example of how Pandas would treat numeric and non-numeric types differently? Is it really only relevant for the display of data?
I would expect user-defined dtypes such as quad precision, rational, quaternion, integer mod N, and integer with nan to be numeric. Would Pandas be able to handle these numeric types? |
My gut feeling is: We should have something. And yes, what pandas does may well not align perfectly with I would be happy to say that this should be implemented using AbstractDTypes (i.e. add a full runtime abstract DType hierarchy to |
That said, knowing what pandas needs is interesting. I just feel like we will need to establish a check for "numeric" (and the other things in our abstract scalar type hierarchy) sooner or later, even if the pandas use-case turns out moot. |
Pandas has a check here to determine the correct block storage format for a given numpy dtype. The concrete issue this PR solves for me is adding a way for
I'm no pandas expert, but grepping the codebase leads to the following places where this information is used inside pandas:
I expect so, yes. I haven't actually tried though. Ultimately it would be up to the author of the dtype to decide whether or not they should opt-in to the flag. I think in most cases it will be pretty clear-cut (strings? nope). For all the built-in numpy dtypes it's pretty clear-cut to me, see the test I added in this PR. Agreed that timedeltas are the least clear-cut case, although I didn't include both because pandas doesn't consider timedeltas to be numeric and because I don't think it makes sense to use a timedelta interchangeably with other numbers. A unit dtype would be numeric in my opinion, since it is intended to be used interchangeably with floats and ints (where those floats and ints currently represent data with units). |
I guess it won't be 100% clear cut, whether quantities are numbers, and timedelta is a bit odd (but I guess it is a somewhat limited quantity as it cannot change unit). If nothing comes up, I would probably just push on here though. Pandas also having this attribute seems like an argument, and I don't think we have a bad rabbit hole of supporting this forever. |
Wouldn't it be easier to add some code to pandas, rather than to add a this to NumPy specifically for, and only for, pandas? |
Especially since the only criteria for the truthiness of the attribute is for pandas |
It’s true that right now that pandas would be the only consumer, but it at least partially solves a problem that all downstream libraries are going to run into if they want to consume the new user dtypes: how do we introspect on the dtypes if they all share the same If I’m missing something, can you share a little more what sort of code you were thinking about that could work inside pandas, in a way that won’t require pandas to add any extra dependencies? Would you be OK with Sebastian’s more complex abstract dtype registration system rather than this flag? |
Hey all, let me make a few points regarding this PR and discussion, which became quite involved with also a bunch of back-and-forth on Slack.
In terms of way forward and integration strategy I propose the following:
|
_is_numeric
attribute for DType classes
Sure, but it needs a documented definition to be useful generally (sorry if I'm sounding like a broken record). @rgommers, your proposal sounds fine, except I would suggest a name change. Since the current PR is specifically designed to make it easier to collaborate with Pandas in developing dtypes, and it sounds like it is a temporary quick-fix that we know will change eventually, I would be more comfortable if the attribute was called Ideally, what we would then get from this collaboration with Pandas is a better idea of exactly which behaviors of a type make it numeric from their point of view. Closed under some or all of the basic arithmetic operations? Castable to a NumPy int, float or complex? Anything except string-like types? Something else (probably)? I don't know what it would look like, but it should be precise enough that if I create a dtype of my own that satisfies the conditions of the numeric specification, and I set the |
@ngoldbaum, are you sure the boolean type should not be numeric? In Pandas 1.5.3, if I create a DataFrame with a boolean field, it looks like Pandas considers it to be numeric. For example, here
I don't know if the development version of Pandas has the same behavior. Also, the Python |
Good call, you're right. This is fixed now. |
Hehe, and I would have leaned towards bools not being numeric ;) (we can still tweak if we want to, I don't think it matters too much; both what we have here since pandas can work around, and whether or not we consider bool "numeric" to begin with). I merged the headers, so there is a pretty big merge conflict now. |
Agreed with both - not numeric and doesn't matter much right now. Python language design is just bad here, and we already don't follow it: >>> True + True
2
>>> np.array([True]) + np.array([True])
array([ True]) |
1646ec6
to
b3df02c
Compare
b3df02c
to
39d88df
Compare
#define NPY_DT_LEGACY 1 << 0 | ||
#define NPY_DT_ABSTRACT 1 << 1 | ||
#define NPY_DT_PARAMETRIC 1 << 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted these two since they already appear in _dtype_api.h
, I left NPY_DT_LEGACY
since it's a private flag.
Fixed the conflict, thanks for that refactor! Much better without all the duplication. @seberg I left a comment where I deleted two redundant |
Thanks, I would be 👍 on putting this in. I do suspect we may want to make this at least But I want to be sure that this doesn't slow down Nathan for yak shaving or "too much complexity at once" reasons, so I am 👍 and would be happy if we merge it as is. In other words, I am planning to merge this in few days unless anyone minds seriously or Nathan thinks it isn't the right path himself. We will be able to trivially revert/change this for a while at least. |
I’d be very happy to see this go in as-is. |
If it helps the development of the new dtypes and their adoption in Pandas, then go for it. My comments have been to raise a red flag about what I see as an ill-defined attribute that should not remain in NumPy as currently implemented in this PR. I worry that it is providing a quick band-aid for Pandas without having a clear definition and without being useful within NumPy itself or in any other third-party library. There are problems already: there is disagreement between NumPy and Pandas about whether a boolean type should be numeric, and Pandas has no clear reason for not considering I understand this is all experimental, and there will almost certainly be iterations over which these issues are worked out. My hope is that, before this becomes part of the public API, a concrete definition of |
Can we at least change the name to |
I see that this PR has the triage review label applied. I'm going to try to make it to the triage meeting to discuss this "in person". I'd like to ask if there is agreement among participants in this discussion about the following goals for the new dtypes:
If we decide that the
I disagree that this is a hack specifically for pandas. Right now it's most immediately useful for pandas, but any other downstream library could use this attribute and might find it useful. For example, here is a place in dask that assumes all numpy dtypes have a statically knowable I also want to emphasize that without the experimental flag, this PR just exposes |
That would be great! |
This all makes perfect sense to me. One thing to add here is the difference between new dtypes inside NumPy and user-defined dtypes. For the latter, the current system with
Thanks for digging up all those concrete examples.
The full design of this should be well out of scope for this PR, but the basis for the public-facing API for using this should be |
The upshot of the triage discussion was to put this in. EDIT: @WarrenWeckesser Your points are good ones, but we decided that the need to move forward justified including this and that it would entail little maintenance. We also agreed that a more general design needs to be settled on going forward. |
Thanks Nathan. |
This adds a new
_is_numeric
attribute to dtype classes which is true for numeric dtypes and false otherwise.Concretely, adding this will make it easier for me to add support in pandas for new dtypes. Currently pandas checks if
dtype.kind
is one of the letter codes corresponding to numeric types, but that won't be easily extendible for new custom dtypes. This will be. See #23180 for more information about the motivation for this feature.Currently this only handles builtin dtypes and new-style custom dtypes, I'm not sure how legacy custom dtypes should be handled (if at all).
I don't have a strong opinion about how this should be spelled. Perhaps it makes sense to make it a "public" attribute without the underscore.