Skip to content

ENH: add _is_numeric attribute for DType classes #23190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 23, 2023

Conversation

ngoldbaum
Copy link
Member

@ngoldbaum ngoldbaum commented Feb 9, 2023

This adds a new _is_numeric attribute to dtype classes which is true for numeric dtypes and false otherwise.

Concretely, adding this will make it easier for me to add support in pandas for new dtypes. Currently pandas checks if dtype.kind is one of the letter codes corresponding to numeric types, but that won't be easily extendible for new custom dtypes. This will be. See #23180 for more information about the motivation for this feature.

Currently this only handles builtin dtypes and new-style custom dtypes, I'm not sure how legacy custom dtypes should be handled (if at all).

I don't have a strong opinion about how this should be spelled. Perhaps it makes sense to make it a "public" attribute without the underscore.

@WarrenWeckesser
Copy link
Member

What is the significance of a dtype being numeric? Is there an explicitly defined API or set of behaviors associated with being numeric? For someone creating custom dtypes, that would be useful to have documented.

Copy link
Contributor

@peytondmurray peytondmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this should be public or not, but if pandas is parsing dtype.kind letter codes I'm betting there are users who would like this also - so I vote to make it public.

@ngoldbaum
Copy link
Member Author

ngoldbaum commented Feb 9, 2023

What is the significance of a dtype being numeric?

The main motivator here is pandas, which keys a number of behaviors mainly related to data display based on whether or not a dtype is considered numeric.

Is there an explicitly defined API or set of behaviors associated with being numeric?

I think that just means that it behaves like a number and can be used interchangeably with e.g. a float or integer. So, for example, even though datetimes are just a number under the hood, since adding two dates isn't defined I didn't include it as a numeric dtype.

I don't have a concrete suggestion for exactly where to draw the dividing line though.

@mattip
Copy link
Member

mattip commented Feb 9, 2023

Python has a numbers heirarchy but it doesn't seem like the base class has any distinctive attributes.

@seberg
Copy link
Member

seberg commented Feb 10, 2023

I think we should probably have the full type hierarchy we have for scalars, we probably need it anyway eventually to replace issubdtype. Having a flag (for now/forever) seems OK for this though, because it doesn't preclude an isinstance/issubclass(dtype, np.NumericDType) for example.

I would say even timedelta's are probably better of not being numbers, but that may just be something for downstream to figure out best practices for...

@WarrenWeckesser
Copy link
Member

WarrenWeckesser commented Feb 10, 2023

As a use-case, it would be nice to know exactly how the Pandas code would treat numeric types differently from non-numeric types. What assumptions will Pandas make about the dtype when it is numeric? The exact meaning seems vague. Currently it feels like "We can't define it, but we know it when we see it."

The main motivator here is pandas, which keys a number of behaviors mainly related to data display based on whether or not a dtype is considered numeric.

Can you give a concrete example of how Pandas would treat numeric and non-numeric types differently? Is it really only relevant for the display of data?

So, for example, even though datetimes are just a number under the hood, since adding two dates isn't defined I didn't include it as a numeric dtype.

By this reasoning, any type with units would not be numeric. E.g. if x is 10 meters and y is 20 seconds, x and y cannot be added. Is that consistent with the meaning of numeric that is proposed here? I guess instances of the specific dtype (something like float64[meters]) can be added and subtracted. But this is also true for timedelta64--which I guess is what @seberg's comment is about. Couldn't timedelta64 be considered numeric?

I would expect user-defined dtypes such as quad precision, rational, quaternion, integer mod N, and integer with nan to be numeric. Would Pandas be able to handle these numeric types?

@seberg
Copy link
Member

seberg commented Feb 10, 2023

My gut feeling is: We should have something. And yes, what pandas does may well not align perfectly with is_number. Maybe they should rather have a "quantity" (which a unit/time might be), or maybe they actually want is_number and is_builtin because they want the NumPy numeric dtypes. Maybe they even need their own abstract DType that is just pre-propulated with NumPy's numeric DTypes.

I would be happy to say that this should be implemented using AbstractDTypes (i.e. add a full runtime abstract DType hierarchy to numpy.types or so, any use of flags would be private/secondary API for speed).

@seberg
Copy link
Member

seberg commented Feb 10, 2023

That said, knowing what pandas needs is interesting. I just feel like we will need to establish a check for "numeric" (and the other things in our abstract scalar type hierarchy) sooner or later, even if the pandas use-case turns out moot.

@ngoldbaum
Copy link
Member Author

Pandas has a check here to determine the correct block storage format for a given numpy dtype. The concrete issue this PR solves for me is adding a way for stringdtype to signal how pandas should handle the dtype. Since new dtypes don't have a meaningful kind attribute, I'd like a way for pandas to introspect on the dtype and determine whether or not it makes sense to represent it as numeric data or not. Specifically for me, I want a way for pandas to know whether it should use NumericBlock or NumpyBlock for the storage. Currently NumpyBlock isn't used directly anywhere in pandas because string arrays are converted to object arrays, but with stringdtype numpy has grown a native variable-width string type, and NumpyBlock is the natural storage choice. New DTypes that represent numeric data would want _is_numeric to be True and to use NumericBlock. Currently there's no way to do that, so right now all NEP 42 dtypes would end up as object arrays using the ObjectBlock storage.

As a use-case, it would be nice to know exactly how the Pandas code would treat numeric types differently from non-numeric types. What assumptions will Pandas make about the dtype when it is numeric?

I'm no pandas expert, but grepping the codebase leads to the following places where this information is used inside pandas:

  • Many methods of core pandas data types accept a numeric_only keyword argument, which is keyed off of whether or not pandas considers the dtype for the column to be numeric. For example, dataframe, groupby.
  • Many of pandas' built-in plotting routines do a check on whether or not data is numeric and does different display strategies in those cases (see e.g. this test).
  • Formatting tables for display has different behavior for numeric and non-numeric columns.
  • The _is_numeric attribute is already a required attribute for Pandas' ExtensionDtype. In principle we could also add support for NEP 42 dtypes via ExtensionDtype, but that adds unnecessary performance overhead and requires a substantial amount of boilerplate for each dtype, when in reality pandas can natively support any numpy dtype so long as its able to get the metadata it needs about the dtype.

I would expect user-defined dtypes such as quad precision, rational, quaternion, integer mod N, and integer with nan to be numeric. Would Pandas be able to handle these numeric types?

I expect so, yes. I haven't actually tried though.

Ultimately it would be up to the author of the dtype to decide whether or not they should opt-in to the flag. I think in most cases it will be pretty clear-cut (strings? nope). For all the built-in numpy dtypes it's pretty clear-cut to me, see the test I added in this PR. Agreed that timedeltas are the least clear-cut case, although I didn't include both because pandas doesn't consider timedeltas to be numeric and because I don't think it makes sense to use a timedelta interchangeably with other numbers. A unit dtype would be numeric in my opinion, since it is intended to be used interchangeably with floats and ints (where those floats and ints currently represent data with units).

@seberg seberg added the triage review Issue/PR to be discussed at the next triage meeting label Feb 13, 2023
@seberg
Copy link
Member

seberg commented Feb 13, 2023

I guess it won't be 100% clear cut, whether quantities are numbers, and timedelta is a bit odd (but I guess it is a somewhat limited quantity as it cannot change unit).

If nothing comes up, I would probably just push on here though. Pandas also having this attribute seems like an argument, and I don't think we have a bad rabbit hole of supporting this forever.
We maybe should have that abstract DType hierarchy instead, but supporting the flag additionally (for a while) shouldn't be a big hassle.

@mattip
Copy link
Member

mattip commented Feb 14, 2023

Wouldn't it be easier to add some code to pandas, rather than to add a this to NumPy specifically for, and only for, pandas?

@mattip
Copy link
Member

mattip commented Feb 14, 2023

Especially since the only criteria for the truthiness of the attribute is for pandas

@ngoldbaum
Copy link
Member Author

Wouldn't it be easier to add some code to pandas, rather than to add a this to NumPy specifically for, and only for, pandas?

It’s true that right now that pandas would be the only consumer, but it at least partially solves a problem that all downstream libraries are going to run into if they want to consume the new user dtypes: how do we introspect on the dtypes if they all share the same kind attribute? Sure pandas could e.g. add a dependency on stringdtype and do an isinstance call for my particular usecase, but what about other custom dtypes that want to work inside pandas (or any other third-party library that wants to have special handling for different dtypes?). I don’t think having special per-dtype calls will scale well.

If I’m missing something, can you share a little more what sort of code you were thinking about that could work inside pandas, in a way that won’t require pandas to add any extra dependencies?

Would you be OK with Sebastian’s more complex abstract dtype registration system rather than this flag?

@rgommers
Copy link
Member

Hey all, let me make a few points regarding this PR and discussion, which became quite involved with also a bunch of back-and-forth on Slack.

  • The bigger-picture context: @ngoldbaum is working on custom dtypes, and on implementing a variable-length string dtype to be upstreamed into NumPy in particular (and then used in Pandas and beyond).
    • This is very valuable, finally using the new dtype and ufunc infrastructure that @seberg has been leading the work on for years and @mattip, @WarrenWeckesser and others have also spent a ton of effort on (code review, brainstorming, prototyping, etc.),
    • It is also very challenging, and it's probably not going to be first-time right. Which is why it's being implemented as a third-party dtype first, in accordance with our general strategy for adopting new dtypes.
  • Something like this _is_numeric is clearly needed.
    • As @seberg pointed out, we already have functionality like "is numeric" via checking with issubdtype on np.floating, np.inexact & co.
    • "is numeric" is an obvious generalization of that kind of thing, it's not only needed for Pandas, that's just the one use at hand right now that demonstrates the need.
    • It is pretty difficult to get this right in a formal hierarchy, and there's always going to be some level of ambiguity for user-defined dtypes that are kinda-like-numbers-but-not-in-all-aspects. That's unavoidable, and insisting on 100% watertight formal semantics will not work.
    • The future replacement for issubdtype (isdtype in the array API standard) also has a numeric category.
  • The new user-defined dtypes API is still in flux, and we're discovering gaps as we go. This is one, I'm sure there'll be a few more.

In terms of way forward and integration strategy I propose the following:

  • Keep this as private API right now (_is_numeric is fine). That way it can be used for a while before finalizing on a public API design and very clearly defined semantics,
  • Aim to revisit the overall picture of user-defined dtypes in a document at some point (after things start working). With a new string dtype, a units dtype, as well as what we new previously about quaternions etc., we will have a more complete picture of the needs,
  • Merge this PR with a new _is_numeric method, linking its docs to ENH: Fleshing out the dtype type hierarchy #23180 and keeping that issue open as a feature request for public API.

@rgommers rgommers changed the title ENH: add _is_numeric attribute for DType classes ENH: add _is_numeric attribute for DType classes Feb 14, 2023
@WarrenWeckesser
Copy link
Member

WarrenWeckesser commented Feb 14, 2023

... "is numeric" is an obvious generalization of that kind of thing,

Sure, but it needs a documented definition to be useful generally (sorry if I'm sounding like a broken record).

@rgommers, your proposal sounds fine, except I would suggest a name change. Since the current PR is specifically designed to make it easier to collaborate with Pandas in developing dtypes, and it sounds like it is a temporary quick-fix that we know will change eventually, I would be more comfortable if the attribute was called _is_pandas_numeric. It is private, so the name doesn't really matter, and that name is quite explicit about the purpose of the attribute. It will be a reminder to devs that it is something we want to change later. Then we really don't care at the moment why Pandas wants boolean to be numeric and timedelta64 to not be numeric.

Ideally, what we would then get from this collaboration with Pandas is a better idea of exactly which behaviors of a type make it numeric from their point of view. Closed under some or all of the basic arithmetic operations? Castable to a NumPy int, float or complex? Anything except string-like types? Something else (probably)? I don't know what it would look like, but it should be precise enough that if I create a dtype of my own that satisfies the conditions of the numeric specification, and I set the numeric flag, I can be sure that it will just work with Pandas (or any other library that claims to accept numeric input). But if, for example, my dtype doesn't define subtraction, and addition results in a new dtype within a parametric family (but not exactly the same dtype as the operands), and the specification for numeric says the dtype must be closed under addition and subtraction, then I know that my dtype shouldn't set the numeric attribute.

@WarrenWeckesser
Copy link
Member

@ngoldbaum, are you sure the boolean type should not be numeric? In Pandas 1.5.3, if I create a DataFrame with a boolean field, it looks like Pandas considers it to be numeric.

For example, here df has three fields, with types bool, int64, and string. The private _data attribute includes information about the "blocks" of the DataFrame. The data for the bool field is a NumericBlock, so apparently in Pandas 1.5.3, boolean is numeric:

In [115]: import pandas as pd

In [116]: pd.__version__
Out[116]: '1.5.3'

In [117]: df = pd.DataFrame([(True, 10, "abc"), (False, 20, "def"), (True, 30, "xyz")])

In [118]: df._data
Out[118]: 
BlockManager
Items: RangeIndex(start=0, stop=3, step=1)
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumericBlock: slice(0, 1, 1), 1 x 3, dtype: bool
NumericBlock: slice(1, 2, 1), 1 x 3, dtype: int64
ObjectBlock: slice(2, 3, 1), 1 x 3, dtype: object

I don't know if the development version of Pandas has the same behavior.

Also, the Python bool type is a subclass of int, which also suggests that a boolean type should be numeric.

@ngoldbaum
Copy link
Member Author

if I create a DataFrame with a boolean field, it looks like Pandas considers it to be numeric.

Good call, you're right. This is fixed now.

@seberg
Copy link
Member

seberg commented Feb 15, 2023

Hehe, and I would have leaned towards bools not being numeric ;) (we can still tweak if we want to, I don't think it matters too much; both what we have here since pandas can work around, and whether or not we consider bool "numeric" to begin with).

I merged the headers, so there is a pretty big merge conflict now.

@rgommers
Copy link
Member

Hehe, and I would have leaned towards bools not being numeric ;) (we can still tweak if we want to, I don't think it matters too much; both what we have here since pandas can work around, and whether or not we consider bool "numeric" to begin with).

Agreed with both - not numeric and doesn't matter much right now. Python language design is just bad here, and we already don't follow it:

>>> True + True
2
>>> np.array([True]) + np.array([True])
array([ True])

#define NPY_DT_LEGACY 1 << 0
#define NPY_DT_ABSTRACT 1 << 1
#define NPY_DT_PARAMETRIC 1 << 2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted these two since they already appear in _dtype_api.h, I left NPY_DT_LEGACY since it's a private flag.

@ngoldbaum
Copy link
Member Author

I merged the headers, so there is a pretty big merge conflict now.

Fixed the conflict, thanks for that refactor! Much better without all the duplication.

@seberg I left a comment where I deleted two redundant #defines that I think you missed, let me know if those were left there on purpose.

@seberg
Copy link
Member

seberg commented Feb 18, 2023

Thanks, I would be 👍 on putting this in. I do suspect we may want to make this at least np.types.NumericDType.register(), and yes, it all isn't quite ideal.

But I want to be sure that this doesn't slow down Nathan for yak shaving or "too much complexity at once" reasons, so I am 👍 and would be happy if we merge it as is.

In other words, I am planning to merge this in few days unless anyone minds seriously or Nathan thinks it isn't the right path himself. We will be able to trivially revert/change this for a while at least.

@ngoldbaum
Copy link
Member Author

I’d be very happy to see this go in as-is.

@WarrenWeckesser
Copy link
Member

WarrenWeckesser commented Feb 18, 2023

If it helps the development of the new dtypes and their adoption in Pandas, then go for it.

My comments have been to raise a red flag about what I see as an ill-defined attribute that should not remain in NumPy as currently implemented in this PR. I worry that it is providing a quick band-aid for Pandas without having a clear definition and without being useful within NumPy itself or in any other third-party library. There are problems already: there is disagreement between NumPy and Pandas about whether a boolean type should be numeric, and Pandas has no clear reason for not considering timedelta64 to be numeric (other than implementation details). If timedelta64 is not numeric, how can any author of user-defined dtypes know whether or not their dtype should be numeric?

I understand this is all experimental, and there will almost certainly be iterations over which these issues are worked out. My hope is that, before this becomes part of the public API, a concrete definition of is_numeric is documented, so developers of user-defined dtypes can unambiguously answer the question "Should I enable the is_numeric flag?".

@mattip
Copy link
Member

mattip commented Feb 18, 2023

Can we at least change the name to is_numeric_for_pandas to indicate this is a hack added to NumPy specifically for pandas, and that dtype-designers should check out this flag's use inside pandas to figure out if they need to enable it?

@ngoldbaum
Copy link
Member Author

ngoldbaum commented Feb 20, 2023

I see that this PR has the triage review label applied. I'm going to try to make it to the triage meeting to discuss this "in person".

I'd like to ask if there is agreement among participants in this discussion about the following goals for the new dtypes:

  • The old system, where dtypes all got a unique type number which corresponds to the publicly visible dtype.kind is not going to be extended for new dtypes.
  • We want new dtypes to be broadly useful in the scientific python ecosystem, with projects being able to consume them without adding explicit support for them or needing to import them directly for e.g. isinstance calls.
  • Therefore, numpy needs some system to discriminate between dtypes that generalizes the current system based on type numbers so that people can write code that treats new dtypes and legacy dtypes on an equal footing.

If we decide that the _is_numeric flag isn't how this should work and we want e.g. Sebastian's dtype hierarchy and ability to declare a dtype as a subdtype of an abstract dtype in numpy, that's fine, but I want to make sure that we all are aligned in having support in numpy for some kind of system for dtypes to register themselves to be described in a certain way.

Can we at least change the name to is_numeric_for_pandas to indicate this is a hack added to NumPy specifically for pandas, and that dtype-designers should check out this flag's use inside pandas to figure out if they need to enable it?

I disagree that this is a hack specifically for pandas. Right now it's most immediately useful for pandas, but any other downstream library could use this attribute and might find it useful.

For example, here is a place in dask that assumes all numpy dtypes have a statically knowable dtype.kind that is a unique letter code. Here is a place in astropy and a spot in xarray where all string dtypes are assumed to have dtype.kind in ('U', 'S'). Here's a github search for the string dtype.kind in which returns many places in open source codebases where people are doing this kind of checking and the implicit assumption is made that there are only a limited number of valid dtypes and they all have a statically knowable dtype.kind letter code that can be hard-coded. This has also come up before in numpy, see #17325.

I also want to emphasize that without the experimental flag, this PR just exposes PyTypeNum_ISNUMBER, which is useful enough inside Numpy to be used in several places. Without something like this PR, downstream users will continue to need to deal with dtype.kind for descriminating between numeric and non-numeric dtypes. I personally think it would be worthwhile to expose more of these PyTypeNum macros for the legacy dtypes and hopefully do so in a way that new dtype authors can hook into that, as is done in this PR.

@mattip
Copy link
Member

mattip commented Feb 21, 2023

I see that this PR has the triage review label applied. I'm going to try to make it to the triage meeting to discuss this "in person".

That would be great!

@rgommers
Copy link
Member

I'd like to ask if there is agreement among participants in this discussion about the following goals for the new dtypes:

  • The old system, where dtypes all got a unique type number which corresponds to the publicly visible dtype.kind is not going to be extended for new dtypes.

  • We want new dtypes to be broadly useful in the scientific python ecosystem, with projects being able to consume them without adding explicit support for them or needing to import them directly for e.g. isinstance calls.

  • Therefore, numpy needs some system to discriminate between dtypes that generalizes the current system based on type numbers so that people can write code that treats new dtypes and legacy dtypes on an equal footing.

This all makes perfect sense to me. One thing to add here is the difference between new dtypes inside NumPy and user-defined dtypes. For the latter, the current system with .kind and numbers doesn't work at all, for the former it's merely bad design but it does work.

I disagree that this is a hack specifically for pandas. Right now it's most immediately useful for pandas, but any other downstream library could use this attribute and might find it useful.

Thanks for digging up all those concrete examples.

I want to make sure that we all are aligned in having support in numpy for some kind of system for dtypes to register themselves to be described in a certain way

The full design of this should be well out of scope for this PR, but the basis for the public-facing API for using this should be array_api.isdtype (xref gh-17325 for that). Other dtypes like strings and datetimes can fit into that pattern too. User-defined dtypes require some additional "registration" logic which is clearly going to be numpy-specific. I don't have an opinion about what the machinery for that should be; what was said about it sounds very reasonable I think.

@charris
Copy link
Member

charris commented Feb 23, 2023

The upshot of the triage discussion was to put this in.

EDIT: @WarrenWeckesser Your points are good ones, but we decided that the need to move forward justified including this and that it would entail little maintenance. We also agreed that a more general design needs to be settled on going forward.

@charris charris added triaged Issue/PR that was discussed in a triage meeting and removed triage review Issue/PR to be discussed at the next triage meeting labels Feb 23, 2023
@charris charris merged commit 2448636 into numpy:main Feb 23, 2023
@charris
Copy link
Member

charris commented Feb 23, 2023

Thanks Nathan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement triaged Issue/PR that was discussed in a triage meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants