Skip to content

TYP,BUG: Complete type stubs for numpy.dtypes #27008

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 5, 2024

Conversation

jorenham
Copy link
Member

@jorenham jorenham commented Jul 22, 2024

The numpy.dtype subtypes in numpy.dtypes were annotated as type aliases of dtype.
But since their constructor signatures are incompatible (the ones in numpy.dtypes accept either 0 or 1 positional-only arguments), so the numpy.dtypes stubs were incorrect.

This PR fixes this by adding specialized dtype subtypes in the numpy.dtypes stubs.

The previously missing type hints for StringDType were also added; fixing #26747 (for the most part, see the note), and superseding #26528.

Note

The StringDType annotations aren't technically valid at the moment, since numpy.dtype[str] is invalid because str is not a subtype of np.generic.
But once its scalar-type is ready (I believe @ngoldbaum was working on this), this won't be a problem anymore.


FYI, this concludes my numpy-typing PR-streak (for now), of 36 PR's in 18 days - I gotta get back to my day jobs. But since my todo-list has grown more than it shrank, I have a feeling that a couple more will follow, albeit less frequently.

@ngoldbaum
Copy link
Member

Awesome! Thanks so much for working on this, I definitely wouldn't have been able to figure this out. This makes adding the scalar much less pressing than it was without this fix. Unfortunately that also means I probably don't know enough about python typing to be able to review this at more than a surface leve.

I guess as a followup we can modify the np.strings type stubs to mark that they accept StringDType arrays too?

@jorenham
Copy link
Member Author

jorenham commented Jul 22, 2024

I guess as a followup we can modify the np.strings type stubs to mark that they accept StringDType arrays too?

Yes exactly. But without a scalar type you won't be able to use np.typing.NDArray like is currently done.

So e.g. add in numpy/_core/_strings.pyi could be rewritten as something like this:

StringArray: TypeAlias = np.ndarray[Any, np.dtypes.StringDType]
T_co: TypeAlias = StringArray | U_co

@overload
def add(x1: T_co, x2: T_co) -> StringArray: ...
@overload
def add(x1: U_co, x2: U_co) -> NDArray[np.str_]: ...
@overload
def add(x1: S_co, x2: S_co) -> NDArray[np.bytes_]: ...

@jorenham
Copy link
Member Author

@seberg

numpy/dtypes.pyi Outdated

@final
class Int64DType(
_TypeCodes[L["i"], L["l"], L[7]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't right unfortunately (i.e. on windows this is q). The literal code for Int64 is l or q and the type number 7 or 9 (I guess). Type numbers are fixed for the C names, not the names here.
The mypy plugin is supposed to sort this out I think: https://github.com/numpy/numpy/blob/main/numpy/typing/mypy_plugin.py

Now, maybe that plugin is the wrong approach here, though. In principle, the built-time information we need is:

  • Is int32 an int or a long?
  • Is int64 an long or a longlong
  • What is the size of longdouble?

Additionally, even if inferable, one of those integers is always equivalent to both. One day, I would like to encode that by making the C named version a subclass of the bit-sized one. Or, by making it an alias only so that the type-number is an instance property only (i.e. the Int64 dtype instances can still be either long or longlong with their old type numbers, but they behave identical, so only have a single type. Without a subclass the small oddity one would have to accept is that LongLong() might not return the type-code we want).


My gut feeling now is that this would be much better than what we have. So if we have some cycles/will to clean it up, I might suggest:

  • Creating a new templated stub file with the crucial information (might be this file itself).
  • Ask @ngoldbaum to help with the meson tooling to fill in the blanks at compile time (need to use compile time constants/sizes to correctly cross-compile).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't right unfortunately (i.e. on windows this is q). The literal code for Int64 is l or q and the type number 7 or 9 (I guess).
Type numbers are fixed for the C names, not the names here.

Ah ok that makes sense. The reason I was confused is because e.g. numpy.dtypes.ByteDType is a direct alias for Int8DType, instead of the other way around. But in hindsight the fact that Int8DType().char is 'b', instead of 'i1', should've been a giveaway.


The mypy plugin is supposed to sort this out I think

I don't consider the mypy plugin as a solution myself, since there are more (and imo better) fish in the typing sea, e.g. pyright.
It's not that I'm against a mypy plugin per-se, but I think that prioritizing a type-checker-agnostic solution should be our aim.

And I'm also not really sure if platform-dependent types is a good idea in general actually.
Having your type stubs depend on e.g. Python version or NumPy version is no problem I think, as these are explicitly known beforehand, and are within the control of the user.
But platform-dependent type stubs are a different story, and could lead to those "but it works on my machine" scenario's, which could lead to 3rd party libraries that are only type-valid on x86/64 win32 machines that compiled numpy with the default compilation options (unless it's the 3rd full moon of a prime-numbered year obviously).

So that's why I want to at least cover all the possible typecodes and sizes here, so that it's platform-agnostic.
But on the other hand, I don't think it's a good idea to account for all theoretically possible situations, assuming that the system is e.g. either 32 or 64 bits, that the C99 standards apply, and that byte, short, and intc are 8, 16, and 32 bits (as I was told that theoretically they can be bigger, but that in practice this almost never is the case, but correct me if I'm wrong; I don't have much C experience).

Having platform-agnostic stubs could also help developers find platform-dependent bugs more easily. Type-checkers will then be able to detect incorrect assumptions such as e.g. being able to cast a long to int32 in safe mode.

And while I think that generating stubs as build-time are definitely a better approach than the mypy-plugin one, the platform-agnostic stubs would still have my preference.

What do you think @seberg?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷, dunno. You will lose some round-tripping, since you will lose the ability to distinguish int64 and int32 when it comes to typecodes and chars. And that means in some cases things cannot round-trip (like np.dtype(int_dtype.char), or with the struct.pack module).

OTOH, yes, maybe it helps someone somewhere to clarify that if you write long you may get a 32bit or 64bit integer.
I suppose you can only try and see how many things get annoyed. Avoiding a plugin seems easy, but removing runtime sizes, I don't know.

@jorenham jorenham force-pushed the typing/numpy_dtypes branch from 814b032 to 48f03da Compare July 26, 2024 03:48
@jorenham jorenham requested a review from seberg July 26, 2024 03:48
@jorenham
Copy link
Member Author

I've changed the widths and aliases so that they should match the C99 spec, given a LP32, ILP32, LLP64, or LP64 data model (https://en.cppreference.com/w/c/language/arithmetic_types).

So apparently a C int can be 16 bits 🤷🏻

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments. Would it be possible to encode that np.dtype[int8] actually is Int8DType? Although, that seems like it is probably some future steps.

Things like metadata of course can be None or any dict also. And Void dtype (in theory all, but lets ignore that), can have any result for isalignedstruct, fields, etc. of course.

"datetime64[ns]",
"datetime64[ps]",
"datetime64[fs]",
"datetime64[as]",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't the unit be part of the class? Just like the byteorder.

Also, for the byte-order shouldn't that refer to the order in the class definition? (But maybe the overloads become too long for it to be useful right now.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't the unit be part of the class? Just like the byteorder.

Yea that's probably better. It's not really doing anything useful at the moment, and is mostly there for aesthetics.

Also, for the byte-order shouldn't that refer to the order in the class definition?

Perhaps it should, but the only way I can think of to make that work, would be to add a byteorder type parameter, and add 28+ overloads in __new__ (at least 2 for each unit).
But that might be a bit too messy, compared to the advantages.
... or isn't that what you mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was what I meant... Anyway, since this is just about the DType classes as exposed here, let's not worry about it too much. All of this might become more relvant when the classes are properly merged...

@jorenham
Copy link
Member Author

Couple of comments. Would it be possible to encode that np.dtype[int8] actually is Int8DType?

Well in theory, yes; by overloading __new__ of either numpy.dtype.
But because of python/mypy#15182, this won't work in mypy, although it'll work with (at least) Pyright.

... and use a more consistent ``TypeVar`` naming scheme.

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
@seberg
Copy link
Member

seberg commented Aug 5, 2024

I'll just put this in, thanks. If I understand correctly, there is still a disassociation at least in the sense that the instances returned by np.dtype are not DType instances.
It seems not a big deal to wait with it, and that also means that the integer size issue might not matter too much in practice I guess.

@seberg seberg merged commit bf822ea into numpy:main Aug 5, 2024
66 checks passed
@jorenham jorenham deleted the typing/numpy_dtypes branch August 5, 2024 14:02
@jorenham
Copy link
Member Author

jorenham commented Aug 5, 2024

If I understand correctly, there is still a disassociation at least in the sense that the instances returned by np.dtype are not DType instances.

They are, but they're (strict) subtypes of dtype. So technically the current return type annotations for dtype.__new__ are correct, but (currently) not as specific as they could be.
And like I mentioned before in #27008 (comment), it's not difficult to "fix" that. But if we do, mypy users won't be able to benefit from that fix, at least not until python/mypy#15182 is fixed.

It seems not a big deal to wait with it, and that also means that the integer size issue might not matter too much in practice I guess.

I'm not sure what you mean here; didn't we cover all possible integer sizes here 🤔?

@seberg
Copy link
Member

seberg commented Aug 5, 2024

Yes we did, I was wondering if the approach we took here is right or not. But I don't think it matters right now at least in the sense that we won't break anyone.

jorenham added a commit to jorenham/numpy that referenced this pull request Aug 11, 2024
charris added a commit that referenced this pull request Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants