-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
TYP,BUG: Complete type stubs for numpy.dtypes
#27008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Awesome! Thanks so much for working on this, I definitely wouldn't have been able to figure this out. This makes adding the scalar much less pressing than it was without this fix. Unfortunately that also means I probably don't know enough about python typing to be able to review this at more than a surface leve. I guess as a followup we can modify the |
Yes exactly. But without a scalar type you won't be able to use So e.g. StringArray: TypeAlias = np.ndarray[Any, np.dtypes.StringDType]
T_co: TypeAlias = StringArray | U_co
@overload
def add(x1: T_co, x2: T_co) -> StringArray: ...
@overload
def add(x1: U_co, x2: U_co) -> NDArray[np.str_]: ...
@overload
def add(x1: S_co, x2: S_co) -> NDArray[np.bytes_]: ... |
numpy/dtypes.pyi
Outdated
|
||
@final | ||
class Int64DType( | ||
_TypeCodes[L["i"], L["l"], L[7]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't right unfortunately (i.e. on windows this is q
). The literal code for Int64 is l
or q
and the type number 7 or 9 (I guess). Type numbers are fixed for the C names, not the names here.
The mypy plugin is supposed to sort this out I think: https://github.com/numpy/numpy/blob/main/numpy/typing/mypy_plugin.py
Now, maybe that plugin is the wrong approach here, though. In principle, the built-time information we need is:
- Is
int32
anint
or along
? - Is
int64
anlong
or alonglong
- What is the size of longdouble?
Additionally, even if inferable, one of those integers is always equivalent to both. One day, I would like to encode that by making the C named version a subclass of the bit-sized one. Or, by making it an alias only so that the type-number is an instance property only (i.e. the Int64
dtype instances can still be either long or longlong with their old type numbers, but they behave identical, so only have a single type. Without a subclass the small oddity one would have to accept is that LongLong()
might not return the type-code we want).
My gut feeling now is that this would be much better than what we have. So if we have some cycles/will to clean it up, I might suggest:
- Creating a new templated stub file with the crucial information (might be this file itself).
- Ask @ngoldbaum to help with the meson tooling to fill in the blanks at compile time (need to use compile time constants/sizes to correctly cross-compile).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't right unfortunately (i.e. on windows this is
q
). The literal code for Int64 isl
orq
and the type number 7 or 9 (I guess).
Type numbers are fixed for the C names, not the names here.
Ah ok that makes sense. The reason I was confused is because e.g. numpy.dtypes.ByteDType
is a direct alias for Int8DType
, instead of the other way around. But in hindsight the fact that Int8DType().char
is 'b'
, instead of 'i1'
, should've been a giveaway.
The mypy plugin is supposed to sort this out I think
I don't consider the mypy plugin as a solution myself, since there are more (and imo better) fish in the typing sea, e.g. pyright.
It's not that I'm against a mypy plugin per-se, but I think that prioritizing a type-checker-agnostic solution should be our aim.
And I'm also not really sure if platform-dependent types is a good idea in general actually.
Having your type stubs depend on e.g. Python version or NumPy version is no problem I think, as these are explicitly known beforehand, and are within the control of the user.
But platform-dependent type stubs are a different story, and could lead to those "but it works on my machine" scenario's, which could lead to 3rd party libraries that are only type-valid on x86/64 win32 machines that compiled numpy with the default compilation options (unless it's the 3rd full moon of a prime-numbered year obviously).
So that's why I want to at least cover all the possible typecodes and sizes here, so that it's platform-agnostic.
But on the other hand, I don't think it's a good idea to account for all theoretically possible situations, assuming that the system is e.g. either 32 or 64 bits, that the C99
standards apply, and that byte
, short
, and intc
are 8, 16, and 32 bits (as I was told that theoretically they can be bigger, but that in practice this almost never is the case, but correct me if I'm wrong; I don't have much C experience).
Having platform-agnostic stubs could also help developers find platform-dependent bugs more easily. Type-checkers will then be able to detect incorrect assumptions such as e.g. being able to cast a long
to int32
in safe mode.
And while I think that generating stubs as build-time are definitely a better approach than the mypy-plugin one, the platform-agnostic stubs would still have my preference.
What do you think @seberg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷, dunno. You will lose some round-tripping, since you will lose the ability to distinguish int64
and int32
when it comes to typecodes and chars. And that means in some cases things cannot round-trip (like np.dtype(int_dtype.char)
, or with the struct.pack
module).
OTOH, yes, maybe it helps someone somewhere to clarify that if you write long
you may get a 32bit or 64bit integer.
I suppose you can only try and see how many things get annoyed. Avoiding a plugin seems easy, but removing runtime sizes, I don't know.
814b032
to
48f03da
Compare
I've changed the widths and aliases so that they should match the C99 spec, given a So apparently a C int can be 16 bits 🤷🏻 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of comments. Would it be possible to encode that np.dtype[int8]
actually is Int8DType
? Although, that seems like it is probably some future steps.
Things like metadata of course can be None
or any dict also. And Void dtype (in theory all, but lets ignore that), can have any result for isalignedstruct, fields, etc. of course.
"datetime64[ns]", | ||
"datetime64[ps]", | ||
"datetime64[fs]", | ||
"datetime64[as]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't the unit be part of the class? Just like the byteorder.
Also, for the byte-order shouldn't that refer to the order in the class definition? (But maybe the overloads become too long for it to be useful right now.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't the unit be part of the class? Just like the byteorder.
Yea that's probably better. It's not really doing anything useful at the moment, and is mostly there for aesthetics.
Also, for the byte-order shouldn't that refer to the order in the class definition?
Perhaps it should, but the only way I can think of to make that work, would be to add a byteorder type parameter, and add 28+ overloads in __new__
(at least 2 for each unit).
But that might be a bit too messy, compared to the advantages.
... or isn't that what you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was what I meant... Anyway, since this is just about the DType classes as exposed here, let's not worry about it too much. All of this might become more relvant when the classes are properly merged...
Well in theory, yes; by overloading |
... and use a more consistent ``TypeVar`` naming scheme. Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
I'll just put this in, thanks. If I understand correctly, there is still a disassociation at least in the sense that the instances returned by |
They are, but they're (strict) subtypes of
I'm not sure what you mean here; didn't we cover all possible integer sizes here 🤔? |
Yes we did, I was wondering if the approach we took here is right or not. But I don't think it matters right now at least in the sense that we won't break anyone. |
DOC: Add release notes for #27008
The
numpy.dtype
subtypes innumpy.dtypes
were annotated as type aliases ofdtype
.But since their constructor signatures are incompatible (the ones in
numpy.dtypes
accept either 0 or 1 positional-only arguments), so thenumpy.dtypes
stubs were incorrect.This PR fixes this by adding specialized
dtype
subtypes in thenumpy.dtypes
stubs.The previously missing type hints for
StringDType
were also added; fixing #26747 (for the most part, see the note), and superseding #26528.Note
The
StringDType
annotations aren't technically valid at the moment, sincenumpy.dtype[str]
is invalid becausestr
is not a subtype ofnp.generic
.But once its scalar-type is ready (I believe @ngoldbaum was working on this), this won't be a problem anymore.
FYI, this concludes my numpy-typing PR-streak (for now), of 36 PR's in 18 days - I gotta get back to my day jobs. But since my todo-list has grown more than it shrank, I have a feeling that a couple more will follow, albeit less frequently.