Skip to content

TYP: dtypes.pyi is missing StringDType #26747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bersbersbers opened this issue Jun 18, 2024 · 19 comments
Closed

TYP: dtypes.pyi is missing StringDType #26747

bersbersbers opened this issue Jun 18, 2024 · 19 comments

Comments

@bersbersbers
Copy link
Contributor

Describe the issue:

The below code runs fine, but does not type check:

Reproduce the code example:

import numpy as np

arr = np.array("abc", dtype=np.dtypes.StringDType)
arr += "def"
print(arr)

Error message:

(.venv) C:\Code\project>mypy bug.py
bug.py:3: error: Module has no attribute "StringDType"  [attr-defined]
Found 1 error in 1 file (checked 1 source file)

Python and NumPy Versions:

2.0.0
3.12.4 (tags/v3.12.4:8e8a4ba, Jun 6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)]

Runtime Environment:

(not relevant)

Context for the issue:

No response

@bersbersbers
Copy link
Contributor Author

Related question: what is the "scalar counterpart" of StringDType?

@bersbersbers
Copy link
Contributor Author

Related issues:

@ngoldbaum
Copy link
Member

I tried looking at this in #26528 and unfortunately hit the issue you asked about:

Related question: what is the "scalar counterpart" of StringDType?

In some sense there isn't one. In another sense it's python's str type. It would almost work to write the type stubs with str as the scalar, but that would break places where str is used for the fixed-width unicode dtype elsewhere in the type stubs.

By now we've decided this was probably a bad design and the plan now is to add a scalar that interns a single string entry. I didn't want to do that initially because it would require effectively rewriting str but now that we have string ufunc implementations it's not so bad. Unfortunately this issue with the type stubs was only discovered relatively late in the development cycle, otherwise I would have devoted cycles to fixing this sooner.

I personally think it's a bug that the type stubs assume all dtypes must have a corresponding scalar type. That's not strictly true anymore with the new DType API.

@ngoldbaum ngoldbaum added the component: numpy.strings String dtypes and functions label Jun 18, 2024
@rgommers rgommers changed the title TYPING: dtypes.pyi is missing StringDType dtypes.pyi is missing StringDType Jun 19, 2024
@seberg
Copy link
Member

seberg commented Jun 19, 2024

Since we discussed it, and I think there was a lot of confusion about it, let me recap a bit.

DTypes indeed don't need a "scalar type" in principle, but, in practice they have one. However, there is absolutely nothing that requires that scalar type to be defined by NumPy. It is a duality: DType <-> scalar_type, and NEP 41 is practically all about why that makes sense.

We could, and I always assumed we will eventually also add a new abstract "NumPy scalar" type, for which isinstance(str, numpy_scalar) can be true! (that is, scalars don't need to subclass np.generic, which comes with too many weird ABI ideas.)

But, back to the issue at hand:

  • Currently, np.asarray(["a", "b"]) will releavl as Revealed type is "numpy.ndarray[Any, numpy.dtype[Any]]". That is because we don't pass in the right scalar, of course the reveal fails!
  • The np.dtype[str] is the best way to spell things and very correct for when you index an array. But, it is incorrect for the np.asarray() which is implemented by _ArrayLike in the stubs.

So the point is that str is the correct element in many ways, but coercion has special rules (also for np.array([1., 2.]))!


So in other words: Rather than introducing a new scalar type just for the sake of fixing typing, let's maybe take a step back and see whether the thing that is fundamentally broken is not np.dtype[str].

But instead that _ArrayLike just needs a couple of special cases, because it already needs exactly the same special cases to get the DType right in those cases!

EDIT: That is:

np.array(["a"])
np.array([b"a"])
np.array([1.])
np.array([1])  # admitteldy, strictly speaking still value dependent so maybe not possible.
np.array([1j])

All fail to resolve correctly.

@seberg
Copy link
Member

seberg commented Jun 19, 2024

Not that I absolutely hate having a dedicated scalar, but I am not yet convinced that types are a serious argument for it. Not having a scalar means that the scalar cannot look like an array, which some do (although strings do a poor job of it!).

I personally don't care for the notion of scalars looking too much like arrays, but I agree it will be a user surprise. For some times it would be cool to not have to deal with a second scalar type just for the sake of the DType! But, I dunno if it is for strings (because of the array(["a"]) oddity for one thing).

@jorenham
Copy link
Member

jorenham commented Jul 3, 2024

At the moment, the type attribute of dtype[S: np.generic] is annotated as dtype[S].type: type[S], i.e. as the type of its scalar type (a subtype of np.generic).
But StringDType (a subtype of np.dtype) violates this rule:

>>> import numpy as np
>>> np.dtypes.StringDType.type
<class 'str'>

So it's currently impossible to annotate dtypes.StringDType, unless the type T parameter of dtype[T] has its upper type bound changed from T: np.generic to e.g. T: np.generic | str (or T: object).
But I strongly suspect that such a change will cause more problems that it solves.

For example, consider the following statement, given some arbitrary dt: np.dtype:

np.dtype(dt.type).type is dt.type

At first glance, this seems reasonable.
I wouldn't be surprised if somewhere in the numpy codebase there is some code that implicitly rests on this assumption, e.g. type coercion/casting with multiple array-like inputs of different scalar types.

In fact, from a typing perspective, this can even be proved to always hold, by considering the following facts:

  1. all T: dtype[S: np.generic] have attribute T.type: S: defined at __init__.pyi:891
  2. all S: generic have attribute S.dtype: np.dtype[S]: __init__.pyi:2742
  3. S: np.generic implements _SupportsDType[S]: defined at _typing/_dtype_like.py:84
  4. dtype(_SupportsDType[S]) -> dtype[S] for all S: np.generic (through a __new__ overload): compatible with init.pyi:778-784
  5. dtype(type[S]) -> dtype[S] (constructor call) for all S: np.generic: substitution of 3. into 4.
  6. dtype(T.type): type[T] for all T: dtype[S: generic]: substitution of 1. into 5.

So that's a case of "QED" I suppose.

But however sound that logic is, StringDType is in its rebellious phase (we've all been there), and lives by its own rules:

>>> np.dtype(np.dtypes.StringDType.type)
dtype('<U')
>>> _.type
<class 'numpy.str_'>
>>> np.dtypes.StringDType.type
<class 'str'>

Let's hope that it won't send its friends numpy.dtypes down the wrong path too...

Anyway, my point with all this, is that the current dtype.type annotations should stay in place:
It plays a fundamental role in the already complicated scalar-/data-type sandcastle, and we don't want it to crumble.

Instead, I believe that StringDType is in need of a string dtype (which the other np.dtypes all have).
This would also makes it possible to use npt.NDArray[np.string], which currently won't work.
And perhaps some might also consider it a good thing that fundamental universal logic won't have to be broken by changing dtype.type, but personally I care more about being able to use npt.NDArray to annotate my string arrays 🤷🏻 .

@seberg
Copy link
Member

seberg commented Jul 3, 2024

Isn't the one critical point in those bullets that _SupportsDType[S] cannot capture things unless it is a subclass of something with a .dtype attribute?

This part I challenge a bit: I don't understand why that should be paramount. Internally, NumPy does this via a mapping, and it must do so because we already have broader behavior anyway even with existing types:

  • np.dtype(bool).type == numpy.bool
  • np.dtype(str).type == numpy.str_ (ooops)
  • np.dtype(float).type == numpy.float64
  • (Integers are complicated, since value is inspected usually.)
  • Python datetimes go to NumPy datetimes.

All of the above are already not captured by the _SupportsDType implementation. And I think _SupportsDType should maybe be expanded to support them properly.
For _ArrayLike there may be more tricky parts, but those are tricky because the actual behavior is (none of which the typing stubs capture currently).

This would also makes it possible to use npt.NDArray[np.string]

Sure, npt.NDarray[str] would be somewhat ambiguous to allow. I do think that npt.NDArray[np.dtypes.StringDType] is more correct with npt.NDArray[scalar_type] being a short-hand (this is what the NumPy core does after all).
But I mostly want to challenge that _SupportsDType step. I would hate to think of its limitations as a given because it is already wrong and because I really think the problem with string isn't this and I would hate to go in the direction of cementing the assumption that this style cannot work for e.g. decimal.decimal where the problem doesn't exist.

@jorenham
Copy link
Member

jorenham commented Jul 4, 2024

Isn't the one critical point in those bullets that _SupportsDType[S] cannot capture things unless it is a subclass of something with a .dtype attribute?

Not really. The main issue is that StringDType.type: str is fully incompatible with np.dtype.type: np.generic. This makes it impossible to write type stubs for StringDType.
There are two possible solutions:

  1. Loosen np.dtype.type: np.generic to np.dtype.type: type[Any]. But as I thoroughly explained before, this is at the very least backwards incompatible, and is likely to cause a chain-reaction of typing issues (because it'll break at least one fundamental assumption that is provably true).
  2. Have StringDType.type fall in line with every other np.dtype, i.e. give it a corresponding scalar type, e.g. np.string, so that np.dtype(np.string) -> StringDType and StringDType.type: np.string.

Clearly, option 2 is the path of least resistance.

I really don't like exceptions (even though I'm Dutch). But currently, StringDType is the dtype exception in at least two ways - both of which can be solved be introducing a np.string scalar dtype.


I do think that npt.NDArray[np.dtypes.StringDType] is more correct

That won't work: NDArray is defined (in PEP 695 syntax) as:

type NDArray[ST: np.generic] = np.ndarray[Any, np.dtype[ST]]

So npt.NDArray[np.dtypes.StringDType] resolves to np.ndarray[Any, np.dtype[np.dtypes.StringDType].
But it's not possible to bind StringDType to ST in np.dtype[ST: np.generic], since
StringDType is not a subtype of np.generic.

@ngoldbaum
Copy link
Member

For what it's worth, I'm planning on working on a scalar implementation next week during the sprints at the SciPy conference.

@jorenham
Copy link
Member

jorenham commented Jul 4, 2024

@ngoldbaum Good to hear! Solving this issue will be a piece of cake once there's a string scalar type :)

@seberg
Copy link
Member

seberg commented Jul 4, 2024

@jorenham except that all of the things you are complaining about seem already wrong for object arrays as well as far as I can tell. The typing stubs get away with things that are just not true because they don't capture all of the runtime and never really tried to.

You may be able to side-step that for StringDType (at the cost of other imcompatibilities possibly, since we are not unicode subclasses anymore at least if you use generic and not a new abstract base class).
But you cannot fix it for object DType, and IMO there are very good reasons why you don't want to fix this in general for future DTypes either.

@jorenham
Copy link
Member

jorenham commented Jul 4, 2024

@seberg I'm afraid I don't quite follow.

except that all of the things you are complaining about [...]

What am I complaining about?
Could it be that you're confusing my analysis of the problem, its two possible solutions, and the conclusion that one solution is the better one, for something else?

[...] seem already wrong for object arrays as well as far as I can tell

What exactly is wrong with np.object_ arrays? If there's a bug or a typing issue with it, it's best to open a new issue for it. I'd be happy to help solve it.

The typing stubs get away with things that are just not true because they don't capture all of the runtime and never really tried to.

What makes you think that? Isn't this very issue a prime example that the community is working hard to close the gap between the numpy type annotations and the runtime?

You may be able to side-step that for StringDType

I don't see how introducing a scalar type for StringDType (i.e. the solution I proposed, which will apparantly already be implemented soon) is "side-stepping" the problem.

But you cannot fix it for object DType, and IMO there are very good reasons why you don't want to fix this in general for future DTypes either.

Fix what exactly? And what reasons are you talking about?

@seberg
Copy link
Member

seberg commented Jul 4, 2024

What exactly is wrong with np.object_ arrays?

Their dtype.type is obviously not a np.generic.

EDIT: OK, fair, it actually is... But np.object_ isn't an actual type, it is just some weird placeholder that doesn't obay any reasonably type expectations.

@jorenham
Copy link
Member

jorenham commented Jul 4, 2024

it is just some weird placeholder that doesn't obay any reasonably type expectations.

>>> import numpy as np
>>> np.object_
<class 'numpy.object_'>
>>> type(np.object_)
<class 'type'>

🤷🏻

But this is all rather out-of-scope for this issue.

@seberg
Copy link
Member

seberg commented Jul 4, 2024

Yes, but it clearly is a meaningless abstract class. (It actually is a concrete class in C, just like np.generic. But it doesn't implement __instancecheck__ so it gets away with that.).

So there are no object instances with a .dtype attribute. There is no object that actually passes isinstance(object(), np.object_) (and that part is impossible, since np.generic comes with protocol/ABI expectations!).

And since that is the case, you can't possibly type something like:

def item(NDarray[scalar_type] self, ...) -> scalar_type:

or even np.generic. Now, you can argue that object is special, because it maps to Any, but it would also be nice to type an array of Python decimal.decimal objects in principle and for those "scalar type" can also not be an np.generic.
(Again, it could be some np.numpy_scalar abstract instance, but it wouldn't come with things like .dtype being defined.)

So, you can side-step this problem with strings, by defining a concrete string type. But I don't see which problem here is unique to strings: they are shared in some form or another either by the object dtype or by [1.2, 3.4] not being a valid _ArrayLike[np.float64].

@jorenham
Copy link
Member

jorenham commented Jul 5, 2024

@seberg I didn't know that about np.object_; it's indeed awkward that np.object_() is None. But even so, this issue isn't really the place to be disccussing this issue.

I understand that StringDType is more related to ObjectDType than it is to e.g. StrDType or BytesDType.
But that doesn't necessarily mean that the current issues with np.object_ will also be present in the future np.string scalar type, as It could use the same "trick" that many other scalar types have (successfully) used: Use an additional builtins type as base type. So in this case, np.string could inherit from both np.generic and builtins.str.

But I'm sure that @ngoldbaum already has a plan in mind for implementing it the StringDType scalar type.
Speculating about its possible hypothetical problems probably isn't the most constructive.

@jorenham jorenham changed the title dtypes.pyi is missing StringDType TYP: dtypes.pyi is missing StringDType Aug 11, 2024
@jorenham
Copy link
Member

Solved through #27008

@bersbersbers
Copy link
Contributor Author

Thanks, @jorenham ! Am I understanding #27008 (comment) correctly in that with numpy==2.1.0, the following code is not supposed to type-check yet:

import numpy as np
import numpy.typing as npt

arr: npt.NDArray[np.dtypes.StringDType] = np.array("abc", dtype=np.dtypes.StringDType)
arr += "def"
print(arr)

Or is my type annotation not correct? I tried npt.NDArray[np.str_] but that is failing, too.

@jorenham
Copy link
Member

@bersbersbers numpy.str_ and numpy.dtypes.StringDType are unrelated; StringDType has no scalar type at the moment, which is what's causing the current type-checker errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants