NEP: Make NEP 51 to propose changing the scalar representation #22261

seberg · 2022-09-14T20:16:58Z

As mentioned on the mailing list, I have been looking into changing the representation of scalars. This is also because the distinction between NumPy scalars and Python scalars would become larger with NEP 50.

This is an early draft for broad feedback for now. The current (not quite settled!) angle is that everything should print as:

np.float32(3.0), etc.
longdouble would have additional quotes (to round trip) and always use longdouble as name. (same for the type name)
Bool (as suggested in ENH: Changed repr of np.bool_ #17592) should be np.True_ (although that PR spells out numpy)
There is a fun fact that (64bit linux), np.longlong prints as numpy.longlong but np.longlong(3) would show as np.int64(3). I think that is the best way to do it.

In any case, there are some details that can be discussed, and I am not sure if we may need a better way to spell decimal.Decimal(repr(numpy_scalar)) (which is used in our tests).

EDIT: The implementation to this is in gh-22449 (besides quite a bit of janitorial work, e.g. to actually update all of the docs. It seems refguide-check executes the repr and doesn't actually notice the change mostly.)

WarrenWeckesser · 2022-09-14T23:23:58Z

Is there any way for the repr of a bytes or Unicode scalar to include the actual size?

seberg · 2022-09-15T07:12:41Z

Is there any way for the repr of a bytes or Unicode scalar to include the actual size?

There is no point right now:

a = np.array(["1"], dtype="S100")
a[()].dtype

is still "S1", the length is not currently preserved when creating a scalar. If we had, that I guess we would also have np.bytes_("a", length=100).

WarrenWeckesser · 2022-09-15T12:41:22Z

Good point, thanks.

One can create instances with padded zeros directly:

In [23]: s = np.str_('abc\0\0')

In [24]: s.dtype
Out[24]: dtype('<U5')

In [25]: s
Out[25]: 'abc'

In that case, it would be nice for the repr to show the actual storage size, but I don't know if that possibility is enough incentive to do so. Perhaps it could be displayed only when there are padded zeros. E.g. repr(s) is "np.str_('abc', length=5)", but for a = np.str_('abc'), repr(a) is "np.str_('abc')". (Of course, then it would also be nice for str_ to actually accept a length argument.)

WarrenWeckesser · 2022-09-15T13:12:39Z

Will this affect the repr of a structured data type?

E.g.

In [66]: dt = np.dtype([('x', np.float32), ('code', 'S4'), ('flag', np.bool_)])

In [67]: a = np.array((1.25, 'ABC', False), dtype=dt)[()]

In [68]: a
Out[68]: (1.25, b'ABC', False)

Would repr(a) change to "(np.float32(1.25), np.bytes_(b'ABC'), np.False_)"? I'm not implying that it should; I'm just interested in understanding the extent of the proposed changes.

mhvk · 2022-09-15T20:49:00Z

Like @WarrenWeckesser, curious about np.record -- ideally, it no longer pretends to be a tuple!

I like the NEP otherwise; do include the strings too.

seberg · 2022-09-19T05:52:58Z

Hmm, I had forgotten about structured, after thinking we print np.void. My first thought would be to print them as:

np.void((x, z, y), dtype=...)

since that also gives the field names. Do you have a different suggestion? We would have to add that as a valid constructor, but that should not be hard.

About strings, I had not thought about the fact that it is possible to create larger strings at all. I suppose that might be something to change: If the string is longer, include the length as length=8 or dtype="S8", users would see that practically never right now.

mhvk · 2022-09-19T13:45:55Z

Yes, I like the suggested np.void((x, z, y), dtype=...). And it would be good to be able to construct those directly. For reference, right now the docstring is:

Init signature: np.void(self, /, *args, **kwargs)
Docstring:     
Either an opaque sequence of bytes, or a structure.

>>> np.void(b'abcd')
void(b'\x61\x62\x63\x64')

Structured `void` scalars can only be constructed via extraction from :ref:`structured_arrays`:

>>> arr = np.array((1, 2), dtype=[('x', np.int8), ('y', np.int8)])
>>> arr[()]
(1, 2)  # looks like a tuple, but is `np.void`

:Character code: ``'V'``

WarrenWeckesser · 2022-09-19T13:50:31Z

Is there any way the name void can be changed to something more descriptive? Maybe struct, or record (at least for the case where the type actually is a structured data type). void has always seemed like a bad name. Perhaps split void into two distinct things: blob (or something like that) for the opaque bytes case, and struct for the structured data type. (I haven't searched, but I wouldn't be surprised if this has been discussed before.)

seberg · 2022-09-19T14:12:23Z

This is getting a bit unwieldy. I think right now things are fine, because I believe Python always prints str and repr for its numbers the same (so we can use str instead of repr in many case).

I am hoping to discuss this in the call on wednesday a bit, maybe you can come?

For the question of structured dtypes, my current angle would be that I am not sure how salvageable np.void is. Rather the real solution to me is probably to create a new "structured dtype" (in NumPy proper). That could clean-slate fix some things and also consider whether structured_arr == structured_arr should (for example) not return a boolean array with the structure preserved.
But, given that, the question is if tweaking the representation a bit is even worthwhile if we have to think about it a lot...

mhvk · 2022-09-19T14:45:49Z

@seberg - indeed, let the perfect not be the enemy of the good. For the NEP, maybe the most important thing is to mention structured types and postponing a decision on those...

seberg · 2022-09-20T09:11:22Z

I thought this would be a relatively straight forward nut to start cracking on 🤦... But, in some cases we just use the repr to print arrays, and if we bloat the repr then we bloat array printing and need new special cases.

This is half thinking out loud for how to extend the NEP: I have to extend the NEP to include a new way to do repr for arrays, there are simply three ways of printing:

scalar.__str__ for raw strings of the scalar
scalar.__repr__ an (ideally) round-trippable representation of the scalar that has some indication of type (since it is not exactly settled what representation).
item_repr(): The representation used by the array. This is a repr that does not need to have type information! This is because we already have the dtype information. The naive approach prints np.array([np.str_("a")], dtype="S1").
- Normally (or maybe always) the scalar.__repr__ would just be the same as the array __repr__ with the additional type information.

Now, that new item_repr is something we could special case for all our dtypes. But, I would rather move it onto the DType, because otherwise we just make things worse for new user DTypes probably...

And, unfortunately, that means designing that API first. Probably not too bad, but I suspect we should for example move the print option to a contextvar while at it.

Also mention the backwards-incompatible break of changing ``arr.tofile()``

…epr``

The current choice is `np.float64(nan)` and not one of the alterntives...

rossbar

Generally looks good, thanks for writing this up @seberg .

I took the liberty of making some minor wording/formatting fixups. There were just a few places where I wasn't quite sure what was being said. I think these instances could use just one more pass to really nail down the intended meaning (see comments below).

Another thing that I think would be really useful would be a table that compares the current repr with the proposed repr. I'd vote that this waits until the NEP is merged with draft status however, as that might spur more instance-specific discussion.

doc/neps/nep-0051-scalar-representation.rst

mhvk · 2022-10-27T20:24:56Z

Had a look just to see what the status was -- it looks good to me!

rossbar

LGTM, thanks for the writeup @seberg ! Let's get this in so it can hit the mailing list for further discussion

seberg · 2022-10-28T06:30:36Z

Thanks for the reviews!

seberg force-pushed the nep-scalar-repr branch from f36efef to 6fdfffc Compare September 14, 2022 20:21

seberg mentioned this pull request Oct 18, 2022

ENH: Update scalar representations as per NEP 51 #22449

Merged

4 tasks

seberg added 5 commits October 18, 2022 15:57

NEP: Initial draft for a NEP to change NumPy scalar representation

6adbe29

NEP: Propose to use np. prefix consistently.

bdcdf61

NEP: Clarify that the PR actually already went beyond booleans

6d09e2a

Add additional notes on longlong and Decimal(repr(scalar))

9cef5e3

NEP: Introduce get_formatter() and mention changes to void/MA

959fed7

Also mention the backwards-incompatible break of changing ``arr.tofile()``

seberg force-pushed the nep-scalar-repr branch from 6fdfffc to 959fed7 Compare October 21, 2022 10:53

seberg added 2 commits October 24, 2022 17:55

Small updates, mainly because I now also use str and not just ``r…

1bbd80a

…epr``

DOC: Add note about non-finite values

8cfdcaa

The current choice is `np.float64(nan)` and not one of the alterntives...

seberg marked this pull request as ready for review October 25, 2022 13:59

DOC: minor wording/formatting fixups.

aa4c412

rossbar reviewed Oct 26, 2022

View reviewed changes

NEP: Address Ross' reviews of NEP 51

68565fb

rossbar approved these changes Oct 28, 2022

View reviewed changes

rossbar merged commit 0694704 into numpy:main Oct 28, 2022

seberg deleted the nep-scalar-repr branch October 28, 2022 06:30

Uh oh!

NEP: Make NEP 51 to propose changing the scalar representation #22261

NEP: Make NEP 51 to propose changing the scalar representation #22261

Uh oh!

Conversation

seberg commented Sep 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WarrenWeckesser commented Sep 14, 2022

Uh oh!

seberg commented Sep 15, 2022

Uh oh!

WarrenWeckesser commented Sep 15, 2022

Uh oh!

WarrenWeckesser commented Sep 15, 2022

Uh oh!

mhvk commented Sep 15, 2022

Uh oh!

seberg commented Sep 19, 2022

Uh oh!

mhvk commented Sep 19, 2022

Uh oh!

WarrenWeckesser commented Sep 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg commented Sep 19, 2022

Uh oh!

mhvk commented Sep 19, 2022

Uh oh!

seberg commented Sep 20, 2022

Uh oh!

rossbar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mhvk commented Oct 27, 2022

Uh oh!

rossbar left a comment

Choose a reason for hiding this comment

Uh oh!

seberg commented Oct 28, 2022

Uh oh!

Uh oh!

seberg commented Sep 14, 2022 •

edited

Loading

WarrenWeckesser commented Sep 19, 2022 •

edited

Loading