Skip to content

NEP: Make NEP 51 to propose changing the scalar representation #22261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Oct 28, 2022

Conversation

seberg
Copy link
Member

@seberg seberg commented Sep 14, 2022

As mentioned on the mailing list, I have been looking into changing the representation of scalars. This is also because the distinction between NumPy scalars and Python scalars would become larger with NEP 50.

This is an early draft for broad feedback for now. The current (not quite settled!) angle is that everything should print as:

  • np.float32(3.0), etc.
  • longdouble would have additional quotes (to round trip) and always use longdouble as name. (same for the type name)
  • Bool (as suggested in ENH: Changed repr of np.bool_ #17592) should be np.True_ (although that PR spells out numpy)
  • There is a fun fact that (64bit linux), np.longlong prints as numpy.longlong but np.longlong(3) would show as np.int64(3). I think that is the best way to do it.

In any case, there are some details that can be discussed, and I am not sure if we may need a better way to spell decimal.Decimal(repr(numpy_scalar)) (which is used in our tests).

EDIT: The implementation to this is in gh-22449 (besides quite a bit of janitorial work, e.g. to actually update all of the docs. It seems refguide-check executes the repr and doesn't actually notice the change mostly.)

@WarrenWeckesser
Copy link
Member

Is there any way for the repr of a bytes or Unicode scalar to include the actual size?

@seberg
Copy link
Member Author

seberg commented Sep 15, 2022

Is there any way for the repr of a bytes or Unicode scalar to include the actual size?

There is no point right now:

a = np.array(["1"], dtype="S100")
a[()].dtype

is still "S1", the length is not currently preserved when creating a scalar. If we had, that I guess we would also have np.bytes_("a", length=100).

@WarrenWeckesser
Copy link
Member

Good point, thanks.

One can create instances with padded zeros directly:

In [23]: s = np.str_('abc\0\0')

In [24]: s.dtype
Out[24]: dtype('<U5')

In [25]: s
Out[25]: 'abc'

In that case, it would be nice for the repr to show the actual storage size, but I don't know if that possibility is enough incentive to do so. Perhaps it could be displayed only when there are padded zeros. E.g. repr(s) is "np.str_('abc', length=5)", but for a = np.str_('abc'), repr(a) is "np.str_('abc')". (Of course, then it would also be nice for str_ to actually accept a length argument.)

@WarrenWeckesser
Copy link
Member

Will this affect the repr of a structured data type?

E.g.

In [66]: dt = np.dtype([('x', np.float32), ('code', 'S4'), ('flag', np.bool_)])

In [67]: a = np.array((1.25, 'ABC', False), dtype=dt)[()]

In [68]: a
Out[68]: (1.25, b'ABC', False)

Would repr(a) change to "(np.float32(1.25), np.bytes_(b'ABC'), np.False_)"? I'm not implying that it should; I'm just interested in understanding the extent of the proposed changes.

@mhvk
Copy link
Contributor

mhvk commented Sep 15, 2022

Like @WarrenWeckesser, curious about np.record -- ideally, it no longer pretends to be a tuple!

I like the NEP otherwise; do include the strings too.

@seberg
Copy link
Member Author

seberg commented Sep 19, 2022

Hmm, I had forgotten about structured, after thinking we print np.void. My first thought would be to print them as:

np.void((x, z, y), dtype=...)

since that also gives the field names. Do you have a different suggestion? We would have to add that as a valid constructor, but that should not be hard.

About strings, I had not thought about the fact that it is possible to create larger strings at all. I suppose that might be something to change: If the string is longer, include the length as length=8 or dtype="S8", users would see that practically never right now.

@mhvk
Copy link
Contributor

mhvk commented Sep 19, 2022

Yes, I like the suggested np.void((x, z, y), dtype=...). And it would be good to be able to construct those directly. For reference, right now the docstring is:

Init signature: np.void(self, /, *args, **kwargs)
Docstring:     
Either an opaque sequence of bytes, or a structure.

>>> np.void(b'abcd')
void(b'\x61\x62\x63\x64')

Structured `void` scalars can only be constructed via extraction from :ref:`structured_arrays`:

>>> arr = np.array((1, 2), dtype=[('x', np.int8), ('y', np.int8)])
>>> arr[()]
(1, 2)  # looks like a tuple, but is `np.void`

:Character code: ``'V'``

@WarrenWeckesser
Copy link
Member

WarrenWeckesser commented Sep 19, 2022

Is there any way the name void can be changed to something more descriptive? Maybe struct, or record (at least for the case where the type actually is a structured data type). void has always seemed like a bad name. Perhaps split void into two distinct things: blob (or something like that) for the opaque bytes case, and struct for the structured data type. (I haven't searched, but I wouldn't be surprised if this has been discussed before.)

@seberg
Copy link
Member Author

seberg commented Sep 19, 2022

This is getting a bit unwieldy. I think right now things are fine, because I believe Python always prints str and repr for its numbers the same (so we can use str instead of repr in many case).

I am hoping to discuss this in the call on wednesday a bit, maybe you can come?

For the question of structured dtypes, my current angle would be that I am not sure how salvageable np.void is. Rather the real solution to me is probably to create a new "structured dtype" (in NumPy proper). That could clean-slate fix some things and also consider whether structured_arr == structured_arr should (for example) not return a boolean array with the structure preserved.
But, given that, the question is if tweaking the representation a bit is even worthwhile if we have to think about it a lot...

@mhvk
Copy link
Contributor

mhvk commented Sep 19, 2022

@seberg - indeed, let the perfect not be the enemy of the good. For the NEP, maybe the most important thing is to mention structured types and postponing a decision on those...

@seberg
Copy link
Member Author

seberg commented Sep 20, 2022

I thought this would be a relatively straight forward nut to start cracking on 🤦... But, in some cases we just use the repr to print arrays, and if we bloat the repr then we bloat array printing and need new special cases.

This is half thinking out loud for how to extend the NEP: I have to extend the NEP to include a new way to do repr for arrays, there are simply three ways of printing:

  • scalar.__str__ for raw strings of the scalar
  • scalar.__repr__ an (ideally) round-trippable representation of the scalar that has some indication of type (since it is not exactly settled what representation).
  • item_repr(): The representation used by the array. This is a repr that does not need to have type information! This is because we already have the dtype information. The naive approach prints np.array([np.str_("a")], dtype="S1").
    • Normally (or maybe always) the scalar.__repr__ would just be the same as the array __repr__ with the additional type information.

Now, that new item_repr is something we could special case for all our dtypes. But, I would rather move it onto the DType, because otherwise we just make things worse for new user DTypes probably...

And, unfortunately, that means designing that API first. Probably not too bad, but I suspect we should for example move the print option to a contextvar while at it.

The current choice is `np.float64(nan)` and not one of the alterntives...
@seberg seberg marked this pull request as ready for review October 25, 2022 13:59
Copy link
Contributor

@rossbar rossbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, thanks for writing this up @seberg .

I took the liberty of making some minor wording/formatting fixups. There were just a few places where I wasn't quite sure what was being said. I think these instances could use just one more pass to really nail down the intended meaning (see comments below).

Another thing that I think would be really useful would be a table that compares the current repr with the proposed repr. I'd vote that this waits until the NEP is merged with draft status however, as that might spur more instance-specific discussion.

@mhvk
Copy link
Contributor

mhvk commented Oct 27, 2022

Had a look just to see what the status was -- it looks good to me!

Copy link
Contributor

@rossbar rossbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the writeup @seberg ! Let's get this in so it can hit the mailing list for further discussion

@rossbar rossbar merged commit 0694704 into numpy:main Oct 28, 2022
@seberg seberg deleted the nep-scalar-repr branch October 28, 2022 06:30
@seberg
Copy link
Member Author

seberg commented Oct 28, 2022

Thanks for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants