Skip to content

gh-74756: support precision field for integer formatting types #131926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

skirpichev
Copy link
Member

@skirpichev skirpichev commented Mar 31, 2025

For integer presentation types (excluding 'c'), the precision gives the minimal number of digits to appear, expanded with an appropriate number of leading zeros.

If 'z' option specified for non-decimal presentation types - integer value interpreted as two's complement, the precision gives it's minimum size precision*k in bits, where k=1,3,4 for 'b', 'o' and 'x'/'X' types, respectively.

A precision of 0 is treated as equivalent to a precision of 1.

Examples:

>>> f"{-12:z.8b}"
'11110100'
>>> f"{-12:#.8b}"
'-0b00001100'
>>> f"{200:z.8b}"
'011001000'
>>> f"{200:.8b}"
'11001000'
>>> f"{123:.8d}"
'00000123'
>>> f"{-12:.8d}"
'-00000012'
>>> f"{-129:z#.2x}"
'0xf7f'
>>> f"{-129:z#.3x}"
'0xf7f'
>>> f"{-129:z#.4x}"
'0xff7f'
>>> f"{383 :z#.2x}"
'0x17f'
>>> f"{383 :z#.3x}"
'0x17f'
>>> f"{383 :z#.4x}"
'0x017f'

📚 Documentation preview 📚: https://cpython-previews--131926.org.readthedocs.build/

```pycon
>>> f"{-12:.8b}"
'11110100'
>>> f"{200:.8b}"
Traceback (most recent call last):
  File "<python-input-5>", line 1, in <module>
    f"{200:.8b}"
      ^^^^^^^^^
OverflowError: Expected integer in range [-2**7, 2**7)
>>> f"{123:.8d}"
'00000123'
>>> f"{-12:.8d}"
'-00000012'
```
rhettinger
rhettinger previously approved these changes Mar 31, 2025
@rhettinger
Copy link
Contributor

rhettinger commented Mar 31, 2025

OverflowError: Expected integer in range [-27, 27)

Perhaps write: Expected integer in range(-2**7, 2**7). People are already familiar with the range() built-in function and its half-open interval. That would also be consistent with the error message for bytes([300]) which displays ValueError: bytes must be in range(0, 256). Also, I have a vague memory that we're now preferring ValueError instead of OverflowError.

If the format coding is still open for discussion, it would be nice to have a new sign option alongside '+' and '-'. In my mind, a '.8' format code would be strongly associated with digits-after-the-decimal-place in floats.

@ericvsmith
Copy link
Member

I agree with @mdickinson on d.p.o. that it would be nice to be able to format 200 as an unsigned 8 bit value.

@skirpichev
Copy link
Member Author

(JFR, discussion thread, starting from Mark's comment: https://discuss.python.org/t/80760/11)

If the format coding is still open for discussion, it would be nice to have a new sign option alongside '+' and '-'. In my mind, a '.8' format code would be strongly associated with digits-after-the-decimal-place in floats.

It's definitely open. In the discussion thread was suggested, that supporting precision field for integers worth a PEP. So, issue might be solved with a different syntax.

Though, I'm not sure if expanding format mini-language with new codes is a good idea. Precision option already used in C for integer types (and in old '%'-style string formatting). And it's supported for string formatting to limit number of characters:

>>> format("length", '.3s')
'len'

I agree with @mdickinson on d.p.o. that it would be nice to be able to format 200 as an unsigned 8 bit value.

We can adopt a different meaning (more C-like): precision being a minimum number of digits. That's easy for non-negative integers. But what to do for negative integers? Lets consider 'b' format. What if x can't be represented as two's complement of size precision. Should we choose precision+1? precision+8?

@jb2170
Copy link
Contributor

jb2170 commented Apr 1, 2025

I don't think this is the right approach, I have some comments to make on this, I'll post them in the discussion thread (sorry I've been really busy!)

tldr I'd recommend that precision works the same way for f-strings as it does for %-strings, both for compatibility, sanity, and the machine-width-free approach of Python integers, ie f"{-5:.4}" should be the same as "%.4i" % -5 which produces '-0005'

For formatting an integer x to exactly n bits / hex digits etc (in general n digits in base b), ie taking $\text{mod}(x, b^n)$ this should use a new format specifier ! called 'exact precision'

eg f"{-1:!8b}" is the same as f"{255:!8b}" that is '11111111'

eg f"{260:!8b}" is the same as f"{4:!8b}" that is '00000100'

I chose ! because it's related to ., but . is only a suggested minimum number of digits to which an integer should be formatted, whereas ! is much more imperative.

@skirpichev
Copy link
Member Author

I'd recommend that precision works the same way for f-strings as it does for %-strings, both for compatibility

Just a quick note: I think that compatibility argument is not enough alone. %-formatting just is incompatible with format().

@jb2170
Copy link
Contributor

jb2170 commented Apr 2, 2025

C 'precision' (.) Existing Python %-strings 'precision' (.) Proposed Python f-string and str.format 'precision' (.) Proposed Python f-string and str.format 'exact precision' (!)
Code printf("0x%.2x\n", -19); "%#.2x" % -19 f"{-19:#.2x}" f"{-19:#!2x}"
Result '0xffffffed' '-0x13' '-0x13' '0xed'
Comments Would go on infinitely if it could Instead of going infinitely, sanely chooses negative sign with +19 formatted Same as %-strings for internal Python consistency Like C's precision, twos-complement-like, but without the machine-width annoyances
Limited / influenced by machine width
Python doesn't use machine width or if it did it would be infinite!
Precision is only a minimum Precision is only a minimum Precision is only a minimum Exact Precision is exact; this is what the PR wants to achieve, but with . instead of !

Okay once again laying things out in a table it is clear that implementing . for str.format and f-strings to use the 'exact precision' behavior above, but as . which this PR does instead of a new ! specifier, would move Python's 'new' formatting further away from 'old' %-formatting. So there's two questions to ask:

  • Is it worth moving further away from %-formatting?

    • This makes precision not a minimum number of digits, but an exact number
    • % is 'old-style' for a reason, and I don't just mean old-bad-new-good, I mean new standards come along for a reason, to improving the language
    • Whence there are already some existing compatibility differences between %-style and str.format / f-strings, so it's not like this is the first difference
  • Would anyone want the '-0x13' signed-negative behavior that 'exact precision' would prevent?

    • '±0xNN' ie two hex digits can represent -255 to 255, a strange range
    • Re-viewing the discussion, the primary reason I was anxious to implement this behavior as well as the new one is only for compatibility

Mulling over this I think I could be convinced that there isn't too much a demand for '-0x13', and since this wouldn't be the first time new-formatting breaks away from old I'm fine with that too.

So my only final concern if this PR goes ahead would be the following:

The implementation given in this PR seems a bit too harsh in raising a ValueError for having an integer outside the signed range(-2 ** (n - 1), 2 ** (n - 1)); should the formatting not just quietly take modulo 2 ** n, moving the integer into the range(0, 2 ** n)? (which you've already implemented as a bitshift / bitmask for base 2, 8, and 16)

I would expect f"{255:#.2x}" is f"{-1:#.2x}" is '0xff' 🙂

@skirpichev
Copy link
Member Author

Here is an alternative implementation (no exceptions), on top of this pr.
diff --git a/Lib/test/test_long.py b/Lib/test/test_long.py
index b8155319b6..f15b780dd8 100644
--- a/Lib/test/test_long.py
+++ b/Lib/test/test_long.py
@@ -708,8 +708,8 @@ def test__format__(self):
         self.assertEqual(format(1234567890, '_x'), '4996_02d2')
         self.assertEqual(format(1234567890, '_X'), '4996_02D2')
         self.assertEqual(format(8086, '#.8x'), '0x00001f96')
-        self.assertRaises(ValueError, format, 2048, '.3x')
-        self.assertRaises(ValueError, format, -2049, '.3x')
+        self.assertEqual(format(2048, '.3x'), '0800')
+        self.assertEqual(format(-2049, '.3x'), '17ff')
 
         # octal
         self.assertEqual(format(3, "o"), "3")
@@ -725,8 +725,8 @@ def test__format__(self):
         self.assertRaises(ValueError, format, 1234567890, ',o')
         self.assertEqual(format(1234567890, '_o'), '111_4540_1322')
         self.assertEqual(format(18, '#.3o'), '0o022')
-        self.assertRaises(ValueError, format, 256, '.3o')
-        self.assertRaises(ValueError, format, -257, '.3o')
+        self.assertEqual(format(256, '.3o'), '0400')
+        self.assertEqual(format(-257, '.3o'), '1377')
 
         # binary
         self.assertEqual(format(3, "b"), "11")
@@ -744,10 +744,10 @@ def test__format__(self):
         self.assertEqual(format(-12, '.8b'), '11110100')
         self.assertEqual(format(73, '.8b'), '01001001')
         self.assertEqual(format(73, '#.8b'), '0b01001001')
-        self.assertRaises(ValueError, format, 300, '.8b')
-        self.assertRaises(ValueError, format, -200, '.8b')
-        self.assertRaises(ValueError, format, 128, '.8b')
-        self.assertRaises(ValueError, format, -129, '.8b')
+        self.assertEqual(format(300, '.8b'), '100101100')
+        self.assertEqual(format(-200, '.8b'), '100111000')
+        self.assertEqual(format(128, '.8b'), '010000000')
+        self.assertEqual(format(-129, '.8b'), '101111111')
 
         # make sure these are errors
         self.assertRaises(ValueError, format, 3, "1.3c")  # precision disallowed with 'c',
diff --git a/Python/formatter_unicode.c b/Python/formatter_unicode.c
index da605415ad..e663a0c33d 100644
--- a/Python/formatter_unicode.c
+++ b/Python/formatter_unicode.c
@@ -1081,11 +1081,14 @@ format_long_internal(PyObject *value, const InternalFormatSpec *format,
 
         /* Do the hard part, converting to a string in a given base */
         if (format->precision != -1) {
+            int64_t precision = Py_MAX(1, format->precision);
+
             /* Use two's complement for 'b', 'o' and 'x' formatting types */
             if (format->type == 'b' || format->type == 'x'
                 || format->type == 'o' || format->type == 'X')
             {
-                int64_t shift = Py_MAX(1, format->precision);
+                int64_t shift = precision;
+                int incr = 1;
 
                 if (format->type == 'x' || format->type == 'X') {
                     shift *= 4;
@@ -1093,8 +1096,11 @@ format_long_internal(PyObject *value, const InternalFormatSpec *format,
                 else if (format->type == 'o') {
                     shift *= 3;
                 }
-                shift--;  /* expected value in range(-2**shift, 2**shift) */
+                shift = Py_MAX(shift, _PyLong_NumBits(value));
+                shift--;
 
+                /* expected value in range(-2**n, 2**n), where n=shift
+                   or n=shift+1 */
                 PyObject *mod = _PyLong_Lshift(PyLong_FromLong(1), shift);
 
                 if (mod == NULL) {
@@ -1106,9 +1112,9 @@ format_long_internal(PyObject *value, const InternalFormatSpec *format,
                         goto done;
                     }
                     if (PyObject_RichCompareBool(value, mod, Py_LT)) {
-                        goto range;
+                        incr++;
                     }
-                    Py_SETREF(mod, _PyLong_Lshift(mod, 1));
+                    Py_SETREF(mod, _PyLong_Lshift(mod, incr));
                     tmp = PyNumber_Subtract(value, mod);
                     Py_DECREF(mod);
                     if (tmp == NULL) {
@@ -1118,16 +1124,12 @@ format_long_internal(PyObject *value, const InternalFormatSpec *format,
                 }
                 else {
                     if (PyObject_RichCompareBool(value, mod, Py_GE)) {
-range:
-                        Py_DECREF(mod);
-                        PyErr_Format(PyExc_ValueError,
-                                     "Expected integer in range(-2**%ld, 2**%ld)",
-                                     shift, shift);
-                        goto done;
+                        incr++;
                     }
                     Py_DECREF(mod);
                     tmp = _PyLong_Format(value, base);
                 }
+                precision += (incr - 1);
             }
             else {
                 tmp = _PyLong_Format(value, base);
@@ -1139,7 +1141,7 @@ format_long_internal(PyObject *value, const InternalFormatSpec *format,
             /* Prepend enough leading zeros (after the sign) */
 
             int sign = PyUnicode_READ_CHAR(tmp, leading_chars_to_skip) == '-';
-            Py_ssize_t tmp2_len = format->precision + leading_chars_to_skip + sign;
+            Py_ssize_t tmp2_len = precision + leading_chars_to_skip + sign;
             Py_ssize_t tmp_len = PyUnicode_GET_LENGTH(tmp);
             Py_ssize_t gap = tmp2_len - tmp_len;
 

I.e. if value is outside of the specified range - we just enlarge it.

>>> format(-128, '.8b')
'10000000'
>>> format(-129, '.8b')
'101111111'
>>> format(127, '.8b')
'01111111'
>>> format(128, '.8b')
'010000000'
>>> format(-129, '.2o')  # maybe better 0o1577
'577'

@jb2170
Copy link
Contributor

jb2170 commented Apr 3, 2025

Discussion updated. I've rightly flip-flopped back to my original proposal, but using z. instead of !. I only conceded to the breaking/differing change of str.format and fstrings vs %-formatting if we had to choose only one behaviour. Being able to implement both 'precision' as . and 'exact precision' as z. is most desirable, and I was correct in my conclusions were we to implement both.

When the syntax + behavior is settled I’ll re-write up the discussion OP + resolution for negative numbers (this message but with z. instead of !) as one PEP more cleanly: to document the behaviour, rejected alternatives, and something more than a mere what’s-new section. The z. for ints genuinely is new. I've worked hard on this!

I’ll open a PR myself. There's no way this one can go ahead with the broken behaviour around f"{255:#.2x}" raising a ValueError

print([f"{c:#.2x}" for c in "Hello".encode()])  # ['0x48', '0x65', '0x6c', '0x6c', '0x6f']
print([f"{c:#.2x}" for c in "привет".encode()]) # ValueError: Expected integer in range(-2**7, 2**7)

@skirpichev
Copy link
Member Author

There's no way this one can go ahead with the broken behaviour around f"{255:#.2x}" raising a ValueError

I'm not sure if it's broken. Though, alternative was presented above. I'll commit it.

With this, we have:

>>> f"{200:.8b}"
'011001000'
>>> [f"{c:#.2x}" for c in "привет".encode()]
['0x0d0', '0x0bf', '0x0d1', '0x080', '0x0d0', '0x0b8', '0x0d0', '0x0b2', '0x0d0', '0x0b5', '0x0d1', '0x082']

Maybe there should be option to interpret integer value as unsigned. Then, the leading 0's will not appear in above examples. The 'z' option could be used for this. Or we can add optional prefix (e.g. 'u') before type specifier.

@skirpichev
Copy link
Member Author

Now implementation aligned with my proposal in the d.p.o thread. Sorry for the mess :-(

I think that Mark's concern was addressed. @ericvsmith, please review.

Some remarks on decisions I made.

  1. Per default, precision field specifies the minimal number of digits in printed magnitude of the integer. This is consistent with current no-precision behavior and allows to print positive integers (e.g. 200 in 8-bit) in more wide range, without leading zeros.
  2. The existing 'z' flag is used as a switch to two's complement interpretation of the integer value. It's supported only for base-2 integer presentation types. I'm not sure if it worth for 'd'.
  3. The precision field specified together with the 'z' flag means that the integer is interpreted as max(k*precision, number.bit_length())-bit two's complement, where k=1,3,4 for 'b', 'o' and 'x'/'X' types, respectively. I.e. no exception raised if the value doesn't fit into k*precision bits.
  4. Simpler alternative for 3) could be just taking remainder, i.e. number%2**(k*precision). But that's not a bijection.

@jb2170
Copy link
Contributor

jb2170 commented Apr 4, 2025

Discussion updated. Same as before.


. shall format the same as %-formatting: the number of digits, and using a negative sign for negative numbers
z. shall pass divmod(x, base ** n)[1] to . It shall only allowed only for binary, octal, and hex. This is what 2s complement is.


As for

['0x0d0', '0x0bf', ...

Nooo that's the variable-width implementation which we're not using 😭

Even this implementation (of variable width which we don't want) is broken?

f"{127:z#.8b}" # '0b01111111'
f"{128:z#.8b}" # '0b010000000'
f"{255:z#.8b}" # '0b011111111'
f"{256:z#.8b}" # '0b100000000'

That's what's making me nervous about this PR, only one of the reviews is left pending for a broken + incorrect implementation.

@ericvsmith please don't merge / review this. I'm writing a PEP and this PR doesn't contain the intended implementation.

@skirpichev
Copy link
Member Author

shall pass divmod(x, base ** n)[1] to .

That's possible option, yes.

But what if you input actually doesn't fit into specified range (say, n bits for 'b' formatting type)? E.g. you did a typo. Output will be silently truncated.

Even this implementation (of variable width which we don't want) is broken?

f"{127:z#.8b}" # '0b01111111'
f"{128:z#.8b}" # '0b010000000'
f"{255:z#.8b}" # '0b011111111'

Why do you think it's "broken"? First integer can be interpreted as 8-bit value in twos complement. Others can't: minimal precision value to do this is 9. That's why we got extra digit.

f"{256:z#.8b}" # '0b100000000'

Thanks, this should be '0b0100000000', of course - that's a real issue. Fixed.

please don't merge / review this

I think that implementation is complete and now this should work as specified, remnants from original version (with exceptions) are fixed. Review does make sense for me.

I'm writing a PEP and this PR doesn't contain the intended implementation.

I'm not sure if PEP is required. But remember, that PEP needs a sponsor in your case.

I think you proposal is clear enough. So, I suggest you first to ask if someone from core developers in the d.p.o thread interested in sponsoring it.

@skirpichev skirpichev dismissed rhettinger’s stale review April 4, 2025 13:08

Implementation was changed (now ValueError not raised if precision is too small) to address Mark's feedback.

@skirpichev skirpichev requested a review from rhettinger April 4, 2025 13:08
@jb2170
Copy link
Contributor

jb2170 commented Apr 5, 2025

...
f"{255:z#.8b}" # '0b011111111'

Why do you think it's "broken"?

Sorry I should've been more clear, it was the f"{256:z#.8b}" # '0b100000000', which you've now fixed. But even so two issues:

I've tried the latest commits but I'm still not happy with how the unsigned range is treated by z.

f"{126:z#.2x}" # '0x7e'
f"{127:z#.2x}" # '0x7f'
f"{128:z#.2x}" # '0x080'
f"{129:z#.2x}" # '0x081'

It's super ugly 😅

There is also an ambiguity / canonical-representation problem that I thought of today with how hex digits represent negative numbers: The current implementation of z. yields the following

f"{-129:z#.2x}" # '0x17f'

If we inspect the binary representation of 0x17f via f"{0x17f:#.12b}" we see it as '0b000101111111'. If we ignore the upper three zeros, and treat the fourth column as a -256 column, this is -256 + 64 + 32 + 16 + 8 + 4 + 2 + 1 = -129. We could however have equivalently written -129 as '0b111101111111', and so 0xf7f is another representation of -129. This isn't surprising, we know there are infinitely many overlong encodings of positive and negative numbers using variable-width twos complement as per one of my discussion posts, but my point is the following:

For binary two's complement the leading digit is always a 0 for positive numbers, and always a 1 for negative, eg f"{127:z#.8b}" # '0b01111111', f"{128:z#.8b}" # '0b010000000', no ambiguity, but for hex it is not clear whether '0x17f' represents 383, because the highest bit within the first digit is zero (the 1 within 17f as a nibble is 0001, with its highest digit a 0), or if '0x17f' represents -129, the minimal-width representation of -129

f"{-129:z#.2x}"  # '0x17f'
f"{383 :z#.2x}"  # '0x17f'
f"{-129:z#.12b}" # '0b111101111111'
f"{383 :z#.12b}" # '0b000101111111'

Mutatis-mutandis the same problem applies for octal.

One way to solve this would be establish a convention that for representing a negative number the leading digit must always be base - 1, and for non-negatives the leading digit must be 0. That is the canonical representation of -129 is 0xf7f. have its highest bit set. (I'm literally getting confused typing this / defining the standard, this is so not going to be useful to the end user in any capacity). This is a very synthetic way of doing things that seems useful to no one unless they were willing to learn this ad-hoc convention.

Thus alternatively:

shall pass divmod(x, base ** n)[1] to .

That's possible option, yes.

I'm still super in favour of this behaviour, precision truncating the value, like a CPU register / C fixed-width variable whereby integer overflow is more of a feature (modular arithmetic) than a bug. Ultimately signed and unsigned fixed-width variables are the same modulo 2 ** width in two's complement hardware (https://godbolt.org/z/YM84x1Mja) and so in my eyes f"{127:z#.2x}" and f"{-129:z#.2x}" should both be '0x7f'.

I'm ~2/3 the way through the PEP draft, and the fact we're still debating (because we care 😅 ) about the behaviour tells me it's a good idea, as a compendium of considered ideas, rejected alternatives, syntax etc

@skirpichev
Copy link
Member Author

>>> [f"{_:z#.2x}" for _ in range(126, 130)]
['0x7e', '0x7f', '0x080', '0x081']

It's super ugly

Sorry, this is not a rational argument for me.

f"{-129:z#.2x}" # '0x17f'
If we inspect the binary representation of 0x17f via f"{0x17f:#.12b}" we see it as '0b000101111111'.

Sorry, but why 12? (0x17f).bit_length() is 9.

We could however have equivalently written -129 as '0b111101111111', and so 0xf7f is another representation of -129.

It's true. We can also choose bitsize being proportional to 4 (or 3 for octal formatting type). That will choose 0xf7f instead. Also, negative values will start with highest bit set:

>>> f"{-129:z#.3x}"
'0xf7f'
>>> f"{-229:z#.3x}"
'0xf1b'
>>> f"{383:z#.3x}"
'0x17f'

I was thinking about that. Using same minimal size for all base-2 formatting types has some pros: it's simpler to explain and it's property of the value (not affected by formatting type). Cons: not invariant wrt increasing precision (f"{-129:z#.2x}" == '0x17f' wrt `f"{-129:z#.3x}" == '0xf7f').

f"{-129:z#.2x}" # '0x17f'
f"{383 :z#.2x}" # '0x17f'

Note that this problem (which might be solved as described above) is only valid iff specified precision is not enough. So, "more digits got than specified" = "insufficient precision". Alternatively, we could raise an exception in such case.

integer overflow is more of a feature (modular arithmetic) than a bug.

The problem that you had no chance to communicate this overflow to user.

now we choose minimal twos complement size as being >= k*precision
AND with leading bit set to 1 for negatives

Increasing precision will adds 0's of (base-1)'s as needed:

>>> f"{-129:z#.2x}"
'0xf7f'
>>> f"{-129:z#.3x}"
'0xf7f'
>>> f"{-129:z#.4x}"
'0xff7f'
>>> f"{383 :z#.2x}"
'0x17f'
>>> f"{383 :z#.3x}"
'0x17f'
>>> f"{383 :z#.4x}"
'0x017f'
@python-cla-bot
Copy link

python-cla-bot bot commented Apr 6, 2025

All commit authors signed the Contributor License Agreement.

CLA signed

@jb2170
Copy link
Contributor

jb2170 commented Apr 7, 2025

It's super ugly

Sorry, this is not a rational argument for me.

Completely the opposite, it is the most rational argument. This is string formatting, which is printed to the user, and thus has to convey useful information.

f"{128:z#.2x}"  # '0x080'
f"{-128:z#.2x}" # '0x80'

Who is this useful for?? In the useful examples section of the PEP I'm documenting useful examples for precision (.) like a hexdump, or Unicode dump:

s = b"GET /\r\n\r\n"
print(" ".join(f"{c:#.2x}" for c in s))
'0x47 0x45 0x54 0x20 0x2f 0x0d 0x0a 0x0d 0x0a'
# observe the CR and LF bytes padded to precision 2

s = "USA 🦅"
print(" ".join(f"U+{ord(c):.4X}" for c in s))
'U+0055 U+0053 U+0041 U+0020 U+1F985'
# observe the last character's Unicode representation has 5 digits;
# precision is only the minimum number of digits

And for precision-modulo (.z) consistent predictable twos-compements of signed vs unsigned ints

import struct

my_struct = b"\xff"
(t,) = struct.unpack('b', my_struct) # signed
print(t, f"{t:#.2x}", f"{t:z#.2x}")  # -1 -0x01 0xff
(t,) = struct.unpack('B', my_struct) # unsigned
print(t, f"{t:#.2x}", f"{t:z#.2x}")  # 255 0xff 0xff

# observe in both the signed and unsigned unpacking the modulo-precision flag 'z'
# produces a predictable two's-complement formatting

Give me a single example of the alternate implementation's formatting of -128 as '0x80' and +128 as '0x080' being useful and ergonomic? One has to juggle in their head the leading digit of hex being in 0-7 meaning positive and 8-f negative (and for octal 0-3 4-7). If ever one wanted to distinguish between positive and negative one would use precision without z, rendering '0x80' for +128 and '-0x80' for -128, not the perverted '0x80' for -128 that you're proposing. If a user, unfamiliar with how a program is written, reads 0x80, where sign is critical, there is a 0% chance they're reading that as -128 without knowing the convention established here. Get real. Is this not clearly a really really bad idea?

The problem that you had no chance to communicate this overflow to user.

In reasonable contexts of using z, both as the programmer and the end-user, -128 and 128 mean the same thing, as do -1 and 255, as do 181 and -75 etc etc; the user wants to see them as two equivalent representations of the same stored byte in two's complement hardware, modulo being irrelevant. Formally the formatting should be a well defined mapping $\mathbb{Z}/\text{base} ^ {\text{precision}} \mathbb{Z} , \to \text{strings} \quad \mid \quad \text{x} \mapsto$ f"{x:z.{precision}{base}}", any two representatives of an equivalence class being mapped to the same string with length exactly precision. This is what z for modulo-precision achieves.

As for worry over truncation outside of range(-base**(precision-1), base**(precision)), eg range(-128, 256), either 1) the programmer wouldn't use z because they don't want the truncation, conveying the full untruncated string to the end user by just using . not z., or 2) if there is a truncation of a larger int, then that int shouldn't have been there in the first place: z is pretty much designed for the purpose of working with bytes, modular arithmetic, and two's complement, base 2 or hex, sometimes oct. As I've hypothesized before, if a library sets an int to be 257, and prints out '0x01' instead of '0x101', that's not our problem or doing, and an exception shouldn't be raised by us, because ValueError: bytes must be in range(0, 256) will be raised by bytes when trying to serialize that int via bytes([257]). That is more indicative of a defect in their library, not our formatting.

I've tried to tread lightly in these conversations lest this look like a 'student arguing with the professor who has decades more experience' situation, but I genuinely think variable-width-two's-complement is the wrong implementation, and modulo-precision is the obvious right one. You've been eager wanting to merge / review this when there's still no consensus, plenty of defects in the commits pushed that have only been spotted with my ad-hoc diligence, eg minimal vs 'canonical' hex/octal two's complement needing 4/3 extra leading bits, not just 1 as binary does, which you only hesitantly mention as "We can also choose bitsize being proportional to 4...". (I thought your problem with the truncation of digits was the loss of information/uniqueness, but you're fine with both -129 and 383 being mapped to '0x17f'? You don't seem sure.)

At this point it seems like you're trying to rush through a broken, wrong, alternate implementation, and fix it later if necessary, prematurely quashing a PEP every time I bring it up, which properly addresses all concerns, and can be debated by others, which I'm 99% sure will side with the modulo-precision implementation. Just to clear the air, for me this is not about trying to win an argument or egos, it's about what is useful to the end user. I'd be a real pain if the wrong behavior gets merged now and users start relying on a defective defacto standard, and we can't fix it later because a user relies upon f"{-129:z#.2x}" == '0x17f' which then gets changed to f"{-129:z#.2x}" == '0xf7f' (neither of which is useful in context compared to '0x7f' in my opinion!). Please wait until I finish the draft PEP, it'll only be a couple of days, I want second opinions from other devs on the formalized alternative implementations and their verdicts, lists of examples showcasing explicit behavior, refined relevant parts of the discussion thread etc, not just a review of the PR in its current state, it's making me anxious.

@skirpichev
Copy link
Member Author

skirpichev commented Apr 7, 2025

Who is this useful for?

It's useful to detect errors. If you see that printed value has more digits than requested - given integer doesn't fit into specified range. This is an alternative to exception.

print(" ".join(f"U+{ord(c):.4X}" for c in "USA 🦅"))

And why you not cut off all entries to 4 digits?

And for precision-modulo (.z) consistent predictable twos-compements of signed vs unsigned ints

I'm not sure you realize that twos-complement representation means. Here is a corrected version of your example:

>>> import struct
>>> my_struct = b"\xff"
>>> (t,) = struct.unpack('b', my_struct) # signed
>>> print(t, f"{t:#.2x}", f"{t:z#.2x}")
-1 -0x01 0xff
>>> (t,) = struct.unpack('B', my_struct) # unsigned
>>> print(t, f"{t:#.2x}", f"{t:z#.2x}")  # note the last output
255 0xff 0x0ff

You asked to interpret Python integer +255 (+ sign and magnitude 255) as two's complement, having just 2 hexadecimal digits. It's impossible. No such representation exists. We have options either to raise a ValueError in the last example, or to print two's complement value in some bigger range.

In Raymond's original proposal the first option was chosen. But if we allow precision option in no-'z' context (your original proposal) - I think it's better to not raise an exception, interpreting precision value as minimal range for twos complement representation.

(BTW, to print unsigned value in 2 digits - you should use f"{t:#.2x}" formatting in above example.)

In reasonable contexts of using z, both as the programmer and the end-user, -128 and 128 mean the same thing

This pr solves issue, where it's not.

I've tried to tread lightly in these conversations lest this look like a 'student arguing with the professor who has decades more experience' situation

Then, probably, it's not worth your time, isn't?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants