-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
gh-74756: support precision field for integer formatting types #131926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
```pycon >>> f"{-12:.8b}" '11110100' >>> f"{200:.8b}" Traceback (most recent call last): File "<python-input-5>", line 1, in <module> f"{200:.8b}" ^^^^^^^^^ OverflowError: Expected integer in range [-2**7, 2**7) >>> f"{123:.8d}" '00000123' >>> f"{-12:.8d}" '-00000012' ```
Perhaps write: If the format coding is still open for discussion, it would be nice to have a new sign option alongside |
I agree with @mdickinson on d.p.o. that it would be nice to be able to format 200 as an unsigned 8 bit value. |
(JFR, discussion thread, starting from Mark's comment: https://discuss.python.org/t/80760/11)
It's definitely open. In the discussion thread was suggested, that supporting precision field for integers worth a PEP. So, issue might be solved with a different syntax. Though, I'm not sure if expanding format mini-language with new codes is a good idea. Precision option already used in C for integer types (and in old '%'-style string formatting). And it's supported for string formatting to limit number of characters: >>> format("length", '.3s')
'len'
We can adopt a different meaning (more C-like): |
I don't think this is the right approach, I have some comments to make on this, I'll post them in the discussion thread (sorry I've been really busy!) tldr I'd recommend that precision works the same way for f-strings as it does for %-strings, both for compatibility, sanity, and the machine-width-free approach of Python integers, ie For formatting an integer eg eg I chose |
Just a quick note: I think that compatibility argument is not enough alone. %-formatting just is incompatible with format(). |
Okay once again laying things out in a table it is clear that implementing
Mulling over this I think I could be convinced that there isn't too much a demand for So my only final concern if this PR goes ahead would be the following: The implementation given in this PR seems a bit too harsh in raising a I would expect |
Here is an alternative implementation (no exceptions), on top of this pr.diff --git a/Lib/test/test_long.py b/Lib/test/test_long.py
index b8155319b6..f15b780dd8 100644
--- a/Lib/test/test_long.py
+++ b/Lib/test/test_long.py
@@ -708,8 +708,8 @@ def test__format__(self):
self.assertEqual(format(1234567890, '_x'), '4996_02d2')
self.assertEqual(format(1234567890, '_X'), '4996_02D2')
self.assertEqual(format(8086, '#.8x'), '0x00001f96')
- self.assertRaises(ValueError, format, 2048, '.3x')
- self.assertRaises(ValueError, format, -2049, '.3x')
+ self.assertEqual(format(2048, '.3x'), '0800')
+ self.assertEqual(format(-2049, '.3x'), '17ff')
# octal
self.assertEqual(format(3, "o"), "3")
@@ -725,8 +725,8 @@ def test__format__(self):
self.assertRaises(ValueError, format, 1234567890, ',o')
self.assertEqual(format(1234567890, '_o'), '111_4540_1322')
self.assertEqual(format(18, '#.3o'), '0o022')
- self.assertRaises(ValueError, format, 256, '.3o')
- self.assertRaises(ValueError, format, -257, '.3o')
+ self.assertEqual(format(256, '.3o'), '0400')
+ self.assertEqual(format(-257, '.3o'), '1377')
# binary
self.assertEqual(format(3, "b"), "11")
@@ -744,10 +744,10 @@ def test__format__(self):
self.assertEqual(format(-12, '.8b'), '11110100')
self.assertEqual(format(73, '.8b'), '01001001')
self.assertEqual(format(73, '#.8b'), '0b01001001')
- self.assertRaises(ValueError, format, 300, '.8b')
- self.assertRaises(ValueError, format, -200, '.8b')
- self.assertRaises(ValueError, format, 128, '.8b')
- self.assertRaises(ValueError, format, -129, '.8b')
+ self.assertEqual(format(300, '.8b'), '100101100')
+ self.assertEqual(format(-200, '.8b'), '100111000')
+ self.assertEqual(format(128, '.8b'), '010000000')
+ self.assertEqual(format(-129, '.8b'), '101111111')
# make sure these are errors
self.assertRaises(ValueError, format, 3, "1.3c") # precision disallowed with 'c',
diff --git a/Python/formatter_unicode.c b/Python/formatter_unicode.c
index da605415ad..e663a0c33d 100644
--- a/Python/formatter_unicode.c
+++ b/Python/formatter_unicode.c
@@ -1081,11 +1081,14 @@ format_long_internal(PyObject *value, const InternalFormatSpec *format,
/* Do the hard part, converting to a string in a given base */
if (format->precision != -1) {
+ int64_t precision = Py_MAX(1, format->precision);
+
/* Use two's complement for 'b', 'o' and 'x' formatting types */
if (format->type == 'b' || format->type == 'x'
|| format->type == 'o' || format->type == 'X')
{
- int64_t shift = Py_MAX(1, format->precision);
+ int64_t shift = precision;
+ int incr = 1;
if (format->type == 'x' || format->type == 'X') {
shift *= 4;
@@ -1093,8 +1096,11 @@ format_long_internal(PyObject *value, const InternalFormatSpec *format,
else if (format->type == 'o') {
shift *= 3;
}
- shift--; /* expected value in range(-2**shift, 2**shift) */
+ shift = Py_MAX(shift, _PyLong_NumBits(value));
+ shift--;
+ /* expected value in range(-2**n, 2**n), where n=shift
+ or n=shift+1 */
PyObject *mod = _PyLong_Lshift(PyLong_FromLong(1), shift);
if (mod == NULL) {
@@ -1106,9 +1112,9 @@ format_long_internal(PyObject *value, const InternalFormatSpec *format,
goto done;
}
if (PyObject_RichCompareBool(value, mod, Py_LT)) {
- goto range;
+ incr++;
}
- Py_SETREF(mod, _PyLong_Lshift(mod, 1));
+ Py_SETREF(mod, _PyLong_Lshift(mod, incr));
tmp = PyNumber_Subtract(value, mod);
Py_DECREF(mod);
if (tmp == NULL) {
@@ -1118,16 +1124,12 @@ format_long_internal(PyObject *value, const InternalFormatSpec *format,
}
else {
if (PyObject_RichCompareBool(value, mod, Py_GE)) {
-range:
- Py_DECREF(mod);
- PyErr_Format(PyExc_ValueError,
- "Expected integer in range(-2**%ld, 2**%ld)",
- shift, shift);
- goto done;
+ incr++;
}
Py_DECREF(mod);
tmp = _PyLong_Format(value, base);
}
+ precision += (incr - 1);
}
else {
tmp = _PyLong_Format(value, base);
@@ -1139,7 +1141,7 @@ format_long_internal(PyObject *value, const InternalFormatSpec *format,
/* Prepend enough leading zeros (after the sign) */
int sign = PyUnicode_READ_CHAR(tmp, leading_chars_to_skip) == '-';
- Py_ssize_t tmp2_len = format->precision + leading_chars_to_skip + sign;
+ Py_ssize_t tmp2_len = precision + leading_chars_to_skip + sign;
Py_ssize_t tmp_len = PyUnicode_GET_LENGTH(tmp);
Py_ssize_t gap = tmp2_len - tmp_len;
I.e. if value is outside of the specified range - we just enlarge it. >>> format(-128, '.8b')
'10000000'
>>> format(-129, '.8b')
'101111111'
>>> format(127, '.8b')
'01111111'
>>> format(128, '.8b')
'010000000'
>>> format(-129, '.2o') # maybe better 0o1577
'577' |
Discussion updated. I've rightly flip-flopped back to my original proposal, but using When the syntax + behavior is settled I’ll re-write up the discussion OP + resolution for negative numbers (this message but with I’ll open a PR myself. There's no way this one can go ahead with the broken behaviour around print([f"{c:#.2x}" for c in "Hello".encode()]) # ['0x48', '0x65', '0x6c', '0x6c', '0x6f']
print([f"{c:#.2x}" for c in "привет".encode()]) # ValueError: Expected integer in range(-2**7, 2**7) |
I'm not sure if it's broken. Though, alternative was presented above. I'll commit it. With this, we have: >>> f"{200:.8b}"
'011001000'
>>> [f"{c:#.2x}" for c in "привет".encode()]
['0x0d0', '0x0bf', '0x0d1', '0x080', '0x0d0', '0x0b8', '0x0d0', '0x0b2', '0x0d0', '0x0b5', '0x0d1', '0x082'] Maybe there should be option to interpret integer value as unsigned. Then, the leading |
Now implementation aligned with my proposal in the d.p.o thread. Sorry for the mess :-( I think that Mark's concern was addressed. @ericvsmith, please review. Some remarks on decisions I made.
|
Discussion updated. Same as before.
As for
Nooo that's the variable-width implementation which we're not using 😭 Even this implementation (of variable width which we don't want) is broken?
That's what's making me nervous about this PR, only one of the reviews is left pending for a broken + incorrect implementation. @ericvsmith please don't merge / review this. I'm writing a PEP and this PR doesn't contain the intended implementation. |
That's possible option, yes. But what if you input actually doesn't fit into specified range (say,
Why do you think it's "broken"? First integer can be interpreted as 8-bit value in twos complement. Others can't: minimal precision value to do this is
Thanks, this should be
I think that implementation is complete and now this should work as specified, remnants from original version (with exceptions) are fixed. Review does make sense for me.
I'm not sure if PEP is required. But remember, that PEP needs a sponsor in your case. I think you proposal is clear enough. So, I suggest you first to ask if someone from core developers in the d.p.o thread interested in sponsoring it. |
Implementation was changed (now ValueError not raised if precision is too small) to address Mark's feedback.
Sorry I should've been more clear, it was the I've tried the latest commits but I'm still not happy with how the unsigned range is treated by f"{126:z#.2x}" # '0x7e'
f"{127:z#.2x}" # '0x7f'
f"{128:z#.2x}" # '0x080'
f"{129:z#.2x}" # '0x081' It's super ugly 😅 There is also an ambiguity / canonical-representation problem that I thought of today with how hex digits represent negative numbers: The current implementation of f"{-129:z#.2x}" # '0x17f' If we inspect the binary representation of For binary two's complement the leading digit is always a 0 for positive numbers, and always a 1 for negative, eg f"{-129:z#.2x}" # '0x17f'
f"{383 :z#.2x}" # '0x17f'
f"{-129:z#.12b}" # '0b111101111111'
f"{383 :z#.12b}" # '0b000101111111' Mutatis-mutandis the same problem applies for octal. One way to solve this would be establish a convention that for representing a negative number the leading digit must always Thus alternatively:
I'm still super in favour of this behaviour, precision truncating the value, like a CPU register / C fixed-width variable whereby integer overflow is more of a feature (modular arithmetic) than a bug. Ultimately signed and unsigned fixed-width variables are the same modulo I'm ~2/3 the way through the PEP draft, and the fact we're still debating (because we care 😅 ) about the behaviour tells me it's a good idea, as a compendium of considered ideas, rejected alternatives, syntax etc |
Sorry, this is not a rational argument for me.
Sorry, but why 12?
It's true. We can also choose bitsize being proportional to 4 (or 3 for octal formatting type). That will choose >>> f"{-129:z#.3x}"
'0xf7f'
>>> f"{-229:z#.3x}"
'0xf1b'
>>> f"{383:z#.3x}"
'0x17f' I was thinking about that. Using same minimal size for all base-2 formatting types has some pros: it's simpler to explain and it's property of the value (not affected by formatting type). Cons: not invariant wrt increasing precision (
Note that this problem (which might be solved as described above) is only valid iff specified precision is not enough. So, "more digits got than specified" = "insufficient precision". Alternatively, we could raise an exception in such case.
The problem that you had no chance to communicate this overflow to user. |
now we choose minimal twos complement size as being >= k*precision AND with leading bit set to 1 for negatives Increasing precision will adds 0's of (base-1)'s as needed: >>> f"{-129:z#.2x}" '0xf7f' >>> f"{-129:z#.3x}" '0xf7f' >>> f"{-129:z#.4x}" '0xff7f' >>> f"{383 :z#.2x}" '0x17f' >>> f"{383 :z#.3x}" '0x17f' >>> f"{383 :z#.4x}" '0x017f'
Completely the opposite, it is the most rational argument. This is string formatting, which is printed to the user, and thus has to convey useful information. f"{128:z#.2x}" # '0x080'
f"{-128:z#.2x}" # '0x80' Who is this useful for?? In the useful examples section of the PEP I'm documenting useful examples for precision ( s = b"GET /\r\n\r\n"
print(" ".join(f"{c:#.2x}" for c in s))
'0x47 0x45 0x54 0x20 0x2f 0x0d 0x0a 0x0d 0x0a'
# observe the CR and LF bytes padded to precision 2
s = "USA 🦅"
print(" ".join(f"U+{ord(c):.4X}" for c in s))
'U+0055 U+0053 U+0041 U+0020 U+1F985'
# observe the last character's Unicode representation has 5 digits;
# precision is only the minimum number of digits And for precision-modulo ( import struct
my_struct = b"\xff"
(t,) = struct.unpack('b', my_struct) # signed
print(t, f"{t:#.2x}", f"{t:z#.2x}") # -1 -0x01 0xff
(t,) = struct.unpack('B', my_struct) # unsigned
print(t, f"{t:#.2x}", f"{t:z#.2x}") # 255 0xff 0xff
# observe in both the signed and unsigned unpacking the modulo-precision flag 'z'
# produces a predictable two's-complement formatting Give me a single example of the alternate implementation's formatting of -128 as
In reasonable contexts of using As for worry over truncation outside of I've tried to tread lightly in these conversations lest this look like a 'student arguing with the professor who has decades more experience' situation, but I genuinely think variable-width-two's-complement is the wrong implementation, and modulo-precision is the obvious right one. You've been eager wanting to merge / review this when there's still no consensus, plenty of defects in the commits pushed that have only been spotted with my ad-hoc diligence, eg minimal vs 'canonical' hex/octal two's complement needing 4/3 extra leading bits, not just 1 as binary does, which you only hesitantly mention as "We can also choose bitsize being proportional to 4...". (I thought your problem with the truncation of digits was the loss of information/uniqueness, but you're fine with both -129 and 383 being mapped to At this point it seems like you're trying to rush through a broken, wrong, alternate implementation, and fix it later if necessary, prematurely quashing a PEP every time I bring it up, which properly addresses all concerns, and can be debated by others, which I'm 99% sure will side with the modulo-precision implementation. Just to clear the air, for me this is not about trying to win an argument or egos, it's about what is useful to the end user. I'd be a real pain if the wrong behavior gets merged now and users start relying on a defective defacto standard, and we can't fix it later because a user relies upon |
It's useful to detect errors. If you see that printed value has more digits than requested - given integer doesn't fit into specified range. This is an alternative to exception.
And why you not cut off all entries to 4 digits?
I'm not sure you realize that twos-complement representation means. Here is a corrected version of your example: >>> import struct
>>> my_struct = b"\xff"
>>> (t,) = struct.unpack('b', my_struct) # signed
>>> print(t, f"{t:#.2x}", f"{t:z#.2x}")
-1 -0x01 0xff
>>> (t,) = struct.unpack('B', my_struct) # unsigned
>>> print(t, f"{t:#.2x}", f"{t:z#.2x}") # note the last output
255 0xff 0x0ff You asked to interpret Python integer In Raymond's original proposal the first option was chosen. But if we allow precision option in no- (BTW, to print unsigned value in 2 digits - you should use
This pr solves issue, where it's not.
Then, probably, it's not worth your time, isn't? |
For integer presentation types (excluding
'c'
), the precision gives the minimal number of digits to appear, expanded with an appropriate number of leading zeros.If
'z'
option specified for non-decimal presentation types - integer value interpreted as two's complement, the precision gives it's minimum sizeprecision*k
in bits, wherek=1,3,4
for'b'
,'o'
and'x'
/'X'
types, respectively.A precision of
0
is treated as equivalent to a precision of1
.Examples:
📚 Documentation preview 📚: https://cpython-previews--131926.org.readthedocs.build/