Skip to content

Should we support unicode in width/precision formatting fields? #135025

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
skirpichev opened this issue Jun 2, 2025 · 6 comments
Open

Should we support unicode in width/precision formatting fields? #135025

skirpichev opened this issue Jun 2, 2025 · 6 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@skirpichev
Copy link
Contributor

skirpichev commented Jun 2, 2025

Bug report

Bug description:

Currently, specification allows only [0-9] digits. Though, actual implementation permits unicode symbols for float/Decimal's, but not Fraction's:

>>> f"{decimal.Decimal('123'):.١١f}"  # arabic 11 in precision
'123.00000000000'
>>> f"{fractions.Fraction('123'):.١١f}"
Traceback (most recent call last):
  File "<python-input-9>", line 1, in <module>
    f"{fractions.Fraction('123'):.١١f}"
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sk/src/cpython/Lib/fractions.py", line 600, in __format__
    raise ValueError(
    ...<2 lines>...
    )
ValueError: Invalid format specifier '.١١f' for object of type 'Fraction'
>>> f"{float(fractions.Fraction('123')):.١١f}"
'123.00000000000'

Quick tests shows no measurable performance penalty with unicode support:

$ python -m timeit -s 'from fractions import Fraction as F' 'format(F(123), ".11f")'
10000 loops, best of 5: 39.2 usec per loop
$ python -m timeit -s 'from fractions import Fraction as F' 'format(F(123), ".١١f")'  # with patch
5000 loops, best of 5: 40.2 usec per loop
a patch
diff --git a/Lib/fractions.py b/Lib/fractions.py
index 063f28478c..b4120b2beb 100644
--- a/Lib/fractions.py
+++ b/Lib/fractions.py
@@ -170,7 +170,7 @@ def _round_to_figures(n, d, figures):
     (?P<zeropad>0(?=[0-9]))?
     (?P<minimumwidth>0|[1-9][0-9]*)?
     (?P<thousands_sep>[,_])?
-    (?:\.(?P<precision>0|[1-9][0-9]*))?
+    (?:\.(?P<precision>0|\d*))?
     (?P<presentation_type>[eEfFgG%])
 """, re.DOTALL | re.VERBOSE).fullmatch

CPython versions tested on:

CPython main branch

Operating systems tested on:

No response

@skirpichev skirpichev added type-bug An unexpected behavior, bug, or error stdlib Python modules in the Lib dir labels Jun 2, 2025
@serhiy-storchaka
Copy link
Member

It was perhaps overlooked in Python 2 to Python 3 transition.

I think that non-ASCII digits should be deprecated here.

@skirpichev
Copy link
Contributor Author

It was perhaps overlooked in Python 2 to Python 3 transition.

Maybe. My archeological research traced this story down to the PEP 3101 implementation. Note that PEP text is vague about width/precision fields. It says:

‘width’ is a decimal integer defining the minimum field width.

The ‘precision’ is a decimal number indicating how many digits should be displayed after the decimal point in a floating point conversion.

Though, I see no tests for this "feature".

I think that non-ASCII digits should be deprecated here.

I think so.

On another hand, support doesn't look too costly and it should be easy to adjust the documentation and the fractions module code.

CC @ericvsmith

@skirpichev skirpichev self-assigned this Jun 3, 2025
@ericvsmith
Copy link
Member

If we were doing it all over again, I'd argue that we should accept only ASCII numbers in format strings (for precision and width). At this point, as much as I'd like to deprecate it, I think we should just document it and move on. And I guess if we do that, we should allow it for Fractions, too.

@skirpichev
Copy link
Contributor Author

we should allow it for Fractions, too.

Does make sense. But,

we should just document

maybe we can leave things as is here? This will complicate things for alternative implementations for no good reasons. (I also guess that this "feature" was introduced unintentionally.) It seems, PyPy3.11 doesn't support it:

Python 3.11.11 (0253c85bf5f8, Feb 26 2025, 10:42:42)
[PyPy 7.3.19 with GCC 10.2.1 20210130 (Red Hat 10.2.1-11)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>> f"{float('123'):.١١f}"
Traceback (most recent call last):
  File "<python-input-0>", line 1, in <module>
    f"{float('123'):.١١f}"
ValueError: no precision given

@serhiy-storchaka
Copy link
Member

Accepting non-ASCII digits creates security risks, because some non-ASCII digits look like ASCII digits with different value.

>>> format(1/7, '.੪')
'0.1429'

Note that while non-ASCII digits are accepted in string to number conversion, they are not accepted in Python numerical literals.

>>> float('੪.੫')
4.5
>>> ੪.੫
  File "<python-input-1>", line 1
    ੪.੫
    ^
SyntaxError: invalid character '੪' (U+0A6A)

Support of non-ASCII digits in regular expressions was deprecated in 3.11 and removed in 3.12 (see #91760).

We are currently in process of making strptime() behavior more strict and consistent -- non-ASCII digits will only be allowed in locales that use them and only for fields for which they are used.

See also https://peps.python.org/pep-0672/#confusable-digits . cc @encukou

@encukou
Copy link
Member

encukou commented Jun 4, 2025

I'd be +1 for deprecating these everywhere, including the int constructor, with a long deprecation period. But I can't commit the time to push the PEP through.

We can't support all numeral systems anyway. From that point of view, supporting ones that use decimal digits is a rather arbitrary choice.
Things like int('੪٦', 20) are just nonsense.

ASCII only is consistent, predictable, and ultimately more secure.

But I'm -1 for only deprecating this in less-important places, like formatting fields. That just feels like a way to avoid the discussion. If we, as the CPython project, have an opinion on this, it should be clear and consistent.


For documentation, I think this should be pointed out as a CPython implementation detail. Other implementations should be free to not support it, if they don't mind the incompatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants