Python 3.12 tokenize generates invalid locations for f'\N{unicode}' #115154

smartbomb · 2024-02-08T06:59:37Z

Bug report

Bug description:

from tokenize import untokenize, generate_tokens
from io import StringIO

untokenize(generate_tokens(StringIO("f'\\N{EXCLAMATION MARK}'").readline))

ValueError: start (1,22) precedes previous end (1,24)

CPython versions tested on:

3.12

Operating systems tested on:

Windows

Linked PRs

Eclips4 · 2024-02-08T07:12:22Z

cc @pablogsal

terryjreedy · 2024-02-08T09:30:24Z

The tokens parsing f'\\N{EXCLAMATION MARK}' and traceback unparsing the list (toks) are

[TokenInfo(type=59 (FSTRING_START), string="f'", start=(1, 0), end=(1, 2), line="f'\\N{EXCLAMATION MARK}'"),
 TokenInfo(type=60 (FSTRING_MIDDLE), string='\\N{EXCLAMATION MARK}', start=(1, 2), end=(1, 22),
   line="f'\\N{EXCLAMATION MARK}'"), 
TokenInfo(type=61 (FSTRING_END), string="'", start=(1, 22), end=(1, 23), line="f'\\N{EXCLAMATION MARK}'"),
 TokenInfo(type=4 (NEWLINE), string='', start=(1, 23), end=(1, 24), line="f'\\N{EXCLAMATION MARK}'"),
 TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]
#
Traceback (most recent call last):
  File "F:\dev\tem\tem2.py", line 6, in <module>
    print(untokenize(toks))
  File "C:\Programs\Python313\Lib\tokenize.py", line 294, in untokenize
    out = ut.untokenize(iterable)
  File "C:\Programs\Python313\Lib\tokenize.py", line 223, in untokenize
    self.add_whitespace(start)
  File "C:\Programs\Python313\Lib\tokenize.py", line 176, in add_whitespace
    raise ValueError("start ({},{}) precedes previous end ({},{})"
ValueError: start (1,22) precedes previous end (1,24)

The problem is that when Untokenizer.untokenize, line215 sees the FSTRING_MIDDLE token, it replaces '{' and '}' with '{{' and '}}' and bumps the end position by 2, making the end column 2 more than the next start column. This works when the presence of curly brackets results from the reverse process, but not when the lexer recognizes \N{name} as a unicode named literal without replacing it with the indicated character. Other escapes are resolved, as with '\ueeee' being tokenized as a single character. Unless the tokenizer replaces \N{name} with a character, the untokenizer must recognize it also and not do the replacement.

pablogsal · 2024-02-08T15:19:18Z

CC: @isidentical

Signed-off-by: Pablo Galindo <pablogsal@gmail.com>

…rals

…ythonGH-115171) (cherry picked from commit ecf16ee) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>

…H-115171) (#115662) gh-115154: Fix untokenize handling of unicode named literals (GH-115171) (cherry picked from commit ecf16ee) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>

…ython#115171)

smartbomb added the type-bug An unexpected behavior, bug, or error label Feb 8, 2024

Eclips4 added the topic-parser label Feb 8, 2024

pablogsal added a commit to pablogsal/cpython that referenced this issue Feb 8, 2024

pythongh-115154: Fix untokenize handling of unicode named literals

3ce1233

Signed-off-by: Pablo Galindo <pablogsal@gmail.com>

bedevere-app bot mentioned this issue Feb 8, 2024

gh-115154: Fix untokenize handling of unicode named literals #115171

Merged

pablogsal added a commit to pablogsal/cpython that referenced this issue Feb 11, 2024

fixup! pythongh-115154: Fix untokenize handling of unicode named lite…

d212f92

…rals

pablogsal added a commit that referenced this issue Feb 19, 2024

gh-115154: Fix untokenize handling of unicode named literals (#115171)

ecf16ee

bedevere-app bot mentioned this issue Feb 19, 2024

[3.12] gh-115154: Fix untokenize handling of unicode named literals (GH-115171) #115662

Merged

pablogsal closed this as completed Feb 19, 2024

woodruffw pushed a commit to woodruffw-forks/cpython that referenced this issue Mar 4, 2024

pythongh-115154: Fix untokenize handling of unicode named literals (p…

aa8b0d7

…ython#115171)

diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024

pythongh-115154: Fix untokenize handling of unicode named literals (p…

c56230b

…ython#115171)

LukasWoodtli pushed a commit to LukasWoodtli/cpython that referenced this issue Jan 22, 2025

pythongh-115154: Fix untokenize handling of unicode named literals (p…

d4d53ef

…ython#115171)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python 3.12 tokenize generates invalid locations for f'\N{unicode}' #115154

Python 3.12 tokenize generates invalid locations for f'\N{unicode}' #115154

smartbomb commented Feb 8, 2024 •

edited by bedevere-app bot

Loading

Eclips4 commented Feb 8, 2024

terryjreedy commented Feb 8, 2024

pablogsal commented Feb 8, 2024

Python 3.12 tokenize generates invalid locations for f'\N{unicode}' #115154

Python 3.12 tokenize generates invalid locations for f'\N{unicode}' #115154

Comments

smartbomb commented Feb 8, 2024 • edited by bedevere-app bot Loading

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

Eclips4 commented Feb 8, 2024

terryjreedy commented Feb 8, 2024

pablogsal commented Feb 8, 2024

smartbomb commented Feb 8, 2024 •

edited by bedevere-app bot

Loading