Skip to content

Surprising tokenization of f-strings #135251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nedbat opened this issue Jun 8, 2025 · 1 comment
Open

Surprising tokenization of f-strings #135251

nedbat opened this issue Jun 8, 2025 · 1 comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-parser type-bug An unexpected behavior, bug, or error

Comments

@nedbat
Copy link
Member

nedbat commented Jun 8, 2025

Bug report

Bug description:

Tokenizing an f-string with double braces produces tokens with single braces:

import tokenize, token

TEXT = b"f'{hello:.23f} this: {{braces}} done'"
f = iter([TEXT]).__next__

for ty, st, _, _, _ in tokenize.tokenize(f):
    print(f"{token.tok_name[ty]}, {st!r}")

Running this with 3.12 shows:

ENCODING, 'utf-8'
FSTRING_START, "f'"
OP, '{'
NAME, 'hello'
OP, ':'
FSTRING_MIDDLE, '.23f'
OP, '}'
FSTRING_MIDDLE, ' this: {'
FSTRING_MIDDLE, 'braces}'
FSTRING_MIDDLE, ' done'
FSTRING_END, "'"
NEWLINE, ''
ENDMARKER, ''

Should the FSTRING_MIDDLE tokens have single braces? Will it stay this way? Are they guaranteed to be split at the braces as shown here, or might they become one FSTRING_MIDDLE token ' this: {braces} done'? To recreate the original source, is it safe to always double the braces found in an FSTRING_MIDDLE token, or are there edge cases I haven't thought of?

Related to nedbat/coveragepy#1980

CPython versions tested on:

3.12, 3.13, 3.14, CPython main branch

Operating systems tested on:

No response

@nedbat nedbat added the type-bug An unexpected behavior, bug, or error label Jun 8, 2025
@picnixz picnixz added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-parser labels Jun 8, 2025
@ericvsmith
Copy link
Member

As far as the "splitting at the braces thing" goes: I would not rely on this not changing. Back when I wrote the original f-string tokenizer, this was just an optimization so I could play C games with pointers to null-terminated strings. I'd temporarily change "this: {{braces}} done" to be these strings, in turn:

"this: {\0"
"braces}\0"
" done\0"

After I was done, I'd replace the '\0' with whatever was originally there. Doing it this way, I didn't have to allocate space for a new string without the doubled braces. I can easily image a future where this tradeoff changes, or is only used for strings longer than some temporary fixed size buffer, or something like that.

I don't know if the PEP 701 tokenizer kept this behavior deliberately for compatibility, or if it was just easier for them, too.

For the rest of your questions: @pablogsal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-parser type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants