Skip to content

Surprising tokenization of f-strings #135251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nedbat opened this issue Jun 8, 2025 · 5 comments
Closed

Surprising tokenization of f-strings #135251

nedbat opened this issue Jun 8, 2025 · 5 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-parser type-bug An unexpected behavior, bug, or error

Comments

@nedbat
Copy link
Member

nedbat commented Jun 8, 2025

Bug report

Bug description:

Tokenizing an f-string with double braces produces tokens with single braces:

import tokenize, token

TEXT = b"f'{hello:.23f} this: {{braces}} done'"
f = iter([TEXT]).__next__

for ty, st, _, _, _ in tokenize.tokenize(f):
    print(f"{token.tok_name[ty]}, {st!r}")

Running this with 3.12 shows:

ENCODING, 'utf-8'
FSTRING_START, "f'"
OP, '{'
NAME, 'hello'
OP, ':'
FSTRING_MIDDLE, '.23f'
OP, '}'
FSTRING_MIDDLE, ' this: {'
FSTRING_MIDDLE, 'braces}'
FSTRING_MIDDLE, ' done'
FSTRING_END, "'"
NEWLINE, ''
ENDMARKER, ''

Should the FSTRING_MIDDLE tokens have single braces? Will it stay this way? Are they guaranteed to be split at the braces as shown here, or might they become one FSTRING_MIDDLE token ' this: {braces} done'? To recreate the original source, is it safe to always double the braces found in an FSTRING_MIDDLE token, or are there edge cases I haven't thought of?

Related to nedbat/coveragepy#1980

CPython versions tested on:

3.12, 3.13, 3.14, CPython main branch

Operating systems tested on:

No response

@nedbat nedbat added the type-bug An unexpected behavior, bug, or error label Jun 8, 2025
@picnixz picnixz added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-parser labels Jun 8, 2025
@ericvsmith
Copy link
Member

As far as the "splitting at the braces thing" goes: I would not rely on this not changing. Back when I wrote the original f-string tokenizer, this was just an optimization so I could play C games with pointers to null-terminated strings. I'd temporarily change "this: {{braces}} done" to be these strings, in turn:

"this: {\0"
"braces}\0"
" done\0"

After I was done, I'd replace the '\0' with whatever was originally there. Doing it this way, I didn't have to allocate space for a new string without the doubled braces. I can easily image a future where this tradeoff changes, or is only used for strings longer than some temporary fixed size buffer, or something like that.

I don't know if the PEP 701 tokenizer kept this behavior deliberately for compatibility, or if it was just easier for them, too.

For the rest of your questions: @pablogsal

@terryjreedy
Copy link
Member

Ned, you can sign up in .github/CODEOWNERS to be notified (emailed) when a PR is submitted that changes particular files (at least those that are tracked).

@pablogsal
Copy link
Member

pablogsal commented Jun 9, 2025

We kept this behavior for compatibility. We also had to deal with this in the untokenizer as well:

cpython/Lib/tokenize.py

Lines 255 to 260 in a58026a

if '{' in token or '}' in token:
token = self.escape_brackets(token)
last_line = token.splitlines()[-1]
end_line, end_col = end
extra_chars = last_line.count("{{") + last_line.count("}}")
end = (end_line, end_col + extra_chars)

We could explore changing this if everyone agrees as this also was kind of a problem in the REPL:

if (token.type in {T.FSTRING_MIDDLE, T.TSTRING_MIDDLE}
and token.string.endswith(("{", "}"))):
# gh-134158: a visible trailing brace comes from a double brace in input
end_offset += 1

On the other hand the change would be backwards incompatible....so I am not sure what's the best thing to do here

@nedbat
Copy link
Member Author

nedbat commented Jun 9, 2025

Thanks for all the details. I've adjusted coverage.py for the tokenization as it is now, and don't depend on the tokens breaking at the braces. So no need for change on my behalf. If you do make a change, my tests should alert me!

@pablogsal
Copy link
Member

are you ok if we close this issue? What are your thoughts @ericvsmith ?

@terryjreedy terryjreedy closed this as not planned Won't fix, can't repro, duplicate, stale Jun 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-parser type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

5 participants