Skip to content

gh-66428 Stop including all bidirectional "B" characters in line breakers #132369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

LoicGrobol
Copy link

@LoicGrobol LoicGrobol commented Apr 10, 2025

This is not a direct fix to the issue (as it is not clear how/if it should be fixed at this point), but to this comment:

I couldn't find a trace of bidirectional B characters being non-tailorable line breaks (classes BK, CR, LF, NL), neither in the line breaking not in the bidirectional algorithm specification. What I think the latter means is that B characters actually only “terminate paragraphs” in the sense of “terminating the effect of directional formatting characters”. Now it is true that \N{PARAGRAPH SEPARATOR} is both B and BK, but does it means that all B characters should be line breaks? In particular what about \N{FILE SEPARATOR}, \N{GROUP SEPARATOR} and \N{RECORD SEPARATOR}, who are all caught by str.splitlines, but have the Combining Mark property (see e.g. the list of line breaks), which should in fact forbid a line break?

And the answer by @malemburg

At the time this code was written (in 2000), Unicode 3.0 was the current version and in those days, bidirectional B code points were listed as "Paragraph Separator", in particular the important ones CR, LF. See https://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html and https://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt for details.

I don't remember, but it's well possible that I wasn't aware of the CM LineBreak property at the time.

I guess the _PyUnicode_IsLinebreak() function should really use the LineBreak.txt file as basis and only include code points which are listed as "Non-tailorable" in that file: https://www.unicode.org/Public/UCD/latest/ucd/LineBreak.txt

Perhaps it's time to fix this after 25 years 😄

Patches are welcome !

It turns out that LineBreak.txt is already used to gather the line break categories, but that B was still kept as indicating a line breaker. Looking at the current state of Unicode (16.0)

  • All the characters in B that are non-tailorable line breakers are in the BK, CR, LF or NL categories, so they were already captured.
  • The other characters in B (U+001C, U+001D, U+001E) are all combining marks and shouldn't break lines.

So I just removed the bidirectional == "B" condition in the code generating the list of line breakers and that should be it.

I think from an API perspective, this only affects str.splitlines(), which at this point is only tested for behaviour against CR, LF and CR+LF and no other line breaker, so I didn't add any test, but I can if it seems useful.

In general, I don't expect this to be a huge compatility-breaking change given the conversation in #66428, but don't really know how to check for that apart from searching for the codepoints (in U+ and \x forms) on Github, which didn't return any Python code that would be broken.

This is my first PR, so it's very likely I missed something, please let me know!


📚 Documentation preview 📚: https://cpython-previews--132369.org.readthedocs.build/

L. Grobol added 3 commits April 10, 2025 15:17
…rectional B but don't have the line break property

This is exactly three characters: U+001C "FILE SEPARATOR", U+001D "GROUP SEPARATOR" and U+001E "RECORD SEPARATOR", all of which have the Combining Mark line breaking property, meaning that they should *not* be present at a line break
@bedevere-app
Copy link

bedevere-app bot commented Apr 10, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@python-cla-bot
Copy link

python-cla-bot bot commented Apr 10, 2025

All commit authors signed the Contributor License Agreement.

CLA signed

@bedevere-app
Copy link

bedevere-app bot commented Apr 10, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@bedevere-app
Copy link

bedevere-app bot commented Apr 10, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant