Skip to content

GH-96172 fix unicodedata.east_asian_width being wrong on unassigned code points #96207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

cfbolz
Copy link
Contributor

@cfbolz cfbolz commented Aug 23, 2022

unicodedata.east_asian_width returned the wrong value in two situations:

  • by default for unassigned code points (should return 'N', did return 'F')
  • for reserved but unassigned code points (should return 'W', did return 'F')

This is fixed by changing the default return value through reordering the list EASTASIANWIDTH_NAMES and by adding records for unassigned but reserved entries.

cfbolz added 2 commits August 23, 2022 13:26
also return the correct width for unassigned but reserved characters
according to EastAsianWidth.txt
- this guards against accidentally introducing changes in the future
- if east_asian_width had been part of the checksum from the beginning,
  the bug would have been found much earlier by PyPy
@isidentical isidentical self-requested a review August 23, 2022 13:35
@isidentical
Copy link
Member

@cfbolz do you know whether we handle the first case mentioned in "6.1 Unassigned and Private-Use Characters" (Wide for CJK):

Unassigned code points in ranges intended for CJK ideographs are classified as Wide. Those ranges are:

the CJK Unified Ideographs block, 4E00..9FFF
the CJK Unified Ideographs Extension A block, 3400..4DBF
the CJK Compatibility Ideographs block, F900..FAFF
the Supplementary Ideographic Plane, 20000..2FFFF
the Tertiary Ideographic Plane, 30000..3FFFF

All other unassigned code points are by default classified as Neutral.

@cfbolz
Copy link
Contributor Author

cfbolz commented Aug 24, 2022

@isidentical oops, I definitely had an explicit test, seems that got lost somehow. will retrieve it tomorrow. and yes, it definitely had characters from the CJK ranges that you list

@cfbolz
Copy link
Contributor Author

cfbolz commented Aug 24, 2022

it's in my first commit, but then in the third one I removed it again for some reason :-/. 5688e6a#diff-c31ff7b8fca97de6b4fdaca3e14da27ab3cac411653e9c510a5378b189f909eaR223

fixing tomorrow.

@isidentical
Copy link
Member

Amazing, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants