-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
unicodedata.east_asian_width returns wrong result for unassigned Unicode code points #96172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You're both right and wrong. Some unassigned codepoints should be 'N', others should be 'W', and there are some assigned codepoints that are also wrong, e.g. U+061D (Other_Punctuation) and a whole sequence starting at U+0870 (Other_Letter) should be 'N', amongst others. I have a package "uniprop" on PyPI that you can use for crosschecking. |
@mrabarnett interesting, did you report these issues? how are the tables in the |
I didn't know that I didn't use |
Thanks for the info! yeah, you're right, basically
I think the U+061D and U+0870... problems you reported might be because those characters were added in 14.0.0, right? did you compare against CPython 3.10 by chance (which uses 13.0.0)? |
Yes, you're correct. Anyway, I count 830_672 incorrect values out of 1_114_112, a mere 75%. :-) |
heh :-). feel like sharing your comparison script? |
OK, here it is, tidied up a little: #!python3.10
# -*- encoding: utf-8 -*-
import unicodedata
from os.path import dirname, join
east_asian_width_path = join(dirname('EastAsianWidth.txt'))
NUM_CODEPOINTS = 0x110000
with open(east_asian_width_path) as file:
for line in file:
if line.startswith('# @missing:'):
expected = [line.split()[-1]] * NUM_CODEPOINTS
else:
fields = line.partition('#')[0].split(';')
if len(fields) == 2:
codepoints = [int(field, 16) for field in fields[0].split('..')]
lower, upper = codepoints[0], codepoints[-1]
expected[lower : upper + 1] = [fields[1].strip()] * (upper - lower + 1)
differences = []
for c in range(NUM_CODEPOINTS):
actual = unicodedata.east_asian_width(chr(c))
if actual != expected[c]:
differences.append((c, expected[c], actual))
print(f'{len(differences):_} differences found')
print()
lower, expected, actual = differences[0]
upper = lower
rows = [('Range', 'Expected', 'Actual')]
for c, e, a in differences[1 : ]:
if (c, e, a) == (upper + 1, expected, actual):
upper = c
else:
if lower == upper:
rows.append((f'{lower:04X}', expected, actual))
else:
rows.append((f'{lower:04X}..{upper:04X}', expected, actual))
lower, expected, actual = c, e, a
upper = lower
if lower == upper:
rows.append((f'{lower:04X}', expected, actual))
else:
rows.append((f'{lower:04X}..{upper:04X}', expected, actual))
widths = tuple(max(len(cell) for cell in column) for column in zip(*rows))
fmt = ' | '.join(f'%-{width}s' for width in widths)
print(fmt % rows.pop(0))
print('-+-'.join('-' * width for width in widths))
for row in rows:
print(fmt % row) |
also return the correct width for unassigned but reserved characters according to EastAsianWidth.txt
This is really a corner case, but I ran across the problem today. The unicode data file for east asian widths states:
However, that seems to not be true in the
unicodedata
module, eg:I'd be happy to fix this, if people agree that it should be fixed. FWIW, PyPy has always returned 'N' in this situation. For assigned code points everything is fine.
The text was updated successfully, but these errors were encountered: