Skip to content

unicodedata.east_asian_width returns wrong result for unassigned Unicode code points #96172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cfbolz opened this issue Aug 22, 2022 · 7 comments
Labels
topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@cfbolz
Copy link
Contributor

cfbolz commented Aug 22, 2022

This is really a corner case, but I ran across the problem today. The unicode data file for east asian widths states:

#  - All code points, assigned or unassigned, that are not listed
#      explicitly are given the value "N".

However, that seems to not be true in the unicodedata module, eg:

$ python3
Python 3.10.4 (main, Jun 29 2022, 12:14:53) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> char = chr(0xfe75) # arbitrary unassigned code point
>>> unicodedata.name(char)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name
>>> unicodedata.east_asian_width(char)
'F'

I'd be happy to fix this, if people agree that it should be fixed. FWIW, PyPy has always returned 'N' in this situation. For assigned code points everything is fine.

@cfbolz cfbolz added the type-bug An unexpected behavior, bug, or error label Aug 22, 2022
@mrabarnett
Copy link

You're both right and wrong.

Some unassigned codepoints should be 'N', others should be 'W', and there are some assigned codepoints that are also wrong, e.g. U+061D (Other_Punctuation) and a whole sequence starting at U+0870 (Other_Letter) should be 'N', amongst others.

I have a package "uniprop" on PyPI that you can use for crosschecking.

@cfbolz
Copy link
Contributor Author

cfbolz commented Aug 22, 2022

@mrabarnett interesting, did you report these issues? how are the tables in the uniprop are generated by this script, I assume?: https://bitbucket.org/mrabarnett/mrab-regex/src/hg/tools/build_regex_unicode.py

@mrabarnett
Copy link

I didn't know that unicodedata had a problem until you reported it here.

I didn't use build_regex_unicode.py for uniprop, I used a related script, because the files generated by the former script don't contain all the stuff that uniprop needs. I really need to finish consolidating them so that I can use the one script for both.

@cfbolz
Copy link
Contributor Author

cfbolz commented Aug 22, 2022

Thanks for the info! yeah, you're right, basically unicodedata as of now seems to ignore all the "reserved" ranges from EastAsianWidth.txt, in addition to the wrong default, eg:

9FFD..9FFF;W     # Cn     [3] <reserved-9FFD>..<reserved-9FFF>
...
FA6E..FA6F;W     # Cn     [2] <reserved-FA6E>..<reserved-FA6F>

I think the U+061D and U+0870... problems you reported might be because those characters were added in 14.0.0, right? did you compare against CPython 3.10 by chance (which uses 13.0.0)?

@mrabarnett
Copy link

Yes, you're correct. Anyway, I count 830_672 incorrect values out of 1_114_112, a mere 75%. :-)

@cfbolz
Copy link
Contributor Author

cfbolz commented Aug 22, 2022

heh :-). feel like sharing your comparison script?

@mrabarnett
Copy link

OK, here it is, tidied up a little:

#!python3.10
# -*- encoding: utf-8 -*-
import unicodedata
from os.path import dirname, join

east_asian_width_path = join(dirname('EastAsianWidth.txt'))

NUM_CODEPOINTS = 0x110000

with open(east_asian_width_path) as file:
    for line in file:
        if line.startswith('# @missing:'):
            expected = [line.split()[-1]] * NUM_CODEPOINTS
        else:
            fields = line.partition('#')[0].split(';')

            if len(fields) == 2:
                codepoints = [int(field, 16) for field in fields[0].split('..')]
                lower, upper = codepoints[0], codepoints[-1]
                expected[lower : upper + 1] = [fields[1].strip()] * (upper - lower + 1)

differences = []

for c in range(NUM_CODEPOINTS):
    actual = unicodedata.east_asian_width(chr(c))

    if actual != expected[c]:
        differences.append((c, expected[c], actual))

print(f'{len(differences):_} differences found')
print()

lower, expected, actual = differences[0]
upper = lower

rows = [('Range', 'Expected', 'Actual')]

for c, e, a in differences[1 : ]:
    if (c, e, a) == (upper + 1, expected, actual):
        upper = c
    else:
        if lower == upper:
            rows.append((f'{lower:04X}', expected, actual))
        else:
            rows.append((f'{lower:04X}..{upper:04X}', expected, actual))

        lower, expected, actual = c, e, a
        upper = lower

if lower == upper:
    rows.append((f'{lower:04X}', expected, actual))
else:
    rows.append((f'{lower:04X}..{upper:04X}', expected, actual))

widths = tuple(max(len(cell) for cell in column) for column in zip(*rows))
fmt = ' | '.join(f'%-{width}s' for width in  widths)

print(fmt % rows.pop(0))
print('-+-'.join('-' * width for width in widths))

for row in rows:
    print(fmt % row)

cfbolz added a commit to cfbolz/cpython that referenced this issue Aug 23, 2022
also return the correct width for unassigned but reserved characters
according to EastAsianWidth.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-unicode type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants