Skip to content

Fix [:C:] getting interpreted as the entire code space #1076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 27, 2025

Conversation

eggrobin
Copy link
Member

@eggrobin eggrobin commented Mar 26, 2025

@josh-hadley found that after #1057, everything gets U+-escaped in the comments of https://github.com/unicode-org/unicodetools/blob/main/unicodetools/data/security/dev/data/review.txt.

The rule governing what gets escaped is

private static final String EXCAPE_FUNNY_RULE =
":: [[:C:]-[:cn:][:Z:][:whitespace:][:Default_Ignorable_Code_Point:]] hex/unicode ; ";

and the underlying symbol table is the one from TUP:

static final UnicodeProperty.Factory ups =
ToolUnicodePropertySource.make(version); // ICUPropertyFactory.make();
static {
// USE the tool unicode set instead of ICU, which may not be using the latest version.
UnicodeSet.setDefaultXSymbolTable(ups.getXSymbolTable());
UnicodeTransform.setFactory(TOOL_FACTORY);
}

ICU UnicodeSet does not give us the possibility of distinguishing \p{X} from \p{X=} in applyPropertyAlias, thus [:C:] is [:C=:].

Recall:

  • from UAX44, that ISO_Comment is deprecated and stabilized;
  • from Issue 639 Check diffs between TUP and IUP #640 (comment), that TUP has the empty string, rather than null, for its ISO_Comment values;
  • from PropertyAliases.txt, that ISO_Comment has the alias ISC;
  • from PropertyValueAliases.txt, that the General_Category value Other has the alias C.

Thus [:C=:] is the set of code points with the empty string as their ISO_Comment value (the entire code space according to TUP), rather than the set of code points whose General_Category is a value in the Other grouping (which is [:C:]).

This change fixes the immediate issue, but as noted in the comment, the symbol table used here is dangerous (it caused #1072), and will need to go away as part of #1073 and #1074. I also filed ICU-23087 on the ICU side.

josh-hadley
josh-hadley previously approved these changes Mar 26, 2025
Copy link
Collaborator

@josh-hadley josh-hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I temporarily patched my branch where I'm working on Confusables with this updated code and it fixes the issue of characters in comments getting replaced with "U+<code>" escapes. So: LGTM, 👍, etc.

@eggrobin eggrobin merged commit 39874e1 into unicode-org:main Mar 27, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants