Fix [:C:] getting interpreted as the entire code space #1076
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@josh-hadley found that after #1057, everything gets U+-escaped in the comments of https://github.com/unicode-org/unicodetools/blob/main/unicodetools/data/security/dev/data/review.txt.
The rule governing what gets escaped is
unicodetools/unicodetools/src/main/java/org/unicode/text/UCD/GenerateConfusables.java
Lines 177 to 178 in 83971c4
and the underlying symbol table is the one from TUP:
unicodetools/unicodetools/src/main/java/org/unicode/text/UCD/GenerateConfusables.java
Lines 130 to 137 in 83971c4
ICU UnicodeSet does not give us the possibility of distinguishing \p{X} from \p{X=} in applyPropertyAlias, thus [:C:] is [:C=:].
Recall:
Thus [:C=:] is the set of code points with the empty string as their ISO_Comment value (the entire code space according to TUP), rather than the set of code points whose General_Category is a value in the Other grouping (which is [:C:]).
This change fixes the immediate issue, but as noted in the comment, the symbol table used here is dangerous (it caused #1072), and will need to go away as part of #1073 and #1074. I also filed ICU-23087 on the ICU side.