Skip to content

Supply the field indices and correct the types of NormalizationCorrections data #1087

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

eggrobin
Copy link
Member

@eggrobin eggrobin commented Apr 9, 2025

Before this change, all three normalization_correction_* pseudoproperties have the value 96FB (that is, a four-character string) for U+F951.

This was not a problem for UCDXML as that one parses the file independently,

private static AttributesImpl getNCAttributes(String namespace, UcdLineParser.UcdLine line) {
String[] parts = line.getParts();
AttributesImpl attributes = new AttributesImpl();
attributes.addAttribute(namespace, "cp", "cp", "CDATA", parts[0]);
attributes.addAttribute(namespace, "old", "old", "CDATA", parts[1]);
attributes.addAttribute(namespace, "new", "new", "CDATA", parts[2]);
attributes.addAttribute(namespace, "version", "version", "CDATA", parts[3]);
return attributes;
}
.

Also fix UnicodeProperty getSet on string-valued or miscellaneous properties of strings (although UnicodeProperty still leaves a lot to be desired for properties of strings; in particular, it has no way to get the value for a string!).

@eggrobin eggrobin requested review from markusicu and jowilco April 9, 2025 09:49
@markusicu
Copy link
Member

This was not a problem for UCDXML as that one parses the file independently,

Will UCDXML be able to call the regular Unicode Tools functions after this, rather than parsing it independently?

@eggrobin
Copy link
Member Author

eggrobin commented Apr 9, 2025

Will UCDXML be able to call the regular Unicode Tools functions after this, rather than parsing it independently?

That is a question for @jowilco, but of course I would like to minimize the number of UCD parsers we spawn here…

@eggrobin eggrobin merged commit 3743816 into unicode-org:main Apr 9, 2025
20 checks passed
@jowilco
Copy link
Contributor

jowilco commented Apr 11, 2025

Will UCDXML be able to call the regular Unicode Tools functions after this, rather than parsing it independently?

That is a question for @jowilco, but of course I would like to minimize the number of UCD parsers we spawn here…

@markusicu , @eggrobin
In theory, yes I could read the values using IndexUnicodeProperties.getResolvedValue(UcdProperty.normalization_correction_*, codepoint), however;

  1. There are only 6 code points, and I don't "know" what these are without a source of truth (like parsing NormalizationCorrections.txt). Parsing NormalizationCorrections.txt would put me back where I am now. Obviously, I could run the full 0x0-0x10FFFF range, but...
  2. The <normalization-corrections> section of UCDXML is separated from the main <repertoire> section. There are comparable sections for <blocks>, <named-sequences>, etc. All of these sections are implemented by parsing the corresponding source file. Most of them are not single code point -> value(s). I'm therefore not sure it makes too much sense in implementing a different path for <normalization-corrections> just so I can use IndexUnicodeProperties.

@eggrobin
Copy link
Member Author

I'm therefore not sure it makes too much sense in implementing a different path for just so I can use IndexUnicodeProperties.

Fair point. Effectively, it would mean removing one parser, but adding something equivalent to MakeUnicodeFiles (the UCD file generator), and that way madness also lies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants