Skip to content

Correctly parse Numeric_Value since 1.0 #1123

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 5, 2025

Conversation

eggrobin
Copy link
Member

@eggrobin eggrobin commented May 2, 2025

Fix #1118.

When it exists, use DerivedNumericValues to tell us:

  1. that there is a numeric value,
  2. what the numeric value is, if it does not come from UnicodeData.

But use UnicodeData as the source of truth if available. This is preferable because of the lack of star in https://www.unicode.org/reports/tr44/#Derived_Extracted, but also because it is the only thing that works in the past.

Before DerivedNumericValues, use UnicodeData alone (at that time, Unihan did not contribute to Numeric_Value, so this is correct).

In addition, this allows us to read the data we would want to write back (such as 2/12 for U+109F7), which seems like the right way to go in the future. Right now we are reverse-engineering that from the name when generating UnicodeData, which is very silly.

final double value = Double.parseDouble(toolValue);
Map<Integer, String> names = new TreeMap<>();
names.put(1, "n/a");
names.put(2, "HALF");
names.put(3, "THIRD");
names.put(4, "FOURTH");
names.put(5, "FIFTH");
names.put(6, "SIXTH");
names.put(7, "SEVENTH");
names.put(8, "EIGHTH");
names.put(9, "NINTH");
names.put(10, "TENTH");
names.put(12, "TWELFTH");
names.put(16, "SIXTEENTH");
names.put(20, "TWENTIETH");
names.put(40, "FORTIETH");
names.put(32, "THIRTY-SECOND");
names.put(64, "SIXTY-FOURTH");
names.put(80, "EIGHTIETH");
names.put(160, "ONE-HUNDRED-AND-SIXTIETH");
names.put(320, "THREE-HUNDRED-AND-TWENTIETH");
List<Integer> denominators = new ArrayList<>(names.keySet());
// Prefer denominators that are in the name, and among those prefer
// those with the longest name (so that we use sixty-fourths, not
// fourths, when both work). Otherwise prefer smaller denominators.
denominators.sort(
(m, n) -> {
final boolean m_in_name = name.contains(names.get(m));
final boolean n_in_name = name.contains(names.get(n));
if (m_in_name != n_in_name) {
return Boolean.compare(n_in_name, m_in_name);
}
if (m_in_name) {
return Integer.compare(names.get(n).length(), names.get(m).length());
} else {
return m.compareTo(n);
}
});

@eggrobin eggrobin requested a review from markusicu May 2, 2025 23:22
@@ -227,6 +238,17 @@ public String getFileName(VersionInfo ucdVersionRequested) {
}
}

public int getFieldNumber(VersionInfo ucdVersionRequested) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we optimize this a little?

For example, if fieldNumbers.size() == 1 then just return the one and only value.

Maybe even better, once the parsing info is all built, populate an int singleFieldNumber with the one and only value if there is exactly one. If there are two or more, then set that to -1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if fieldNumbers.size() == 1 then just return the one and only value.

Sure. Done.

Maybe even better, once the parsing info is all built, populate an int singleFieldNumber with the one and only value if there is exactly one. If there are two or more, then set that to -1.

That is probably not the direction this wants to go in. This is clearly in dire need of refactoring, because this logic of version-dependent chunk of parsing info is now triplicated (we do this for field number, file, default value). This should just get transposed into a type that has (field number, file, default value, whatever else we want to make version-dependent), and then a map keyed on version. I don’t think we have much to gain from making the map sparse (the maps are sparse here because they match contents of the source files, which is rarely a good guide for data structures).

But that refactoring isn’t going to happen today.

@eggrobin eggrobin requested a review from markusicu May 2, 2025 23:55
markusicu
markusicu previously approved these changes May 3, 2025
@eggrobin eggrobin merged commit 5118852 into unicode-org:main May 5, 2025
14 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IndexUnicodeProperties fails to load Numeric_Value prior to Unicode 5.1
2 participants