Correctly parse Numeric_Value since 1.0 #1123

eggrobin · 2025-05-02T20:29:19Z

When it exists, use DerivedNumericValues to tell us:

that there is a numeric value,
what the numeric value is, if it does not come from UnicodeData.

But use UnicodeData as the source of truth if available. This is preferable because of the lack of star in https://www.unicode.org/reports/tr44/#Derived_Extracted, but also because it is the only thing that works in the past.

Before DerivedNumericValues, use UnicodeData alone (at that time, Unihan did not contribute to Numeric_Value, so this is correct).

In addition, this allows us to read the data we would want to write back (such as 2/12 for U+109F7), which seems like the right way to go in the future. Right now we are reverse-engineering that from the name when generating UnicodeData, which is very silly.

unicodetools/unicodetools/src/main/java/org/unicode/text/UCD/MakeUnicodeFiles.java

Lines 2220 to 2257 in c2c5ba9

    
           final double value = Double.parseDouble(toolValue); 
        
           Map<Integer, String> names = new TreeMap<>(); 
        
           names.put(1, "n/a"); 
        
           names.put(2, "HALF"); 
        
           names.put(3, "THIRD"); 
        
           names.put(4, "FOURTH"); 
        
           names.put(5, "FIFTH"); 
        
           names.put(6, "SIXTH"); 
        
           names.put(7, "SEVENTH"); 
        
           names.put(8, "EIGHTH"); 
        
           names.put(9, "NINTH"); 
        
           names.put(10, "TENTH"); 
        
           names.put(12, "TWELFTH"); 
        
           names.put(16, "SIXTEENTH"); 
        
           names.put(20, "TWENTIETH"); 
        
           names.put(40, "FORTIETH"); 
        
           names.put(32, "THIRTY-SECOND"); 
        
           names.put(64, "SIXTY-FOURTH"); 
        
           names.put(80, "EIGHTIETH"); 
        
           names.put(160, "ONE-HUNDRED-AND-SIXTIETH"); 
        
           names.put(320, "THREE-HUNDRED-AND-TWENTIETH"); 
        
           List<Integer> denominators = new ArrayList<>(names.keySet()); 
        
           // Prefer denominators that are in the name, and among those prefer 
        
           // those with the longest name (so that we use sixty-fourths, not 
        
           // fourths, when both work).  Otherwise prefer smaller denominators. 
        
           denominators.sort( 
        
                   (m, n) -> { 
        
                       final boolean m_in_name = name.contains(names.get(m)); 
        
                       final boolean n_in_name = name.contains(names.get(n)); 
        
                       if (m_in_name != n_in_name) { 
        
                           return Boolean.compare(n_in_name, m_in_name); 
        
                       } 
        
                       if (m_in_name) { 
        
                           return Integer.compare(names.get(n).length(), names.get(m).length()); 
        
                       } else { 
        
                           return m.compareTo(n); 
        
                       } 
        
                   });

…ing the history display code. Do not show version ranges where we do not have version data.

markusicu · 2025-05-02T23:40:50Z

unicodetools/src/main/java/org/unicode/props/PropertyParsingInfo.java

@@ -227,6 +238,17 @@ public String getFileName(VersionInfo ucdVersionRequested) {
        }
    }

+    public int getFieldNumber(VersionInfo ucdVersionRequested) {


Could we optimize this a little?

For example, if fieldNumbers.size() == 1 then just return the one and only value.

Maybe even better, once the parsing info is all built, populate an int singleFieldNumber with the one and only value if there is exactly one. If there are two or more, then set that to -1.

For example, if fieldNumbers.size() == 1 then just return the one and only value.

Sure. Done.

Maybe even better, once the parsing info is all built, populate an int singleFieldNumber with the one and only value if there is exactly one. If there are two or more, then set that to -1.

That is probably not the direction this wants to go in. This is clearly in dire need of refactoring, because this logic of version-dependent chunk of parsing info is now triplicated (we do this for field number, file, default value). This should just get transposed into a type that has (field number, file, default value, whatever else we want to make version-dependent), and then a map keyed on version. I don’t think we have much to gain from making the map sparse (the maps are sparse here because they match contents of the source files, which is rarely a good guide for data structures).

But that refactoring isn’t going to happen today.

… into numeric-hell

eggrobin added 9 commits May 2, 2025 14:15

Allow for properties to pop in and out of existence, thereby simplify…

089ec4e

…ing the history display code. Do not show version ranges where we do not have version data.

!= on String I hate you

22bcca6

Fix (recent) versioned nv queries

c10b800

Properly apply the default class

9cf21ff

Hack around numeric value and see where that leads us

63b2906

misc

d5c5a3a

This might even be going in a reasonable direction

4fa56af

Merge remote-tracking branch 'la-vache/main' into numeric-hell

b5999e9

Bad missing lines

6bec34c

eggrobin requested a review from markusicu May 2, 2025 23:22

markusicu reviewed May 2, 2025

View reviewed changes

optimization of sorts

4c02b60

eggrobin requested a review from markusicu May 2, 2025 23:55

markusicu previously approved these changes May 3, 2025

View reviewed changes

eggrobin added 2 commits May 5, 2025 11:58

fix comments

91ceb5e

Merge branch 'numeric-hell' of https://github.com/eggrobin/unicodetools…

55c2e38

… into numeric-hell

eggrobin dismissed markusicu’s stale review via 55c2e38 May 5, 2025 09:58

eggrobin requested a review from markusicu May 5, 2025 09:58

markusicu approved these changes May 5, 2025

View reviewed changes

eggrobin merged commit 5118852 into unicode-org:main May 5, 2025
14 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly parse Numeric_Value since 1.0 #1123

Correctly parse Numeric_Value since 1.0 #1123

eggrobin commented May 2, 2025 •

edited

Loading

markusicu May 2, 2025

eggrobin May 2, 2025

	final double value = Double.parseDouble(toolValue);
	Map<Integer, String> names = new TreeMap<>();
	names.put(1, "n/a");
	names.put(2, "HALF");
	names.put(3, "THIRD");
	names.put(4, "FOURTH");
	names.put(5, "FIFTH");
	names.put(6, "SIXTH");
	names.put(7, "SEVENTH");
	names.put(8, "EIGHTH");
	names.put(9, "NINTH");
	names.put(10, "TENTH");
	names.put(12, "TWELFTH");
	names.put(16, "SIXTEENTH");
	names.put(20, "TWENTIETH");
	names.put(40, "FORTIETH");
	names.put(32, "THIRTY-SECOND");
	names.put(64, "SIXTY-FOURTH");
	names.put(80, "EIGHTIETH");
	names.put(160, "ONE-HUNDRED-AND-SIXTIETH");
	names.put(320, "THREE-HUNDRED-AND-TWENTIETH");
	List<Integer> denominators = new ArrayList<>(names.keySet());
	// Prefer denominators that are in the name, and among those prefer
	// those with the longest name (so that we use sixty-fourths, not
	// fourths, when both work). Otherwise prefer smaller denominators.
	denominators.sort(
	(m, n) -> {
	final boolean m_in_name = name.contains(names.get(m));
	final boolean n_in_name = name.contains(names.get(n));
	if (m_in_name != n_in_name) {
	return Boolean.compare(n_in_name, m_in_name);
	}
	if (m_in_name) {
	return Integer.compare(names.get(n).length(), names.get(m).length());
	} else {
	return m.compareTo(n);
	}
	});

Correctly parse Numeric_Value since 1.0 #1123

Correctly parse Numeric_Value since 1.0 #1123

Conversation

eggrobin commented May 2, 2025 • edited Loading

markusicu May 2, 2025

Choose a reason for hiding this comment

eggrobin May 2, 2025

Choose a reason for hiding this comment

eggrobin commented May 2, 2025 •

edited

Loading