Supply the field indices and correct the types of NormalizationCorrections data #1087

eggrobin · 2025-04-09T09:49:26Z

Before this change, all three normalization_correction_* pseudoproperties have the value 96FB (that is, a four-character string) for U+F951.

This was not a problem for UCDXML as that one parses the file independently,

unicodetools/unicodetools/src/main/java/org/unicode/xml/UCDDataResolver.java

Lines 189 to 197 in 7628438

    
           private static AttributesImpl getNCAttributes(String namespace, UcdLineParser.UcdLine line) { 
        
               String[] parts = line.getParts(); 
        
               AttributesImpl attributes = new AttributesImpl(); 
        
               attributes.addAttribute(namespace, "cp", "cp", "CDATA", parts[0]); 
        
               attributes.addAttribute(namespace, "old", "old", "CDATA", parts[1]); 
        
               attributes.addAttribute(namespace, "new", "new", "CDATA", parts[2]); 
        
               attributes.addAttribute(namespace, "version", "version", "CDATA", parts[3]); 
        
               return attributes; 
        
           }

.

Also fix UnicodeProperty getSet on string-valued or miscellaneous properties of strings (although UnicodeProperty still leaves a lot to be desired for properties of strings; in particular, it has no way to get the value for a string!).

…tions data

markusicu · 2025-04-09T16:37:40Z

This was not a problem for UCDXML as that one parses the file independently,

Will UCDXML be able to call the regular Unicode Tools functions after this, rather than parsing it independently?

eggrobin · 2025-04-09T16:45:55Z

Will UCDXML be able to call the regular Unicode Tools functions after this, rather than parsing it independently?

That is a question for @jowilco, but of course I would like to minimize the number of UCD parsers we spawn here…

jowilco · 2025-04-11T15:23:00Z

Will UCDXML be able to call the regular Unicode Tools functions after this, rather than parsing it independently?

That is a question for @jowilco, but of course I would like to minimize the number of UCD parsers we spawn here…

@markusicu , @eggrobin
In theory, yes I could read the values using IndexUnicodeProperties.getResolvedValue(UcdProperty.normalization_correction_*, codepoint), however;

There are only 6 code points, and I don't "know" what these are without a source of truth (like parsing NormalizationCorrections.txt). Parsing NormalizationCorrections.txt would put me back where I am now. Obviously, I could run the full 0x0-0x10FFFF range, but...
The <normalization-corrections> section of UCDXML is separated from the main <repertoire> section. There are comparable sections for <blocks>, <named-sequences>, etc. All of these sections are implemented by parsing the corresponding source file. Most of them are not single code point -> value(s). I'm therefore not sure it makes too much sense in implementing a different path for <normalization-corrections> just so I can use IndexUnicodeProperties.

eggrobin · 2025-04-11T15:25:36Z

I'm therefore not sure it makes too much sense in implementing a different path for just so I can use IndexUnicodeProperties.

Fair point. Effectively, it would mean removing one parser, but adding something equivalent to MakeUnicodeFiles (the UCD file generator), and that way madness also lies.

eggrobin added 2 commits April 9, 2025 11:34

Supply the field indices and correct the types of NormalizationCorrec…

ab969ee

…tions data

A test.

61a8547

eggrobin requested review from markusicu and jowilco April 9, 2025 09:49

eggrobin added 4 commits April 9, 2025 11:53

Better test and comment

63777f5

@missing

0b2d5f2

Minimal fix to emoji_variation_sequence getSet

ff240a8

spots

d4ff8b0

markusicu approved these changes Apr 9, 2025

View reviewed changes

eggrobin merged commit 3743816 into unicode-org:main Apr 9, 2025
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Supply the field indices and correct the types of NormalizationCorrections data #1087

Supply the field indices and correct the types of NormalizationCorrections data #1087

Uh oh!

eggrobin commented Apr 9, 2025 •

edited

Loading

Uh oh!

markusicu commented Apr 9, 2025

Uh oh!

eggrobin commented Apr 9, 2025

Uh oh!

Uh oh!

jowilco commented Apr 11, 2025 •

edited by markusicu

Loading

Uh oh!

eggrobin commented Apr 11, 2025

Uh oh!

Uh oh!

	private static AttributesImpl getNCAttributes(String namespace, UcdLineParser.UcdLine line) {
	String[] parts = line.getParts();
	AttributesImpl attributes = new AttributesImpl();
	attributes.addAttribute(namespace, "cp", "cp", "CDATA", parts[0]);
	attributes.addAttribute(namespace, "old", "old", "CDATA", parts[1]);
	attributes.addAttribute(namespace, "new", "new", "CDATA", parts[2]);
	attributes.addAttribute(namespace, "version", "version", "CDATA", parts[3]);
	return attributes;
	}

Uh oh!

Supply the field indices and correct the types of NormalizationCorrections data #1087

Supply the field indices and correct the types of NormalizationCorrections data #1087

Uh oh!

Conversation

eggrobin commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markusicu commented Apr 9, 2025

Uh oh!

eggrobin commented Apr 9, 2025

Uh oh!

Uh oh!

jowilco commented Apr 11, 2025 • edited by markusicu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eggrobin commented Apr 11, 2025

Uh oh!

Uh oh!

eggrobin commented Apr 9, 2025 •

edited

Loading

jowilco commented Apr 11, 2025 •

edited by markusicu

Loading