Skip to content

Ucdxml 17v1 #1104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 26 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
c46e7ef
Initial checkin to support Unikemet
jowilco Mar 31, 2025
230f01e
Added note about grouping Unihan attributes, and more work for Unikemet
jowilco Apr 2, 2025
e19c289
Merge branch 'main' into ucdxml_17v1
jowilco Apr 4, 2025
9592302
Fixed merge conflict
jowilco Apr 11, 2025
4d3d207
Merge branch 'main' into ucdxml_17v1
jowilco Apr 24, 2025
dbec10c
Removed Deprecated properties, removed sections that only contained h…
jowilco Apr 24, 2025
c96704b
Updated version information
jowilco Apr 24, 2025
83224c0
Renamed index to tr42 to make it easier to copy to unicode-reports
jowilco Apr 25, 2025
09e5ce9
Added removed/changed support
jowilco Apr 29, 2025
3a6fa48
Initial checkin to support Unikemet
jowilco Mar 31, 2025
f067ad8
Added note about grouping Unihan attributes, and more work for Unikemet
jowilco Apr 2, 2025
8f968de
Removed Deprecated properties, removed sections that only contained h…
jowilco Apr 24, 2025
65041e5
Updated version information
jowilco Apr 24, 2025
e725fe4
Renamed index to tr42 to make it easier to copy to unicode-reports
jowilco Apr 25, 2025
16ba5f0
Added removed/changed support
jowilco Apr 29, 2025
a122cc6
Merge conflict
jowilco Apr 29, 2025
48fc787
Removed commented lines
jowilco Apr 29, 2025
9ed79b4
Removed commented lines
jowilco Apr 29, 2025
0fb10c5
Set maxVersion for kGB7
jowilco Apr 29, 2025
2037df8
kGB7 is a removed property
jowilco Apr 29, 2025
f75893e
kGB7 is a removed property
jowilco Apr 29, 2025
8dfe8d2
GenerateEnums and mvn spotless
jowilco Apr 29, 2025
599b253
Updated with latest UCD changes
jowilco Apr 29, 2025
f1e6058
Latest TR38 changes
jowilco Apr 29, 2025
902ba5c
Merged main
jowilco May 2, 2025
7d24164
Latest Unihan, Tangut, and Nushu changes
jowilco May 2, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion docs/ucdxml.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,12 @@ We'll use [jing-trang](https://github.com/relaxng/jing-trang) in this example.
1. Clone and build [jing-trang](https://github.com/relaxng/jing-trang)
2. Run the following:
```
java -jar C:\_git\jing-trang\build\jing.jar -c UNICODETOOLS_REPO_DIR\uax\uax42\output\index.rnc <path to UAX xml file>
java -jar C:\_git\jing-trang\build\jing.jar -c UNICODETOOLS_REPO_DIR\uax\uax42\output\tr42.rnc <path to UAX xml file>
```
Note that the UAX xml file has to be saved as NFD as the Unihan syntax regular expressions are expecting NFD.

To convert to NFD, use ICU's uconv.exe:
```
uconv.exe uconv -f utf8 -t utf8 -x nfd -o {outputfile} {originalfile}
```

35,844 changes: 17,771 additions & 18,073 deletions unicodetools/data/ucdxml/dev/ucd.nounihan.grouped.xml

Large diffs are not rendered by default.

201,716 changes: 103,025 additions & 98,691 deletions unicodetools/data/ucdxml/dev/ucd.unihan.grouped.xml

Large diffs are not rendered by default.

7 changes: 3 additions & 4 deletions unicodetools/data/ucdxml/dev/ucdxml.readme.txt
Original file line number Diff line number Diff line change
@@ -1,19 +1,18 @@
XML Representation of Unicode 16.0.0 UCD
XML Representation of Unicode 17.0.0 UCD


© 2024 Unicode®, Inc.
© 2025 Unicode®, Inc.
For terms of use, see https://www.unicode.org/terms_of_use.html


This directory contains the representation in XML of Version 16.0.0 of
This directory contains the representation in XML of Version 17.0.0 of
the UCD, using the schema defined by UAX #42: Unicode Character
Database in XML, at https://www.unicode.org/reports/tr42/

While every effort has been made to ensure consistency of the
XML representation with the UCD files, there may be some errors;
the UCD files are authoritative.


There are six files, available in zip/jar format:
- flat vs. grouped
- no Unihan data vs. Unihan data only vs. complete UCD.
Expand Down
35 changes: 30 additions & 5 deletions unicodetools/src/main/java/org/unicode/props/UcdProperty.java
Original file line number Diff line number Diff line change
Expand Up @@ -184,11 +184,36 @@ public enum UcdProperty {
kEACC(PropertyType.Miscellaneous, DerivedPropertyStatus.Provisional, "cjkEACC"),
kEH_Cat(PropertyType.Miscellaneous, DerivedPropertyStatus.Approved, "kEH_Cat"),
kEH_Desc(PropertyType.Miscellaneous, DerivedPropertyStatus.Approved, "kEH_Desc"),
kEH_FVal(PropertyType.Miscellaneous, DerivedPropertyStatus.Provisional, "kEH_FVal"),
kEH_Func(PropertyType.Miscellaneous, DerivedPropertyStatus.Provisional, "kEH_Func"),
kEH_HG(PropertyType.Miscellaneous, DerivedPropertyStatus.Approved, "kEH_HG"),
kEH_IFAO(PropertyType.Miscellaneous, DerivedPropertyStatus.Approved, "kEH_IFAO"),
kEH_JSesh(PropertyType.Miscellaneous, DerivedPropertyStatus.Approved, "kEH_JSesh"),
kEH_FVal(
PropertyType.Miscellaneous,
DerivedPropertyStatus.Provisional,
null,
ValueCardinality.Unordered,
"kEH_FVal"),
kEH_Func(
PropertyType.Miscellaneous,
DerivedPropertyStatus.Provisional,
null,
ValueCardinality.Unordered,
"kEH_Func"),
kEH_HG(
PropertyType.Miscellaneous,
DerivedPropertyStatus.Approved,
null,
ValueCardinality.Unordered,
"kEH_HG"),
kEH_IFAO(
PropertyType.Miscellaneous,
DerivedPropertyStatus.Approved,
null,
ValueCardinality.Unordered,
"kEH_IFAO"),
kEH_JSesh(
PropertyType.Miscellaneous,
DerivedPropertyStatus.Approved,
null,
ValueCardinality.Unordered,
"kEH_JSesh"),
kEH_UniK(PropertyType.Miscellaneous, DerivedPropertyStatus.Provisional, "kEH_UniK"),
kFanqie(
PropertyType.Miscellaneous,
Expand Down
13 changes: 11 additions & 2 deletions unicodetools/src/main/java/org/unicode/xml/AttributeResolver.java
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
*/
public class AttributeResolver {

static final String SET_SEPARATOR = "|";
private final IndexUnicodeProperties indexUnicodeProperties;
private final UnicodeMap<UcdPropertyValues.Age_Values> map_age;
private final UnicodeMap<UcdPropertyValues.Block_Values> map_block;
Expand Down Expand Up @@ -147,6 +148,9 @@ public String getAttributeValue(UcdProperty prop, int codepoint) {
case kOtherNumeric:
case kPrimaryNumeric:
case kAccountingNumeric:
if (resolvedValue != null) {
resolvedValue = resolvedValue.replaceAll("\\" + SET_SEPARATOR, " ");
}
return (resolvedValue.equals("NaN")) ? null : resolvedValue;
default:
return Optional.ofNullable(resolvedValue).orElse("NaN");
Expand Down Expand Up @@ -211,7 +215,7 @@ public String getAttributeValue(UcdProperty prop, int codepoint) {
return resolvedValue;
default:
if (resolvedValue != null) {
return resolvedValue.replaceAll("\\|", " ");
return resolvedValue.replaceAll("\\" + SET_SEPARATOR, " ");
}
return "";
}
Expand All @@ -226,7 +230,8 @@ public String getAttributeValue(UcdProperty prop, int codepoint) {
return map_script.get(codepoint).getShortName();
case Script_Extensions:
StringBuilder extensionBuilder = new StringBuilder();
String[] extensions = map_script_extensions.get(codepoint).split("\\|", 0);
String[] extensions =
map_script_extensions.get(codepoint).split("\\" + SET_SEPARATOR, 0);
for (String extension : extensions) {
extensionBuilder.append(
UcdPropertyValues.Script_Values.valueOf(extension)
Expand Down Expand Up @@ -348,4 +353,8 @@ public boolean isUnifiedIdeograph(int codepoint) {
return getAttributeValue(UcdProperty.Unified_Ideograph, codepoint).equals("Y")
&& getAttributeValue(UcdProperty.Name, codepoint).equals("CJK UNIFIED IDEOGRAPH-#");
}

public boolean isUnikemetAttributeRange(int codepoint) {
return !getAttributeValue(UcdProperty.kEH_Cat, codepoint).isEmpty();
}
}
Loading
Loading