-
Notifications
You must be signed in to change notification settings - Fork 5.4k
[Feature #19908] Update to Unicode 15.1.0 #12798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1b0a49a
to
b3ec885
Compare
NEWS.md
Outdated
@@ -31,7 +36,6 @@ The following bundled gems are promoted from default gems. | |||
* reline 0.6.0 | |||
* readline 0.0.4 | |||
* fiddle 1.1.6 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you revert a needless change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reverted.
tool/enc-unicode.rb
Outdated
elsif /^(\h+)(?:\.\.(\h+))?\s*;\s*(\w(?>[\w\s;]*\w)?)/ =~ line | ||
$2 ? cps.concat(($1.to_i(16)..$2.to_i(16)).to_a) : cps.push($1.to_i(16)) | ||
current = $3.gsub(/\W+/, '_') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment that shows an example matched line and $1
/$2
/$3
for the case for easy to understand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b3ec885
to
162d060
Compare
regparse.c
Outdated
/* conjunctCluster := \p{InCB=Consonant} ([\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Consonant})+ */ | ||
{ | ||
Node **CC_list = core_alts + 6; /* size: 2 */ | ||
R_ERR(create_property_node(CC_list+0, env, "InCB_Consonant")); | ||
|
||
{ | ||
Node **CC_list1 = CC_list + 1; /* size: 4 */ | ||
// [\p{InCB=Extend} \p{InCB=Linker}]* | ||
{ | ||
{ | ||
Node **CC_alt1 = CC_list1 + 1; | ||
R_ERR(create_property_node(CC_alt1+1, env, "InCB_Extend")); | ||
R_ERR(create_property_node(CC_alt1+2, env, "InCB_Linker")); | ||
|
||
R_ERR(create_node_from_array(ALT, CC_alt1, CC_alt1+1)); | ||
} | ||
|
||
R_ERR(quantify_node(CC_list1+1, 0, REPEAT_INFINITE)); | ||
} | ||
|
||
// \p{InCB=Linker} | ||
R_ERR(create_property_node(CC_list1+2, env, "InCB_Linker")); | ||
|
||
// [\p{InCB=Extend} \p{InCB=Linker}]* | ||
{ | ||
{ | ||
Node **CC_alt2 = CC_list1 + 3; | ||
R_ERR(create_property_node(CC_alt2+1, env, "InCB_Extend")); | ||
R_ERR(create_property_node(CC_alt2+2, env, "InCB_Linker")); | ||
|
||
R_ERR(create_node_from_array(ALT, CC_alt2, CC_alt2+1)); | ||
} | ||
|
||
R_ERR(quantify_node(CC_list1+3, 0, REPEAT_INFINITE)); | ||
} | ||
|
||
// \p{InCB=Consonant} | ||
R_ERR(create_property_node(CC_list1+4, env, "InCB_Consonant")); | ||
|
||
R_ERR(create_node_from_array(LIST, CC_list1, CC_list1+1)); | ||
R_ERR(quantify_node(CC_list1, 1, REPEAT_INFINITE)); | ||
} | ||
R_ERR(create_node_from_array(LIST, core_alts+5, CC_list)); | ||
} | ||
|
||
/* [^Control CR LF] */ | ||
core_alts[5] = node_new_cclass(); | ||
if (IS_NULL(core_alts[5])) goto err; | ||
cc = NCCLASS(core_alts[5]); | ||
core_alts[6] = node_new_cclass(); | ||
if (IS_NULL(core_alts[6])) goto err; | ||
cc = NCCLASS(core_alts[6]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is covered by an automatically generated test in this file based on GraphemeBreakTest.txt.
https://github.com/ruby/ruby/blob/dfc25204235079e23eadf9e0ba860c1ebcb14325/test/ruby/enc/test_grapheme_breaks.rb
https://www.unicode.org/Public/15.1.0/ucd/auxiliary/GraphemeBreakTest.txt
162d060
to
4817ab0
Compare
bcd7641
to
d07beeb
Compare
This comment has been minimized.
This comment has been minimized.
40d6ec9
to
c648189
Compare
49775b7
to
a66a4d3
Compare
…failure with Ruby 3.5.0dev This commit addresses the following failure with Ruby 3.5.0dev since ruby/ruby@6670926 ```ruby % ruby -v ruby 3.5.0dev (2025-03-21T06:17:15Z master d868922ea8) +PRISM [arm64-darwin24] % bin/test test/multibyte_chars_test.rb -n test_should_compute_grapheme_length Running 90 tests in parallel using 10 processes Run options: -n test_should_compute_grapheme_length --seed 52859 F Failure: MultibyteCharsExtrasTest#test_should_compute_grapheme_length [test/multibyte_chars_test.rb:512]: "त्र". Expected: 2 Actual: 1 bin/test test/multibyte_chars_test.rb:595 Finished in 0.209643s, 4.7700 runs/s, 38.1601 assertions/s. 1 runs, 8 assertions, 1 failures, 0 errors, 0 skips % ``` According to ruby/ruby#12798 ,this is an expected change since Unicode 15.1.0. > As a result, an orthographic syllable in scripts like Devanagari (e.g. “क्या”, consisting of KA + VIRAMA + YA) > is now treated as a single extended grapheme cluster rather than split into two. Fix rails#54794
https://bugs.ruby-lang.org/issues/19908
Summary
Unicode 15.1.0 introduced a new grapheme cluster rule to handle Indic conjuncts. In UAX #29, rule GB9c was added to prevent grapheme breaks within certain Indic consonant+virama sequences. As a result, an orthographic syllable in scripts like Devanagari (e.g. “क्या”, consisting of KA + VIRAMA + YA) is now treated as a single extended grapheme cluster rather than split into two. This aligns the default segmentation with Indic writing system expectations.
UAX #29: Unicode Text Segmentation
New Enumerated Property
To support this, the Unicode Character Database (UCD) added a new enumerated property, Indic_Conjunct_Break (InCB), with values Consonant, Linker, Extend, and None, derived from existing properties. For example, virama characters in certain Brahmic scripts are classified as InCB=Linker, and base consonants as InCB=Consonant. These property values are listed in the UCD data (e.g. in DerivedCoreProperties.txt).
Impact on Ruby
This change affected Ruby’s implementation of Unicode support in several ways: