Skip to content

[Feature #19908] Update to Unicode 15.1.0 #12798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 18, 2025
Merged

Conversation

ima1zumi
Copy link
Member

@ima1zumi ima1zumi commented Feb 24, 2025

https://bugs.ruby-lang.org/issues/19908

Summary

Unicode 15.1.0 introduced a new grapheme cluster rule to handle Indic conjuncts. In UAX #29, rule GB9c was added to prevent grapheme breaks within certain Indic consonant+virama sequences. As a result, an orthographic syllable in scripts like Devanagari (e.g. “क्या”, consisting of KA + VIRAMA + YA) is now treated as a single extended grapheme cluster rather than split into two. This aligns the default segmentation with Indic writing system expectations.

UAX #29: Unicode Text Segmentation

New Enumerated Property

To support this, the Unicode Character Database (UCD) added a new enumerated property, Indic_Conjunct_Break (InCB), with values Consonant, Linker, Extend, and None, derived from existing properties. For example, virama characters in certain Brahmic scripts are classified as InCB=Linker, and base consonants as InCB=Consonant. These property values are listed in the UCD data (e.g. in DerivedCoreProperties.txt).

Impact on Ruby

This change affected Ruby’s implementation of Unicode support in several ways:

  • enc-unicode.rb
    • Ruby’s enc-unicode.rb script, which processes Unicode data, had to be updated to handle the new enumerated property. Previously, it expected only binary properties.
  • Regex Engine
    • Support for InCB property values was added to Ruby’s regex engine (exposing them as InCB_Linker, InCB_Consonant, etc.).
  • Grapheme Cluster Logic
    • Ruby’s grapheme cluster regex (\X) logic was updated to incorporate GB9c, ensuring that Indic conjuncts are not split in string operations.

@ima1zumi ima1zumi changed the title Update to Unicode 15.1.0 [Feature #19908] Update to Unicode 15.1.0 Feb 24, 2025
NEWS.md Outdated
@@ -31,7 +36,6 @@ The following bundled gems are promoted from default gems.
* reline 0.6.0
* readline 0.0.4
* fiddle 1.1.6

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you revert a needless change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reverted.

Comment on lines 169 to 181
elsif /^(\h+)(?:\.\.(\h+))?\s*;\s*(\w(?>[\w\s;]*\w)?)/ =~ line
$2 ? cps.concat(($1.to_i(16)..$2.to_i(16)).to_a) : cps.push($1.to_i(16))
current = $3.gsub(/\W+/, '_')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment that shows an example matched line and $1/$2/$3 for the case for easy to understand?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regparse.c Outdated
Comment on lines 6092 to 6134
/* conjunctCluster := \p{InCB=Consonant} ([\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Consonant})+ */
{
Node **CC_list = core_alts + 6; /* size: 2 */
R_ERR(create_property_node(CC_list+0, env, "InCB_Consonant"));

{
Node **CC_list1 = CC_list + 1; /* size: 4 */
// [\p{InCB=Extend} \p{InCB=Linker}]*
{
{
Node **CC_alt1 = CC_list1 + 1;
R_ERR(create_property_node(CC_alt1+1, env, "InCB_Extend"));
R_ERR(create_property_node(CC_alt1+2, env, "InCB_Linker"));

R_ERR(create_node_from_array(ALT, CC_alt1, CC_alt1+1));
}

R_ERR(quantify_node(CC_list1+1, 0, REPEAT_INFINITE));
}

// \p{InCB=Linker}
R_ERR(create_property_node(CC_list1+2, env, "InCB_Linker"));

// [\p{InCB=Extend} \p{InCB=Linker}]*
{
{
Node **CC_alt2 = CC_list1 + 3;
R_ERR(create_property_node(CC_alt2+1, env, "InCB_Extend"));
R_ERR(create_property_node(CC_alt2+2, env, "InCB_Linker"));

R_ERR(create_node_from_array(ALT, CC_alt2, CC_alt2+1));
}

R_ERR(quantify_node(CC_list1+3, 0, REPEAT_INFINITE));
}

// \p{InCB=Consonant}
R_ERR(create_property_node(CC_list1+4, env, "InCB_Consonant"));

R_ERR(create_node_from_array(LIST, CC_list1, CC_list1+1));
R_ERR(quantify_node(CC_list1, 1, REPEAT_INFINITE));
}
R_ERR(create_node_from_array(LIST, core_alts+5, CC_list));
}

/* [^Control CR LF] */
core_alts[5] = node_new_cclass();
if (IS_NULL(core_alts[5])) goto err;
cc = NCCLASS(core_alts[5]);
core_alts[6] = node_new_cclass();
if (IS_NULL(core_alts[6])) goto err;
cc = NCCLASS(core_alts[6]);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ima1zumi ima1zumi marked this pull request as ready for review February 24, 2025 15:05
@ima1zumi ima1zumi marked this pull request as draft February 25, 2025 00:21
@ima1zumi ima1zumi force-pushed the unicode-15.1.0 branch 2 times, most recently from bcd7641 to d07beeb Compare March 4, 2025 11:16
@ima1zumi ima1zumi marked this pull request as ready for review March 4, 2025 11:16

This comment has been minimized.

@ima1zumi ima1zumi force-pushed the unicode-15.1.0 branch 2 times, most recently from 40d6ec9 to c648189 Compare March 9, 2025 11:48
@nurse nurse self-assigned this Mar 13, 2025
@ima1zumi ima1zumi merged commit 6670926 into ruby:master Mar 18, 2025
80 checks passed
yahonda added a commit to yahonda/rails that referenced this pull request Mar 23, 2025
…failure with Ruby 3.5.0dev

This commit addresses the following failure with Ruby 3.5.0dev
since ruby/ruby@6670926

```ruby
% ruby -v
ruby 3.5.0dev (2025-03-21T06:17:15Z master d868922ea8) +PRISM [arm64-darwin24]
% bin/test test/multibyte_chars_test.rb -n test_should_compute_grapheme_length
Running 90 tests in parallel using 10 processes
Run options: -n test_should_compute_grapheme_length --seed 52859

F

Failure:
MultibyteCharsExtrasTest#test_should_compute_grapheme_length [test/multibyte_chars_test.rb:512]:
"त्र".
Expected: 2
  Actual: 1

bin/test test/multibyte_chars_test.rb:595

Finished in 0.209643s, 4.7700 runs/s, 38.1601 assertions/s.
1 runs, 8 assertions, 1 failures, 0 errors, 0 skips
%
```

According to ruby/ruby#12798 ,this is an expected change since Unicode 15.1.0.

> As a result, an orthographic syllable in scripts like Devanagari (e.g. “क्या”, consisting of KA + VIRAMA + YA)
> is now treated as a single extended grapheme cluster rather than split into two.

Fix rails#54794
@ima1zumi ima1zumi deleted the unicode-15.1.0 branch April 5, 2025 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants