Skip to content

Fix segmentation tests #1075

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 24, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions unicodetools/data/ucd/dev/auxiliary/GraphemeBreakTest.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
<body bgcolor='#FFFFFF'>
<h2>Grapheme_Cluster_Break Chart</h2>
<p><b>Unicode Version:</b> 17.0.0</p>
<p><b>Date:</b> 2025-02-14, 00:14:44 GMT</p>
<p><b>Date:</b> 2025-03-24, 14:45:55 GMT</p>
<p>This page illustrates the application of the Grapheme_Cluster_Break specification. The material here is informative, not normative.</p> <p>The first chart shows where breaks would appear between different sample characters or strings. The sample characters are chosen mechanically to represent the different properties used by the specification.</p><p>Each cell shows the break-status for the position between the character(s) in its row header and the character(s) in its column header. The × symbol indicates no break, while the ÷ symbol indicates a break. The cells with × are also shaded to make it easier to scan the table. For example, in the cell at the intersection of the row headed by “CR” and the column headed by “LF”, there is a × symbol, indicating that there is no break between CR and LF.</p>
<p>After the heavy blue line in the table are additional rows, either with different sample characters or for sequences. </p><p>In the row and column headers of the <a href='#table'>Table</a>, in the <a href='#rules'>Rules</a>, when hovering over characters in the <a href='#samples'>Samples</a>, and in the comments in the associated list of test cases <a href='GraphemeBreakTest.txt'>GraphemeBreakTest.txt</a>:</p>
<ol><li>The following sets are used:<ul>
Expand All @@ -24,7 +24,7 @@ <h2>Grapheme_Cluster_Break Chart</h2>
<li>
ExtPict
=
\p{Extended_Pictographic}
\p{Extended_Pictographic=True}
</li>
<li>
LinkingConsonant
Expand Down Expand Up @@ -232,15 +232,15 @@ <h3><a href='#samples' name='samples'>Sample Strings</a></h3>

</font></td></tr>
<tr><th style='text-align:right'><a href='#s23' name='s23'>23</a></th><td><font size='5'>
<span title='0.2'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span><span title='U+2701 UPPER BLADE SCISSORS (ExtPict)'>&#x2701;</span><span title='9.0'><span>&nbsp;</span>&nbsp;</span>
<span title='U+200D ZERO WIDTH JOINER (ZWJ)'>&#x25A1;</span><span title='11.0'><span>&nbsp;</span>&nbsp;</span>
<span title='U+2701 UPPER BLADE SCISSORS (ExtPict)'>&#x2701;</span><span title='0.3'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>
<span title='0.2'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span><span title='U+2701 UPPER BLADE SCISSORS (XXmLinkingConsonantmExtPict)'>&#x2701;</span><span title='9.0'><span>&nbsp;</span>&nbsp;</span>
<span title='U+200D ZERO WIDTH JOINER (ZWJ)'>&#x25A1;</span><span title='999.0'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>
<span title='U+2701 UPPER BLADE SCISSORS (XXmLinkingConsonantmExtPict)'>&#x2701;</span><span title='0.3'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>

</font></td></tr>
<tr><th style='text-align:right'><a href='#s24' name='s24'>24</a></th><td><font size='5'>
<span title='0.2'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span><span title='U+0061 LATIN SMALL LETTER A (XXmLinkingConsonantmExtPict)'>a</span><span title='9.0'><span>&nbsp;</span>&nbsp;</span>
<span title='U+200D ZERO WIDTH JOINER (ZWJ)'>&#x25A1;</span><span title='999.0'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>
<span title='U+2701 UPPER BLADE SCISSORS (ExtPict)'>&#x2701;</span><span title='0.3'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>
<span title='U+2701 UPPER BLADE SCISSORS (XXmLinkingConsonantmExtPict)'>&#x2701;</span><span title='0.3'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>

</font></td></tr>
<tr><th style='text-align:right'><a href='#s25' name='s25'>25</a></th><td><font size='5'>
Expand Down
6 changes: 3 additions & 3 deletions unicodetools/data/ucd/dev/auxiliary/GraphemeBreakTest.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# GraphemeBreakTest-17.0.0.txt
# Date: 2025-02-14, 00:14:44 GMT
# Date: 2025-03-24, 14:45:55 GMT
# © 2025 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down Expand Up @@ -768,8 +768,8 @@
÷ 1F476 × 1F3FF × 0308 × 200D × 1F476 × 1F3FF ÷ # ÷ [0.2] BABY (ExtPict) × [9.0] EMOJI MODIFIER FITZPATRICK TYPE-6 (Extend_ConjunctExtendermConjunctLinker) × [9.0] COMBINING DIAERESIS (Extend_ConjunctExtendermConjunctLinker) × [9.0] ZERO WIDTH JOINER (ZWJ) × [11.0] BABY (ExtPict) × [9.0] EMOJI MODIFIER FITZPATRICK TYPE-6 (Extend_ConjunctExtendermConjunctLinker) ÷ [0.3]
÷ 1F6D1 × 200D × 1F6D1 ÷ # ÷ [0.2] OCTAGONAL SIGN (ExtPict) × [9.0] ZERO WIDTH JOINER (ZWJ) × [11.0] OCTAGONAL SIGN (ExtPict) ÷ [0.3]
÷ 0061 × 200D ÷ 1F6D1 ÷ # ÷ [0.2] LATIN SMALL LETTER A (XXmLinkingConsonantmExtPict) × [9.0] ZERO WIDTH JOINER (ZWJ) ÷ [999.0] OCTAGONAL SIGN (ExtPict) ÷ [0.3]
÷ 2701 × 200D × 2701 ÷ # ÷ [0.2] UPPER BLADE SCISSORS (ExtPict) × [9.0] ZERO WIDTH JOINER (ZWJ) × [11.0] UPPER BLADE SCISSORS (ExtPict) ÷ [0.3]
÷ 0061 × 200D ÷ 2701 ÷ # ÷ [0.2] LATIN SMALL LETTER A (XXmLinkingConsonantmExtPict) × [9.0] ZERO WIDTH JOINER (ZWJ) ÷ [999.0] UPPER BLADE SCISSORS (ExtPict) ÷ [0.3]
÷ 2701 × 200D ÷ 2701 ÷ # ÷ [0.2] UPPER BLADE SCISSORS (XXmLinkingConsonantmExtPict) × [9.0] ZERO WIDTH JOINER (ZWJ) ÷ [999.0] UPPER BLADE SCISSORS (XXmLinkingConsonantmExtPict) ÷ [0.3]
÷ 0061 × 200D ÷ 2701 ÷ # ÷ [0.2] LATIN SMALL LETTER A (XXmLinkingConsonantmExtPict) × [9.0] ZERO WIDTH JOINER (ZWJ) ÷ [999.0] UPPER BLADE SCISSORS (XXmLinkingConsonantmExtPict) ÷ [0.3]
÷ 0915 ÷ 0924 ÷ # ÷ [0.2] DEVANAGARI LETTER KA (LinkingConsonant) ÷ [999.0] DEVANAGARI LETTER TA (LinkingConsonant) ÷ [0.3]
÷ 0915 × 094D × 0924 ÷ # ÷ [0.2] DEVANAGARI LETTER KA (LinkingConsonant) × [9.0] DEVANAGARI SIGN VIRAMA (Extend_ConjunctLinker) × [9.3] DEVANAGARI LETTER TA (LinkingConsonant) ÷ [0.3]
÷ 0915 × 094D × 094D × 0924 ÷ # ÷ [0.2] DEVANAGARI LETTER KA (LinkingConsonant) × [9.0] DEVANAGARI SIGN VIRAMA (Extend_ConjunctLinker) × [9.0] DEVANAGARI SIGN VIRAMA (Extend_ConjunctLinker) × [9.3] DEVANAGARI LETTER TA (LinkingConsonant) ÷ [0.3]
Expand Down
4 changes: 2 additions & 2 deletions unicodetools/data/ucd/dev/auxiliary/LineBreakTest.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
<body bgcolor='#FFFFFF'>
<h2>Line_Break Chart</h2>
<p><b>Unicode Version:</b> 17.0.0</p>
<p><b>Date:</b> 2025-02-14, 17:30:27 GMT</p>
<p><b>Date:</b> 2025-03-24, 14:45:57 GMT</p>
<p>This page illustrates the application of the Line_Break specification. The material here is informative, not normative.</p> <p>The first chart shows where breaks would appear between different sample characters or strings. The sample characters are chosen mechanically to represent the different properties used by the specification.</p><p>Each cell shows the break-status for the position between the character(s) in its row header and the character(s) in its column header. The symbol × indicates a prohibited break, even with intervening spaces; the ÷ symbol indicates a (direct) break; the symbol ∻ indicates a break only in the presence of an intervening space (an indirect break).The cells with × or ∻ are also shaded to make it easier to scan the table. For example, in the cell at the intersection of the row headed by “CR” and the column headed by “LF”, there is a × symbol, indicating that there is no break between CR and LF.</p>
<p></p><p>In the row and column headers of the <a href='#table'>Table</a>, in the <a href='#rules'>Rules</a>, when hovering over characters in the <a href='#samples'>Samples</a>, and in the comments in the associated list of test cases <a href='LineBreakTest.txt'>LineBreakTest.txt</a>:</p>
<ol><li>The following sets are used:<ul>
Expand Down Expand Up @@ -49,7 +49,7 @@ <h2>Line_Break Chart</h2>
<li>
ExtPictUnassigned
=
[\p{Extended_Pictographic}&\p{gc=Cn}]
[\p{Extended_Pictographic=True}&\p{gc=Cn}]
</li>
<li>
NS
Expand Down
14 changes: 7 additions & 7 deletions unicodetools/data/ucd/dev/auxiliary/WordBreakTest.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
<body bgcolor='#FFFFFF'>
<h2>Word_Break Chart</h2>
<p><b>Unicode Version:</b> 17.0.0</p>
<p><b>Date:</b> 2024-11-27, 17:44:59 GMT</p>
<p><b>Date:</b> 2025-03-24, 14:46:35 GMT</p>
<p>This page illustrates the application of the Word_Break specification. The material here is informative, not normative.</p> <p>The first chart shows where breaks would appear between different sample characters or strings. The sample characters are chosen mechanically to represent the different properties used by the specification.</p><p>Each cell shows the break-status for the position between the character(s) in its row header and the character(s) in its column header. The × symbol indicates no break, while the ÷ symbol indicates a break. The cells with × are also shaded to make it easier to scan the table. For example, in the cell at the intersection of the row headed by “CR” and the column headed by “LF”, there is a × symbol, indicating that there is no break between CR and LF.</p>
<p>After the heavy blue line in the table are additional rows, either with different sample characters or for sequences, such as “ALetter MidLetter”. </p><p>In the row and column headers of the <a href='#table'>Table</a>, in the <a href='#rules'>Rules</a>, when hovering over characters in the <a href='#samples'>Samples</a>, and in the comments in the associated list of test cases <a href='WordBreakTest.txt'>WordBreakTest.txt</a>:</p>
<ol><li>The following sets are used:<ul>
Expand All @@ -19,7 +19,7 @@ <h2>Word_Break Chart</h2>
<li>
ExtPict
=
\p{Extended_Pictographic}
\p{Extended_Pictographic=True}
</li>
<li>
MidNumLetQ
Expand Down Expand Up @@ -292,15 +292,15 @@ <h3><a href='#samples' name='samples'>Sample Strings</a></h3>

</font></td></tr>
<tr><th style='text-align:right'><a href='#s27' name='s27'>27</a></th><td><font size='5'>
<span title='0.2'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span><span title='U+2701 UPPER BLADE SCISSORS (ExtPictmALetter)'>&#x2701;</span><span title='4.0'><span>&nbsp;</span>&nbsp;</span>
<span title='U+200D ZERO WIDTH JOINER (ZWJ)'>&#x25A1;</span><span title='3.3'><span>&nbsp;</span>&nbsp;</span>
<span title='U+2701 UPPER BLADE SCISSORS (ExtPictmALetter)'>&#x2701;</span><span title='0.3'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>
<span title='0.2'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span><span title='U+2701 UPPER BLADE SCISSORS (XXmExtPict)'>&#x2701;</span><span title='4.0'><span>&nbsp;</span>&nbsp;</span>
<span title='U+200D ZERO WIDTH JOINER (ZWJ)'>&#x25A1;</span><span title='999.0'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>
<span title='U+2701 UPPER BLADE SCISSORS (XXmExtPict)'>&#x2701;</span><span title='0.3'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>

</font></td></tr>
<tr><th style='text-align:right'><a href='#s28' name='s28'>28</a></th><td><font size='5'>
<span title='0.2'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span><span title='U+0061 LATIN SMALL LETTER A (ALettermExtPict)'>a</span><span title='4.0'><span>&nbsp;</span>&nbsp;</span>
<span title='U+200D ZERO WIDTH JOINER (ZWJ)'>&#x25A1;</span><span title='3.3'><span>&nbsp;</span>&nbsp;</span>
<span title='U+2701 UPPER BLADE SCISSORS (ExtPictmALetter)'>&#x2701;</span><span title='0.3'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>
<span title='U+200D ZERO WIDTH JOINER (ZWJ)'>&#x25A1;</span><span title='999.0'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>
<span title='U+2701 UPPER BLADE SCISSORS (XXmExtPict)'>&#x2701;</span><span title='0.3'><span style='border-right: 1px solid blue'>&nbsp;</span>&nbsp;</span>

</font></td></tr>
<tr><th style='text-align:right'><a href='#s29' name='s29'>29</a></th><td><font size='5'>
Expand Down
6 changes: 3 additions & 3 deletions unicodetools/data/ucd/dev/auxiliary/WordBreakTest.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# WordBreakTest-17.0.0.txt
# Date: 2025-01-27, 18:09:43 GMT
# Date: 2025-03-24, 14:46:35 GMT
# © 2025 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
Expand Down Expand Up @@ -1850,8 +1850,8 @@
÷ 1F476 × 1F3FF ÷ 1F476 ÷ # ÷ [0.2] BABY (ExtPictmALetter) × [4.0] EMOJI MODIFIER FITZPATRICK TYPE-6 (Extend) ÷ [999.0] BABY (ExtPictmALetter) ÷ [0.3]
÷ 1F6D1 × 200D × 1F6D1 ÷ # ÷ [0.2] OCTAGONAL SIGN (ExtPictmALetter) × [4.0] ZERO WIDTH JOINER (ZWJ) × [3.3] OCTAGONAL SIGN (ExtPictmALetter) ÷ [0.3]
÷ 0061 × 200D × 1F6D1 ÷ # ÷ [0.2] LATIN SMALL LETTER A (ALettermExtPict) × [4.0] ZERO WIDTH JOINER (ZWJ) × [3.3] OCTAGONAL SIGN (ExtPictmALetter) ÷ [0.3]
÷ 2701 × 200D × 2701 ÷ # ÷ [0.2] UPPER BLADE SCISSORS (ExtPictmALetter) × [4.0] ZERO WIDTH JOINER (ZWJ) × [3.3] UPPER BLADE SCISSORS (ExtPictmALetter) ÷ [0.3]
÷ 0061 × 200D × 2701 ÷ # ÷ [0.2] LATIN SMALL LETTER A (ALettermExtPict) × [4.0] ZERO WIDTH JOINER (ZWJ) × [3.3] UPPER BLADE SCISSORS (ExtPictmALetter) ÷ [0.3]
÷ 2701 × 200D ÷ 2701 ÷ # ÷ [0.2] UPPER BLADE SCISSORS (XXmExtPict) × [4.0] ZERO WIDTH JOINER (ZWJ) ÷ [999.0] UPPER BLADE SCISSORS (XXmExtPict) ÷ [0.3]
÷ 0061 × 200D ÷ 2701 ÷ # ÷ [0.2] LATIN SMALL LETTER A (ALettermExtPict) × [4.0] ZERO WIDTH JOINER (ZWJ) ÷ [999.0] UPPER BLADE SCISSORS (XXmExtPict) ÷ [0.3]
÷ 1F476 × 1F3FF × 0308 × 200D × 1F476 × 1F3FF ÷ # ÷ [0.2] BABY (ExtPictmALetter) × [4.0] EMOJI MODIFIER FITZPATRICK TYPE-6 (Extend) × [4.0] COMBINING DIAERESIS (Extend) × [4.0] ZERO WIDTH JOINER (ZWJ) × [3.3] BABY (ExtPictmALetter) × [4.0] EMOJI MODIFIER FITZPATRICK TYPE-6 (Extend) ÷ [0.3]
÷ 1F6D1 × 1F3FF ÷ # ÷ [0.2] OCTAGONAL SIGN (ExtPictmALetter) × [4.0] EMOJI MODIFIER FITZPATRICK TYPE-6 (Extend) ÷ [0.3]
÷ 200D × 1F6D1 × 1F3FF ÷ # ÷ [0.2] ZERO WIDTH JOINER (ZWJ) × [3.3] OCTAGONAL SIGN (ExtPictmALetter) × [4.0] EMOJI MODIFIER FITZPATRICK TYPE-6 (Extend) ÷ [0.3]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ $ConjunctLinker=\p{Indic_Conjunct_Break=Linker}
$LinkingConsonant=\p{Indic_Conjunct_Break=Consonant}
## $E_Base=\p{Grapheme_Cluster_Break=E_Base}
## $E_Modifier=\p{Grapheme_Cluster_Break=E_Modifier}
$ExtPict=\p{Extended_Pictographic}
$ExtPict=\p{Extended_Pictographic=True}
$ConjunctExtender=[\p{Indic_Conjunct_Break=Linker}\p{Indic_Conjunct_Break=Extend}]
## $EBG=\p{Grapheme_Cluster_Break=E_Base_GAZ}
## $Glue_After_Zwj=\p{Grapheme_Cluster_Break=Glue_After_Zwj}
Expand Down Expand Up @@ -124,7 +124,7 @@ $DottedCircle = [◌]
$CPmEastAsian=[$CP-$EastAsian]
$OPmEastAsian=[$OP-$EastAsian]

$ExtPictUnassigned=[\p{Extended_Pictographic}&\p{gc=Cn}]
$ExtPictUnassigned=[\p{Extended_Pictographic=True}&\p{gc=Cn}]

# Some rules refer to the start and end of text. We could just use a literal ^ for sot, but naming
# it as in the spec makes it easier to compare. The parser will eat (and choke on) $, so we play a
Expand Down Expand Up @@ -364,7 +364,7 @@ $Single_Quote=\p{Word_Break=Single_Quote}
## $E_Modifier=\p{Word_Break=E_Modifier}
$ZWJ=\p{Word_Break=ZWJ}
# Note: The following may overlap with the above
$ExtPict=\p{Extended_Pictographic}
$ExtPict=\p{Extended_Pictographic=True}
## $EBG=\p{Word_Break=E_Base_GAZ}
## $Glue_After_Zwj=\p{Word_Break=Glue_After_Zwj}
$WSegSpace=\p{Word_Break=WSegSpace}
Expand Down
Loading