Skip to content

Conversation

hsivonen
Copy link
Member

@hsivonen hsivonen commented Aug 7, 2025

See #6811.

This change does not fix the issue for sort keys.

Includes the test changeset from #6812.

@hsivonen hsivonen requested a review from echeran as a code owner August 7, 2025 08:47
Copy link

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

return Decomposition::new(left_tail.$left_to_iter(), self.decompositions, self.tables).cmp(
Decomposition::new(right_tail.$right_to_iter(), self.decompositions, self.tables),
return Decomposition::new(left_tail.$left_to_iter(), self.decompositions, self.tables).map(|c| if c != '\u{FFFE}' { c as i32 } else { -1i32 }).cmp(
Decomposition::new(right_tail.$right_to_iter(), self.decompositions, self.tables).map(|c| if c != '\u{FFFE}' { c as i32 } else { -1i32 }),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if condition is backwards due to the assumption that absent other information, LLVM treats the first branch as more likely and we still after a decade aren't allowed to say unlikely on non-nightly Rust.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the best place to put this logic? It seems un-elegant to put this directly into the main compare function macro, which to this point has just delegated to other functions and hasn't had any code-point-specific logic?

Are we sure that U+FFFE is the only code point with this special case? It is the one that the conformance test discovered, but I wonder if it was pointing at a bigger class of issues?

@hsivonen
Copy link
Member Author

hsivonen commented Aug 8, 2025

Is this the best place to put this logic?

Yes. It needs to go into a .map() where the identical level is implemented.

It seems un-elegant to put this directly into the main compare function macro, which to this point has just delegated to other functions and hasn't had any code-point-specific logic?

It makes sense for the relevant iterators to be inside the macro, since the concrete types end up differing per macro instantiation. And, as noted above, the .map() needs to go where the iterators are. The .map() is tiny.

Are we sure that U+FFFE is the only code point with this special case? It is the one that the conformance test discovered, but I wonder if it was pointing at a bigger class of issues?

It looks a lot like this isn't a bigger class of issues but a very special case.

In the case of ICU4C sort key generation, only U+FFFE is special-cased: https://searchfox.org/mozilla-central/rev/f6a806c38c459e0e0d797d264ca0e8ad46005105/intl/icu/source/i18n/bocsu.cpp#132

@sffc
Copy link
Member

sffc commented Aug 9, 2025

I do see U+FFFE specifically called out in the CLDR collator spec, so maybe this really is just a single isolated special case:

https://www.unicode.org/reports/tr35/dev/tr35-collation.html#CLDR_Collation_Algorithm

@sffc
Copy link
Member

sffc commented Aug 12, 2025

Why is this a 4 line change but #6823 a much bigger change?

How come there aren't tests failing right now given an inconsistency in behavior between sort keys and compare?

@hsivonen
Copy link
Member Author

Why is this a 4 line change but #6823 a much bigger change?

Signed 32-bit integers offer a way to easily switch from char to something that expands the value space below U+0000, so the fix here is simple.

In the sort key case, the requirement is that 0u8 is reserved for C string compat, 1u8 is reserved for level separator, and 2u8 is the merge separator, so there's a need to put everything else at 3u8 or above. Instead of implementing a new scheme to shift UTF-8 like that, the exact solution from ICU4C was implemented. That solution also compresses the length of the resulting byte sequence, which is where the complexity comes from.

How come there aren't tests failing right now given an inconsistency in behavior between sort keys and compare?

Because we in general lack test coverage to show consistency between sort keys and compare. It would make sense to fuzz them.

@sffc sffc self-requested a review August 12, 2025 19:49
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve but I'd like @echeran to also take a look.

@hsivonen hsivonen merged commit 80a001c into unicode-org:main Aug 13, 2025
32 checks passed
@hsivonen
Copy link
Member Author

Thanks. I'm assuming that @echeran is doing a post-landing review. Landing now to avoid merge conflicts showing up and to avoid keeping the tree in a state where regular comparison and sort keys don't agree.

@hsivonen hsivonen deleted the mergeseparator branch August 13, 2025 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants