Skip to content

expr is failing with multibyte chars #3132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sylvestre opened this issue Feb 13, 2022 · 6 comments · Fixed by #3133
Open

expr is failing with multibyte chars #3132

sylvestre opened this issue Feb 13, 2022 · 6 comments · Fixed by #3133
Labels
J - Locale locale related issue U - expr

Comments

@sylvestre
Copy link
Contributor

It causes https://github.com/coreutils/coreutils/blob/master/tests/misc/expr-multibyte.pl to fail

$ ./target/debug/coreutils expr length αbcdef
7

GNU:

$ expr length αbcdef
6

needs to have a different locale compiled like

sudo locale-gen fr_FR.UTF-8
@sylvestre
Copy link
Contributor Author

Of course, it is about rust. See https://doc.rust-lang.org/book/ch08-02-strings.html#internal-representation

Simple testcase:

fn main() {
    let s = String::from("αbcdef");
    assert_eq!(s.len(), 6);
}

=>

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `7`,
 right: `6`', src/main.rs:3:5

@tertsdiepraam
Copy link
Member

I did some extra testing to check whether we need unicode segmentation here and we don't. GNU expr outputs a length of 2 for this emoji:

[src/main.rs:4] "🇳🇱".len() = 8
[src/main.rs:5] "🇳🇱".chars().count() = 2
[src/main.rs:6] UnicodeSegmentation::graphemes("🇳🇱", true).count() = 1

Playground link

@sylvestre
Copy link
Contributor Author

Yeah, I am working on a fix :)

@sylvestre
Copy link
Contributor Author

To reproduce:
bash util/run-gnu-test.sh tests/misc/expr-multibyte

@sylvestre
Copy link
Contributor Author

Actually, my patch was wrong, it should take in account the locale

$ LANG=C expr length αbcdef
7
$ LANG=fr_FR.UTF-8 expr length αbcdef
6

seems that we should use MB_CUR_MAX to see the number of bytes

@chrisdebian
Copy link

Hi all, is this still an issue?

Chris

@RenjiSann RenjiSann added the J - Locale locale related issue label Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
J - Locale locale related issue U - expr
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants