-
Notifications
You must be signed in to change notification settings - Fork 7.9k
[Fix] UTF-8 handling on macOS in php_replace_controlchars() #19529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
I think this is caused by the locale and giving it characters > 127 because of unsigned char: Program to test this:#include <ctype.h>
#include <locale.h>
#include <stdio.h>
#include <string.h>
// isolated from url.c
static void php_replace_controlchars(char *str, size_t len)
{
unsigned char *s = (unsigned char *)str;
unsigned char *e = (unsigned char *)str + len;
while (s < e) {
printf("unsigned char '%c'/%d control char? %d\n", *s, *s, iscntrl(*s));
if (iscntrl(*s)) {
*s='_';
}
s++;
}
}
int main(int argc, char **argv)
{
char x[128];
strcpy(x, "http://ουτοπία.δπθ.gr/"); // i know
// Set the locale so it isn't just "C"
setlocale(LC_CTYPE, "");
printf("Locale: %s\n\n", setlocale(LC_CTYPE, NULL));
// Call iscntrl with char values like the typical string,
// but this messes up 8-bit single-byte locales.
for (int i = 0; i < strlen(x); i++) {
printf("char '%c'/%d control char? %d\n", x[i], x[i], iscntrl(x[i]));
}
puts("");
// This will use unsigned chars instead, print the result
php_replace_controlchars(x, strlen(x));
printf("%s\n", x);
return 0;
} Output:
The problematic bytes map to 0x80, 0x84, and 0x85. It's not continuation characters in general (there's 0xBx characters that are not control chars for instance), but rather those being control characters in Unicode (that is, U+0080, U+0084, U+0085 are padding, index, and NEL respectively). That said, I'm not sure if this is the right fix. Should it just use the C locale (i.e. with |
Fix php_replace_controlchars() on macOS so UTF-8 characters in URLs are preserved.
Previously, non-ASCII characters in hosts were replaced with _ or became � due to improper handling of multibyte UTF-8 characters.