Skip to content

[Fix] UTF-8 handling on macOS in php_replace_controlchars() #19529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

arshidkv12
Copy link

Fix php_replace_controlchars() on macOS so UTF-8 characters in URLs are preserved.

Previously, non-ASCII characters in hosts were replaced with _ or became � due to improper handling of multibyte UTF-8 characters.

<?php
$parsed = parse_url('http://ουτοπία.δπθ.gr/');
var_dump($parsed['host']);
?>
--EXPECT--
string(24) "ουτοπία.δπθ.gr"

@NattyNarwhal
Copy link
Member

I think this is caused by the locale and giving it characters > 127 because of unsigned char:

Program to test this:
#include <ctype.h>
#include <locale.h>
#include <stdio.h>
#include <string.h>

// isolated from url.c
static void php_replace_controlchars(char *str, size_t len)
{
	unsigned char *s = (unsigned char *)str;
	unsigned char *e = (unsigned char *)str + len;
				
	while (s < e) {
		printf("unsigned char '%c'/%d control char? %d\n", *s, *s, iscntrl(*s));
		if (iscntrl(*s)) {
			*s='_';
		}       
		s++;
	}	       
}

int main(int argc, char **argv)
{
	char x[128];
	strcpy(x, "http://ουτοπία.δπθ.gr/"); // i know
	
	// Set the locale so it isn't just "C"
	setlocale(LC_CTYPE, "");
	printf("Locale: %s\n\n", setlocale(LC_CTYPE, NULL));

	// Call iscntrl with char values like the typical string,
	// but this messes up 8-bit single-byte locales.
	for (int i = 0; i < strlen(x); i++) {
		printf("char '%c'/%d control char? %d\n", x[i], x[i], iscntrl(x[i]));
	}
	puts("");

	// This will use unsigned chars instead, print the result
	php_replace_controlchars(x, strlen(x));
	printf("%s\n", x);
	return 0;
}
Output:
calvin@anika-5 src % ./controlchars 
Locale: en_CA.UTF-8

char 'h'/104 control char? 0
char 't'/116 control char? 0
char 't'/116 control char? 0
char 'p'/112 control char? 0
char ':'/58 control char? 0
char '/'/47 control char? 0
char '/'/47 control char? 0
char '?'/-50 control char? 0
char '?'/-65 control char? 0
char '?'/-49 control char? 0
char '?'/-123 control char? 0
char '?'/-49 control char? 0
char '?'/-124 control char? 0
char '?'/-50 control char? 0
char '?'/-65 control char? 0
char '?'/-49 control char? 0
char '?'/-128 control char? 0
char '?'/-50 control char? 0
char '?'/-81 control char? 0
char '?'/-50 control char? 0
char '?'/-79 control char? 0
char '.'/46 control char? 0
char '?'/-50 control char? 0
char '?'/-76 control char? 0
char '?'/-49 control char? 0
char '?'/-128 control char? 0
char '?'/-50 control char? 0
char '?'/-72 control char? 0
char '.'/46 control char? 0
char 'g'/103 control char? 0
char 'r'/114 control char? 0
char '/'/47 control char? 0

unsigned char 'h'/104 control char? 0
unsigned char 't'/116 control char? 0
unsigned char 't'/116 control char? 0
unsigned char 'p'/112 control char? 0
unsigned char ':'/58 control char? 0
unsigned char '/'/47 control char? 0
unsigned char '/'/47 control char? 0
unsigned char '?'/206 control char? 0
unsigned char '?'/191 control char? 0
unsigned char '?'/207 control char? 0
unsigned char '?'/133 control char? 1
unsigned char '?'/207 control char? 0
unsigned char '?'/132 control char? 1
unsigned char '?'/206 control char? 0
unsigned char '?'/191 control char? 0
unsigned char '?'/207 control char? 0
unsigned char '?'/128 control char? 1
unsigned char '?'/206 control char? 0
unsigned char '?'/175 control char? 0
unsigned char '?'/206 control char? 0
unsigned char '?'/177 control char? 0
unsigned char '.'/46 control char? 0
unsigned char '?'/206 control char? 0
unsigned char '?'/180 control char? 0
unsigned char '?'/207 control char? 0
unsigned char '?'/128 control char? 1
unsigned char '?'/206 control char? 0
unsigned char '?'/184 control char? 0
unsigned char '.'/46 control char? 0
unsigned char 'g'/103 control char? 0
unsigned char 'r'/114 control char? 0
unsigned char '/'/47 control char? 0
http://ο?_?_ο?_ία.δ?_θ.gr/
calvin@anika-5 src % LC_CTYPE=C ./controlchars
Locale: C

char 'h'/104 control char? 0
char 't'/116 control char? 0
char 't'/116 control char? 0
char 'p'/112 control char? 0
char ':'/58 control char? 0
char '/'/47 control char? 0
char '/'/47 control char? 0
char '?'/-50 control char? 0
char '?'/-65 control char? 0
char '?'/-49 control char? 0
char '?'/-123 control char? 0
char '?'/-49 control char? 0
char '?'/-124 control char? 0
char '?'/-50 control char? 0
char '?'/-65 control char? 0
char '?'/-49 control char? 0
char '?'/-128 control char? 0
char '?'/-50 control char? 0
char '?'/-81 control char? 0
char '?'/-50 control char? 0
char '?'/-79 control char? 0
char '.'/46 control char? 0
char '?'/-50 control char? 0
char '?'/-76 control char? 0
char '?'/-49 control char? 0
char '?'/-128 control char? 0
char '?'/-50 control char? 0
char '?'/-72 control char? 0
char '.'/46 control char? 0
char 'g'/103 control char? 0
char 'r'/114 control char? 0
char '/'/47 control char? 0

unsigned char 'h'/104 control char? 0
unsigned char 't'/116 control char? 0
unsigned char 't'/116 control char? 0
unsigned char 'p'/112 control char? 0
unsigned char ':'/58 control char? 0
unsigned char '/'/47 control char? 0
unsigned char '/'/47 control char? 0
unsigned char '?'/206 control char? 0
unsigned char '?'/191 control char? 0
unsigned char '?'/207 control char? 0
unsigned char '?'/133 control char? 0
unsigned char '?'/207 control char? 0
unsigned char '?'/132 control char? 0
unsigned char '?'/206 control char? 0
unsigned char '?'/191 control char? 0
unsigned char '?'/207 control char? 0
unsigned char '?'/128 control char? 0
unsigned char '?'/206 control char? 0
unsigned char '?'/175 control char? 0
unsigned char '?'/206 control char? 0
unsigned char '?'/177 control char? 0
unsigned char '.'/46 control char? 0
unsigned char '?'/206 control char? 0
unsigned char '?'/180 control char? 0
unsigned char '?'/207 control char? 0
unsigned char '?'/128 control char? 0
unsigned char '?'/206 control char? 0
unsigned char '?'/184 control char? 0
unsigned char '.'/46 control char? 0
unsigned char 'g'/103 control char? 0
unsigned char 'r'/114 control char? 0
unsigned char '/'/47 control char? 0
http://ουτοπία.δπθ.gr/

The problematic bytes map to 0x80, 0x84, and 0x85. It's not continuation characters in general (there's 0xBx characters that are not control chars for instance), but rather those being control characters in Unicode (that is, U+0080, U+0084, U+0085 are padding, index, and NEL respectively).

That said, I'm not sure if this is the right fix. Should it just use the C locale (i.e. with iscntrl_l) and thus not convert any high byte characters? This function predates IDN URLs, so I'm not sure how it be handled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants