-
Notifications
You must be signed in to change notification settings - Fork 7.8k
Major overhaul of mbstring (part 8) #7177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Treat truncated multi-byte characters as an error. - Don't allow ASCII control characters to appear in the middle of a multi-byte character. - Handle ~ escapes according to the HZ standard (RFC 1843). - Treat unrecognized ~ escapes as an error. - Multi-byte characters (between ~{ ~} escapes) are GB2312, not CP936. (CP936 is an extended version from MicroSoft, but the RFC does not state that this extended version of GB should be used.)
- Treat truncated multi-byte characters as an error. - Don't allow ASCII control characters to appear in the middle of a multi-byte character. - Adjust some mappings to match recommendations in conversion table from Unicode Consortium.
- Treat truncated multi-byte characters as an error. - Don't allow ASCII control characters to appear in the middle of a multi-byte character. - There was also a bug whereby some unrecognized Unicode codepoints would be passed through to the output unchanged when converting Unicode to EUC-KR.
- Flag truncated multi-byte characters as erroneous. - Don't allow ASCII control characters to appear in the middle of a multi-byte character. - There was a bug whereby some unrecognized Unicode codepoints would be passed through unchanged to the output when converting Unicode to EUC-CN. - Stick to the original EUC-CN standard, rather than CP936 (an extended version invented by MS).
Looks like this is failing on 32-bit:
|
Yeah, I agree that those should be deprecated. They aren't really encodings in the sense of encoding unicode code points, and need special casing for that reason. |
Thanks for pointing that out. I'll work on it. |
Hmm. The string which is showing in that message is definitely not valid EUC-TW text. |
7bb282b
to
e600b2c
Compare
OK, just trying something... |
- Treat text which ends abruptly in the middle of a multi-byte character as erroneous. - Don't allow ASCII control characters to appear in the middle of a multi-byte character. - If an illegal byte appears in the middle of a multi-byte character, go back to the initial state rather than trying to finish the multi-byte character. - There was a bug in the file with the conversion tables, which set the 'maximum codepoint which can be converted using table A2' using the size of table A1, not table A2. This meant that several hundred Unicode codepoints which should have been able to be converted to EUC-TW were flagged as erroneous instead. - When a sequence which cannot possibly be a prefix of a valid multi-byte character is found, immediately flag it as an error, rather than waiting to read more bytes first. - Allow characters in CNS-11643 plane 1 to be encoded as 4-byte sequences (although they can also be encoded as 2-byte sequences). This is allowed by the standard for EUC-TW text.
e600b2c
to
4dc7d5e
Compare
OK, it's better now. |
As far as I can tell, While it's certainly possible to implement that in pure PHP, I'm not aware of any intrinsic function that supports that behavior. To anyone else who may happen upon this issue via Google: mb_convert_encoding('🙃', 'HTML-ENTITIES') === mb_convert_encoding('🙃', 'HTML-ENTITIES') In order to use it safely, you would have had to, at minimum, manually replaced If you currently use HTML-ENTITIES in your code base, you should probably examine it to determine whether that results in any security vulnerabilities for your use case. Edit: I think this should be roughly equivalent: mb_encode_numericentity($input, [0x80, 0x10fffff, 0, 0x1fffff], mb_internal_encoding()); Keep in mind that it won't escape ampersands, so the input must already be HTML/XML-encoded, just like with HTML-ENTITIES. If you only need to support PHP 8.0+, you can omit the third parameter to use the default internal encoding. |
@Zenexer You are very right that mbstring's By no means was that the only bug in mbstring's You are also very right that the preferred approach is to use |
@alexdowad Thanks for confirming.
From looking through libraries I use, it seems as though people were using this behavior to their advantage, but it's quite unintuitive; |
Newbie question, years later (forgive me)... why isn't https://www.php.net/manual/en/function.mb-convert-encoding.php |
@cdevroe Thank you for your question. Probably I suggest write an issue to https://github.com/php/doc-en/issues . So we fix the document. |
@youkidearitai TY. Done. |
FYA @nikic
The remaining legacy text encodings supported by mbstring which do not have adequate tests are: EUC-JP-2004, ISO-2022-KR, GB18030, and Big5.
Then we have Base64, HTML entities, QPrint, and UUEncode. I would like to suggest that we deprecate the use of mbstring to convert to/from HTML entities or UUEncode... I think there are other built-in functions to handle those.