Major overhaul of mbstring (part 8) #7177

alexdowad · 2021-06-20T12:03:38Z

The remaining legacy text encodings supported by mbstring which do not have adequate tests are: EUC-JP-2004, ISO-2022-KR, GB18030, and Big5.

Then we have Base64, HTML entities, QPrint, and UUEncode. I would like to suggest that we deprecate the use of mbstring to convert to/from HTML entities or UUEncode... I think there are other built-in functions to handle those.

- Treat truncated multi-byte characters as an error. - Don't allow ASCII control characters to appear in the middle of a multi-byte character. - Handle ~ escapes according to the HZ standard (RFC 1843). - Treat unrecognized ~ escapes as an error. - Multi-byte characters (between ~{ ~} escapes) are GB2312, not CP936. (CP936 is an extended version from MicroSoft, but the RFC does not state that this extended version of GB should be used.)

- Treat truncated multi-byte characters as an error. - Don't allow ASCII control characters to appear in the middle of a multi-byte character. - Adjust some mappings to match recommendations in conversion table from Unicode Consortium.

- Treat truncated multi-byte characters as an error. - Don't allow ASCII control characters to appear in the middle of a multi-byte character. - There was also a bug whereby some unrecognized Unicode codepoints would be passed through to the output unchanged when converting Unicode to EUC-KR.

- Flag truncated multi-byte characters as erroneous. - Don't allow ASCII control characters to appear in the middle of a multi-byte character. - There was a bug whereby some unrecognized Unicode codepoints would be passed through unchanged to the output when converting Unicode to EUC-CN. - Stick to the original EUC-CN standard, rather than CP936 (an extended version invented by MS).

nikic · 2021-06-28T10:21:17Z

Looks like this is failing on 32-bit:

001+ mb_check_encoding failed on good EUC-TW string: (7fffffffa5bbe4bfe9a2d2a9e1fd)
001- Tested EUC-TW -> UTF-16BE
002- Tested UTF-16BE -> EUC-TW

nikic · 2021-06-28T11:02:12Z

Then we have Base64, HTML entities, QPrint, and UUEncode. I would like to suggest that we deprecate the use of mbstring to convert to/from HTML entities or UUEncode... I think there are other built-in functions to handle those.

Yeah, I agree that those should be deprecated. They aren't really encodings in the sense of encoding unicode code points, and need special casing for that reason.

alexdowad · 2021-06-28T17:41:00Z

Looks like this is failing on 32-bit:

001+ mb_check_encoding failed on good EUC-TW string: (7fffffffa5bbe4bfe9a2d2a9e1fd)
001- Tested EUC-TW -> UTF-16BE
002- Tested UTF-16BE -> EUC-TW

Thanks for pointing that out. I'll work on it.

alexdowad · 2021-06-28T18:02:22Z

Hmm.

The string which is showing in that message is definitely not valid EUC-TW text.
And neither 0x7FFF, 0xFFFF, nor 0xFF should be in the array of "good characters" which it generates test cases from.

alexdowad · 2021-06-28T18:20:34Z

OK, just trying something...

- Treat text which ends abruptly in the middle of a multi-byte character as erroneous. - Don't allow ASCII control characters to appear in the middle of a multi-byte character. - If an illegal byte appears in the middle of a multi-byte character, go back to the initial state rather than trying to finish the multi-byte character. - There was a bug in the file with the conversion tables, which set the 'maximum codepoint which can be converted using table A2' using the size of table A1, not table A2. This meant that several hundred Unicode codepoints which should have been able to be converted to EUC-TW were flagged as erroneous instead. - When a sequence which cannot possibly be a prefix of a valid multi-byte character is found, immediately flag it as an error, rather than waiting to read more bytes first. - Allow characters in CNS-11643 plane 1 to be encoded as 4-byte sequences (although they can also be encoded as 2-byte sequences). This is allowed by the standard for EUC-TW text.

alexdowad · 2021-06-28T19:45:36Z

OK, it's better now.

Zenexer · 2022-11-16T16:26:48Z

@nikic

They aren't really encodings in the sense of encoding unicode code points, and need special casing for that reason.

As far as I can tell, HTML-ENTITIES is actually an encoding mechanism for unicode code points, and ~~there doesn't seem to be an equivalent replacement~~ (edit: this was incorrect). It seems to just take any code point above U+007F and encode it as an HTML entity, making it 7-bit clean. It won't encode special HTML characters such as < or &--the latter being quite curious, since it could technically result in invalid output unless the input is already HTML-encoded.

While it's certainly possible to implement that in pure PHP, I'm not aware of any intrinsic function that supports that behavior.

To anyone else who may happen upon this issue via Google: HTML-ENTITIES is broken and doesn't result in valid output. It won't encode &, which means the following is true:

mb_convert_encoding('🙃', 'HTML-ENTITIES') === mb_convert_encoding('&#128579;', 'HTML-ENTITIES')

In order to use it safely, you would have had to, at minimum, manually replaced & with & prior to running the text through mb_convert_encoding, or be passing in text that is already valid HTML/XML.

If you currently use HTML-ENTITIES in your code base, you should probably examine it to determine whether that results in any security vulnerabilities for your use case.

Edit: I think this should be roughly equivalent:

mb_encode_numericentity($input, [0x80, 0x10fffff, 0, 0x1fffff], mb_internal_encoding());

Keep in mind that it won't escape ampersands, so the input must already be HTML/XML-encoded, just like with HTML-ENTITIES.

If you only need to support PHP 8.0+, you can omit the third parameter to use the default internal encoding.

alexdowad · 2022-11-16T17:39:52Z

@Zenexer You are very right that mbstring's HTML-ENTITIES encoding support is broken, and has been from the beginning (around 20 years ago). Interestingly, if you study the implementation code, it is clear that the author did intend for it to encode &, but it never worked... and nobody ever complained.

By no means was that the only bug in mbstring's HTML-ENTITIES. It was very, very buggy indeed.

You are also very right that the preferred approach is to use mb_encode_numericentity.

Zenexer · 2022-11-16T18:48:43Z

@alexdowad Thanks for confirming.

but it never worked... and nobody ever complained.

From looking through libraries I use, it seems as though people were using this behavior to their advantage, but it's quite unintuitive; mb_encoding_numericentity is much clearer. All it takes is for one person to misunderstand the behavior of HTML-ENTITIES, and suddenly there's a subtle vulnerability.

cdevroe · 2024-01-31T13:03:49Z

Newbie question, years later (forgive me)... why isn't mb_convert_encoding() marked as deprecated in 8.2 yet? Shouldn't the documentation note that? Or, am I looking in the wrong place or is it still too early for that? TY.

https://www.php.net/manual/en/function.mb-convert-encoding.php

youkidearitai · 2024-01-31T13:24:27Z

@cdevroe Thank you for your question. Probably I suggest write an issue to https://github.com/php/doc-en/issues . So we fix the document.

cdevroe · 2024-01-31T13:47:29Z

@youkidearitai TY. Done.

alexdowad added 5 commits June 20, 2021 13:15

Fix conversion of CP936 text (and add test suite)

30af960

- Treat truncated multi-byte characters as an error. - Don't allow ASCII control characters to appear in the middle of a multi-byte character. - Adjust some mappings to match recommendations in conversion table from Unicode Consortium.

Remove table generation scripts which have not been used for years

dcf3a09

alexdowad force-pushed the cleanup-mbstring-8 branch from 7bb282b to e600b2c Compare June 28, 2021 18:20

alexdowad force-pushed the cleanup-mbstring-8 branch from e600b2c to 4dc7d5e Compare June 28, 2021 19:03

nikic approved these changes Jun 29, 2021

View reviewed changes

alexdowad merged commit ff85ed8 into php:master Jun 29, 2021

This was referenced Nov 25, 2021

PHP 8.2: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated symfony/symfony#44281

Closed

PHP 8.2: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated erusev/parsedown-extra#173

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Major overhaul of mbstring (part 8) #7177

Major overhaul of mbstring (part 8) #7177

Uh oh!

alexdowad commented Jun 20, 2021

Uh oh!

nikic commented Jun 28, 2021

Uh oh!

nikic commented Jun 28, 2021

Uh oh!

alexdowad commented Jun 28, 2021

Uh oh!

alexdowad commented Jun 28, 2021

Uh oh!

alexdowad commented Jun 28, 2021

Uh oh!

alexdowad commented Jun 28, 2021

Uh oh!

Zenexer commented Nov 16, 2022 •

edited

Loading

Uh oh!

alexdowad commented Nov 16, 2022

Uh oh!

Zenexer commented Nov 16, 2022 •

edited

Loading

Uh oh!

cdevroe commented Jan 31, 2024 •

edited

Loading

Uh oh!

youkidearitai commented Jan 31, 2024

Uh oh!

cdevroe commented Jan 31, 2024

Uh oh!

Uh oh!

Major overhaul of mbstring (part 8) #7177

Major overhaul of mbstring (part 8) #7177

Uh oh!

Conversation

alexdowad commented Jun 20, 2021

Uh oh!

nikic commented Jun 28, 2021

Uh oh!

nikic commented Jun 28, 2021

Uh oh!

alexdowad commented Jun 28, 2021

Uh oh!

alexdowad commented Jun 28, 2021

Uh oh!

alexdowad commented Jun 28, 2021

Uh oh!

alexdowad commented Jun 28, 2021

Uh oh!

Zenexer commented Nov 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexdowad commented Nov 16, 2022

Uh oh!

Zenexer commented Nov 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cdevroe commented Jan 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

youkidearitai commented Jan 31, 2024

Uh oh!

cdevroe commented Jan 31, 2024

Uh oh!

Uh oh!

Zenexer commented Nov 16, 2022 •

edited

Loading

Zenexer commented Nov 16, 2022 •

edited

Loading

cdevroe commented Jan 31, 2024 •

edited

Loading