Skip to content

Major overhaul of mbstring (part 8) #7177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 29, 2021
Merged

Conversation

alexdowad
Copy link
Contributor

FYA @nikic

The remaining legacy text encodings supported by mbstring which do not have adequate tests are: EUC-JP-2004, ISO-2022-KR, GB18030, and Big5.

Then we have Base64, HTML entities, QPrint, and UUEncode. I would like to suggest that we deprecate the use of mbstring to convert to/from HTML entities or UUEncode... I think there are other built-in functions to handle those.

- Treat truncated multi-byte characters as an error.
- Don't allow ASCII control characters to appear in the middle of a
  multi-byte character.
- Handle ~ escapes according to the HZ standard (RFC 1843).
- Treat unrecognized ~ escapes as an error.
- Multi-byte characters (between ~{ ~} escapes) are GB2312, not CP936.
  (CP936 is an extended version from MicroSoft, but the RFC does not
  state that this extended version of GB should be used.)
- Treat truncated multi-byte characters as an error.
- Don't allow ASCII control characters to appear in the middle of a
  multi-byte character.
- Adjust some mappings to match recommendations in conversion table
  from Unicode Consortium.
- Treat truncated multi-byte characters as an error.
- Don't allow ASCII control characters to appear in the middle of a
  multi-byte character.
- There was also a bug whereby some unrecognized Unicode codepoints
  would be passed through to the output unchanged when converting
  Unicode to EUC-KR.
- Flag truncated multi-byte characters as erroneous.
- Don't allow ASCII control characters to appear in the middle of a
  multi-byte character.
- There was a bug whereby some unrecognized Unicode codepoints would be
  passed through unchanged to the output when converting Unicode to
  EUC-CN.
- Stick to the original EUC-CN standard, rather than CP936 (an extended
  version invented by MS).
@nikic
Copy link
Member

nikic commented Jun 28, 2021

Looks like this is failing on 32-bit:

001+ mb_check_encoding failed on good EUC-TW string: (7fffffffa5bbe4bfe9a2d2a9e1fd)
001- Tested EUC-TW -> UTF-16BE
002- Tested UTF-16BE -> EUC-TW

@nikic
Copy link
Member

nikic commented Jun 28, 2021

Then we have Base64, HTML entities, QPrint, and UUEncode. I would like to suggest that we deprecate the use of mbstring to convert to/from HTML entities or UUEncode... I think there are other built-in functions to handle those.

Yeah, I agree that those should be deprecated. They aren't really encodings in the sense of encoding unicode code points, and need special casing for that reason.

@alexdowad
Copy link
Contributor Author

Looks like this is failing on 32-bit:

001+ mb_check_encoding failed on good EUC-TW string: (7fffffffa5bbe4bfe9a2d2a9e1fd)
001- Tested EUC-TW -> UTF-16BE
002- Tested UTF-16BE -> EUC-TW

Thanks for pointing that out. I'll work on it.

@alexdowad
Copy link
Contributor Author

Hmm.

The string which is showing in that message is definitely not valid EUC-TW text.
And neither 0x7FFF, 0xFFFF, nor 0xFF should be in the array of "good characters" which it generates test cases from.

@alexdowad alexdowad force-pushed the cleanup-mbstring-8 branch from 7bb282b to e600b2c Compare June 28, 2021 18:20
@alexdowad
Copy link
Contributor Author

OK, just trying something...

- Treat text which ends abruptly in the middle of a multi-byte
  character as erroneous.
- Don't allow ASCII control characters to appear in the middle of a
  multi-byte character.
- If an illegal byte appears in the middle of a multi-byte character,
  go back to the initial state rather than trying to finish the
  multi-byte character.
- There was a bug in the file with the conversion tables, which set the
  'maximum codepoint which can be converted using table A2' using the
  size of table A1, not table A2. This meant that several hundred
  Unicode codepoints which should have been able to be converted to
  EUC-TW were flagged as erroneous instead.
- When a sequence which cannot possibly be a prefix of a valid
  multi-byte character is found, immediately flag it as an error, rather
  than waiting to read more bytes first.
- Allow characters in CNS-11643 plane 1 to be encoded as 4-byte
  sequences (although they can also be encoded as 2-byte sequences).
  This is allowed by the standard for EUC-TW text.
@alexdowad alexdowad force-pushed the cleanup-mbstring-8 branch from e600b2c to 4dc7d5e Compare June 28, 2021 19:03
@alexdowad
Copy link
Contributor Author

OK, it's better now.

@Zenexer
Copy link

Zenexer commented Nov 16, 2022

@nikic

They aren't really encodings in the sense of encoding unicode code points, and need special casing for that reason.

As far as I can tell, HTML-ENTITIES is actually an encoding mechanism for unicode code points, and there doesn't seem to be an equivalent replacement (edit: this was incorrect). It seems to just take any code point above U+007F and encode it as an HTML entity, making it 7-bit clean. It won't encode special HTML characters such as < or &--the latter being quite curious, since it could technically result in invalid output unless the input is already HTML-encoded.

While it's certainly possible to implement that in pure PHP, I'm not aware of any intrinsic function that supports that behavior.

To anyone else who may happen upon this issue via Google: HTML-ENTITIES is broken and doesn't result in valid output. It won't encode &, which means the following is true:

mb_convert_encoding('🙃', 'HTML-ENTITIES') === mb_convert_encoding('&#128579;', 'HTML-ENTITIES')

In order to use it safely, you would have had to, at minimum, manually replaced & with &amp; prior to running the text through mb_convert_encoding, or be passing in text that is already valid HTML/XML.

If you currently use HTML-ENTITIES in your code base, you should probably examine it to determine whether that results in any security vulnerabilities for your use case.

Edit: I think this should be roughly equivalent:

mb_encode_numericentity($input, [0x80, 0x10fffff, 0, 0x1fffff], mb_internal_encoding());

Keep in mind that it won't escape ampersands, so the input must already be HTML/XML-encoded, just like with HTML-ENTITIES.

If you only need to support PHP 8.0+, you can omit the third parameter to use the default internal encoding.

@alexdowad
Copy link
Contributor Author

@Zenexer You are very right that mbstring's HTML-ENTITIES encoding support is broken, and has been from the beginning (around 20 years ago). Interestingly, if you study the implementation code, it is clear that the author did intend for it to encode &, but it never worked... and nobody ever complained.

By no means was that the only bug in mbstring's HTML-ENTITIES. It was very, very buggy indeed.

You are also very right that the preferred approach is to use mb_encode_numericentity.

@Zenexer
Copy link

Zenexer commented Nov 16, 2022

@alexdowad Thanks for confirming.

but it never worked... and nobody ever complained.

From looking through libraries I use, it seems as though people were using this behavior to their advantage, but it's quite unintuitive; mb_encoding_numericentity is much clearer. All it takes is for one person to misunderstand the behavior of HTML-ENTITIES, and suddenly there's a subtle vulnerability.

@cdevroe
Copy link

cdevroe commented Jan 31, 2024

Newbie question, years later (forgive me)... why isn't mb_convert_encoding() marked as deprecated in 8.2 yet? Shouldn't the documentation note that? Or, am I looking in the wrong place or is it still too early for that? TY.

https://www.php.net/manual/en/function.mb-convert-encoding.php

@youkidearitai
Copy link
Contributor

@cdevroe Thank you for your question. Probably I suggest write an issue to https://github.com/php/doc-en/issues . So we fix the document.

@cdevroe
Copy link

cdevroe commented Jan 31, 2024

@youkidearitai TY. Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants