-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
Fix binary check in binary dumper #18241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@axel-rutz thanks for reporting this issue. Let me ask you if I've understood you right. You want us to change this: private static function isBinaryString($value)
{
return preg_match('/[^\x09-\x0d\x20-\xff]/', $value);
} By something like this? private static function isBinaryString($value)
{
$isValidUTF8 = preg_match('//u', $string);
return $isValidUTF8 && preg_match('/[^\x09-\x0d\x20-\xff]/', $value);
} |
The intent is to have a rock-solid unicode check, so that i can throw whatever in and can be sure i get no invalid YAML. And, otoh, to throw whatever unicode in and get no unreadable binary. It's even simpler, see above SO link (which i did only basically test though).
|
The thing is that the YAML dumper right now is able to also dump strings that are no UTF-8 strings (for example, the ISO-8859-1 charset used in Western Europe). However, as dumping binary strings must be explicitly enabled by the user I think we can deprecate dumping non UTF-8 encoded strings as YAML strings and dump them as binary data if the flag is passed (see #18294 for a patch). |
I don't see how this can work in general, as you can easily construct valid ISO-8859-1 strings that are invalid as UTF-8. |
Let's elaborate it a bit more: The claim is that it is important that the binary-check is specific in the following sense. (For the sake of clarity i'll replace "binary" with "invalid UTF-8".)
This is why i think it is important to use a proper utf-8-check. Given another check it will always be possible to create a string that will violate a) or b). |
…xabbuh) This PR was merged into the 3.1-dev branch. Discussion ---------- [Yaml] dump non UTF-8 encoded strings as binary data | Q | A | ------------- | --- | Branch? | master | Bug fix? | yes | New feature? | no | BC breaks? | no | Deprecations? | yes | Tests pass? | yes | Fixed tickets | #18241 | License | MIT | Doc PR | Commits ------- 86e4a6f dump non UTF-8 encoded strings as binary data
In #17863 a binary check is introduced like
preg_match('/[^\x09-\x0d\x20-\xff]/', $value)
This is not specific, in the sense that some invalid utf8 won't be recognized as binary.
The recognition is nontrivial iterating a byte sequence, but PHP has some simple one-liners to do it like
$isUTF8 = preg_match('//u', $string);
The text was updated successfully, but these errors were encountered: