Skip to content

Fix binary check in binary dumper #18241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
geek-merlin opened this issue Mar 20, 2016 · 5 comments
Closed

Fix binary check in binary dumper #18241

geek-merlin opened this issue Mar 20, 2016 · 5 comments

Comments

@geek-merlin
Copy link

In #17863 a binary check is introduced like
preg_match('/[^\x09-\x0d\x20-\xff]/', $value)

This is not specific, in the sense that some invalid utf8 won't be recognized as binary.

The recognition is nontrivial iterating a byte sequence, but PHP has some simple one-liners to do it like
$isUTF8 = preg_match('//u', $string);

@javiereguiluz
Copy link
Member

@axel-rutz thanks for reporting this issue. Let me ask you if I've understood you right. You want us to change this:

private static function isBinaryString($value)
{
    return preg_match('/[^\x09-\x0d\x20-\xff]/', $value);
}

By something like this?

private static function isBinaryString($value)
{
    $isValidUTF8 = preg_match('//u', $string);

    return $isValidUTF8 && preg_match('/[^\x09-\x0d\x20-\xff]/', $value);
}

@geek-merlin
Copy link
Author

The intent is to have a rock-solid unicode check, so that i can throw whatever in and can be sure i get no invalid YAML. And, otoh, to throw whatever unicode in and get no unreadable binary.

It's even simpler, see above SO link (which i did only basically test though).

private static function isBinaryString($value)
{
    return FALSE === preg_match('//u', $value);
}

@xabbuh
Copy link
Member

xabbuh commented Mar 24, 2016

The thing is that the YAML dumper right now is able to also dump strings that are no UTF-8 strings (for example, the ISO-8859-1 charset used in Western Europe). However, as dumping binary strings must be explicitly enabled by the user I think we can deprecate dumping non UTF-8 encoded strings as YAML strings and dump them as binary data if the flag is passed (see #18294 for a patch).

@geek-merlin
Copy link
Author

The thing is that the YAML dumper right now is able to also dump strings that are no UTF-8 strings (for example, the ISO-8859-1 charset used in Western Europe).

I don't see how this can work in general, as you can easily construct valid ISO-8859-1 strings that are invalid as UTF-8.

@geek-merlin
Copy link
Author

Let's elaborate it a bit more: The claim is that it is important that the binary-check is specific in the following sense. (For the sake of clarity i'll replace "binary" with "invalid UTF-8".)

  • a) There MUST NOT be any false negatives (i.e. invalid UTF-8 that is considered valid, thus not encoded)
    Reason: Non-encoded invalid UTF-8 is not allowed in valid JSON.
  • b) There SHOULD NOT be any false positives (i.e. valid UTF-8 that is considered invalid, thus encoded)
    Reason: DX. Developers might expect to be able to look at string data in an unencoded form. (I have some projects where this is important.)

This is why i think it is important to use a proper utf-8-check. Given another check it will always be possible to create a string that will violate a) or b).

fabpot added a commit that referenced this issue Mar 30, 2016
…xabbuh)

This PR was merged into the 3.1-dev branch.

Discussion
----------

[Yaml] dump non UTF-8 encoded strings as binary data

| Q             | A
| ------------- | ---
| Branch?       | master
| Bug fix?      | yes
| New feature?  | no
| BC breaks?    | no
| Deprecations? | yes
| Tests pass?   | yes
| Fixed tickets | #18241
| License       | MIT
| Doc PR        |

Commits
-------

86e4a6f dump non UTF-8 encoded strings as binary data
@fabpot fabpot closed this as completed Mar 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants