-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 corruption in \Dom\HTMLDocument
#17481
Labels
Comments
FYI this is reproducible with 2-byte utf-8 sequences too, but not ascii. |
I'll try to have a look at the encoding code tomorrow. |
nielsdos
added a commit
to nielsdos/php-src
that referenced
this issue
Jan 16, 2025
We need to properly handle the case when we return from having too few bytes, this needs to be handled separately because the while loop otherwise just performs a partial byte copy.
nielsdos
added a commit
to nielsdos/php-src
that referenced
this issue
Jan 16, 2025
We need to properly handle the case when we return from having too few bytes, this needs to be handled separately because the while loop otherwise just performs a partial byte copy.
nielsdos
added a commit
to nielsdos/php-src
that referenced
this issue
Jan 16, 2025
We need to properly handle the case when we return from having too few bytes, this needs to be handled separately because the while loop otherwise just performs a partial byte copy.
nielsdos
added a commit
to nielsdos/php-src
that referenced
this issue
Jan 29, 2025
The fix for phpGH-17481 introduced a regression that can cause the read of uninitialized padding data when going over a chunk boundary during HTML parsing of UTF-8. The wrong offset was computed with respect to the input buffer, the length of the error-corrected UTF-8 code point is not necessarily the same as the input code point length. This was not noticed because no CI jobs run with Valgrind nor I do it regularly, and ASAN doesn't catch uninitialized accesses.
NattyNarwhal
pushed a commit
that referenced
this issue
Jan 30, 2025
The fix for GH-17481 introduced a regression that can cause the read of uninitialized padding data when going over a chunk boundary during HTML parsing of UTF-8. The wrong offset was computed with respect to the input buffer, the length of the error-corrected UTF-8 code point is not necessarily the same as the input code point length. This was not noticed because no CI jobs run with Valgrind nor I do it regularly, and ASAN doesn't catch uninitialized accesses.
NattyNarwhal
pushed a commit
that referenced
this issue
Jan 30, 2025
The fix for GH-17481 introduced a regression that can cause the read of uninitialized padding data when going over a chunk boundary during HTML parsing of UTF-8. The wrong offset was computed with respect to the input buffer, the length of the error-corrected UTF-8 code point is not necessarily the same as the input code point length. This was not noticed because no CI jobs run with Valgrind nor I do it regularly, and ASAN doesn't catch uninitialized accesses.
NattyNarwhal
pushed a commit
that referenced
this issue
Jan 30, 2025
The fix for GH-17481 introduced a regression that can cause the read of uninitialized padding data when going over a chunk boundary during HTML parsing of UTF-8. The wrong offset was computed with respect to the input buffer, the length of the error-corrected UTF-8 code point is not necessarily the same as the input code point length. This was not noticed because no CI jobs run with Valgrind nor I do it regularly, and ASAN doesn't catch uninitialized accesses.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description
The following code:
The resulting string contains random invalid UTF-8 sequences like with the οΏ½ character. With the repeated emoji, emojis become corrupted. If you repeat the string for longer, there are more corrupted bytes in random places.
I initially spotted this bug when parsing a real HTML document and used
textContent
(innerHTML produces the same issue) on an element I found with xpath.PHP Version
PHP 8.4.2
Operating System
Windows 11 and Debian 12
The text was updated successfully, but these errors were encountered: