Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 corruption in \Dom\HTMLDocument #17481

Closed
xPaw opened this issue Jan 15, 2025 · 2 comments
Closed

UTF-8 corruption in \Dom\HTMLDocument #17481

xPaw opened this issue Jan 15, 2025 · 2 comments

Comments

@xPaw
Copy link
Contributor

xPaw commented Jan 15, 2025

Description

The following code:

<?php
$Repeated = str_repeat( '–', 4096 );
//$Repeated = str_repeat( '😏', 4096 );
$Data = '<!DOCTYPE HTML><html>' . $Repeated . '</html>';
$Document = \Dom\HTMLDocument::createFromString( $Data, 0, 'UTF-8' );

echo $Document->saveHTML();
// var_dump($Document->body->textContent);

The resulting string contains random invalid UTF-8 sequences like with the οΏ½ character. With the repeated emoji, emojis become corrupted. If you repeat the string for longer, there are more corrupted bytes in random places.

I initially spotted this bug when parsing a real HTML document and used textContent (innerHTML produces the same issue) on an element I found with xpath.

image

PHP Version

PHP 8.4.2

Operating System

Windows 11 and Debian 12

@xPaw
Copy link
Contributor Author

xPaw commented Jan 15, 2025

FYI this is reproducible with 2-byte utf-8 sequences too, but not ascii.

@nielsdos
Copy link
Member

I'll try to have a look at the encoding code tomorrow.

nielsdos added a commit to nielsdos/php-src that referenced this issue Jan 16, 2025
We need to properly handle the case when we return from having too few
bytes, this needs to be handled separately because the while loop
otherwise just performs a partial byte copy.
nielsdos added a commit to nielsdos/php-src that referenced this issue Jan 16, 2025
We need to properly handle the case when we return from having too few
bytes, this needs to be handled separately because the while loop
otherwise just performs a partial byte copy.
@nielsdos nielsdos linked a pull request Jan 16, 2025 that will close this issue
nielsdos added a commit to nielsdos/php-src that referenced this issue Jan 16, 2025
We need to properly handle the case when we return from having too few
bytes, this needs to be handled separately because the while loop
otherwise just performs a partial byte copy.
nielsdos added a commit that referenced this issue Jan 17, 2025
* PHP-8.4:
  Fix GH-17481: UTF-8 corruption in \Dom\HTMLDocument
  Fix GH-17486: Incorrect error line numbers reported in Dom\HTMLDocument::createFromString
nielsdos added a commit to nielsdos/php-src that referenced this issue Jan 29, 2025
The fix for phpGH-17481 introduced a regression that can cause the read of
uninitialized padding data when going over a chunk boundary during HTML
parsing of UTF-8.
The wrong offset was computed with respect to the input buffer, the
length of the error-corrected UTF-8 code point is not necessarily the
same as the input code point length.
This was not noticed because no CI jobs run with Valgrind nor I do it
regularly, and ASAN doesn't catch uninitialized accesses.
NattyNarwhal pushed a commit that referenced this issue Jan 30, 2025
The fix for GH-17481 introduced a regression that can cause the read of
uninitialized padding data when going over a chunk boundary during HTML
parsing of UTF-8.
The wrong offset was computed with respect to the input buffer, the
length of the error-corrected UTF-8 code point is not necessarily the
same as the input code point length.
This was not noticed because no CI jobs run with Valgrind nor I do it
regularly, and ASAN doesn't catch uninitialized accesses.
NattyNarwhal pushed a commit that referenced this issue Jan 30, 2025
The fix for GH-17481 introduced a regression that can cause the read of
uninitialized padding data when going over a chunk boundary during HTML
parsing of UTF-8.
The wrong offset was computed with respect to the input buffer, the
length of the error-corrected UTF-8 code point is not necessarily the
same as the input code point length.
This was not noticed because no CI jobs run with Valgrind nor I do it
regularly, and ASAN doesn't catch uninitialized accesses.
NattyNarwhal pushed a commit that referenced this issue Jan 30, 2025
The fix for GH-17481 introduced a regression that can cause the read of
uninitialized padding data when going over a chunk boundary during HTML
parsing of UTF-8.
The wrong offset was computed with respect to the input buffer, the
length of the error-corrected UTF-8 code point is not necessarily the
same as the input code point length.
This was not noticed because no CI jobs run with Valgrind nor I do it
regularly, and ASAN doesn't catch uninitialized accesses.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants