UTF-8 corruption in `\Dom\HTMLDocument` #17481

xPaw · 2025-01-15T21:23:48Z

Description

The following code:

<?php
$Repeated = str_repeat( '–', 4096 );
//$Repeated = str_repeat( '😏', 4096 );
$Data = '<!DOCTYPE HTML><html>' . $Repeated . '</html>';
$Document = \Dom\HTMLDocument::createFromString( $Data, 0, 'UTF-8' );

echo $Document->saveHTML();
// var_dump($Document->body->textContent);

The resulting string contains random invalid UTF-8 sequences like with the � character. With the repeated emoji, emojis become corrupted. If you repeat the string for longer, there are more corrupted bytes in random places.

I initially spotted this bug when parsing a real HTML document and used textContent (innerHTML produces the same issue) on an element I found with xpath.

PHP Version

PHP 8.4.2

Operating System

Windows 11 and Debian 12

The text was updated successfully, but these errors were encountered:

xPaw · 2025-01-15T21:46:07Z

FYI this is reproducible with 2-byte utf-8 sequences too, but not ascii.

nielsdos · 2025-01-15T21:47:13Z

I'll try to have a look at the encoding code tomorrow.

We need to properly handle the case when we return from having too few bytes, this needs to be handled separately because the while loop otherwise just performs a partial byte copy.

* PHP-8.4: Fix GH-17481: UTF-8 corruption in \Dom\HTMLDocument Fix GH-17486: Incorrect error line numbers reported in Dom\HTMLDocument::createFromString

The fix for phpGH-17481 introduced a regression that can cause the read of uninitialized padding data when going over a chunk boundary during HTML parsing of UTF-8. The wrong offset was computed with respect to the input buffer, the length of the error-corrected UTF-8 code point is not necessarily the same as the input code point length. This was not noticed because no CI jobs run with Valgrind nor I do it regularly, and ASAN doesn't catch uninitialized accesses.

The fix for GH-17481 introduced a regression that can cause the read of uninitialized padding data when going over a chunk boundary during HTML parsing of UTF-8. The wrong offset was computed with respect to the input buffer, the length of the error-corrected UTF-8 code point is not necessarily the same as the input code point length. This was not noticed because no CI jobs run with Valgrind nor I do it regularly, and ASAN doesn't catch uninitialized accesses.

xPaw added Bug Status: Needs Triage labels Jan 15, 2025

nielsdos added Extension: dom Status: Verified and removed Status: Needs Triage labels Jan 15, 2025

nielsdos self-assigned this Jan 15, 2025

nielsdos linked a pull request Jan 16, 2025 that will close this issue

Fix GH-17481: UTF-8 corruption in \Dom\HTMLDocument #17489

Closed

nielsdos closed this as completed in 2952e16 Jan 17, 2025

nielsdos added a commit that referenced this issue Jan 17, 2025

Merge branch 'PHP-8.4'

72708f2

* PHP-8.4: Fix GH-17481: UTF-8 corruption in \Dom\HTMLDocument Fix GH-17486: Incorrect error line numbers reported in Dom\HTMLDocument::createFromString

nielsdos mentioned this issue Jan 29, 2025

Fix potential read of uninitialized padding data in DOM #17628

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 corruption in `\Dom\HTMLDocument` #17481

UTF-8 corruption in `\Dom\HTMLDocument` #17481

xPaw commented Jan 15, 2025

xPaw commented Jan 15, 2025

nielsdos commented Jan 15, 2025

UTF-8 corruption in \Dom\HTMLDocument #17481

UTF-8 corruption in \Dom\HTMLDocument #17481

Comments

xPaw commented Jan 15, 2025

Description

PHP Version

Operating System

xPaw commented Jan 15, 2025

nielsdos commented Jan 15, 2025

UTF-8 corruption in `\Dom\HTMLDocument` #17481

UTF-8 corruption in `\Dom\HTMLDocument` #17481