[DomCrawler] Add `DomCrawler` to leverage PHP 8.4's HTML5-compliant DOM parser #61356

nicolas-grekas · 2025-08-07T15:29:55Z

Q	A
Branch?	7.4
Bug fix?	no
New feature?	yes
Deprecations?	yes
Issues	#53666
License	MIT

This PR replaces #54383

It takes the BC breaking approach by widening some property and method types to accept / return classes from both Dom\* and DOM* sets.

This widening is a hard BC break for child classes the re-declare the corresponding properties/methods.
I expect this situation to be rather uncommon, so I don't think this will be a big-bang BC break.
Fortunately, this change in signature will be easy to spot since the PHP engine will fail early.

In addition, the Crawler class is made generic: once an instance is created, it's going to stick to one set, either Dom\* or DOM* classes.
To further limit the impact, Crawler::__construct() allows building only Crawler<DOMNode> instances and a DomCrawler class is introduced to create Crawler<Dom\Node> ones.

As a reminder, the goal of this work is to be able to remove the dependency on masterminds/html5 in Symfony 8.
It is NOT to force moving to the new Dom\* API.

Still, in order to achieve this goal and because we want HTML5-parsing to be the default, BrowserKit will leverage the new DomCrawler class when running on PHP 8.4+.

The new Dom\* API has some behavioral changes, all legit and documented at https://wiki.php.net/rfc/opt_in_dom_spec_compliance

The Crawler implementation ensures some fundamental behaviors remain, such as node names being returned in lowercase, empty values/texts being returned as the empty string instead of null, or void tags to not have closing tags. I don't think we'd gain much by propagating these changes brought by the Dom\* API to the Crawler one. It'd actually make adopting the new Dom\* implementation harder for little to no benefits.

The BC break is a decision we have to make. I think it's the best possible approach because in practice, the impact will be limited. The alternative taken in #54383 is to create two independent type hierarchies. This will create a situation where suddenly, all the existing crawler-related code will need to be updated. Better not when there's a less costly path - even though it'll break some edge-cases.

nicolas-grekas · 2025-08-07T16:01:16Z

Looks like psalm doesn't undertand my annotations. Any proposals to make them understandable to the tool?
Maybe phpstan would do a better job?

alexandre-daubois · 2025-08-07T16:45:58Z

I'll have a deeper look at this PR soon, but the BC break seems definitely the best option. As you said, I also expect that very few projects will be impacted. Thank you for taking over this topic!

… PHP 8.4+ (nicolas-grekas) This PR was merged into the 7.4 branch. Discussion ---------- [HtmlSanitizer] Use the native HTML5 parser when using PHP 8.4+ | Q | A | ------------- | --- | Branch? | 7.4 | Bug fix? | no | New feature? | yes | Deprecations? | yes | Issues | #53666 | License | MIT Together with #61356, this PR allows removing any dependency on masterminds/html5 in favor of the native HTML5 capabilities of PHP 8.4 on Symfony 8 In order to do so, this we: * Use the native HTML5 parser when using PHP 8.4+ * Deprecate `MastermindsParser`; use `NativeParser` instead * [BC BREAK] `ParserInterface::parse()` can now return `\Dom\Node|\DOMNode|null` instead of just `\DOMNode|null` * Add argument `$context` to `ParserInterface::parse()` Note that `DomVisitor` is internal so no BC breaks there. And `StringSanitizer::htmlLower()` can leverage `strtolower()` since PHP 8.2 thanks to https://wiki.php.net/rfc/strtolower-ascii Commits ------- d0f98ad [HtmlSanitizer] Use the native HTML5 parser when using PHP 8.4+

alexandre-daubois

Looks good! About PHPStan, I'm not sure... To me, the syntax looks correct but I never saw the T is ... used elsewhere than with @return. Usage with @return is the only one documented, so I guess @var may be unsupported unfortunately.

nicolas-grekas · 2025-08-19T06:44:41Z

At least phpstan understands the type alias and the type conditions.
PR is good to go. Votes pending ;)

…OM parser

stof · 2025-08-20T11:46:33Z

src/Symfony/Component/DomCrawler/Crawler.php

@@ -614,9 +634,17 @@ public function html(?string $default = null): string
            $html .= $owner->saveHTML($child);
        }

+        if ($this instanceof DomCrawler) {
+            // remove all void elements as defined by HTML5


Be careful. A DomCrawler might hold XML content, not just HTML content. You need to also check whether $this->document is a Dom\HtmlDocument, or check $this->isHtml

src/Symfony/Component/DomCrawler/Crawler.php

stof · 2025-08-20T11:49:51Z

src/Symfony/Component/DomCrawler/Crawler.php

+        }
+
+        $domxpath = new \Dom\XPath($document);
+        foreach ($this->namespaces as $prefix => $namespace) {


why isn't this respecting the $prefixes argument like the old logic ?

stof · 2025-08-20T11:54:51Z

UPGRADE-7.4.md

+   * properties `FormField::$document`, `$xpath`, `$node` and method `getLabel()`
+   * methods `Form::getFormNode()` and `addField()`
+   * property `AbstractUriElement::$node`, and methods `getNode()` and `setNode()`
+   * methods `Crawler::add()`, `addDocument()`, `addNodeList()`, `addNode()`, `getNode()` and `sibling()`


type widening for getNode is a BC impacting caller code rather than child classes due to widening a return type instead of a parameter type (same for other getters in other classes). This should be highlighted IMO.

And this is not just theoretical. https://github.com/minkphp/MinkBrowserKitDriver will be totally broken by this BC break affecting consumers of the library (and immediately because of the BC break in BrowserKit switching to the BC-breaking implementation)

src/Symfony/Component/DomCrawler/Crawler.php

stof · 2025-08-20T12:00:55Z

UPGRADE-7.4.md

+BrowserKit
+----------
+
+ * Leverage the native HTML5 parser when using PHP 8.4+


this is a BC break for code interacting with DomCrawler methods returning a raw node as the type has changed. this should be documented as such.

src/Symfony/Component/DomCrawler/Crawler.php

carsonbot added Status: Needs Review Deprecation DomCrawler Feature labels Aug 7, 2025

carsonbot added this to the 7.4 milestone Aug 7, 2025

nicolas-grekas force-pushed the dom-crawler-84 branch 4 times, most recently from 2e9f204 to 8b72b64 Compare August 7, 2025 15:59

nicolas-grekas mentioned this pull request Aug 7, 2025

[DomCrawler] Support classes from the new DOM extension #54383

Closed

nicolas-grekas mentioned this pull request Aug 8, 2025

[HtmlSanitizer] Use the native HTML5 parser when using PHP 8.4+ #61366

Merged

alexandre-daubois approved these changes Aug 12, 2025

View reviewed changes

carsonbot added Status: Reviewed and removed Status: Needs Review labels Aug 12, 2025

nicolas-grekas force-pushed the dom-crawler-84 branch from 8b72b64 to b3e62b3 Compare August 19, 2025 06:44

nicolas-grekas force-pushed the dom-crawler-84 branch 4 times, most recently from f89dec7 to c7d6231 Compare August 20, 2025 11:08

[DomCrawler] Add DomCrawler to leverage PHP 8.4's HTML5-compliant D…

9f6bc5a

…OM parser

nicolas-grekas force-pushed the dom-crawler-84 branch from c7d6231 to 9f6bc5a Compare August 20, 2025 12:04

stof reviewed Aug 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[DomCrawler] Add `DomCrawler` to leverage PHP 8.4's HTML5-compliant DOM parser #61356

[DomCrawler] Add `DomCrawler` to leverage PHP 8.4's HTML5-compliant DOM parser #61356

nicolas-grekas commented Aug 7, 2025 •

edited

Loading

Uh oh!

nicolas-grekas commented Aug 7, 2025

Uh oh!

alexandre-daubois commented Aug 7, 2025

Uh oh!

alexandre-daubois left a comment •

edited

Loading

Uh oh!

nicolas-grekas commented Aug 19, 2025

Uh oh!

stof Aug 20, 2025

Uh oh!

Uh oh!

stof Aug 20, 2025

Uh oh!

stof Aug 20, 2025

Uh oh!

stof Aug 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stof Aug 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[DomCrawler] Add DomCrawler to leverage PHP 8.4's HTML5-compliant DOM parser #61356

Are you sure you want to change the base?

[DomCrawler] Add DomCrawler to leverage PHP 8.4's HTML5-compliant DOM parser #61356

Conversation

nicolas-grekas commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nicolas-grekas commented Aug 7, 2025

Uh oh!

alexandre-daubois commented Aug 7, 2025

Uh oh!

alexandre-daubois left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicolas-grekas commented Aug 19, 2025

Uh oh!

stof Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stof Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

stof Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

stof Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stof Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

[DomCrawler] Add `DomCrawler` to leverage PHP 8.4's HTML5-compliant DOM parser #61356

[DomCrawler] Add `DomCrawler` to leverage PHP 8.4's HTML5-compliant DOM parser #61356

nicolas-grekas commented Aug 7, 2025 •

edited

Loading

alexandre-daubois left a comment •

edited

Loading