Skip to content

[DomCrawler] Add DomCrawler to leverage PHP 8.4's HTML5-compliant DOM parser #61356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: 7.4
Choose a base branch
from

Conversation

nicolas-grekas
Copy link
Member

@nicolas-grekas nicolas-grekas commented Aug 7, 2025

Q A
Branch? 7.4
Bug fix? no
New feature? yes
Deprecations? yes
Issues #53666
License MIT

This PR replaces #54383

It takes the BC breaking approach by widening some property and method types to accept / return classes from both Dom\* and DOM* sets.

This widening is a hard BC break for child classes the re-declare the corresponding properties/methods.
I expect this situation to be rather uncommon, so I don't think this will be a big-bang BC break.
Fortunately, this change in signature will be easy to spot since the PHP engine will fail early.

In addition, the Crawler class is made generic: once an instance is created, it's going to stick to one set, either Dom\* or DOM* classes.
To further limit the impact, Crawler::__construct() allows building only Crawler<DOMNode> instances and a DomCrawler class is introduced to create Crawler<Dom\Node> ones.

As a reminder, the goal of this work is to be able to remove the dependency on masterminds/html5 in Symfony 8.
It is NOT to force moving to the new Dom\* API.

Still, in order to achieve this goal and because we want HTML5-parsing to be the default, BrowserKit will leverage the new DomCrawler class when running on PHP 8.4+.

The new Dom\* API has some behavioral changes, all legit and documented at https://wiki.php.net/rfc/opt_in_dom_spec_compliance

The Crawler implementation ensures some fundamental behaviors remain, such as node names being returned in lowercase, empty values/texts being returned as the empty string instead of null, or void tags to not have closing tags. I don't think we'd gain much by propagating these changes brought by the Dom\* API to the Crawler one. It'd actually make adopting the new Dom\* implementation harder for little to no benefits.

The BC break is a decision we have to make. I think it's the best possible approach because in practice, the impact will be limited. The alternative taken in #54383 is to create two independent type hierarchies. This will create a situation where suddenly, all the existing crawler-related code will need to be updated. Better not when there's a less costly path - even though it'll break some edge-cases.

@nicolas-grekas
Copy link
Member Author

Looks like psalm doesn't undertand my annotations. Any proposals to make them understandable to the tool?
Maybe phpstan would do a better job?

@alexandre-daubois
Copy link
Member

I'll have a deeper look at this PR soon, but the BC break seems definitely the best option. As you said, I also expect that very few projects will be impacted. Thank you for taking over this topic!

nicolas-grekas added a commit that referenced this pull request Aug 12, 2025
… PHP 8.4+ (nicolas-grekas)

This PR was merged into the 7.4 branch.

Discussion
----------

[HtmlSanitizer] Use the native HTML5 parser when using PHP 8.4+

| Q             | A
| ------------- | ---
| Branch?       | 7.4
| Bug fix?      | no
| New feature?  | yes
| Deprecations? | yes
| Issues        | #53666
| License       | MIT

Together with #61356, this PR allows removing any dependency on masterminds/html5 in favor of the native HTML5 capabilities of PHP 8.4 on Symfony 8

In order to do so, this we:
 * Use the native HTML5 parser when using PHP 8.4+
 * Deprecate `MastermindsParser`; use `NativeParser` instead
 * [BC BREAK] `ParserInterface::parse()` can now return `\Dom\Node|\DOMNode|null` instead of just `\DOMNode|null`
 * Add argument `$context` to `ParserInterface::parse()`

Note that `DomVisitor` is internal so no BC breaks there.
And `StringSanitizer::htmlLower()` can leverage `strtolower()` since PHP 8.2 thanks to https://wiki.php.net/rfc/strtolower-ascii

Commits
-------

d0f98ad [HtmlSanitizer] Use the native HTML5 parser when using PHP 8.4+
Copy link
Member

@alexandre-daubois alexandre-daubois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! About PHPStan, I'm not sure... To me, the syntax looks correct but I never saw the T is ... used elsewhere than with @return. Usage with @return is the only one documented, so I guess @var may be unsupported unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants