-
-
Notifications
You must be signed in to change notification settings - Fork 9.7k
[DomCrawler] Add DomCrawler
to leverage PHP 8.4's HTML5-compliant DOM parser
#61356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 7.4
Are you sure you want to change the base?
Conversation
2e9f204
to
8b72b64
Compare
Looks like psalm doesn't undertand my annotations. Any proposals to make them understandable to the tool? |
I'll have a deeper look at this PR soon, but the BC break seems definitely the best option. As you said, I also expect that very few projects will be impacted. Thank you for taking over this topic! |
… PHP 8.4+ (nicolas-grekas) This PR was merged into the 7.4 branch. Discussion ---------- [HtmlSanitizer] Use the native HTML5 parser when using PHP 8.4+ | Q | A | ------------- | --- | Branch? | 7.4 | Bug fix? | no | New feature? | yes | Deprecations? | yes | Issues | #53666 | License | MIT Together with #61356, this PR allows removing any dependency on masterminds/html5 in favor of the native HTML5 capabilities of PHP 8.4 on Symfony 8 In order to do so, this we: * Use the native HTML5 parser when using PHP 8.4+ * Deprecate `MastermindsParser`; use `NativeParser` instead * [BC BREAK] `ParserInterface::parse()` can now return `\Dom\Node|\DOMNode|null` instead of just `\DOMNode|null` * Add argument `$context` to `ParserInterface::parse()` Note that `DomVisitor` is internal so no BC breaks there. And `StringSanitizer::htmlLower()` can leverage `strtolower()` since PHP 8.2 thanks to https://wiki.php.net/rfc/strtolower-ascii Commits ------- d0f98ad [HtmlSanitizer] Use the native HTML5 parser when using PHP 8.4+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! About PHPStan, I'm not sure... To me, the syntax looks correct but I never saw the T is ...
used elsewhere than with @return
. Usage with @return
is the only one documented, so I guess @var
may be unsupported unfortunately.
8b72b64
to
b3e62b3
Compare
At least phpstan understands the type alias and the type conditions. |
f89dec7
to
c7d6231
Compare
c7d6231
to
9f6bc5a
Compare
@@ -614,9 +634,17 @@ public function html(?string $default = null): string | |||
$html .= $owner->saveHTML($child); | |||
} | |||
|
|||
if ($this instanceof DomCrawler) { | |||
// remove all void elements as defined by HTML5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be careful. A DomCrawler might hold XML content, not just HTML content. You need to also check whether $this->document
is a Dom\HtmlDocument
, or check $this->isHtml
} | ||
|
||
$domxpath = new \Dom\XPath($document); | ||
foreach ($this->namespaces as $prefix => $namespace) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why isn't this respecting the $prefixes
argument like the old logic ?
* properties `FormField::$document`, `$xpath`, `$node` and method `getLabel()` | ||
* methods `Form::getFormNode()` and `addField()` | ||
* property `AbstractUriElement::$node`, and methods `getNode()` and `setNode()` | ||
* methods `Crawler::add()`, `addDocument()`, `addNodeList()`, `addNode()`, `getNode()` and `sibling()` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type widening for getNode
is a BC impacting caller code rather than child classes due to widening a return type instead of a parameter type (same for other getters in other classes). This should be highlighted IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this is not just theoretical. https://github.com/minkphp/MinkBrowserKitDriver will be totally broken by this BC break affecting consumers of the library (and immediately because of the BC break in BrowserKit switching to the BC-breaking implementation)
BrowserKit | ||
---------- | ||
|
||
* Leverage the native HTML5 parser when using PHP 8.4+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a BC break for code interacting with DomCrawler methods returning a raw node as the type has changed. this should be documented as such.
This PR replaces #54383
It takes the BC breaking approach by widening some property and method types to accept / return classes from both
Dom\*
andDOM*
sets.This widening is a hard BC break for child classes the re-declare the corresponding properties/methods.
I expect this situation to be rather uncommon, so I don't think this will be a big-bang BC break.
Fortunately, this change in signature will be easy to spot since the PHP engine will fail early.
In addition, the
Crawler
class is made generic: once an instance is created, it's going to stick to one set, eitherDom\*
orDOM*
classes.To further limit the impact,
Crawler::__construct()
allows building onlyCrawler<DOMNode>
instances and aDomCrawler
class is introduced to createCrawler<Dom\Node>
ones.As a reminder, the goal of this work is to be able to remove the dependency on
masterminds/html5
in Symfony 8.It is NOT to force moving to the new
Dom\*
API.Still, in order to achieve this goal and because we want HTML5-parsing to be the default,
BrowserKit
will leverage the newDomCrawler
class when running on PHP 8.4+.The new
Dom\*
API has some behavioral changes, all legit and documented at https://wiki.php.net/rfc/opt_in_dom_spec_complianceThe
Crawler
implementation ensures some fundamental behaviors remain, such as node names being returned in lowercase, empty values/texts being returned as the empty string instead of null, or void tags to not have closing tags. I don't think we'd gain much by propagating these changes brought by theDom\*
API to theCrawler
one. It'd actually make adopting the newDom\*
implementation harder for little to no benefits.The BC break is a decision we have to make. I think it's the best possible approach because in practice, the impact will be limited. The alternative taken in #54383 is to create two independent type hierarchies. This will create a situation where suddenly, all the existing crawler-related code will need to be updated. Better not when there's a less costly path - even though it'll break some edge-cases.