Skip to content

Leverage PHP 8.4's native parsing of HTML5 #53666

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nicolas-grekas opened this issue Jan 29, 2024 · 22 comments · May be fixed by #54383
Open

Leverage PHP 8.4's native parsing of HTML5 #53666

nicolas-grekas opened this issue Jan 29, 2024 · 22 comments · May be fixed by #54383
Labels
DomCrawler Feature Help wanted Issues and PRs which are looking for volunteers to complete them.

Comments

@nicolas-grekas
Copy link
Member

nicolas-grekas commented Jan 29, 2024

Description

DomCrawler currently relies on html

https://wiki.php.net/rfc/domdocument_html5_parser

Example

No response

@nicolas-grekas nicolas-grekas added the Help wanted Issues and PRs which are looking for volunteers to complete them. label Jan 29, 2024
@MatTheCat
Copy link
Contributor

https://wiki.php.net/rfc/opt_in_dom_spec_compliance is now to take into consideration as well.

@alexandre-daubois
Copy link
Member

I tried things but I keep getting blocked. The problem I'm facing is the following: Crawler class is not final, thus public methods cannot see their signature change. This is problematic because some central methods like addNode() cannot see their signature changed to addNode(\DOMNode|\DOM\Node).

Indeed, when parsing HTML with the new HTMLDocument::createFromString() method, it returns a \DOM\Document, composed of \DOM\Node and all other new classes. This prevents the existing methods to be called, because there's no way we convert the new \DOM\Node to legacy \DOMNode.

I'm still thinking of the best way to leverage this new parsing method. One solution I could think of is creating a new Crawler, extending the existing one. Something like NativeCrawler. This would allow to widen types of public methods. But before starting this work, I'd love to hear your opinion on this!

@derrabus
Copy link
Member

How do the new namespaced DOM classes relate to the old ones?

@alexandre-daubois
Copy link
Member

alexandre-daubois commented Mar 19, 2024

They doesn't relate to the old ones actually. Here's the stub file: https://github.com/php/php-src/blob/cc0260e0140f42fb98a4da52e0fbc715de6c5cad/ext/dom/php_dom.stub.php. This is the pain point. The new extension is completely opt-in and separated from the old implementation.

Thanks to a recent addition, the new DOM\Document::importLegacyNode() allows to create an instance a one of the new classes from an old class. But this doesn't work with all classes and it would involve a lot of type juggling, and I'm not even sure it's really possible to do something fully working in the end. 🤔

@nicolas-grekas
Copy link
Member Author

Since this is WIP on php-src's side, maybe we can try to hint them about what would help? /cc @nielsdos FYI

@derrabus
Copy link
Member

They doesn't relate to the old ones actually.

I see. In that case, a new Crawler class would be our best option, I'm afraid. 😟

@MatTheCat
Copy link
Contributor

FormFields also are impacted by these new classes, and by behavior changes (like nodeNames being uppercase).

@alexandre-daubois
Copy link
Member

Also Link, Image... Not sure what's the best way to update here. Creating a new crawler : maybe, but creating new classes used by it because we can't change the signature doesn't seem the best thing to do. Feels like a huge rewriting of the component is necessary at this point to support the new parser without breaking BC. 🤔

@nielsdos
Copy link

Since this is WIP on php-src's side, maybe we can try to hint them about what would help? /cc @nielsdos FYI

The class structure is pretty much finished. I only have one more RFC planned to add some more (minor) methods and fields to the new DOM classes. This might also come with a specialisation (i.e. optional-to-use subclass, although that shouldn't cause issues as "old DOM" has no equivalent) of the Element class called HTMLElement, but that hasn't been decided yet.

I see that the crawler has signatures and fields tightly coupled to the old DOM classes, so indeed as these types have nothing in common this is a problem that propagates through the entire component. As noted above, this is not limited to types but impacts fields (and methods) too.

The reason the old and new classes have nothing in common is because their behaviour is incompatible and the types of methods and fields are slightly different to fix things like nullability etc. On the ML people seemed happy to not have aliases but have real classes with the correct types instead.

I originally included an importModernNode analogue of importLegacyNode, but dropped it from the proposal because I found out it is rather easy to create documents that are representable in "new DOM" but unrepresentable in "old DOM". As such, it seemed better to me to not give the false impression that documents from "new DOM" could be converted unconditionally to "old DOM". The inverse is however always possible: you can always convert "old DOM" documents to "new DOM" documents.

@alexandre-daubois
Copy link
Member

Thank you for the complete explanation @nielsdos!

This comfort me in the idea that we may rewrite part of the component to support both old and new DOM classes. A viable solution would be to create a new sub namespace of classes extending the existing ones. We would then widen types of required method and properties (and also change some private visibilities to protected).

I think this solution is nice because it would avoid double maintenance. The current Crawler could then be deprecated when PHP 8.4 becomes the minimal version for Symfony (in version 8.0 I guess).

What do you think?

@nielsdos
Copy link

I can't really comment much about this symfony component, as I don't really have experience with Symfony to be honest.
But from a high-level PoV it sounds like a plan that should work.

@nicolas-grekas
Copy link
Member Author

nicolas-grekas commented Mar 20, 2024

What I would love is a plan that involves polyfilling classes in the new DOM namespace. It'd be fine to not implement the HTML5 parsing (although that could maybe be done with mastermind/html5). Allowing to use the new classes without having to wait for PHP 8.4 would be a major trigger to foster their adoption.

PR welcome on https://github.com/symfony/polyfill-php84

@nielsdos
Copy link

What I would love is a plan that involves polyfilling classes in the new DOM namespace.

That's not possible in the general case, if you want your polyfills to be based off the old DOM classes.
Fixing the typing issues is possible, but what's not possible is fixing the behavioural differences, because the old DOM classes lack the necessary primitives to implement the new behaviours. In particular I'm thinking about namespace handling.

@nicolas-grekas
Copy link
Member Author

Would it be possible to implement importLegacyNode in a polyfill? That might be enough as a transition point.

@nielsdos
Copy link

Would it be possible to implement importLegacyNode in a polyfill? That might be enough as a transition point.

To wrap the old classes into your own implementations? Sure. I don't know how you envision implementing the new DOM behaviour though?

@nicolas-grekas
Copy link
Member Author

You mean what's described in https://wiki.php.net/rfc/opt_in_dom_spec_compliance#methods right? Is that exhaustive?
I don't have the answer, but finding one would be of great interest to the wider PHP community. What could we do about it?

@nielsdos
Copy link

Yes, the RFC is exhaustive in its issue descriptions. The problems are not limited to the methods section though, there's also a section about namespace bugs.
For example, see: https://wiki.php.net/rfc/opt_in_dom_spec_compliance#namespace_bug_examples

I realize that I've been a bit vague in this thread about concrete issues that might arise when trying to polyfill DOM\Node.
I'll give an example.

Let's say you create a polyfill for DOM\Node that wraps DOMNode. At some point, you need to make XML serialization work, i.e. DOM\Document::saveXML(?DOM\Node $node=null, int $options=0). It's tempting to reuse DOMDocument::saveXML(...), but that will, for example, lead to incorrect serialization for the following code: https://3v4l.org/W0gqk

Notice how the serialization erroneously placed the <no-namespace/> element inside the urn:a namespace instead of the empty namespace. The correct serialization should've been <no-namespace xmlns=""/>. The new DOM classes solve this by providing a new XML serialization implementation. But if you reuse the old XML serialization to implement your polyfill, you'll have to take issues like these into account.

That's something you can overcome, although I wonder whether you really want to build and maintain an XML serializer (it's no fun).

Also, here is something that is impossible to implement using the old DOM classes: https://3v4l.org/75RiC
This outputs the following:

string(4) "foo1"
string(7) "default"
string(4) "foo2"
string(7) "default"
<?xml version="1.0"?>
<root xmlns="urn:a" xmlns:default="urn:a" default:foo1="" default:foo2=""/>

Notice how the namespace prefix of the attributes is not kept (as seen in the first 4 lines). This happens because with the old DOM classes you can't have conflicting declared namespace prefixes on the same element.
The output should've been this instead (per spec):

string(6) "a:foo1"
string(1) "a"
string(6) "b:foo2"
string(1) "b"
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns="urn:a" xmlns:a="urn:a" a:foo1="" a:foo2=""/>

The attributes retain their prefixes (as you can see in the first 4 lines), and the serialization resolves the conflicting namespaces to use the same prefix.
I don't see how you can polyfill behaviours like this.

@nicolas-grekas
Copy link
Member Author

nicolas-grekas commented Mar 20, 2024

Got it, thanks for the explanations. Then I'm wondering if there's a subset that could be made to work when namespaces are not in use?
And last but not least: could it be worth to ship the new DOM classes as a standalone extension for PHP 8.2+?

On our side, we need to widen the types to accept both the new and the old classes. DomCrawler extends Crawler and we'd deprecate Crawler I suppose...

And we need to deprecate ParserInterface in HtmlSanitizer, which can be replaced by a simple closure instead.

FTR, here are the APIs we have that are related to `\DOM*`.
src/Symfony/Component/Translation/Util/XliffUtils.php:    public static function getVersionNumber(\DOMDocument $dom): string
src/Symfony/Component/Translation/Util/XliffUtils.php:    public static function validateSchema(\DOMDocument $dom): array
src/Symfony/Component/Config/Util/XmlUtils.php:    public static function parse(string $content, string|callable|null $schemaOrCallable = null): \DOMDocument
src/Symfony/Component/Config/Util/XmlUtils.php:    public static function loadFile(string $file, string|callable|null $schemaOrCallable = null): \DOMDocument
src/Symfony/Component/Config/Util/XmlUtils.php:    public static function convertDomElementToArray(\DOMElement $element, bool $checkPrefix = true): mixed
src/Symfony/Component/Console/Descriptor/XmlDescriptor.php:    public function getInputDefinitionDocument(InputDefinition $definition): \DOMDocument
src/Symfony/Component/Console/Descriptor/XmlDescriptor.php:    public function getCommandDocument(Command $command, bool $short = false): \DOMDocument
src/Symfony/Component/Console/Descriptor/XmlDescriptor.php:    public function getApplicationDocument(Application $application, ?string $namespace = null, bool $short = false): \DOMDocument
src/Symfony/Component/DomCrawler/AbstractUriElement.php:    protected \DOMElement $node;
src/Symfony/Component/DomCrawler/AbstractUriElement.php:    public function __construct(\DOMElement $node, ?string $currentUri = null, ?string $method = 'GET')
src/Symfony/Component/DomCrawler/AbstractUriElement.php:    public function getNode(): \DOMElement
src/Symfony/Component/DomCrawler/AbstractUriElement.php:    abstract protected function setNode(\DOMElement $node): void;
src/Symfony/Component/DomCrawler/Form.php:    public function __construct(\DOMElement $node, ?string $currentUri = null, ?string $method = null, ?string $baseHref = null)
src/Symfony/Component/DomCrawler/Form.php:    public function getFormNode(): \DOMElement
src/Symfony/Component/DomCrawler/Form.php:    protected function setNode(\DOMElement $node): void
src/Symfony/Component/DomCrawler/Image.php:    public function __construct(\DOMElement $node, ?string $currentUri = null)
src/Symfony/Component/DomCrawler/Image.php:    protected function setNode(\DOMElement $node): void
src/Symfony/Component/DomCrawler/Link.php:    protected function setNode(\DOMElement $node): void
src/Symfony/Component/DomCrawler/Crawler.php:    public function __construct(\DOMNodeList|\DOMNode|array|string|null $node = null, ?string $uri = null, ?string $baseHref = null, bool $useHtml5Parser = true)
src/Symfony/Component/DomCrawler/Crawler.php:    public function add(\DOMNodeList|\DOMNode|array|string|null $node): void
src/Symfony/Component/DomCrawler/Crawler.php:    public function addDocument(\DOMDocument $dom): void
src/Symfony/Component/DomCrawler/Crawler.php:    public function addNodeList(\DOMNodeList $nodes): void
src/Symfony/Component/DomCrawler/Crawler.php:    public function addNode(\DOMNode $node): void
src/Symfony/Component/DomCrawler/Crawler.php:    public function getNode(int $position): ?\DOMNode
src/Symfony/Component/DomCrawler/Crawler.php:    protected function sibling(\DOMNode $node, string $siblingDir = 'nextSibling'): array
src/Symfony/Component/DomCrawler/Field/ChoiceFormField.php:    public function addChoice(\DOMElement $node): void
src/Symfony/Component/DomCrawler/Field/FormField.php:    protected \DOMElement $node;
src/Symfony/Component/DomCrawler/Field/FormField.php:    protected \DOMDocument $document;
src/Symfony/Component/DomCrawler/Field/FormField.php:    protected \DOMXPath $xpath;
src/Symfony/Component/DomCrawler/Field/FormField.php:    public function __construct(\DOMElement $node)
src/Symfony/Component/DomCrawler/Field/FormField.php:    public function getLabel(): ?\DOMElement
src/Symfony/Component/HtmlSanitizer/Visitor/DomVisitor.php:    public function visit(\DOMDocumentFragment $domNode): ?NodeInterface
src/Symfony/Component/HtmlSanitizer/Parser/ParserInterface.php:    public function parse(string $html): ?\DOMNode;
src/Symfony/Component/HtmlSanitizer/Parser/MastermindsParser.php:    public function parse(string $html): ?\DOMNode
src/Symfony/Component/VarDumper/Caster/DOMCaster.php:    public static function castException(\DOMException $e, array $a, Stub $stub, bool $isNested): array
src/Symfony/Component/VarDumper/Caster/DOMCaster.php:    public static function castImplementation(\DOMImplementation $dom, array $a, Stub $stub, bool $isNested): array
src/Symfony/Component/VarDumper/Caster/DOMCaster.php:    public static function castNode(\DOMNode $dom, array $a, Stub $stub, bool $isNested): array
src/Symfony/Component/VarDumper/Caster/DOMCaster.php:    public static function castNameSpaceNode(\DOMNameSpaceNode $dom, array $a, Stub $stub, bool $isNested): array
src/Symfony/Component/VarDumper/Caster/DOMCaster.php:    public static function castDocument(\DOMDocument $dom, array $a, Stub $stub, bool $isNested, int $filter = 0): array
src/Symfony/Component/VarDumper/Caster/DOMCaster.php:    public static function castCharacterData(\DOMCharacterData $dom, array $a, Stub $stub, bool $isNested): array
src/Symfony/Component/VarDumper/Caster/DOMCaster.php:    public static function castAttr(\DOMAttr $dom, array $a, Stub $stub, bool $isNested): array
src/Symfony/Component/VarDumper/Caster/DOMCaster.php:    public static function castElement(\DOMElement $dom, array $a, Stub $stub, bool $isNested): array
src/Symfony/Component/VarDumper/Caster/DOMCaster.php:    public static function castText(\DOMText $dom, array $a, Stub $stub, bool $isNested): array
src/Symfony/Component/VarDumper/Caster/DOMCaster.php:    public static function castDocumentType(\DOMDocumentType $dom, array $a, Stub $stub, bool $isNested): array
src/Symfony/Component/VarDumper/Caster/DOMCaster.php:    public static function castNotation(\DOMNotation $dom, array $a, Stub $stub, bool $isNested): array
src/Symfony/Component/VarDumper/Caster/DOMCaster.php:    public static function castEntity(\DOMEntity $dom, array $a, Stub $stub, bool $isNested): array
src/Symfony/Component/VarDumper/Caster/DOMCaster.php:    public static function castProcessingInstruction(\DOMProcessingInstruction $dom, array $a, Stub $stub, bool $isNested): array
src/Symfony/Component/VarDumper/Caster/DOMCaster.php:    public static function castXPath(\DOMXPath $dom, array $a, Stub $stub, bool $isNested): array
src/Symfony/Component/DependencyInjection/Loader/XmlFileLoader.php:    public function validateSchema(\DOMDocument $dom): bool
src/Symfony/Component/DependencyInjection/Loader/XmlFileLoader.php:    public static function convertDomElementToArray(\DOMElement $element): mixed
src/Symfony/Component/Routing/Loader/XmlFileLoader.php:    protected function parseNode(RouteCollection $collection, \DOMElement $node, string $path, string $file): void
src/Symfony/Component/Routing/Loader/XmlFileLoader.php:    protected function parseRoute(RouteCollection $collection, \DOMElement $node, string $path): void
src/Symfony/Component/Routing/Loader/XmlFileLoader.php:    protected function parseImport(RouteCollection $collection, \DOMElement $node, string $path, string $file): void
src/Symfony/Component/Routing/Loader/XmlFileLoader.php:    protected function loadFile(string $file): \DOMDocument

@nielsdos
Copy link

Then I'm wondering if there's a subset that could be made to work when namespaces are not in use?

If namespaces are not used, things become easier. Unfortunately, even HTML5 documents can use multiple namespaces, although HTML5 documents that come straight out of the parser should be fine. There's still many other issues of course that need to be worked around (as noted by the RFC).

And last but not least: could it be worth to ship the new DOM classes as a standalone extension for PHP 8.2+?

That would be nice, but also hard to do. To implement the new DOM classes I refactored some internals of the DOM extension such that the old and new classes can share some core functionality. Having an extension just providing the new classes in PECL would mean that I have to decouple the implementation from the current DOM extension. This would be a lot of effort and I'd rather not do this. All the development work already took hundreds of hours of spare time and I don't feel like doing that again to be honest.

On our side, we need to widen the types to accept both the new and the old classes. DomCrawler extends Crawler and we'd deprecate Crawler I suppose...
And we need to deprecate ParserInterface in HtmlSanitizer, which can be replaced by a simple closure instead.

All sounds reasonable.

@alexandre-daubois
Copy link
Member

alexandre-daubois commented Mar 20, 2024

On our side, we need to widen the types to accept both the new and the old classes. DomCrawler extends Crawler and we'd deprecate Crawler I suppose

Seems like the best thing to do. I'll have a look at this, I already played a few hours with the new classes in the current Crawler. I think having a new DomCrawler will solve most (all?) the blockers I faced.

Thanks for taking that much time to clarify the situation!

@stof
Copy link
Member

stof commented Mar 21, 2024

Replacing ParserInterface with a closure would not help much IMO. We still need to define the signature of that closure.

@alexandre-daubois
Copy link
Member

After some back and forth on the approach to take, I think we've come up with a solution that seems to fit the bill.

Thanks to Niels and the various amendments made to the new DOM extension during development, all the features of the current Crawler have been transposed to the new native HTML parser.

Many questions have arisen concerning the public API for using the new classes. Alexander proposed the solution of duplicating existing classes with the introduction of traits to mutualize the unchanged parts between the current and new Crawler. This allows us to have new classes that accept modern nodes, and to be free in terms of parameter typing with regard to the BC promise.

The new classes are located in the NativeCrawler sub namespace. These classes only support modern nodes, not legacy ones. As a result, there are no new deprecations in this code.

The resulting PR is therefore quite substantial, and this is also due to the fact that the tests have been repeated for the new classes in particular. If I can help with the review in any way, please let me know. I know it can be quite a challenge, as there's a lot of stuff all over the place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DomCrawler Feature Help wanted Issues and PRs which are looking for volunteers to complete them.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants