-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
Leverage PHP 8.4's native parsing of HTML5 #53666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
https://wiki.php.net/rfc/opt_in_dom_spec_compliance is now to take into consideration as well. |
I tried things but I keep getting blocked. The problem I'm facing is the following: Indeed, when parsing HTML with the new I'm still thinking of the best way to leverage this new parsing method. One solution I could think of is creating a new Crawler, extending the existing one. Something like |
How do the new namespaced DOM classes relate to the old ones? |
They doesn't relate to the old ones actually. Here's the stub file: https://github.com/php/php-src/blob/cc0260e0140f42fb98a4da52e0fbc715de6c5cad/ext/dom/php_dom.stub.php. This is the pain point. The new extension is completely opt-in and separated from the old implementation. Thanks to a recent addition, the new |
Since this is WIP on php-src's side, maybe we can try to hint them about what would help? /cc @nielsdos FYI |
I see. In that case, a new Crawler class would be our best option, I'm afraid. 😟 |
|
Also |
The class structure is pretty much finished. I only have one more RFC planned to add some more (minor) methods and fields to the new DOM classes. This might also come with a specialisation (i.e. optional-to-use subclass, although that shouldn't cause issues as "old DOM" has no equivalent) of the Element class called HTMLElement, but that hasn't been decided yet. I see that the crawler has signatures and fields tightly coupled to the old DOM classes, so indeed as these types have nothing in common this is a problem that propagates through the entire component. As noted above, this is not limited to types but impacts fields (and methods) too. The reason the old and new classes have nothing in common is because their behaviour is incompatible and the types of methods and fields are slightly different to fix things like nullability etc. On the ML people seemed happy to not have aliases but have real classes with the correct types instead. I originally included an importModernNode analogue of importLegacyNode, but dropped it from the proposal because I found out it is rather easy to create documents that are representable in "new DOM" but unrepresentable in "old DOM". As such, it seemed better to me to not give the false impression that documents from "new DOM" could be converted unconditionally to "old DOM". The inverse is however always possible: you can always convert "old DOM" documents to "new DOM" documents. |
Thank you for the complete explanation @nielsdos! This comfort me in the idea that we may rewrite part of the component to support both old and new DOM classes. A viable solution would be to create a new sub namespace of classes extending the existing ones. We would then widen types of required method and properties (and also change some private visibilities to I think this solution is nice because it would avoid double maintenance. The current Crawler could then be deprecated when PHP 8.4 becomes the minimal version for Symfony (in version 8.0 I guess). What do you think? |
I can't really comment much about this symfony component, as I don't really have experience with Symfony to be honest. |
What I would love is a plan that involves polyfilling classes in the new DOM namespace. It'd be fine to not implement the HTML5 parsing (although that could maybe be done with mastermind/html5). Allowing to use the new classes without having to wait for PHP 8.4 would be a major trigger to foster their adoption. PR welcome on https://github.com/symfony/polyfill-php84 |
That's not possible in the general case, if you want your polyfills to be based off the old DOM classes. |
Would it be possible to implement importLegacyNode in a polyfill? That might be enough as a transition point. |
To wrap the old classes into your own implementations? Sure. I don't know how you envision implementing the new DOM behaviour though? |
You mean what's described in https://wiki.php.net/rfc/opt_in_dom_spec_compliance#methods right? Is that exhaustive? |
Yes, the RFC is exhaustive in its issue descriptions. The problems are not limited to the I realize that I've been a bit vague in this thread about concrete issues that might arise when trying to polyfill Let's say you create a polyfill for Notice how the serialization erroneously placed the That's something you can overcome, although I wonder whether you really want to build and maintain an XML serializer (it's no fun). Also, here is something that is impossible to implement using the old DOM classes: https://3v4l.org/75RiC string(4) "foo1"
string(7) "default"
string(4) "foo2"
string(7) "default"
<?xml version="1.0"?>
<root xmlns="urn:a" xmlns:default="urn:a" default:foo1="" default:foo2=""/> Notice how the namespace prefix of the attributes is not kept (as seen in the first 4 lines). This happens because with the old DOM classes you can't have conflicting declared namespace prefixes on the same element. string(6) "a:foo1"
string(1) "a"
string(6) "b:foo2"
string(1) "b"
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns="urn:a" xmlns:a="urn:a" a:foo1="" a:foo2=""/> The attributes retain their prefixes (as you can see in the first 4 lines), and the serialization resolves the conflicting namespaces to use the same prefix. |
Got it, thanks for the explanations. Then I'm wondering if there's a subset that could be made to work when namespaces are not in use? On our side, we need to widen the types to accept both the new and the old classes. And we need to deprecate ParserInterface in HtmlSanitizer, which can be replaced by a simple closure instead. FTR, here are the APIs we have that are related to `\DOM*`.
|
If namespaces are not used, things become easier. Unfortunately, even HTML5 documents can use multiple namespaces, although HTML5 documents that come straight out of the parser should be fine. There's still many other issues of course that need to be worked around (as noted by the RFC).
That would be nice, but also hard to do. To implement the new DOM classes I refactored some internals of the DOM extension such that the old and new classes can share some core functionality. Having an extension just providing the new classes in PECL would mean that I have to decouple the implementation from the current DOM extension. This would be a lot of effort and I'd rather not do this. All the development work already took hundreds of hours of spare time and I don't feel like doing that again to be honest.
All sounds reasonable. |
Seems like the best thing to do. I'll have a look at this, I already played a few hours with the new classes in the current Crawler. I think having a new Thanks for taking that much time to clarify the situation! |
Replacing ParserInterface with a closure would not help much IMO. We still need to define the signature of that closure. |
After some back and forth on the approach to take, I think we've come up with a solution that seems to fit the bill. Thanks to Niels and the various amendments made to the new DOM extension during development, all the features of the current Crawler have been transposed to the new native HTML parser. Many questions have arisen concerning the public API for using the new classes. Alexander proposed the solution of duplicating existing classes with the introduction of traits to mutualize the unchanged parts between the current and new Crawler. This allows us to have new classes that accept modern nodes, and to be free in terms of parameter typing with regard to the BC promise. The new classes are located in the The resulting PR is therefore quite substantial, and this is also due to the fact that the tests have been repeated for the new classes in particular. If I can help with the review in any way, please let me know. I know it can be quite a challenge, as there's a lot of stuff all over the place. |
Uh oh!
There was an error while loading. Please reload this page.
Description
DomCrawler currently relies on html
https://wiki.php.net/rfc/domdocument_html5_parser
Example
No response
The text was updated successfully, but these errors were encountered: