Optionally use html5-php to parse HTML in DomCrawler

We got lots of complains in the past about the fact that the Crawler does not properly parse HTML5 documents, when the HTML5 shortcut features are being used.
The root cause is that `DomDocument::loadHtml` does not rely on an HTML5 parser, and so does not understand all these specificities of the spec.

A solution could be to add an optional dependency on https://github.com/Masterminds/html5-php, which is a userland implementation of an HTML5 parser, which still returns a DomDocument object after parsing (and so everything else would work fine after that).

I'm not sure whether the usage of html5-php should be turned on just by the presence of that library, or should require an explicit opt-in:

- the explicit opt-in might be painful for users of the BrowserKit component, because BrowserKit handles the creation of the Crawler (and so it might require to have an opt-in in BrowserKit as well to control the DomCrawler opt-in)
- automatically using html5-php might be bad for people *not* needing the HTML5 features for the page they load, as a userland implementation is probably slower than  `DomDocument::loadHtml` which is implemented in C (that's only a guess; I haven't done any benchmarking)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optionally use html5-php to parse HTML in DomCrawler #29280

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Optionally use html5-php to parse HTML in DomCrawler #29280

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions