Closed
Description
We got lots of complains in the past about the fact that the Crawler does not properly parse HTML5 documents, when the HTML5 shortcut features are being used.
The root cause is that DomDocument::loadHtml
does not rely on an HTML5 parser, and so does not understand all these specificities of the spec.
A solution could be to add an optional dependency on https://github.com/Masterminds/html5-php, which is a userland implementation of an HTML5 parser, which still returns a DomDocument object after parsing (and so everything else would work fine after that).
I'm not sure whether the usage of html5-php should be turned on just by the presence of that library, or should require an explicit opt-in:
- the explicit opt-in might be painful for users of the BrowserKit component, because BrowserKit handles the creation of the Crawler (and so it might require to have an opt-in in BrowserKit as well to control the DomCrawler opt-in)
- automatically using html5-php might be bad for people not needing the HTML5 features for the page they load, as a userland implementation is probably slower than
DomDocument::loadHtml
which is implemented in C (that's only a guess; I haven't done any benchmarking)