Skip to content

Optionally use html5-php to parse HTML in DomCrawler #29280

Closed
@stof

Description

@stof

We got lots of complains in the past about the fact that the Crawler does not properly parse HTML5 documents, when the HTML5 shortcut features are being used.
The root cause is that DomDocument::loadHtml does not rely on an HTML5 parser, and so does not understand all these specificities of the spec.

A solution could be to add an optional dependency on https://github.com/Masterminds/html5-php, which is a userland implementation of an HTML5 parser, which still returns a DomDocument object after parsing (and so everything else would work fine after that).

I'm not sure whether the usage of html5-php should be turned on just by the presence of that library, or should require an explicit opt-in:

  • the explicit opt-in might be painful for users of the BrowserKit component, because BrowserKit handles the creation of the Crawler (and so it might require to have an opt-in in BrowserKit as well to control the DomCrawler opt-in)
  • automatically using html5-php might be bad for people not needing the HTML5 features for the page they load, as a userland implementation is probably slower than DomDocument::loadHtml which is implemented in C (that's only a guess; I haven't done any benchmarking)

Metadata

Metadata

Assignees

No one assigned

    Labels

    DomCrawlerRFCRFC = Request For Comments (proposals about features that you want to be discussed)

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions