Skip to content

[DomCrawler] HTML5 not recognized when document starts with a comment #37681

Closed
@bert-w

Description

@bert-w

Symfony version(s) affected: 4.4.11 (symfony/dom-crawler)

Description
Symfony DOM crawler has an option to use a HTML5 parser when you install the respective package ( masterminds/html5). However, this parser is specifically checking for a HTML5 doc-type as the first content in the HTML. The following situation therefore does not work (see reproduction):

How to reproduce
Consider the following file sample.html:

<!--
    This is a comment
-->
<!DOCTYPE html>
<html lang="en">
<body>
    <h1>Hello</h1>
</body>
</html>

Next, we create a crawler with this content:

$crawler = new \Symfony\Component\DomCrawler\Crawler(file_get_contents('sample.html'), 'https://example.com');

The file above is now parsed using the regular non-html5 parser.

As seen on this line https://github.com/symfony/symfony/blob/master/src/Symfony/Component/DomCrawler/Crawler.php#L186,
it evaluates to parseXhtml instead of the expected parseHtml5:

$dom = null !== $this->html5Parser && strspn($content, " \t\r\n") === stripos($content, '<!doctype html>') ? $this->parseHtml5($content, $charset) : $this->parseXhtml($content, $charset);

This creates trivial issues since it is actually a HTML5 document.

P.S. I dont know if the html sample above is according to spec.

Possible Solution
1)
A dirty fix I'm using is simply discarding any HTML comments using a regex:

$content = preg_replace('/<!--.*?-->/s', '', file_get_contents('sample.html'));
$crawler = new Crawler($content, ...);

This is unlikely to be a closing solution. I can imagine there being websites that have <script> tags or even other html elements before the <!DOCTYPE html> definition. Again, I do not know if this is against html5 spec.

2)
Add a feature so the HTML5 parser can be forced for any content you pass. I have no clue what implications this has because this causes non-html5 content to be parsed by the HTML5 parser.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugDomCrawlerGood first issueIdeal for your first contribution! (some Symfony experience may be required)Help wantedIssues and PRs which are looking for volunteers to complete them.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions