Description
Symfony version(s) affected: 4.4.11 (symfony/dom-crawler
)
Description
Symfony DOM crawler has an option to use a HTML5 parser when you install the respective package ( masterminds/html5
). However, this parser is specifically checking for a HTML5 doc-type as the first content in the HTML. The following situation therefore does not work (see reproduction):
How to reproduce
Consider the following file sample.html
:
<!--
This is a comment
-->
<!DOCTYPE html>
<html lang="en">
<body>
<h1>Hello</h1>
</body>
</html>
Next, we create a crawler with this content:
$crawler = new \Symfony\Component\DomCrawler\Crawler(file_get_contents('sample.html'), 'https://example.com');
The file above is now parsed using the regular non-html5 parser.
As seen on this line https://github.com/symfony/symfony/blob/master/src/Symfony/Component/DomCrawler/Crawler.php#L186,
it evaluates to parseXhtml
instead of the expected parseHtml5
:
$dom = null !== $this->html5Parser && strspn($content, " \t\r\n") === stripos($content, '<!doctype html>') ? $this->parseHtml5($content, $charset) : $this->parseXhtml($content, $charset);
This creates trivial issues since it is actually a HTML5 document.
P.S. I dont know if the html sample above is according to spec.
Possible Solution
1)
A dirty fix I'm using is simply discarding any HTML comments using a regex:
$content = preg_replace('/<!--.*?-->/s', '', file_get_contents('sample.html'));
$crawler = new Crawler($content, ...);
This is unlikely to be a closing solution. I can imagine there being websites that have <script>
tags or even other html elements before the <!DOCTYPE html>
definition. Again, I do not know if this is against html5 spec.
2)
Add a feature so the HTML5 parser can be forced for any content you pass. I have no clue what implications this has because this causes non-html5 content to be parsed by the HTML5 parser.