Skip to content

DomCrawler encodes html entities inside script tags #54437

Closed
@ausi

Description

@ausi

Symfony version(s) affected

5.4.37

Description

The Crawler encodes html entities before parsing it in

private function convertToHtmlEntities(string $htmlContent, string $charset = 'UTF-8'): string
which causes the contents of script tags to be corrupt (because HTML entities have no meaning between <script> and </script>).

How to reproduce

$crawler = new Crawler();
$crawler->addContent('<!doctype html><html><script>var foo = "bär";</script></html>', 'text/html; charset=UTF-8');
echo $crawler->filterXPath('//script')->text();
// output: var foo = "b&#228;r";
// expected: var foo = "bär";

Possible Solution

I’m not sure what’s the best way to fix it as convertToHtmlEntities() cannot distinguish between outside and inside <script>. But as it seems that convertToHtmlEntities() is only there to fix issues with the libxml parser, maybe it can be skipped if the Masterminds\HTML5 is used?

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions