Skip to content

[DomCrawler] text() method mangles UTF8 text #46822

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rvock opened this issue Jul 1, 2022 · 15 comments
Closed

[DomCrawler] text() method mangles UTF8 text #46822

rvock opened this issue Jul 1, 2022 · 15 comments

Comments

@rvock
Copy link

rvock commented Jul 1, 2022

Symfony version(s) affected

6.1.0

Description

The text() method mangles some UTF8 content, if the normalizeWhitespace option is used. This happens, because the preg_replace does not set the utf-8 modifier for preg_replace.

How to reproduce

Example XML:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<node>mą</node>

Example PHP Code:

$document = new \Symfony\Component\DomCrawler\Crawler(file_get_contents($sourcePath));
$text = $document->filter('node')->text();
$correctText = $document->filter('node')->getNode(0)->nodeValue;
print_r(array_map('dechex', array_map('ord', preg_split('//', $text))));
print_r(array_map('dechex', array_map('ord', preg_split('//', $correctText))));

Output (prints the hexcode for the content)

Array
(
    [0] => 0
    [1] => 6d
    [2] => c4
    [3] => 0
)
Array
(
    [0] => 0
    [1] => 6d
    [2] => c4
    [3] => 85
    [4] => 0
)

Possible Solution

The solution is simple: Set the utf8 modifier for the text() method:
https://github.com/symfony/dom-crawler/blob/6.1/Crawler.php#L558

return trim(preg_replace('/(?:\s{2,}+|[^\S ])/u', ' ', $text));

Additional Context

No response

@OskarStark
Copy link
Contributor

Open for a PR including a testcase?

@OskarStark OskarStark changed the title DomCrawler text() method mangles UTF8 text [DomCrawler] text() method mangles UTF8 text Jul 14, 2022
@nicolas-grekas
Copy link
Member

nicolas-grekas commented Jul 27, 2022

I tried this test on the code and tests still pass. Could this be a locale-sensitive behavior?
Can you provide us with a reproducer?

--- a/src/Symfony/Component/DomCrawler/Tests/AbstractCrawlerTest.php
+++ b/src/Symfony/Component/DomCrawler/Tests/AbstractCrawlerTest.php
@@ -265,6 +265,9 @@ abstract class AbstractCrawlerTest extends TestCase
         $crawler = $this->createTestCrawler()->filterXPath('//p');
         $this->assertSame('Elsa <3', $crawler->text(null, true), '->text(null, true) returns the text with normalized whitespace');
         $this->assertNotSame('Elsa <3', $crawler->text(null, false));
+
+        $document = $this->createCrawler('<?xml version="1.0" encoding="utf-8" standalone="yes"?><node>mą</node>');
+        $this->assertSame('mą', $document->text(null, true));
     }
 

@rvock
Copy link
Author

rvock commented Jul 28, 2022

Hmm... You're right. I cannot reproduce the error within a testcase in this repository.
I'll investigate and try to create a working test case.

@rvock
Copy link
Author

rvock commented Jul 28, 2022

@nicolas-grekas Yes, it's a behavior, that appears when setlocale is called before creating the Crawler:

setlocale(LC_CTYPE, null);
$document = $this->createCrawler('<?xml version="1.0" encoding="utf-8" standalone="yes"?><node>mą</node>');
$this->assertSame('', $document->text(null, true));

I know it's incorrect to call setlocale with null as second parameter 😉 The call is somewhere deep within the CMS used (TYPO3) and only happens in a specific context, when a method is called before the configuration is available:
https://github.com/TYPO3/typo3/blob/10.4/typo3/sysext/core/Classes/Utility/PathUtility.php#L165-L172

The u-modifier within the regular expression in the text() method fixes this bug.

@nicolas-grekas
Copy link
Member

What's your system locale? Eg what does setlocale(LC_CTYPE, '0'); return for you?

@rvock
Copy link
Author

rvock commented Jul 28, 2022

setlocale(LC_CTYPE, '0') returns the string C.

@nicolas-grekas
Copy link
Member

I'm sorry I'm unable to reproduce.
We cannot blindly add the u modifier as that might 1. break parsing non-UTF8 html 2. trim more "spaces" than required.

@rvock
Copy link
Author

rvock commented Aug 2, 2022

Hm, very strange. I am currently using the following test-file (without symfony or any other dependencies):

<?php

setlocale(LC_CTYPE, null);
$text = '';
$replaced = preg_replace('/(?:\s{2,}+|[^\S ])/', ' ', $text);
var_dump($replaced);
print_r(array_map('dechex', array_map('ord', preg_split('//', $replaced))));

I am using macOS 12.4 with PHP 8.1.5 (installed using Homebrew). The following command returns the incorrect string:

$ LC_ALL=en_US.UTF-8 php -f test.php

string(3) "m� "
Array
(
    [0] => 0
    [1] => 6d
    [2] => c4
    [3] => 20
    [4] => 0
)

The following command returns the correct code:

$ LC_ALL=C php -f test.php
(or even invalid LC)
$ LC_ALL=invalid php -f test.php

string(3) "mą"
Array
(
    [0] => 0
    [1] => 6d
    [2] => c4
    [3] => 85
    [4] => 0
)

I was not able to reproduce the buggy behavior on another system (I tried a couple linux/unix servers which all worked fine). But I was able to reproduce on another Mac. Which system are you using?

@nicolas-grekas
Copy link
Member

I'm on Linux. We could try replacing those \s by an explicit character class.
But this is still strange and this creates a doubt for the other \s in the code 😮‍💨

@rvock
Copy link
Author

rvock commented Aug 3, 2022

It's not the \s character class, but the \S class:

<?php

setlocale(LC_CTYPE, null);
$text = 'ą';
echo 'Original: ' . bin2hex($text) . "\n";

$replaced = preg_replace('/[^\S ]/', ' ', $text);

echo 'Replaced: ' . bin2hex($replaced) . "\n";

This outputs

Original: c485
Replaced: c420

This means, the "character" 85 get's replaced by 20 (a space). Maybe preg_replace sees the 85 as two chars. One \S and one space.

@nicolas-grekas
Copy link
Member

nicolas-grekas commented Aug 3, 2022

Looks like pcre is using ctype to decide what is a space. Out of curiosity, what's the output of this script for you?
Also, what's the output of locale on your machine? I'm trying to figure in which locale x85 is a space.

nicolas-grekas added a commit that referenced this issue Aug 3, 2022
…ation (nicolas-grekas)

This PR was merged into the 4.4 branch.

Discussion
----------

[DowCrawler] Fix locale-sensitivity of whitespace normalization

| Q             | A
| ------------- | ---
| Branch?       | 4.4
| Bug fix?      | yes
| New feature?  | no
| Deprecations? | no
| Tickets       | Fix #46822
| License       | MIT
| Doc PR        | -

Also aligning with https://infra.spec.whatwg.org/#ascii-whitespace

Commits
-------

a632fe2 [DowCrawler] Fix locale-sensitivity of whitespace normalization
@nicolas-grekas
Copy link
Member

Anyway, fixed in #47175

@rvock
Copy link
Author

rvock commented Aug 3, 2022

The script outputs the following:

9
a
b
c
d
20
85
a0

locale on Terminal outputs the following:

LANG="de_DE.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

@rvock
Copy link
Author

rvock commented Aug 3, 2022

I tested your commit and it works fine 👍 thanks :)

I rechecked my PHP version. Initially I said that I am using PHP 8.1.5. But the CLI used PHP 7.4.30.
I've got three different PHP versions installed:
PHP 8.1.8 and 8.0.21 work fine. (I've installed updates, that's why it's no longer 8.1.5)
PHP 7.4.30 is still buggy.

Maybe this is a strange PHP macOS Bug?

Anyways: Your commit fixes it for 7.4.30 and does not break anything in 8.1 or 8.0 for me :)

@nicolas-grekas
Copy link
Member

Thanks for all the insights. Seeing 85 there is unexpected to me but that goes beyond Symfony's responsibility. At least here we're done :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants