-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
[DomCrawler] text()
method mangles UTF8 text
#46822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Open for a PR including a testcase? |
text()
method mangles UTF8 text
I tried this test on the code and tests still pass. Could this be a locale-sensitive behavior? --- a/src/Symfony/Component/DomCrawler/Tests/AbstractCrawlerTest.php
+++ b/src/Symfony/Component/DomCrawler/Tests/AbstractCrawlerTest.php
@@ -265,6 +265,9 @@ abstract class AbstractCrawlerTest extends TestCase
$crawler = $this->createTestCrawler()->filterXPath('//p');
$this->assertSame('Elsa <3', $crawler->text(null, true), '->text(null, true) returns the text with normalized whitespace');
$this->assertNotSame('Elsa <3', $crawler->text(null, false));
+
+ $document = $this->createCrawler('<?xml version="1.0" encoding="utf-8" standalone="yes"?><node>mą</node>');
+ $this->assertSame('mą', $document->text(null, true));
}
|
Hmm... You're right. I cannot reproduce the error within a testcase in this repository. |
@nicolas-grekas Yes, it's a behavior, that appears when setlocale(LC_CTYPE, null);
$document = $this->createCrawler('<?xml version="1.0" encoding="utf-8" standalone="yes"?><node>mą</node>');
$this->assertSame('mą', $document->text(null, true)); I know it's incorrect to call setlocale with null as second parameter 😉 The call is somewhere deep within the CMS used (TYPO3) and only happens in a specific context, when a method is called before the configuration is available: The |
What's your system locale? Eg what does |
|
I'm sorry I'm unable to reproduce. |
Hm, very strange. I am currently using the following test-file (without symfony or any other dependencies): <?php
setlocale(LC_CTYPE, null);
$text = 'mą';
$replaced = preg_replace('/(?:\s{2,}+|[^\S ])/', ' ', $text);
var_dump($replaced);
print_r(array_map('dechex', array_map('ord', preg_split('//', $replaced)))); I am using macOS 12.4 with PHP 8.1.5 (installed using Homebrew). The following command returns the incorrect string:
The following command returns the correct code:
I was not able to reproduce the buggy behavior on another system (I tried a couple linux/unix servers which all worked fine). But I was able to reproduce on another Mac. Which system are you using? |
I'm on Linux. We could try replacing those |
It's not the <?php
setlocale(LC_CTYPE, null);
$text = 'ą';
echo 'Original: ' . bin2hex($text) . "\n";
$replaced = preg_replace('/[^\S ]/', ' ', $text);
echo 'Replaced: ' . bin2hex($replaced) . "\n"; This outputs
This means, the "character" 85 get's replaced by 20 (a space). Maybe |
Looks like pcre is using ctype to decide what is a space. Out of curiosity, what's the output of this script for you? |
…ation (nicolas-grekas) This PR was merged into the 4.4 branch. Discussion ---------- [DowCrawler] Fix locale-sensitivity of whitespace normalization | Q | A | ------------- | --- | Branch? | 4.4 | Bug fix? | yes | New feature? | no | Deprecations? | no | Tickets | Fix #46822 | License | MIT | Doc PR | - Also aligning with https://infra.spec.whatwg.org/#ascii-whitespace Commits ------- a632fe2 [DowCrawler] Fix locale-sensitivity of whitespace normalization
Anyway, fixed in #47175 |
The script outputs the following:
|
I tested your commit and it works fine 👍 thanks :) I rechecked my PHP version. Initially I said that I am using PHP 8.1.5. But the CLI used PHP 7.4.30. Maybe this is a strange PHP macOS Bug? Anyways: Your commit fixes it for 7.4.30 and does not break anything in 8.1 or 8.0 for me :) |
Thanks for all the insights. Seeing 85 there is unexpected to me but that goes beyond Symfony's responsibility. At least here we're done :) |
Symfony version(s) affected
6.1.0
Description
The text() method mangles some UTF8 content, if the
normalizeWhitespace
option is used. This happens, because thepreg_replace
does not set theutf-8
modifier forpreg_replace
.How to reproduce
Example XML:
Example PHP Code:
Output (prints the hexcode for the content)
Possible Solution
The solution is simple: Set the utf8 modifier for the text() method:
https://github.com/symfony/dom-crawler/blob/6.1/Crawler.php#L558
Additional Context
No response
The text was updated successfully, but these errors were encountered: