Skip to content

DomCrawler not getting text #8105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
flip111 opened this issue May 20, 2013 · 11 comments
Closed

DomCrawler not getting text #8105

flip111 opened this issue May 20, 2013 · 11 comments

Comments

@flip111
Copy link
Contributor

flip111 commented May 20, 2013

As far as i can understand from the documentation ->text() should get everything between tags with inner tags stripped. So:
input = <p><b>hello</b> world</p>
A) $crawler->filter('p')->text() = "hello world"
B) $crawler->filter('p')->html() = "<b>hello</b> world"

Assuming this is infact the way it's suppose to be then there is a bug here:

 require_once __DIR__.'/vendor/autoload.php';

use Symfony\Component\DomCrawler\Crawler;

$html = <<<'HTML'
<!DOCTYPE html>
<html>
    <body>
        <p class="message">
          <p>Hello</p>
          World!
        </p>
        <p>Hello Crawler!</p>
    </body>
</html>
HTML;

$crawler = new Crawler($html);

$test = $crawler->filter('p.message')->text();
var_dump($test);

Returns: "\r\n "
Expected: Hello World! (with some spaces on the left and right

If this is not a bug, then please regard it as:

  1. A request for better documentation on the html() and text() function
  2. A feature request for A) "plaintext" and B) "innertext" similar to this parser: http://simplehtmldom.sourceforge.net/manual.htm
@lazyhammer
Copy link
Contributor

@flip111: It's not a bug. Since p is not allowed inside p by the spec, libxml fixes the markup for you. And at the time you call text(), your DOM tree is actually looking like this:

<!DOCTYPE html>
<html>
    <body>
        <p class="message">
          </p><p>Hello</p>
          World!

        <p>Hello Crawler!</p>
    </body>
</html>

@flip111
Copy link
Contributor Author

flip111 commented May 20, 2013

Thx for your answer lazyhammer, that sure clears up a lot. The source which is being crawled is not always under the control of the programmer who is using the crawler. So i'm relieved the crawler doesn't break with a wrong usage of the html spec but instead tries to fix it.

To get the "fixed" html:

$html = '';
foreach ($crawler as $domElement) {
    $html.= $domElement->ownerDocument->saveHTML();
}
echo $html;

from: http://stackoverflow.com/a/9567835

I won't close this issue as i find it important enough that this specific feature gets documented better. Even though it's not of the DomCrawler itself but libxml ... Symfony docs sometimes has references to "3rd party" functionality and a little explanation. Which would certainly come in good use right here.

@stof
Copy link
Member

stof commented May 21, 2013

@flip111 note that the behavior of fixing the HTML is even part of the HTML5 spec

@jakzal
Copy link
Contributor

jakzal commented May 21, 2013

@flip111 would you mind sending a PR to the https://github.com/symfony/symfony-docs to document this behaviour?

@flip111
Copy link
Contributor Author

flip111 commented May 21, 2013

@jakzal I tried (as you can see) Don't really get why it says: flip111 wants to merge 1,614 commits into symfony:2.0 from flip111:patch-1

1614 commits !!!! I just made a little change ...

@jakzal
Copy link
Contributor

jakzal commented May 21, 2013

@flip111 Looks like you created a branch from master but you're trying to send a PR against 2.0.

@stloyd
Copy link
Contributor

stloyd commented May 21, 2013

@flip111 You made PR on master but tried to merge it into 2.0, you should create patch with base branch 2.1 and send PR against branch 2.1.

@flip111
Copy link
Contributor Author

flip111 commented May 21, 2013

Ok i tried again with 2.1

@stof
Copy link
Member

stof commented May 21, 2013

@flip111 Your PR against 2.1 is still messed as you haven't changed your branch, which is still based on master

@flip111
Copy link
Contributor Author

flip111 commented May 21, 2013

Ok i try again ... hopefully this time everything is okay.

@fabpot
Copy link
Member

fabpot commented Jun 13, 2013

Closing as there is an issue for the docs now.

@fabpot fabpot closed this as completed Jun 13, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants