Skip to content

[DomCrawler] Terrible perfomance #12298

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
icaine opened this issue Oct 23, 2014 · 24 comments
Closed

[DomCrawler] Terrible perfomance #12298

icaine opened this issue Oct 23, 2014 · 24 comments

Comments

@icaine
Copy link

icaine commented Oct 23, 2014

When trying to parse info from some website i discovered that DomCrawler is terribly slow and unoptimized. With a few little changes i improved performance more than 150+ times (11500 ms -> 65 ms).

The main culprit is
https://github.com/symfony/symfony/blob/master/src/Symfony/Component/DomCrawler/Crawler.php#L862
because creation of DOMXPath is damn expensive.

I quick fixed it by (i know its very very dirty)

    private function createDOMXPath(\DOMDocument $document, array $prefixes = [])
    {
        static $domxpath;
        if (empty($domxpath)) {
            $domxpath = new \DOMXPath($document);
            foreach ($prefixes as $prefix) {
                $namespace = $this->discoverNamespace($domxpath, $prefix);
                if (null !== $namespace) {
                    $domxpath->registerNamespace($prefix, $namespace);
                }
            }
        }

        return $domxpath;
    }

Another performance improvements (cca 200 ms in my environment) can be gained if results from CssSelector::toXPath($selector) at
https://github.com/symfony/symfony/blob/master/src/Symfony/Component/DomCrawler/Crawler.php#L675
are cached.

@stof
Copy link
Member

stof commented Oct 23, 2014

I see a big drawback in your proposal: it does not recreate the XPath when the DomDocument is different

@icaine
Copy link
Author

icaine commented Oct 23, 2014

I know (as i said its quickfix and very very dirty), its only a demonstration of how the perfomance can be greatly improved..

@icaine
Copy link
Author

icaine commented Oct 23, 2014

Moreover Crawler has many methods (mostly namespace related) that seem to do same job repeatedly in every subsequent Crawler instance and if those calls can be eliminated it would improve performance even more..

Maybe building Crawlers as a tree with some shareable properties (DOMXPath, cssToXPath cache, etc) would help?

@icaine
Copy link
Author

icaine commented Oct 23, 2014

Something like this icaine@c4cf128?diff=unified

@jakzal
Copy link
Contributor

jakzal commented Nov 2, 2014

@icaine are there any xml namespaces in your xpath queries?

@icaine
Copy link
Author

icaine commented Nov 3, 2014

No, its html page.. the problem is that domxpath is recreated for every call of filter method..

code below on a html page that has cca 240kb takes about 15.5 seconds, but with changes from my commit it takes only 350 ms..

$channels = $crawler->filter('div.channel_program');
$this->result = $channels->each(function (Crawler $node, $i) use ($url) {
    $data = [];
    $channelName = $node->filter('div.head span.title')->first()->text();
    $channel = Channels::key($channelName);

    if (!$channel) {
        throw new InvalidStateException("Channel name '$channelName' could not be recognized.");
    }
    $data['channel'] = $channel;
    $data['channelName'] = $channelName;

    $data['programs'] = $node->filter('div.item')->each(function (Crawler $node, $i) use ($url) {
        $program['time'] = $this->parseDates($node->filter('span.time')->text());
        $program['title'] = $node->filter('span.title')->text();
        $program['desc'] = $node->filter('p')->text();
        $program['actions'] = [];
        $node->filter('a')->each(function (Crawler $node) use (&$program, $url) {
            $href= $node->attr('href');
            $action = Actions::classToAction(trim(Strings::replace($node->attr('class'), '~\s*button\s*~', '')));
            $program['actions'][$action] = Strings::startsWith($href, 'http') ? $href : $this->parseId($href);
        });
        return $program;
    });

    return $data;
});

@jrmyio
Copy link
Contributor

jrmyio commented Jan 15, 2015

Came across the same issue. I am waiting 15 minutes to update ~500 doctrine records from a HTML table. I don't know with what test-data this DomCrawler was tested but it has horrible performance on any sorts of HTML tables.

Seems more people have the problem: http://stackoverflow.com/questions/22330008/should-symfony-domcrawler-goutte-be-this-slow

@icaine thanks for the hack. I think I see a 5 times improvement here though it's still far too slow for what it has to do.

I also hacked in the CSS to Xpath caching @icaine proposed.

    public function filter($selector)
    {
        if (!class_exists('Symfony\\Component\\CssSelector\\CssSelector')) {
            throw new \RuntimeException('Unable to filter with a CSS selector as the Symfony CssSelector is not installed (you can use filterXPath instead).');
        }

        // The CssSelector already prefixes the selector with descendant-or-self::
        static $cssToXPathMap;
        if (isset($cssToXPathMap[$selector])){
            $xPath = $cssToXPathMap[$selector];
        }else{
            $cssToXPathMap[$selector] = $xPath = CssSelector::toXPath($selector);
        }
        return $this->filterRelativeXPath($xPath);
    }

That also seem to reduces the time.

@jrmyio
Copy link
Contributor

jrmyio commented Mar 19, 2015

Any update on this? With the changes I mentioned above, I increased speed from minutes to seconds.

@javiereguiluz
Copy link
Member

@ConneXNL apparently @stof and @jakzal raised some valid concerns that haven't been solved yet.

@GromNaN
Copy link
Member

GromNaN commented Mar 19, 2015

The DomDocument should be wrapped into an object that also contain the DomXPath.
Something like this : https://github.com/LExpress/Welldom/blob/master/src/Welldom/Document.php#L112

@jakzal
Copy link
Contributor

jakzal commented Mar 19, 2015

@icaine would you mind publishing the html page you're pasting somewhere? I'd love to profile your script.

@icaine
Copy link
Author

icaine commented Mar 19, 2015

@jakzal unfortunatelly i abandoned the project.. But its not hard to simulate similar use case, just find any website with some structural data and hundreds of records (tv program, eshop, etc) and try extract some info..

@stof
Copy link
Member

stof commented Mar 20, 2015

The hard part is that the Crawler class is currently mutable, allowing to add more content in it. this means it is possible to have nodes belong to multiple documents in the same crawler. This makes any optimisation much harder.
And the fact that it extends SplObjectStorage instead of wrapping it means that the Crawler exposes methods which allow messing with the internal state directly, which makes things even worse.

I see 2 main use cases for the crawler:

  • creating it for an HTML (or XML) content
  • creating it from an list of nodes coming from an existing crawler (with any of the DOM traversal methods).

The following API for the crawler would make things much simpler.

class Crawler
{
    public function __construct(\DomDocument $document, array $nodes = array())
    {
        $this->document = $document

        // Validate that all DomElement are for this DomDocument and store them in the internal storage
    }

    public static function fromHtml($html, $charset = 'UTF-8')
    {
        $document = //... Load the HTML (see existing code)

        return new self($document, array($document->documentElement));
    }
}

Looking at the existing constructor, it may even be possible to preserve BC for most cases if we deprecate the possibility to use nodes from multiple documents (which would make such optimizations much easier)

@salaman
Copy link

salaman commented Mar 21, 2015

I have a huge (~24MB) HTML file of horribly nested tables to parse. Everything was working acceptably during dev runs with much smaller test files, so I assumed the processing time would scale linearly when I eventually fed it the entire file, but it ended up being unacceptably slow.

Using @icaine's edit, the processing time decreased a completely insane amount and actually made it usable.

I profiled both runs with xhprof (using the huge data file but limiting parsing to a few tables, since the slow run would probably take hours):

Code Function calls Wall time (s) CPU time (s) Memory (b) Peak memory (b)
Normal DomCrawler 53345 505.799 503.594 16,272,720 89,225,312
Patched DomCrawler 52873 3.502 3.426 16,272,992 89,225,928
Difference -472 -502.296 -500.167 272 616

Most of the time is spent in Symfony\Component\DomCrawler\Crawler::discoverNamespace on the slow run.

As you can see, for my data, the edit makes my parser run in about 0.7% of the time, which is stellar.

I'd love to see this (at least partially) addressed in master.

@jakzal
Copy link
Contributor

jakzal commented Mar 21, 2015

@salaman I'll be looking into this on Monday. Would you mind sending me the file and crawler queries you used to jakub AT zalas DOT pl? It would save me some time.

Btw, since namespace discovery is not always needed, if it's causing the issue we could make it possible to disable this feature.

@salaman
Copy link

salaman commented Mar 21, 2015

@jakzal Sure thing, I'll clean up the file a bit since it's sort of integrated into a larger parser package and send it over so it can be used by itself.

@jakzal
Copy link
Contributor

jakzal commented Mar 21, 2015

@salaman thanks. It was really helpful.

fabpot added a commit that referenced this issue Mar 23, 2015
This PR was merged into the 2.6 branch.

Discussion
----------

[DomCrawler] Improve namespace discovery performance

| Q             | A
| ------------- | ---
| Bug fix?      | yes
| New feature?  | no
| BC breaks?    | no
| Deprecations? | no
| Tests pass?   | yes
| Fixed tickets | #12298
| License       | MIT
| Doc PR        | -

Before this quick-fix, xpath axes like `child::*` or `descendant-or-self::*` were considered namespaces. As a result, far too many xpath queries were being executed during namespace discovery.

Here's a full blackfire before/after comparision of @salaman's script:  https://blackfire.io/profiles/compare/a80b9e77-8e55-45d3-a348-7d34a51053b6/graph

![domcrawler-xpath-query](https://cloud.githubusercontent.com/assets/190447/6767384/ba93c57a-d024-11e4-84e1-e58dd7527f03.png)

Commits
-------

b6af002 [DomCrawler] Improve namespace discovery performance
@fabpot fabpot closed this as completed Mar 23, 2015
@jrmyio
Copy link
Contributor

jrmyio commented Apr 23, 2015

In what way was this fixed? I still have terrible performance without the changes listed above.

@jakzal
Copy link
Contributor

jakzal commented Apr 23, 2015

@ConneXNL if you provide a script reproducing your issue, I'll be happy to look into it. The current issue was related to a namespace discovery logic (see #14012).

@scheb
Copy link
Contributor

scheb commented Jul 24, 2016

Just ran into that issue when iterating over a list of XML nodes and doing a bunch of XPath queries on each list item. Sample document was ~2,5MB.

The filterRelativeXPath() creates a sub-crawler for each XPath query. Then, when the query is executed on the sub-crawler, it has to execute the very slow discoverNamespace() method, because the namespace cache ($namespaces attribute of the Crawler class) is not copied over to the sub-crawler. By changing the createSubCrawler() method to this, I was able to boost performance significantly:

private function createSubCrawler($nodes)
{
    $crawler = new static($nodes, $this->uri, $this->baseHref);
    $crawler->namespaces = $this->namespaces; // Added this line

    return $crawler;
}

It works for my use-case, but I don't know what implications this could have for other use-cases.

@jakzal
Copy link
Contributor

jakzal commented Jul 24, 2016

@scheb could you please provide the script and the document that exhibit the issue?

@scheb
Copy link
Contributor

scheb commented Jul 25, 2016

Here's a script to reproduce:

<?php
include __DIR__ . '/vendor/autoload.php';

use Symfony\Component\DomCrawler\Crawler;

$xmlRaw = file_get_contents('http://api.elsevier.com/content/article/PII:S0370269312002560?httpAccept=text/xml');

$crawler = new Crawler();
$crawler->addXmlContent($xmlRaw);
$crawler->registerNamespace("ce", "http://www.elsevier.com/xml/common/dtd");

$nodes = $crawler->filterXPath('//ce:author-group/ce:affiliation');
$nodes->each(function (Crawler $crawler) {
    $startTime = microtime(true);
    $crawler->filterXPath('//ce:textfn')->extract(array('_text'));
    echo microtime(true) - $startTime . " sec\n";

    $startTime = microtime(true);
    $crawler->filterXPath('//ce:label')->extract(array('_text'));
    echo microtime(true) - $startTime . " sec\n";
});

Also, I want to mention, that all our tests, which are using the DomCrawler, still work and they're nearly twice as fast.

@javiereguiluz
Copy link
Member

@scheb I've tested the change that you propose:

    private function createSubCrawler($nodes)
    {
        $crawler = new static($nodes, $this->uri, $this->baseHref);
        $crawler->isHtml = $this->isHtml;
        $crawler->document = $this->document;
+       $crawler->namespaces = $this->namespaces;

        return $crawler;
    }

And the performance improvement is astonishing:

comparison

Now @jakzal and others will evaluate if this change is appropriate. Thanks!

@stof
Copy link
Member

stof commented Jul 25, 2016

@javiereguiluz the change is totally legitimate. I opened a PR already to apply it.

nicolas-grekas added a commit that referenced this issue Jul 25, 2016
…tof)

This PR was merged into the 2.8 branch.

Discussion
----------

[DomCrawler] Inherit the namespace cache in subcrawlers

| Q             | A
| ------------- | ---
| Branch?       | 2.8
| Bug fix?      | no
| New feature?  | no
| BC breaks?    | no
| Deprecations? | no
| Tests pass?   | yes
| Fixed tickets | #12298
| License       | MIT
| Doc PR        | n/a

This inherits any already discovered/registered namespace with subcrawlers, improving performance when using namespaces.

I submitted to 2.8 rather than 2.7, because the namespace mapping feature was actually buggy in 2.x, because of the fact that nodes could belong to different documents in the same Crawler while the namespace map was shared. The fact that the map was not inherited in subcrawler mitigated this issue (by reducing changes to have multiple documents in the same subcrawler). 2.8 deprecated this possibility to have multiple documents, so I'm fine with applying this here.

Note that the subcrawler inherits the namespace cache at the time it is created, but the cache is not shared between instance (so if a subcrawler discovers an additional namespace of the document, it will not be available for the parent crawler of other subcrawlers of the parent). Sharing the cache would be totally possible (as they share the same document anyway) and would make the experience even better (removing the need to ensure that the root crawler discovers namespace before filtering). But it would require moving from an array to an object. I'm not sure we want to do this in a patch release. What do you think @symfony/deciders ?

Commits
-------

e89c758 [DomCrawler] Inherit the namespace cache in subcrawlers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants