[DomCrawler] Terrible perfomance #12298

icaine · 2014-10-23T09:49:35Z

When trying to parse info from some website i discovered that DomCrawler is terribly slow and unoptimized. With a few little changes i improved performance more than 150+ times (11500 ms -> 65 ms).

The main culprit is
https://github.com/symfony/symfony/blob/master/src/Symfony/Component/DomCrawler/Crawler.php#L862
because creation of DOMXPath is damn expensive.

I quick fixed it by (i know its very very dirty)

    private function createDOMXPath(\DOMDocument $document, array $prefixes = [])
    {
        static $domxpath;
        if (empty($domxpath)) {
            $domxpath = new \DOMXPath($document);
            foreach ($prefixes as $prefix) {
                $namespace = $this->discoverNamespace($domxpath, $prefix);
                if (null !== $namespace) {
                    $domxpath->registerNamespace($prefix, $namespace);
                }
            }
        }

        return $domxpath;
    }

Another performance improvements (cca 200 ms in my environment) can be gained if results from CssSelector::toXPath($selector) at
https://github.com/symfony/symfony/blob/master/src/Symfony/Component/DomCrawler/Crawler.php#L675
are cached.

The text was updated successfully, but these errors were encountered:

stof · 2014-10-23T09:55:23Z

I see a big drawback in your proposal: it does not recreate the XPath when the DomDocument is different

icaine · 2014-10-23T10:01:21Z

I know (as i said its quickfix and very very dirty), its only a demonstration of how the perfomance can be greatly improved..

icaine · 2014-10-23T10:14:41Z

Moreover Crawler has many methods (mostly namespace related) that seem to do same job repeatedly in every subsequent Crawler instance and if those calls can be eliminated it would improve performance even more..

Maybe building Crawlers as a tree with some shareable properties (DOMXPath, cssToXPath cache, etc) would help?

icaine · 2014-10-23T10:51:18Z

Something like this icaine@c4cf128?diff=unified

jakzal · 2014-11-02T21:36:44Z

@icaine are there any xml namespaces in your xpath queries?

icaine · 2014-11-03T00:21:59Z

No, its html page.. the problem is that domxpath is recreated for every call of filter method..

code below on a html page that has cca 240kb takes about 15.5 seconds, but with changes from my commit it takes only 350 ms..

$channels = $crawler->filter('div.channel_program');
$this->result = $channels->each(function (Crawler $node, $i) use ($url) {
    $data = [];
    $channelName = $node->filter('div.head span.title')->first()->text();
    $channel = Channels::key($channelName);

    if (!$channel) {
        throw new InvalidStateException("Channel name '$channelName' could not be recognized.");
    }
    $data['channel'] = $channel;
    $data['channelName'] = $channelName;

    $data['programs'] = $node->filter('div.item')->each(function (Crawler $node, $i) use ($url) {
        $program['time'] = $this->parseDates($node->filter('span.time')->text());
        $program['title'] = $node->filter('span.title')->text();
        $program['desc'] = $node->filter('p')->text();
        $program['actions'] = [];
        $node->filter('a')->each(function (Crawler $node) use (&$program, $url) {
            $href= $node->attr('href');
            $action = Actions::classToAction(trim(Strings::replace($node->attr('class'), '~\s*button\s*~', '')));
            $program['actions'][$action] = Strings::startsWith($href, 'http') ? $href : $this->parseId($href);
        });
        return $program;
    });

    return $data;
});

jrmyio · 2015-01-15T12:46:01Z

Came across the same issue. I am waiting 15 minutes to update ~500 doctrine records from a HTML table. I don't know with what test-data this DomCrawler was tested but it has horrible performance on any sorts of HTML tables.

Seems more people have the problem: http://stackoverflow.com/questions/22330008/should-symfony-domcrawler-goutte-be-this-slow

@icaine thanks for the hack. I think I see a 5 times improvement here though it's still far too slow for what it has to do.

I also hacked in the CSS to Xpath caching @icaine proposed.

    public function filter($selector)
    {
        if (!class_exists('Symfony\\Component\\CssSelector\\CssSelector')) {
            throw new \RuntimeException('Unable to filter with a CSS selector as the Symfony CssSelector is not installed (you can use filterXPath instead).');
        }

        // The CssSelector already prefixes the selector with descendant-or-self::
        static $cssToXPathMap;
        if (isset($cssToXPathMap[$selector])){
            $xPath = $cssToXPathMap[$selector];
        }else{
            $cssToXPathMap[$selector] = $xPath = CssSelector::toXPath($selector);
        }
        return $this->filterRelativeXPath($xPath);
    }

That also seem to reduces the time.

jrmyio · 2015-03-19T15:30:23Z

Any update on this? With the changes I mentioned above, I increased speed from minutes to seconds.

javiereguiluz · 2015-03-19T15:38:30Z

@ConneXNL apparently @stof and @jakzal raised some valid concerns that haven't been solved yet.

GromNaN · 2015-03-19T15:54:41Z

The DomDocument should be wrapped into an object that also contain the DomXPath.
Something like this : https://github.com/LExpress/Welldom/blob/master/src/Welldom/Document.php#L112

jakzal · 2015-03-19T19:22:04Z

@icaine would you mind publishing the html page you're pasting somewhere? I'd love to profile your script.

icaine · 2015-03-19T20:17:03Z

@jakzal unfortunatelly i abandoned the project.. But its not hard to simulate similar use case, just find any website with some structural data and hundreds of records (tv program, eshop, etc) and try extract some info..

stof · 2015-03-20T08:56:33Z

The hard part is that the Crawler class is currently mutable, allowing to add more content in it. this means it is possible to have nodes belong to multiple documents in the same crawler. This makes any optimisation much harder.
And the fact that it extends SplObjectStorage instead of wrapping it means that the Crawler exposes methods which allow messing with the internal state directly, which makes things even worse.

I see 2 main use cases for the crawler:

creating it for an HTML (or XML) content
creating it from an list of nodes coming from an existing crawler (with any of the DOM traversal methods).

The following API for the crawler would make things much simpler.

class Crawler
{
    public function __construct(\DomDocument $document, array $nodes = array())
    {
        $this->document = $document

        // Validate that all DomElement are for this DomDocument and store them in the internal storage
    }

    public static function fromHtml($html, $charset = 'UTF-8')
    {
        $document = //... Load the HTML (see existing code)

        return new self($document, array($document->documentElement));
    }
}

Looking at the existing constructor, it may even be possible to preserve BC for most cases if we deprecate the possibility to use nodes from multiple documents (which would make such optimizations much easier)

salaman · 2015-03-21T20:09:03Z

I have a huge (~24MB) HTML file of horribly nested tables to parse. Everything was working acceptably during dev runs with much smaller test files, so I assumed the processing time would scale linearly when I eventually fed it the entire file, but it ended up being unacceptably slow.

Using @icaine's edit, the processing time decreased a completely insane amount and actually made it usable.

I profiled both runs with xhprof (using the huge data file but limiting parsing to a few tables, since the slow run would probably take hours):

Code	Function calls	Wall time (s)	CPU time (s)	Memory (b)	Peak memory (b)
Normal DomCrawler	53345	505.799	503.594	16,272,720	89,225,312
Patched DomCrawler	52873	3.502	3.426	16,272,992	89,225,928
Difference	-472	-502.296	-500.167	272	616

Most of the time is spent in Symfony\Component\DomCrawler\Crawler::discoverNamespace on the slow run.

As you can see, for my data, the edit makes my parser run in about 0.7% of the time, which is stellar.

I'd love to see this (at least partially) addressed in master.

jakzal · 2015-03-21T20:20:44Z

@salaman I'll be looking into this on Monday. Would you mind sending me the file and crawler queries you used to jakub AT zalas DOT pl? It would save me some time.

Btw, since namespace discovery is not always needed, if it's causing the issue we could make it possible to disable this feature.

salaman · 2015-03-21T20:33:24Z

@jakzal Sure thing, I'll clean up the file a bit since it's sort of integrated into a larger parser package and send it over so it can be used by itself.

jakzal · 2015-03-21T23:53:21Z

@salaman thanks. It was really helpful.

@salaman

This PR was merged into the 2.6 branch. Discussion ---------- [DomCrawler] Improve namespace discovery performance | Q | A | ------------- | --- | Bug fix? | yes | New feature? | no | BC breaks? | no | Deprecations? | no | Tests pass? | yes | Fixed tickets | #12298 | License | MIT | Doc PR | - Before this quick-fix, xpath axes like `child::*` or `descendant-or-self::*` were considered namespaces. As a result, far too many xpath queries were being executed during namespace discovery. Here's a full blackfire before/after comparision of @salaman's script: https://blackfire.io/profiles/compare/a80b9e77-8e55-45d3-a348-7d34a51053b6/graph ![domcrawler-xpath-query](https://cloud.githubusercontent.com/assets/190447/6767384/ba93c57a-d024-11e4-84e1-e58dd7527f03.png) Commits ------- b6af002 [DomCrawler] Improve namespace discovery performance

jrmyio · 2015-04-23T15:00:59Z

In what way was this fixed? I still have terrible performance without the changes listed above.

jakzal · 2015-04-23T16:48:35Z

@ConneXNL if you provide a script reproducing your issue, I'll be happy to look into it. The current issue was related to a namespace discovery logic (see #14012).

scheb · 2016-07-24T18:05:37Z

Just ran into that issue when iterating over a list of XML nodes and doing a bunch of XPath queries on each list item. Sample document was ~2,5MB.

The filterRelativeXPath() creates a sub-crawler for each XPath query. Then, when the query is executed on the sub-crawler, it has to execute the very slow discoverNamespace() method, because the namespace cache ($namespaces attribute of the Crawler class) is not copied over to the sub-crawler. By changing the createSubCrawler() method to this, I was able to boost performance significantly:

private function createSubCrawler($nodes)
{
    $crawler = new static($nodes, $this->uri, $this->baseHref);
    $crawler->namespaces = $this->namespaces; // Added this line

    return $crawler;
}

It works for my use-case, but I don't know what implications this could have for other use-cases.

jakzal · 2016-07-24T20:11:52Z

@scheb could you please provide the script and the document that exhibit the issue?

scheb · 2016-07-25T08:37:15Z

Here's a script to reproduce:

<?php
include __DIR__ . '/vendor/autoload.php';

use Symfony\Component\DomCrawler\Crawler;

$xmlRaw = file_get_contents('http://api.elsevier.com/content/article/PII:S0370269312002560?httpAccept=text/xml');

$crawler = new Crawler();
$crawler->addXmlContent($xmlRaw);
$crawler->registerNamespace("ce", "http://www.elsevier.com/xml/common/dtd");

$nodes = $crawler->filterXPath('//ce:author-group/ce:affiliation');
$nodes->each(function (Crawler $crawler) {
    $startTime = microtime(true);
    $crawler->filterXPath('//ce:textfn')->extract(array('_text'));
    echo microtime(true) - $startTime . " sec\n";

    $startTime = microtime(true);
    $crawler->filterXPath('//ce:label')->extract(array('_text'));
    echo microtime(true) - $startTime . " sec\n";
});

Also, I want to mention, that all our tests, which are using the DomCrawler, still work and they're nearly twice as fast.

javiereguiluz · 2016-07-25T09:05:47Z

@scheb I've tested the change that you propose:

    private function createSubCrawler($nodes)
    {
        $crawler = new static($nodes, $this->uri, $this->baseHref);
        $crawler->isHtml = $this->isHtml;
        $crawler->document = $this->document;
+       $crawler->namespaces = $this->namespaces;

        return $crawler;
    }

And the performance improvement is astonishing:

Now @jakzal and others will evaluate if this change is appropriate. Thanks!

stof · 2016-07-25T09:34:03Z

@javiereguiluz the change is totally legitimate. I opened a PR already to apply it.

…tof) This PR was merged into the 2.8 branch. Discussion ---------- [DomCrawler] Inherit the namespace cache in subcrawlers | Q | A | ------------- | --- | Branch? | 2.8 | Bug fix? | no | New feature? | no | BC breaks? | no | Deprecations? | no | Tests pass? | yes | Fixed tickets | #12298 | License | MIT | Doc PR | n/a This inherits any already discovered/registered namespace with subcrawlers, improving performance when using namespaces. I submitted to 2.8 rather than 2.7, because the namespace mapping feature was actually buggy in 2.x, because of the fact that nodes could belong to different documents in the same Crawler while the namespace map was shared. The fact that the map was not inherited in subcrawler mitigated this issue (by reducing changes to have multiple documents in the same subcrawler). 2.8 deprecated this possibility to have multiple documents, so I'm fine with applying this here. Note that the subcrawler inherits the namespace cache at the time it is created, but the cache is not shared between instance (so if a subcrawler discovers an additional namespace of the document, it will not be available for the parent crawler of other subcrawlers of the parent). Sharing the cache would be totally possible (as they share the same document anyway) and would make the experience even better (removing the need to ensure that the root crawler discovers namespace before filtering). But it would require moving from an array to an object. I'm not sure we want to do this in a patch release. What do you think @symfony/deciders ? Commits ------- e89c758 [DomCrawler] Inherit the namespace cache in subcrawlers

fabpot added the DomCrawler label Jan 2, 2015

mikeymike mentioned this issue Feb 22, 2015

Major performance issue with Symfony Dom Crawler mikeymike/php-rfc-digestor#3

Closed

jakzal mentioned this issue Mar 21, 2015

[DomCrawler] Improve namespace discovery performance #14012

Merged

fabpot closed this as completed Mar 23, 2015

stof mentioned this issue Jul 25, 2016

[DomCrawler] Inherit the namespace cache in subcrawlers #19422

Merged

simonberger mentioned this issue Nov 13, 2020

[DomCrawler] Terrible perfomance filterXpath - new Crawler #39067

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DomCrawler] Terrible perfomance #12298

[DomCrawler] Terrible perfomance #12298

icaine commented Oct 23, 2014

stof commented Oct 23, 2014

icaine commented Oct 23, 2014

icaine commented Oct 23, 2014

icaine commented Oct 23, 2014

jakzal commented Nov 2, 2014

icaine commented Nov 3, 2014

jrmyio commented Jan 15, 2015

jrmyio commented Mar 19, 2015

javiereguiluz commented Mar 19, 2015

GromNaN commented Mar 19, 2015

jakzal commented Mar 19, 2015

icaine commented Mar 19, 2015

stof commented Mar 20, 2015

salaman commented Mar 21, 2015

jakzal commented Mar 21, 2015

salaman commented Mar 21, 2015

jakzal commented Mar 21, 2015

jrmyio commented Apr 23, 2015

jakzal commented Apr 23, 2015

scheb commented Jul 24, 2016 •

edited

Loading

jakzal commented Jul 24, 2016

scheb commented Jul 25, 2016 •

edited

Loading

javiereguiluz commented Jul 25, 2016

stof commented Jul 25, 2016

[DomCrawler] Terrible perfomance #12298

[DomCrawler] Terrible perfomance #12298

Comments

icaine commented Oct 23, 2014

stof commented Oct 23, 2014

icaine commented Oct 23, 2014

icaine commented Oct 23, 2014

icaine commented Oct 23, 2014

jakzal commented Nov 2, 2014

icaine commented Nov 3, 2014

jrmyio commented Jan 15, 2015

jrmyio commented Mar 19, 2015

javiereguiluz commented Mar 19, 2015

GromNaN commented Mar 19, 2015

jakzal commented Mar 19, 2015

icaine commented Mar 19, 2015

stof commented Mar 20, 2015

salaman commented Mar 21, 2015

jakzal commented Mar 21, 2015

salaman commented Mar 21, 2015

jakzal commented Mar 21, 2015

jrmyio commented Apr 23, 2015

jakzal commented Apr 23, 2015

scheb commented Jul 24, 2016 • edited Loading

jakzal commented Jul 24, 2016

scheb commented Jul 25, 2016 • edited Loading

javiereguiluz commented Jul 25, 2016

stof commented Jul 25, 2016

scheb commented Jul 24, 2016 •

edited

Loading

scheb commented Jul 25, 2016 •

edited

Loading