-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
[DomCrawler] Terrible perfomance #12298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I see a big drawback in your proposal: it does not recreate the XPath when the DomDocument is different |
I know (as i said its quickfix and very very dirty), its only a demonstration of how the perfomance can be greatly improved.. |
Moreover Crawler has many methods (mostly namespace related) that seem to do same job repeatedly in every subsequent Crawler instance and if those calls can be eliminated it would improve performance even more.. Maybe building Crawlers as a tree with some shareable properties (DOMXPath, cssToXPath cache, etc) would help? |
Something like this icaine@c4cf128?diff=unified |
@icaine are there any xml namespaces in your xpath queries? |
No, its html page.. the problem is that domxpath is recreated for every call of filter method.. code below on a html page that has cca 240kb takes about 15.5 seconds, but with changes from my commit it takes only 350 ms.. $channels = $crawler->filter('div.channel_program');
$this->result = $channels->each(function (Crawler $node, $i) use ($url) {
$data = [];
$channelName = $node->filter('div.head span.title')->first()->text();
$channel = Channels::key($channelName);
if (!$channel) {
throw new InvalidStateException("Channel name '$channelName' could not be recognized.");
}
$data['channel'] = $channel;
$data['channelName'] = $channelName;
$data['programs'] = $node->filter('div.item')->each(function (Crawler $node, $i) use ($url) {
$program['time'] = $this->parseDates($node->filter('span.time')->text());
$program['title'] = $node->filter('span.title')->text();
$program['desc'] = $node->filter('p')->text();
$program['actions'] = [];
$node->filter('a')->each(function (Crawler $node) use (&$program, $url) {
$href= $node->attr('href');
$action = Actions::classToAction(trim(Strings::replace($node->attr('class'), '~\s*button\s*~', '')));
$program['actions'][$action] = Strings::startsWith($href, 'http') ? $href : $this->parseId($href);
});
return $program;
});
return $data;
}); |
Came across the same issue. I am waiting 15 minutes to update ~500 doctrine records from a HTML table. I don't know with what test-data this DomCrawler was tested but it has horrible performance on any sorts of HTML tables. Seems more people have the problem: http://stackoverflow.com/questions/22330008/should-symfony-domcrawler-goutte-be-this-slow @icaine thanks for the hack. I think I see a 5 times improvement here though it's still far too slow for what it has to do. I also hacked in the CSS to Xpath caching @icaine proposed. public function filter($selector)
{
if (!class_exists('Symfony\\Component\\CssSelector\\CssSelector')) {
throw new \RuntimeException('Unable to filter with a CSS selector as the Symfony CssSelector is not installed (you can use filterXPath instead).');
}
// The CssSelector already prefixes the selector with descendant-or-self::
static $cssToXPathMap;
if (isset($cssToXPathMap[$selector])){
$xPath = $cssToXPathMap[$selector];
}else{
$cssToXPathMap[$selector] = $xPath = CssSelector::toXPath($selector);
}
return $this->filterRelativeXPath($xPath);
} That also seem to reduces the time. |
Any update on this? With the changes I mentioned above, I increased speed from minutes to seconds. |
The DomDocument should be wrapped into an object that also contain the DomXPath. |
@icaine would you mind publishing the html page you're pasting somewhere? I'd love to profile your script. |
@jakzal unfortunatelly i abandoned the project.. But its not hard to simulate similar use case, just find any website with some structural data and hundreds of records (tv program, eshop, etc) and try extract some info.. |
The hard part is that the Crawler class is currently mutable, allowing to add more content in it. this means it is possible to have nodes belong to multiple documents in the same crawler. This makes any optimisation much harder. I see 2 main use cases for the crawler:
The following API for the crawler would make things much simpler. class Crawler
{
public function __construct(\DomDocument $document, array $nodes = array())
{
$this->document = $document
// Validate that all DomElement are for this DomDocument and store them in the internal storage
}
public static function fromHtml($html, $charset = 'UTF-8')
{
$document = //... Load the HTML (see existing code)
return new self($document, array($document->documentElement));
}
} Looking at the existing constructor, it may even be possible to preserve BC for most cases if we deprecate the possibility to use nodes from multiple documents (which would make such optimizations much easier) |
I have a huge (~24MB) HTML file of horribly nested tables to parse. Everything was working acceptably during dev runs with much smaller test files, so I assumed the processing time would scale linearly when I eventually fed it the entire file, but it ended up being unacceptably slow. Using @icaine's edit, the processing time decreased a completely insane amount and actually made it usable. I profiled both runs with xhprof (using the huge data file but limiting parsing to a few tables, since the slow run would probably take hours):
Most of the time is spent in As you can see, for my data, the edit makes my parser run in about 0.7% of the time, which is stellar. I'd love to see this (at least partially) addressed in master. |
@salaman I'll be looking into this on Monday. Would you mind sending me the file and crawler queries you used to jakub AT zalas DOT pl? It would save me some time. Btw, since namespace discovery is not always needed, if it's causing the issue we could make it possible to disable this feature. |
@jakzal Sure thing, I'll clean up the file a bit since it's sort of integrated into a larger parser package and send it over so it can be used by itself. |
@salaman thanks. It was really helpful. |
This PR was merged into the 2.6 branch. Discussion ---------- [DomCrawler] Improve namespace discovery performance | Q | A | ------------- | --- | Bug fix? | yes | New feature? | no | BC breaks? | no | Deprecations? | no | Tests pass? | yes | Fixed tickets | #12298 | License | MIT | Doc PR | - Before this quick-fix, xpath axes like `child::*` or `descendant-or-self::*` were considered namespaces. As a result, far too many xpath queries were being executed during namespace discovery. Here's a full blackfire before/after comparision of @salaman's script: https://blackfire.io/profiles/compare/a80b9e77-8e55-45d3-a348-7d34a51053b6/graph  Commits ------- b6af002 [DomCrawler] Improve namespace discovery performance
In what way was this fixed? I still have terrible performance without the changes listed above. |
@ConneXNL if you provide a script reproducing your issue, I'll be happy to look into it. The current issue was related to a namespace discovery logic (see #14012). |
Just ran into that issue when iterating over a list of XML nodes and doing a bunch of XPath queries on each list item. Sample document was ~2,5MB. The filterRelativeXPath() creates a sub-crawler for each XPath query. Then, when the query is executed on the sub-crawler, it has to execute the very slow discoverNamespace() method, because the namespace cache ($namespaces attribute of the Crawler class) is not copied over to the sub-crawler. By changing the createSubCrawler() method to this, I was able to boost performance significantly: private function createSubCrawler($nodes)
{
$crawler = new static($nodes, $this->uri, $this->baseHref);
$crawler->namespaces = $this->namespaces; // Added this line
return $crawler;
} It works for my use-case, but I don't know what implications this could have for other use-cases. |
@scheb could you please provide the script and the document that exhibit the issue? |
Here's a script to reproduce: <?php
include __DIR__ . '/vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$xmlRaw = file_get_contents('http://api.elsevier.com/content/article/PII:S0370269312002560?httpAccept=text/xml');
$crawler = new Crawler();
$crawler->addXmlContent($xmlRaw);
$crawler->registerNamespace("ce", "http://www.elsevier.com/xml/common/dtd");
$nodes = $crawler->filterXPath('//ce:author-group/ce:affiliation');
$nodes->each(function (Crawler $crawler) {
$startTime = microtime(true);
$crawler->filterXPath('//ce:textfn')->extract(array('_text'));
echo microtime(true) - $startTime . " sec\n";
$startTime = microtime(true);
$crawler->filterXPath('//ce:label')->extract(array('_text'));
echo microtime(true) - $startTime . " sec\n";
}); Also, I want to mention, that all our tests, which are using the DomCrawler, still work and they're nearly twice as fast. |
@scheb I've tested the change that you propose: private function createSubCrawler($nodes)
{
$crawler = new static($nodes, $this->uri, $this->baseHref);
$crawler->isHtml = $this->isHtml;
$crawler->document = $this->document;
+ $crawler->namespaces = $this->namespaces;
return $crawler;
} And the performance improvement is astonishing: Now @jakzal and others will evaluate if this change is appropriate. Thanks! |
@javiereguiluz the change is totally legitimate. I opened a PR already to apply it. |
…tof) This PR was merged into the 2.8 branch. Discussion ---------- [DomCrawler] Inherit the namespace cache in subcrawlers | Q | A | ------------- | --- | Branch? | 2.8 | Bug fix? | no | New feature? | no | BC breaks? | no | Deprecations? | no | Tests pass? | yes | Fixed tickets | #12298 | License | MIT | Doc PR | n/a This inherits any already discovered/registered namespace with subcrawlers, improving performance when using namespaces. I submitted to 2.8 rather than 2.7, because the namespace mapping feature was actually buggy in 2.x, because of the fact that nodes could belong to different documents in the same Crawler while the namespace map was shared. The fact that the map was not inherited in subcrawler mitigated this issue (by reducing changes to have multiple documents in the same subcrawler). 2.8 deprecated this possibility to have multiple documents, so I'm fine with applying this here. Note that the subcrawler inherits the namespace cache at the time it is created, but the cache is not shared between instance (so if a subcrawler discovers an additional namespace of the document, it will not be available for the parent crawler of other subcrawlers of the parent). Sharing the cache would be totally possible (as they share the same document anyway) and would make the experience even better (removing the need to ensure that the root crawler discovers namespace before filtering). But it would require moving from an array to an object. I'm not sure we want to do this in a patch release. What do you think @symfony/deciders ? Commits ------- e89c758 [DomCrawler] Inherit the namespace cache in subcrawlers
When trying to parse info from some website i discovered that DomCrawler is terribly slow and unoptimized. With a few little changes i improved performance more than 150+ times (11500 ms -> 65 ms).
The main culprit is
https://github.com/symfony/symfony/blob/master/src/Symfony/Component/DomCrawler/Crawler.php#L862
because creation of DOMXPath is damn expensive.
I quick fixed it by (i know its very very dirty)
Another performance improvements (cca 200 ms in my environment) can be gained if results from CssSelector::toXPath($selector) at
https://github.com/symfony/symfony/blob/master/src/Symfony/Component/DomCrawler/Crawler.php#L675
are cached.
The text was updated successfully, but these errors were encountered: