Skip to content

[DomCrawler] Cache discovered namespaces #39097

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 27, 2020

Conversation

simonberger
Copy link
Contributor

@simonberger simonberger commented Nov 16, 2020

Q A
Branch? 5.x
Bug fix? no
New feature? yes
Deprecations? no
Tickets Address #39067
License MIT

Discovering namespaces is by far the most expensive task when filtering for nodes and the xpath contains prefixes.
When Crawler::filterRelativeXPath is called multiple times with (identical) prefixes the slowdown is huge.

This fix brings the runtime of the linked ticket down from 27 seconds to 9 seconds in my test. Compared to a pure PHP version which takes < 0.5 seconds the design of the crawler API is the limiting factor. There are still many repeated namespace queries caused by new Crawler instances. Ideas to solve this are discussed in the ticket.

@carsonbot
Copy link

Hey!

I see that this is your first PR. That is great! Welcome!

Symfony has a contribution guide which I suggest you to read.

In short:

  • Always add tests
  • Keep backward compatibility (see https://symfony.com/bc).
  • Bug fixes must be submitted against the lowest maintained branch where they apply (see https://symfony.com/releases)
  • Features and deprecations must be submitted against the 5.x branch.

Review the GitHub status checks of your pull request and try to solve the reported issues. If some tests are failing, try to see if they are failing because of this change.

When two Symfony core team members approve this change, it will be merged and you will become an official Symfony contributor!
If this PR is merged in a lower version branch, it will be merged up to all maintained branches within a few days.

I am going to sit back now and wait for the reviews.

Cheers!

Carsonbot

@carsonbot carsonbot changed the title Cache discovered namespaces in Crawler [DomCrawler] Cache discovered namespaces in Crawler Nov 16, 2020
@simonberger simonberger changed the title [DomCrawler] Cache discovered namespaces in Crawler [DomCrawler] Cache discovered namespaces Nov 16, 2020
@simonberger simonberger force-pushed the crawler_cache_namespaces branch from d02e765 to d256b2f Compare November 17, 2020 23:55
@@ -1034,6 +1041,8 @@ private function filterRelativeXPath(string $xpath)
$crawler->add($domxpath->query($xpath, $node));
}

$crawler->cachedNamespaces = $this->cachedNamespaces;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than copying the cached namespaced a second time to account for changes done in createDOMXPath, I'd rather ensure that the changes are done before creating the subcrawler.
This can be done by creating a single $domxpath thanks to the invariant that $node->ownerDocument === $this->document in Symfony 4+ (that was not enforced in Symfony 3), which could allow us to create a single $domxpath before the createSubCrawler call and use it in the loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonberger why is this marked as resolved ?

@simonberger simonberger force-pushed the crawler_cache_namespaces branch from c73a09f to a26870a Compare November 18, 2020 19:18
@simonberger simonberger force-pushed the crawler_cache_namespaces branch from a26870a to b4af1ba Compare November 18, 2020 19:20
@simonberger
Copy link
Contributor Author

Is the ticket (still) auto-closed by this PR? If so can I write it differently? Otherwise I just remove it.

@stof
Copy link
Member

stof commented Nov 18, 2020

@simonberger don't put fix just before the issue number

@simonberger simonberger force-pushed the crawler_cache_namespaces branch 2 times, most recently from 4624d83 to 840fe47 Compare November 19, 2020 12:36
@stof
Copy link
Member

stof commented Nov 19, 2020

this PR looks good to me.

the last potential improvement is the one I suggested in #39067 (comment) to make all subcrawlers share their namespace cache with the main crawler (instead of copying the parent cache at instantiation time), which helps when the subcrawlers are the one making the discovery.

@simonberger
Copy link
Contributor Author

Yeah I tested to use an ArrayIterator instead of an array which seems to solve that shared namespace issue.
I'll push a solution like this tomorrow.

@@ -73,6 +73,7 @@ public function __construct($node = null, string $uri = null, string $baseHref =
$this->uri = $uri;
$this->baseHref = $baseHref ?: $uri;
$this->html5Parser = class_exists(HTML5::class) ? new HTML5(['disable_html_ns' => true]) : null;
$this->cachedNamespaces = new \ArrayIterator();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be ArrayObject rather than ArrayIterator IMO

}

// ask for one namespace, otherwise we'd get a collection with an item for each node
$namespaces = $domxpath->query(sprintf('(//namespace::*[name()="%s"])[last()]', $this->defaultNamespacePrefix === $prefix ? '' : $prefix));

$namespace = ($node = $namespaces->item(0)) ? $node->nodeValue : null;
$this->cachedNamespaces[$prefix] = $namespace;
$this->cachedNamespaces->offsetSet($prefix, $namespace);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using this->cachedNamespaces[$prefix] = $namespace; is fine. that's what ArrayAccess is for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thankful for your valuable improvement ideas but please let us not discuss every unimportant detail of my implementation. I know I have an ArrayIterator here in opposite to most public ArrayIterator|array|... usages so I decided to directly take its functions.

@fabpot fabpot force-pushed the crawler_cache_namespaces branch from e1728a5 to b9fe3ba Compare November 27, 2020 07:00
@fabpot fabpot changed the base branch from 4.4 to 5.x November 27, 2020 07:00
@fabpot fabpot force-pushed the crawler_cache_namespaces branch from b9fe3ba to c31d7ef Compare November 27, 2020 07:03
@fabpot fabpot force-pushed the crawler_cache_namespaces branch from c31d7ef to a8e85ec Compare November 27, 2020 07:04
@fabpot
Copy link
Member

fabpot commented Nov 27, 2020

Rebase on 5.x (we don't introduce new features in 4.4, and caching is a new feature). I've also made some cosmetic changes.

@fabpot
Copy link
Member

fabpot commented Nov 27, 2020

Thank you @simonberger.

@simonberger
Copy link
Contributor Author

@fabpot Not a really big thing but this change a8e85ec#diff-940b51ffa31dedac17aa49cd8e04e44aa8b3747782c7f9f27457b0510587c05dR1201 does not find the also cached null results of prefixes. One of the reasons I preferred to stick to the method usage everywhere.

@simonberger simonberger deleted the crawler_cache_namespaces branch November 27, 2020 09:31
derrabus added a commit that referenced this pull request Dec 11, 2020
This PR was merged into the 5.3-dev branch.

Discussion
----------

[DomCrawler] Fix null namespace issue in Crawler

| Q             | A
| ------------- | ---
| Branch?       | 5.x <!-- see below -->
| Bug fix?      | yes
| New feature?  | no <!-- please update src/**/CHANGELOG.md files -->
| Deprecations? | no <!-- please update UPGRADE-*.md and src/**/CHANGELOG.md files -->
| Tickets       | Fix #39277 <!-- prefix each issue number with "Fix #", no need to create an issue if none exist, explain below instead -->
| License       | MIT
| Doc PR        |  <!-- required for new features -->

In this case `isset` may cause an issue if null value would be stored in `$this->namespaces` or `$this->cachedNamespaces`, but never fetched, as if condition would return false. This PR is for 5.x, because namespace caching was introduced recently (#39097).

Commits
-------

7b0a183 [DomCrawler] Fix null namespace issue in Crawler
@fabpot fabpot mentioned this pull request Apr 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants