-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
[DomCrawler] Cache discovered namespaces #39097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hey! I see that this is your first PR. That is great! Welcome! Symfony has a contribution guide which I suggest you to read. In short:
Review the GitHub status checks of your pull request and try to solve the reported issues. If some tests are failing, try to see if they are failing because of this change. When two Symfony core team members approve this change, it will be merged and you will become an official Symfony contributor! I am going to sit back now and wait for the reviews. Cheers! Carsonbot |
d02e765
to
d256b2f
Compare
@@ -1034,6 +1041,8 @@ private function filterRelativeXPath(string $xpath) | |||
$crawler->add($domxpath->query($xpath, $node)); | |||
} | |||
|
|||
$crawler->cachedNamespaces = $this->cachedNamespaces; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than copying the cached namespaced a second time to account for changes done in createDOMXPath
, I'd rather ensure that the changes are done before creating the subcrawler.
This can be done by creating a single $domxpath
thanks to the invariant that $node->ownerDocument === $this->document
in Symfony 4+ (that was not enforced in Symfony 3), which could allow us to create a single $domxpath
before the createSubCrawler
call and use it in the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonberger why is this marked as resolved ?
c73a09f
to
a26870a
Compare
a26870a
to
b4af1ba
Compare
Is the ticket (still) auto-closed by this PR? If so can I write it differently? Otherwise I just remove it. |
@simonberger don't put |
4624d83
to
840fe47
Compare
this PR looks good to me. the last potential improvement is the one I suggested in #39067 (comment) to make all subcrawlers share their namespace cache with the main crawler (instead of copying the parent cache at instantiation time), which helps when the subcrawlers are the one making the discovery. |
Yeah I tested to use an ArrayIterator instead of an array which seems to solve that shared namespace issue. |
@@ -73,6 +73,7 @@ public function __construct($node = null, string $uri = null, string $baseHref = | |||
$this->uri = $uri; | |||
$this->baseHref = $baseHref ?: $uri; | |||
$this->html5Parser = class_exists(HTML5::class) ? new HTML5(['disable_html_ns' => true]) : null; | |||
$this->cachedNamespaces = new \ArrayIterator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be ArrayObject
rather than ArrayIterator
IMO
} | ||
|
||
// ask for one namespace, otherwise we'd get a collection with an item for each node | ||
$namespaces = $domxpath->query(sprintf('(//namespace::*[name()="%s"])[last()]', $this->defaultNamespacePrefix === $prefix ? '' : $prefix)); | ||
|
||
$namespace = ($node = $namespaces->item(0)) ? $node->nodeValue : null; | ||
$this->cachedNamespaces[$prefix] = $namespace; | ||
$this->cachedNamespaces->offsetSet($prefix, $namespace); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using this->cachedNamespaces[$prefix] = $namespace;
is fine. that's what ArrayAccess
is for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thankful for your valuable improvement ideas but please let us not discuss every unimportant detail of my implementation. I know I have an ArrayIterator here in opposite to most public ArrayIterator|array|... usages so I decided to directly take its functions.
e1728a5
to
b9fe3ba
Compare
b9fe3ba
to
c31d7ef
Compare
c31d7ef
to
a8e85ec
Compare
Rebase on 5.x (we don't introduce new features in 4.4, and caching is a new feature). I've also made some cosmetic changes. |
Thank you @simonberger. |
@fabpot Not a really big thing but this change a8e85ec#diff-940b51ffa31dedac17aa49cd8e04e44aa8b3747782c7f9f27457b0510587c05dR1201 does not find the also cached null results of prefixes. One of the reasons I preferred to stick to the method usage everywhere. |
This PR was merged into the 5.3-dev branch. Discussion ---------- [DomCrawler] Fix null namespace issue in Crawler | Q | A | ------------- | --- | Branch? | 5.x <!-- see below --> | Bug fix? | yes | New feature? | no <!-- please update src/**/CHANGELOG.md files --> | Deprecations? | no <!-- please update UPGRADE-*.md and src/**/CHANGELOG.md files --> | Tickets | Fix #39277 <!-- prefix each issue number with "Fix #", no need to create an issue if none exist, explain below instead --> | License | MIT | Doc PR | <!-- required for new features --> In this case `isset` may cause an issue if null value would be stored in `$this->namespaces` or `$this->cachedNamespaces`, but never fetched, as if condition would return false. This PR is for 5.x, because namespace caching was introduced recently (#39097). Commits ------- 7b0a183 [DomCrawler] Fix null namespace issue in Crawler
Discovering namespaces is by far the most expensive task when filtering for nodes and the xpath contains prefixes.
When
Crawler::filterRelativeXPath
is called multiple times with (identical) prefixes the slowdown is huge.This fix brings the runtime of the linked ticket down from 27 seconds to 9 seconds in my test. Compared to a pure PHP version which takes < 0.5 seconds the design of the crawler API is the limiting factor. There are still many repeated namespace queries caused by new Crawler instances. Ideas to solve this are discussed in the ticket.