-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
[DomCrawler] Added auto-discovery and explicit registration of namespaces in filter() and filterByXPath() #6650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DomCrawler] Added auto-discovery and explicit registration of namespaces in filter() and filterByXPath() #6650
Conversation
@jakzal Any news on this PR. I would love to have this in 2.3. |
@fabpot I stumbled across few issues, some of them already reported by @arryon. Let me do a brain dump here. Note that we'll have similar problems with #5886. XML content is loaded in different ways depending on a method we use. Consider simple XML below:
Most of the time XML is loaded as html, since this is a default type in the addContent() method. When loading the XML with one of the mentioned methods:
we end up with the following result:
Notice that namespace attributes remain unchanged on the entry tag, but nodes are not namespaced anymore. Our XML is embedded in an html now. I'm not sure if this was an intended behaviour. If so, it might be worth documenting and advising loading XMLs with addXmlContent(). At the moment only way to load XML with
It indeed makes xpath expressions simpler, since we no longer have to use prefixes. However, it might cause problems with complex XMLs where we might get the same node in multiple namespaces. Namespaces are there to prevent name collisions, by removing them we're exposing ourselves to this issue. Only way to keep the original XML is to use DOMDocument:
My solution only works if we load the XML with DOMDocument. In other cases either namespace definitions or nodes are modified. Furthermore, at the moment my solution doesn't work with a default namespace (oh irony). Just because we have a default namespace, doesn't mean we can drop it from xpath expressions. That's probably the reason why code in In other words, even if we fixed issues with variate of ways we can load XML, we'd have to register a default namespace in some way (either by convention, or explicitly). Imho we have to problems to solve here:
|
This PR was merged into the master branch. Discussion ---------- [CssSelector] Updated parsers to support namespaces (fix for ClassParser included) ClassParser was passing improper parameters to `ElementNode`, as well as namespaces simply not being supported in the various parsers. This is a natural extension of #6650, by properly parsing the requested CSS filter if supplied. | Q | A | ------------- | --- | Bug fix? | yes | New feature? | yes | BC breaks? | no | Deprecations? | no | Tests pass? | yes | Fixed tickets | NA | License | MIT | Doc PR | NA Commits ------- 3c015d5 Updated parsers to support namespaces (fix for ClassParser included)
@fabpot I can't find a good solution for autodiscovery. Main reason is that even the default namespace has to be registered before querying with xpath. I'm not sure what would be a good prefix for a default namespace (or if we should be doing this at all). #5886 is not a serious problem, since we can actually work around it (i've posted an update there, it's possible to skip the namespace in xpath). It could become a problem with complex XMLs, with multiple namespaces containing the same nodes. If we decided that solution for #5886 is good enough, than we could close #4845 as well. Otherwise we'd probably need to make a BC break, to avoid playing with namespaces when loading XMLs: // remove the default namespace to make XPath expressions simpler
@$dom->loadXML(str_replace('xmlns', 'ns', $content), LIBXML_NONET); I'm not sure what was the intention, but it actually removes ALL the namespaces (which is not what comment says). We definitely have to document it. |
@jakzal Would it be possible to only remove the default namespace and keep the other ones? That would be the best option I think. |
Removing the default namespace would actually work. We'd only need to make sure it's removed consistently (no matter how you load an XML). I'll continue working on this. |
Filtering with CSS expressions fails because @jfsimon any reason why node names are lowercased? |
or even better: http://www.w3.org/TR/selectors/#casesens |
@fabpot while that's perfectly valid for css selectors, XPath expressions are case sensitive. That's why calling |
@jakzal I think we can just safely remove the |
It seems that documents loaded as HTML are case-lowered (http://fr2.php.net/manual/fr/domxpath.query.php#77048) so in this case xpath must be case-lowered too. In the case of XML (which is the one here), they're not. Case-insensitive search can be performed
|
Note that the |
@jfsimon Calling |
@jakzal exactly. The |
Any news on this one? |
For my own use I 'solved' this problem by naming the default namespace 'default', and keeping a registry of all other namespaces. Xpath queries using the default namespace must then begin with 'default', or any other sensible name that you could come up with. I haven't followed this discussing since, but here's my own commit implementing adding namespaces, could be a starting point for someone to pick this issue up: |
@jakzal I need to take a decision about this one: can we rely on auto-detection or do we allow people to register namespaces? |
Auto-detection seems to be working fine and there's no need for special treatment of a default namespace. See I'd appreciate if someone reviewed it to make sure I'm not missing anything (ping @jfsimon @stof). Notice I had to manually disable the html extension for css selector with |
@jakzal I think it makes sense to be able to use CssSelector for XML documents (this is exactly why we allow disabling the HTML extension of the CssSelector component btw). |
The CssSelector and the DomCrawler components must support both XML and HTML. |
@jakzal Can you make a PR to update the docs? |
👍 |
There's an inconsistency in the way XMLs are loaded to the Crawler, I'm not sure if it should be addressed here or as a separate issue. If an XML is loaded directly from string, the Implication for the current PR is that if we load Options we have:
|
For the inconsistency you found, I would go with option 1. If that's not too complex, I would like to get the fix in this PR, if not, let's create an issue. Also, can you add a note in the component CHANGELOG file about this new feature? |
Updated this PR with changes described in the option 1. Note that the outcome is we HAVE to use prefixes once there's at least one non-default namespace in the document (that's how it works in php). We could still improve this PR by:
|
The 2 suggestions make sense. |
…and Crawler::filterByXPath(). Improved content type guessing.
…ding an XML content.
…ading documents with addXmlContent().
@fabpot both suggestions are now implemented. |
/** | ||
* @covers Symfony\Component\DomCrawler\Crawler::filter | ||
*/ | ||
public function testFilter() | ||
{ | ||
$this->markSkippedIfCssSelectorNotPresent(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove the test for CSS selector as we have removed all such checks everywhere as composer must have been run to execute the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
This PR was merged into the master branch. Discussion ---------- [DomCrawler] Added auto-discovery and explicit registration of namespaces in filter() and filterByXPath() | Q | A | --- | --- |Bug fix: | no |Feature addition: |yes |Backwards compatibility break: | yes, default namespace is no longer removed in the `addContent` method |Symfony2 tests pass: | yes| |Fixes the following tickets: | #4845 |Todo: | - |License of the code:| MIT |Documentation PR: | symfony/symfony-docs#2979 * added support for automatic discovery and explicit registration of document namespaces for `Crawler::filterXPath()` and `Crawler::filter()` * improved content type guessing in `Crawler::addContent()` * [BC BREAK] `Crawler::addXmlContent()` no longer removes the default document namespace I mentioned in #4845 it would probably be possible to use [DOMNode::lookupNamespaceURI()](http://www.php.net/manual/en/domnode.lookupnamespaceuri.php) to find a namespace URI by given prefix. Unfortunately we cannot use it here since we'd have to call it on a node in the namespace we're looking for. Current implementation makes the following query to find a namespace: ```php $domxpath->query('(//namespace::*[name()="media"])[last()]') ``` Commits ------- 77e2fa5 [DomCrawler] Removed checks if CssSelector is present. 9110468 [DomCrawler] Enabled manual namespace registration. be1e4e6 [DomCrawler] Enabled default namespace prefix overloading. 943d446 [DomCrawler] Updated the CHANGELOG with namespace auto-registration details. c6fbb13 [DomCrawler] Added support for an automatic default namespace registration. 587e2dd [DomCrawler] Made that default namespace is no longer removed when loading documents with addXmlContent(). c905bba [DomCrawler] Added more tests for namespaced filtering. 6e717a3 [DomCrawler] Made sure only the default namespace is removed when loading an XML content. e5b8abb [DomCrawler] Added auto-discovery of namespaces in Crawler::filter() and Crawler::filterByXPath().
This PR was squashed before being merged into the 2.4 branch (closes #9771). Discussion ---------- Crawler default namespace fix | Q | A | ------------- | --- | Bug fix? | no | New feature? | no | BC breaks? | yes | Deprecations? | no | Tests pass? | yes | Fixed tickets | #9732, #6650 | License | MIT | Doc PR | symfony/symfony-docs/pull/2979 Fix backwards compatibility of xml namespaces for having only one default namespace. Commits ------- cfff054 Crawler default namespace fix
addContent
methodCrawler::filterXPath()
andCrawler::filter()
Crawler::addContent()
Crawler::addXmlContent()
no longer removes the default document namespaceI mentioned in #4845 it would probably be possible to use DOMNode::lookupNamespaceURI() to find a namespace URI by given prefix. Unfortunately we cannot use it here since we'd have to call it on a node in the namespace we're looking for.
Current implementation makes the following query to find a namespace: