[DomCrawler] Added auto-discovery and explicit registration of namespaces in filter() and filterByXPath() #6650

jakzal · 2013-01-10T10:00:13Z

Q	A
Bug fix:	no
Feature addition:	yes
Backwards compatibility break:	yes, default namespace is no longer removed in the `addContent` method
Symfony2 tests pass:	yes
Fixes the following tickets:	#4845
Todo:	-
License of the code:	MIT
Documentation PR:	symfony/symfony-docs#2979

added support for automatic discovery and explicit registration of document namespaces for Crawler::filterXPath() and Crawler::filter()
improved content type guessing in Crawler::addContent()
[BC BREAK] Crawler::addXmlContent() no longer removes the default document namespace

I mentioned in #4845 it would probably be possible to use DOMNode::lookupNamespaceURI() to find a namespace URI by given prefix. Unfortunately we cannot use it here since we'd have to call it on a node in the namespace we're looking for.

Current implementation makes the following query to find a namespace:

$domxpath->query('(//namespace::*[name()="media"])[last()]')

fabpot · 2013-03-23T13:23:19Z

@jakzal Any news on this PR. I would love to have this in 2.3.

jakzal · 2013-03-26T10:38:38Z

@fabpot I stumbled across few issues, some of them already reported by @arryon. Let me do a brain dump here. Note that we'll have similar problems with #5886.

XML content is loaded in different ways depending on a method we use.

Consider simple XML below:

<?xml version="1.0" encoding="UTF-8"?>
<entry xmlns="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" xmlns:yt="http://gdata.youtube.com/schemas/2007">
    <id>tag:youtube.com,2008:video:kgZRZmEc9j4</id>
    <yt:accessControl action="comment" permission="allowed"/>
    <yt:accessControl action="videoRespond" permission="moderated"/>
    <media:group>
        <media:title type="plain">Chordates - CrashCourse Biology #24</media:title>
        <yt:aspectRatio>widescreen</yt:aspectRatio>
    </media:group>
</entry>

Most of the time XML is loaded as html, since this is a default type in the addContent() method. addContent is also used when calling add() or using the constructor.

When loading the XML with one of the mentioned methods:

$crawler = new Crawler(file_get_contents('entry.xml'));

foreach ($crawler as $domElement) {
    echo $domElement->ownerDocument->saveXML();
}

we end up with the following result:

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml version="1.0" encoding="UTF-8"??>
<html>
    <body>
        <entry xmlns="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" xmlns:yt="http://gdata.youtube.com/schemas/2007">
            <id>tag:youtube.com,2008:video:kgZRZmEc9j4</id>
            <accesscontrol action="comment" permission="allowed"/>
            <accesscontrol action="videoRespond" permission="moderated"/>
            <group>
                <title type="plain">Chordates - CrashCourse Biology #24</title>
                <aspectratio>widescreen</aspectratio>
            </group>
        </entry>
    </body>
</html>

Notice that namespace attributes remain unchanged on the entry tag, but nodes are not namespaced anymore. Our XML is embedded in an html now. I'm not sure if this was an intended behaviour. If so, it might be worth documenting and advising loading XMLs with addXmlContent().

At the moment only way to load XML with addXmlContent is to call it directly:

$crawler = new Crawler();
$crawler->addXmlContent(file_get_contents('entry.xml'));

foreach ($crawler as $domElement) {
    echo $domElement->ownerDocument->saveXML();
}

addXmlContent removes all the namespaces to simplify xpath expressions, so we end up with:

<?xml version="1.0" encoding="UTF-8"?>
<entry ns="http://www.w3.org/2005/Atom" media="http://search.yahoo.com/mrss/" yt="http://gdata.youtube.com/schemas/2007">
    <id>tag:youtube.com,2008:video:kgZRZmEc9j4</id>
    <accessControl action="comment" permission="allowed"/>
    <accessControl action="videoRespond" permission="moderated"/>
    <group>
        <title type="plain">Chordates - CrashCourse Biology #24</title>
        <aspectRatio>widescreen</aspectRatio>
    </group>
</entry>

It indeed makes xpath expressions simpler, since we no longer have to use prefixes. However, it might cause problems with complex XMLs where we might get the same node in multiple namespaces. Namespaces are there to prevent name collisions, by removing them we're exposing ourselves to this issue.

Only way to keep the original XML is to use DOMDocument:

$dom = new \DOMDocument();
$dom->loadXML(file_get_contents('entry.xml'));

$crawler = new Crawler($dom);

foreach ($crawler as $domElement) {
    echo $domElement->ownerDocument->saveXML();
}

My solution only works if we load the XML with DOMDocument. In other cases either namespace definitions or nodes are modified.

Furthermore, at the moment my solution doesn't work with a default namespace (oh irony). Just because we have a default namespace, doesn't mean we can drop it from xpath expressions. That's probably the reason why code in addXmlContent is removing it.

In other words, even if we fixed issues with variate of ways we can load XML, we'd have to register a default namespace in some way (either by convention, or explicitly).

Imho we have to problems to solve here:

consistency in loading XMLs
providing users with a way to use namespace prefixes in xpath expressions

This PR was merged into the master branch. Discussion ---------- [CssSelector] Updated parsers to support namespaces (fix for ClassParser included) ClassParser was passing improper parameters to `ElementNode`, as well as namespaces simply not being supported in the various parsers. This is a natural extension of #6650, by properly parsing the requested CSS filter if supplied. | Q | A | ------------- | --- | Bug fix? | yes | New feature? | yes | BC breaks? | no | Deprecations? | no | Tests pass? | yes | Fixed tickets | NA | License | MIT | Doc PR | NA Commits ------- 3c015d5 Updated parsers to support namespaces (fix for ClassParser included)

fabpot · 2013-04-21T07:04:41Z

@jakzal Are you closing because you think there are no good solutions? If that's the case, can we also close #5886 and #4845?

jakzal · 2013-04-21T07:45:30Z

@fabpot I can't find a good solution for autodiscovery. Main reason is that even the default namespace has to be registered before querying with xpath. I'm not sure what would be a good prefix for a default namespace (or if we should be doing this at all).

#5886 is not a serious problem, since we can actually work around it (i've posted an update there, it's possible to skip the namespace in xpath). It could become a problem with complex XMLs, with multiple namespaces containing the same nodes.

If we decided that solution for #5886 is good enough, than we could close #4845 as well.

Otherwise we'd probably need to make a BC break, to avoid playing with namespaces when loading XMLs:

// remove the default namespace to make XPath expressions simpler
@$dom->loadXML(str_replace('xmlns', 'ns', $content), LIBXML_NONET);

I'm not sure what was the intention, but it actually removes ALL the namespaces (which is not what comment says). We definitely have to document it.

fabpot · 2013-04-25T12:43:15Z

@jakzal Would it be possible to only remove the default namespace and keep the other ones? That would be the best option I think.

jakzal · 2013-04-25T15:41:26Z

Removing the default namespace would actually work. We'd only need to make sure it's removed consistently (no matter how you load an XML).

I'll continue working on this.

jakzal · 2013-04-25T16:29:29Z

Filtering with CSS expressions fails because CssSelector::toXPath() converts nodes to lowercase. Instead of descendant-or-self::yt:accessControl it generates descendant-or-self::yt:accesscontrol.

@jfsimon any reason why node names are lowercased?

fabpot · 2013-04-25T16:47:11Z

@jakzal see scrapy/cssselect@c192fcb

fabpot · 2013-04-25T16:47:34Z

or even better: http://www.w3.org/TR/selectors/#casesens

jakzal · 2013-04-25T20:17:22Z

@fabpot while that's perfectly valid for css selectors, XPath expressions are case sensitive. That's why calling CssSelectors::toXPath() yields invalid results in some cases (in my case original accessControl becomes accesscontrol).

fabpot · 2013-04-25T20:39:16Z

@jakzal I think we can just safely remove the strtolower call then, no?

jfsimon · 2013-04-25T20:41:20Z

It seems that documents loaded as HTML are case-lowered (http://fr2.php.net/manual/fr/domxpath.query.php#77048) so in this case xpath must be case-lowered too. In the case of XML (which is the one here), they're not.

Case-insensitive search can be performed

with XPATH 2.0 using th i flag with the match() function, or the lower-case() function (http://stackoverflow.com/questions/2893551/case-insensitive-matching-in-xpath);
with XPATH 1.0 (which seems to be used by DOMXPath), we have have to use the translate() function (http://stackoverflow.com/questions/625986/how-can-i-use-xpath-to-perform-a-case-insensitive-search-and-support-non-english).

jfsimon · 2013-04-25T20:50:34Z

Note that the CssSelector component already supports some differences between HTML and XML.
The case could be lowered for HTML documents only.
That said, the DomCrawler component would have to be aware of the loaded document type (HTML or XML).

jakzal · 2013-04-25T21:00:34Z

@jfsimon Calling CssSelector::disableHtmlExtension() before CssSelector::toXPath() makes that my example works. However, it's probably not what we want with html.

jfsimon · 2013-04-25T21:08:54Z

@jakzal exactly. The HtmlExtension must be enabled for HTML queries and disabled for XML ones.
I guess this wont be easy to implement in the DomCrawler component.
That's why I talked about case-insensitive search, which could be a "cross-type" solution (but not a perfect one).

fabpot · 2013-08-09T06:46:59Z

Any news on this one?

arryon · 2013-08-09T08:44:45Z

For my own use I 'solved' this problem by naming the default namespace 'default', and keeping a registry of all other namespaces. Xpath queries using the default namespace must then begin with 'default', or any other sensible name that you could come up with.

I haven't followed this discussing since, but here's my own commit implementing adding namespaces, could be a starting point for someone to pick this issue up:
https://github.com/arryon/DomCrawler/commit/7c4a774b531110ad8516d3bcdb5052d3f439418e

fabpot · 2013-09-13T10:35:48Z

@jakzal I need to take a decision about this one: can we rely on auto-detection or do we allow people to register namespaces?

jakzal · 2013-09-13T11:13:03Z

Auto-detection seems to be working fine and there's no need for special treatment of a default namespace. See

I'd appreciate if someone reviewed it to make sure I'm not missing anything (ping @jfsimon @stof).

Notice I had to manually disable the html extension for css selector with CssSelector::disableHtmlExtension() to make filter() work. Does it make sense to query XML with a css selector? I could update the filter() tests to query an HTML document instead.

stof · 2013-09-13T11:19:48Z

@jakzal I think it makes sense to be able to use CssSelector for XML documents (this is exactly why we allow disabling the HTML extension of the CssSelector component btw).

fabpot · 2013-09-13T13:16:05Z

The CssSelector and the DomCrawler components must support both XML and HTML.

fabpot · 2013-09-13T13:19:56Z

@jakzal Can you make a PR to update the docs?

jakzal · 2013-09-13T13:20:55Z

👍

jakzal · 2013-09-13T21:16:06Z

There's an inconsistency in the way XMLs are loaded to the Crawler, I'm not sure if it should be addressed here or as a separate issue.

If an XML is loaded directly from string, the Crawler::addXmlContent is used, which removes the default namespace. If an XML is loaded from a DOMDocument, the default namespace is not removed.

Implication for the current PR is that if we load DOMDocument filtering by default namespace doesn't work atm. I think inconsistency should be solved first, otherwise we'd have two behaviours depending on method used to load the document.

Options we have:

Make that addXmlContent doesn't remove the default namespace and register the namespace under a fixed alias (default?) when calling filterXPath
Make that addDocument removes the default namespace just like addXmlContent (probably would require dumping to string and loading it again or $dom->documentElement->removeAttributeNS($dom->documentElement->getAttributeNode('xmlns')->nodeValue, '');). This wouldn't involve further changes in filterXPath.
Document differences in behaviour between addXmlContent and addDocument

fabpot · 2013-09-15T10:44:16Z

For the inconsistency you found, I would go with option 1. If that's not too complex, I would like to get the fix in this PR, if not, let's create an issue.

Also, can you add a note in the component CHANGELOG file about this new feature?

jakzal · 2013-09-17T17:30:47Z

Updated this PR with changes described in the option 1.

Note that the outcome is we HAVE to use prefixes once there's at least one non-default namespace in the document (that's how it works in php).

We could still improve this PR by:

letting users change the default namespace prefix (Crawler::setDefaultNamespacePrefix()) - this would be very easy
allowing users to register namespaces manually - not much harder

fabpot · 2013-09-17T19:52:02Z

The 2 suggestions make sense.

…and Crawler::filterByXPath(). Improved content type guessing.

…ding an XML content.

…ading documents with addXmlContent().

…ation.

…etails.

jakzal · 2013-09-22T22:48:02Z

@fabpot both suggestions are now implemented.

fabpot · 2013-09-24T20:04:50Z

src/Symfony/Component/DomCrawler/Tests/CrawlerTest.php

    /**
     * @covers Symfony\Component\DomCrawler\Crawler::filter
     */
    public function testFilter()
    {
+        $this->markSkippedIfCssSelectorNotPresent();


Can you remove the test for CSS selector as we have removed all such checks everywhere as composer must have been run to execute the tests.

This PR was merged into the master branch. Discussion ---------- [DomCrawler] Added auto-discovery and explicit registration of namespaces in filter() and filterByXPath() | Q | A | --- | --- |Bug fix: | no |Feature addition: |yes |Backwards compatibility break: | yes, default namespace is no longer removed in the `addContent` method |Symfony2 tests pass: | yes| |Fixes the following tickets: | #4845 |Todo: | - |License of the code:| MIT |Documentation PR: | symfony/symfony-docs#2979 * added support for automatic discovery and explicit registration of document namespaces for `Crawler::filterXPath()` and `Crawler::filter()` * improved content type guessing in `Crawler::addContent()` * [BC BREAK] `Crawler::addXmlContent()` no longer removes the default document namespace I mentioned in #4845 it would probably be possible to use [DOMNode::lookupNamespaceURI()](http://www.php.net/manual/en/domnode.lookupnamespaceuri.php) to find a namespace URI by given prefix. Unfortunately we cannot use it here since we'd have to call it on a node in the namespace we're looking for. Current implementation makes the following query to find a namespace: ```php $domxpath->query('(//namespace::*[name()="media"])[last()]') ``` Commits ------- 77e2fa5 [DomCrawler] Removed checks if CssSelector is present. 9110468 [DomCrawler] Enabled manual namespace registration. be1e4e6 [DomCrawler] Enabled default namespace prefix overloading. 943d446 [DomCrawler] Updated the CHANGELOG with namespace auto-registration details. c6fbb13 [DomCrawler] Added support for an automatic default namespace registration. 587e2dd [DomCrawler] Made that default namespace is no longer removed when loading documents with addXmlContent(). c905bba [DomCrawler] Added more tests for namespaced filtering. 6e717a3 [DomCrawler] Made sure only the default namespace is removed when loading an XML content. e5b8abb [DomCrawler] Added auto-discovery of namespaces in Crawler::filter() and Crawler::filterByXPath().

This PR was squashed before being merged into the 2.4 branch (closes #9771). Discussion ---------- Crawler default namespace fix | Q | A | ------------- | --- | Bug fix? | no | New feature? | no | BC breaks? | yes | Deprecations? | no | Tests pass? | yes | Fixed tickets | #9732, #6650 | License | MIT | Doc PR | symfony/symfony-docs/pull/2979 Fix backwards compatibility of xml namespaces for having only one default namespace. Commits ------- cfff054 Crawler default namespace fix

troymccabe mentioned this pull request Apr 6, 2013

[CssSelector] Updated parsers to support namespaces (fix for ClassParser included) #7585

Merged

jakzal closed this Apr 20, 2013

jakzal reopened this Apr 25, 2013

fabpot mentioned this pull request Sep 13, 2013

[DomCrawler/CssSelector] Fixed namespace using in filter by xpath #5886

Closed

jakzal mentioned this pull request Sep 13, 2013

[DomCrawler] Documented xml namespace autodiscovery symfony/symfony-docs#2979

Closed

jakzal added 8 commits September 22, 2013 23:05

[DomCrawler] Added auto-discovery of namespaces in Crawler::filter() …

e5b8abb

…and Crawler::filterByXPath(). Improved content type guessing.

[DomCrawler] Made sure only the default namespace is removed when loa…

6e717a3

…ding an XML content.

[DomCrawler] Added more tests for namespaced filtering.

c905bba

[DomCrawler] Made that default namespace is no longer removed when lo…

587e2dd

…ading documents with addXmlContent().

[DomCrawler] Added support for an automatic default namespace registr…

c6fbb13

…ation.

[DomCrawler] Updated the CHANGELOG with namespace auto-registration d…

943d446

…etails.

[DomCrawler] Enabled default namespace prefix overloading.

be1e4e6

[DomCrawler] Enabled manual namespace registration.

9110468

fabpot reviewed Sep 24, 2013
View reviewed changes

[DomCrawler] Removed checks if CssSelector is present.

77e2fa5

fabpot merged commit 77e2fa5 into symfony:master Sep 25, 2013

jakzal deleted the domcrawler-namespace-autodiscovery branch September 25, 2013 07:22

crudecki mentioned this pull request Dec 14, 2013

Crawler default namespace fix #9771

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DomCrawler] Added auto-discovery and explicit registration of namespaces in filter() and filterByXPath() #6650

[DomCrawler] Added auto-discovery and explicit registration of namespaces in filter() and filterByXPath() #6650

jakzal commented Jan 10, 2013

fabpot commented Mar 23, 2013

jakzal commented Mar 26, 2013

fabpot commented Apr 21, 2013

jakzal commented Apr 21, 2013

fabpot commented Apr 25, 2013

jakzal commented Apr 25, 2013

jakzal commented Apr 25, 2013

fabpot commented Apr 25, 2013

fabpot commented Apr 25, 2013

jakzal commented Apr 25, 2013

fabpot commented Apr 25, 2013

jfsimon commented Apr 25, 2013

jfsimon commented Apr 25, 2013

jakzal commented Apr 25, 2013

jfsimon commented Apr 25, 2013

fabpot commented Aug 9, 2013

arryon commented Aug 9, 2013

fabpot commented Sep 13, 2013

jakzal commented Sep 13, 2013

stof commented Sep 13, 2013

fabpot commented Sep 13, 2013

fabpot commented Sep 13, 2013

jakzal commented Sep 13, 2013

jakzal commented Sep 13, 2013

fabpot commented Sep 15, 2013

jakzal commented Sep 17, 2013

fabpot commented Sep 17, 2013

jakzal commented Sep 22, 2013

fabpot Sep 24, 2013

jakzal Sep 24, 2013

[DomCrawler] Added auto-discovery and explicit registration of namespaces in filter() and filterByXPath() #6650

[DomCrawler] Added auto-discovery and explicit registration of namespaces in filter() and filterByXPath() #6650

Conversation

jakzal commented Jan 10, 2013

fabpot commented Mar 23, 2013

jakzal commented Mar 26, 2013

fabpot commented Apr 21, 2013

jakzal commented Apr 21, 2013

fabpot commented Apr 25, 2013

jakzal commented Apr 25, 2013

jakzal commented Apr 25, 2013

fabpot commented Apr 25, 2013

fabpot commented Apr 25, 2013

jakzal commented Apr 25, 2013

fabpot commented Apr 25, 2013

jfsimon commented Apr 25, 2013

jfsimon commented Apr 25, 2013

jakzal commented Apr 25, 2013

jfsimon commented Apr 25, 2013

fabpot commented Aug 9, 2013

arryon commented Aug 9, 2013

fabpot commented Sep 13, 2013

jakzal commented Sep 13, 2013

stof commented Sep 13, 2013

fabpot commented Sep 13, 2013

fabpot commented Sep 13, 2013

jakzal commented Sep 13, 2013

jakzal commented Sep 13, 2013

fabpot commented Sep 15, 2013

jakzal commented Sep 17, 2013

fabpot commented Sep 17, 2013

jakzal commented Sep 22, 2013

fabpot Sep 24, 2013

Choose a reason for hiding this comment

jakzal Sep 24, 2013

Choose a reason for hiding this comment