Skip to content

[DomCrawler] Added auto-discovery and explicit registration of namespaces in filter() and filterByXPath() #6650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Sep 25, 2013

Conversation

jakzal
Copy link
Contributor

@jakzal jakzal commented Jan 10, 2013

Q A
Bug fix: no
Feature addition: yes
Backwards compatibility break: yes, default namespace is no longer removed in the addContent method
Symfony2 tests pass: yes
Fixes the following tickets: #4845
Todo: -
License of the code: MIT
Documentation PR: symfony/symfony-docs#2979
  • added support for automatic discovery and explicit registration of document namespaces for Crawler::filterXPath() and Crawler::filter()
  • improved content type guessing in Crawler::addContent()
  • [BC BREAK] Crawler::addXmlContent() no longer removes the default document namespace

I mentioned in #4845 it would probably be possible to use DOMNode::lookupNamespaceURI() to find a namespace URI by given prefix. Unfortunately we cannot use it here since we'd have to call it on a node in the namespace we're looking for.

Current implementation makes the following query to find a namespace:

$domxpath->query('(//namespace::*[name()="media"])[last()]')

@fabpot
Copy link
Member

fabpot commented Mar 23, 2013

@jakzal Any news on this PR. I would love to have this in 2.3.

@jakzal
Copy link
Contributor Author

jakzal commented Mar 26, 2013

@fabpot I stumbled across few issues, some of them already reported by @arryon. Let me do a brain dump here. Note that we'll have similar problems with #5886.

XML content is loaded in different ways depending on a method we use.

Consider simple XML below:

<?xml version="1.0" encoding="UTF-8"?>
<entry xmlns="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" xmlns:yt="http://gdata.youtube.com/schemas/2007">
    <id>tag:youtube.com,2008:video:kgZRZmEc9j4</id>
    <yt:accessControl action="comment" permission="allowed"/>
    <yt:accessControl action="videoRespond" permission="moderated"/>
    <media:group>
        <media:title type="plain">Chordates - CrashCourse Biology #24</media:title>
        <yt:aspectRatio>widescreen</yt:aspectRatio>
    </media:group>
</entry>

Most of the time XML is loaded as html, since this is a default type in the addContent() method. addContent is also used when calling add() or using the constructor.

When loading the XML with one of the mentioned methods:

$crawler = new Crawler(file_get_contents('entry.xml'));

foreach ($crawler as $domElement) {
    echo $domElement->ownerDocument->saveXML();
}

we end up with the following result:

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml version="1.0" encoding="UTF-8"??>
<html>
    <body>
        <entry xmlns="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" xmlns:yt="http://gdata.youtube.com/schemas/2007">
            <id>tag:youtube.com,2008:video:kgZRZmEc9j4</id>
            <accesscontrol action="comment" permission="allowed"/>
            <accesscontrol action="videoRespond" permission="moderated"/>
            <group>
                <title type="plain">Chordates - CrashCourse Biology #24</title>
                <aspectratio>widescreen</aspectratio>
            </group>
        </entry>
    </body>
</html>

Notice that namespace attributes remain unchanged on the entry tag, but nodes are not namespaced anymore. Our XML is embedded in an html now. I'm not sure if this was an intended behaviour. If so, it might be worth documenting and advising loading XMLs with addXmlContent().

At the moment only way to load XML with addXmlContent is to call it directly:

$crawler = new Crawler();
$crawler->addXmlContent(file_get_contents('entry.xml'));

foreach ($crawler as $domElement) {
    echo $domElement->ownerDocument->saveXML();
}

addXmlContent removes all the namespaces to simplify xpath expressions, so we end up with:

<?xml version="1.0" encoding="UTF-8"?>
<entry ns="http://www.w3.org/2005/Atom" media="http://search.yahoo.com/mrss/" yt="http://gdata.youtube.com/schemas/2007">
    <id>tag:youtube.com,2008:video:kgZRZmEc9j4</id>
    <accessControl action="comment" permission="allowed"/>
    <accessControl action="videoRespond" permission="moderated"/>
    <group>
        <title type="plain">Chordates - CrashCourse Biology #24</title>
        <aspectRatio>widescreen</aspectRatio>
    </group>
</entry>

It indeed makes xpath expressions simpler, since we no longer have to use prefixes. However, it might cause problems with complex XMLs where we might get the same node in multiple namespaces. Namespaces are there to prevent name collisions, by removing them we're exposing ourselves to this issue.

Only way to keep the original XML is to use DOMDocument:

$dom = new \DOMDocument();
$dom->loadXML(file_get_contents('entry.xml'));

$crawler = new Crawler($dom);

foreach ($crawler as $domElement) {
    echo $domElement->ownerDocument->saveXML();
}

My solution only works if we load the XML with DOMDocument. In other cases either namespace definitions or nodes are modified.

Furthermore, at the moment my solution doesn't work with a default namespace (oh irony). Just because we have a default namespace, doesn't mean we can drop it from xpath expressions. That's probably the reason why code in addXmlContent is removing it.

In other words, even if we fixed issues with variate of ways we can load XML, we'd have to register a default namespace in some way (either by convention, or explicitly).

Imho we have to problems to solve here:

  • consistency in loading XMLs
  • providing users with a way to use namespace prefixes in xpath expressions

fabpot added a commit that referenced this pull request Apr 9, 2013
This PR was merged into the master branch.

Discussion
----------

[CssSelector] Updated parsers to support namespaces (fix for ClassParser included)

ClassParser was passing improper parameters to `ElementNode`, as well as namespaces simply not being supported in the various parsers. This is a natural extension of #6650, by properly parsing the requested CSS filter if supplied.

| Q             | A
| ------------- | ---
| Bug fix?      | yes
| New feature?  | yes
| BC breaks?    | no
| Deprecations? | no
| Tests pass?   | yes
| Fixed tickets | NA
| License       | MIT
| Doc PR        | NA

Commits
-------

3c015d5 Updated parsers to support namespaces (fix for ClassParser included)
@jakzal jakzal closed this Apr 20, 2013
@fabpot
Copy link
Member

fabpot commented Apr 21, 2013

@jakzal Are you closing because you think there are no good solutions? If that's the case, can we also close #5886 and #4845?

@jakzal
Copy link
Contributor Author

jakzal commented Apr 21, 2013

@fabpot I can't find a good solution for autodiscovery. Main reason is that even the default namespace has to be registered before querying with xpath. I'm not sure what would be a good prefix for a default namespace (or if we should be doing this at all).

#5886 is not a serious problem, since we can actually work around it (i've posted an update there, it's possible to skip the namespace in xpath). It could become a problem with complex XMLs, with multiple namespaces containing the same nodes.

If we decided that solution for #5886 is good enough, than we could close #4845 as well.

Otherwise we'd probably need to make a BC break, to avoid playing with namespaces when loading XMLs:

// remove the default namespace to make XPath expressions simpler
@$dom->loadXML(str_replace('xmlns', 'ns', $content), LIBXML_NONET);

I'm not sure what was the intention, but it actually removes ALL the namespaces (which is not what comment says). We definitely have to document it.

@fabpot
Copy link
Member

fabpot commented Apr 25, 2013

@jakzal Would it be possible to only remove the default namespace and keep the other ones? That would be the best option I think.

@jakzal
Copy link
Contributor Author

jakzal commented Apr 25, 2013

Removing the default namespace would actually work. We'd only need to make sure it's removed consistently (no matter how you load an XML).

I'll continue working on this.

@jakzal jakzal reopened this Apr 25, 2013
@jakzal
Copy link
Contributor Author

jakzal commented Apr 25, 2013

Filtering with CSS expressions fails because CssSelector::toXPath() converts nodes to lowercase. Instead of descendant-or-self::yt:accessControl it generates descendant-or-self::yt:accesscontrol.

@jfsimon any reason why node names are lowercased?

@fabpot
Copy link
Member

fabpot commented Apr 25, 2013

@jakzal see scrapy/cssselect@c192fcb

@fabpot
Copy link
Member

fabpot commented Apr 25, 2013

or even better: http://www.w3.org/TR/selectors/#casesens

@jakzal
Copy link
Contributor Author

jakzal commented Apr 25, 2013

@fabpot while that's perfectly valid for css selectors, XPath expressions are case sensitive. That's why calling CssSelectors::toXPath() yields invalid results in some cases (in my case original accessControl becomes accesscontrol).

@fabpot
Copy link
Member

fabpot commented Apr 25, 2013

@jakzal I think we can just safely remove the strtolower call then, no?

@jfsimon
Copy link
Contributor

jfsimon commented Apr 25, 2013

It seems that documents loaded as HTML are case-lowered (http://fr2.php.net/manual/fr/domxpath.query.php#77048) so in this case xpath must be case-lowered too. In the case of XML (which is the one here), they're not.

Case-insensitive search can be performed

@jfsimon
Copy link
Contributor

jfsimon commented Apr 25, 2013

Note that the CssSelector component already supports some differences between HTML and XML.
The case could be lowered for HTML documents only.
That said, the DomCrawler component would have to be aware of the loaded document type (HTML or XML).

@jakzal
Copy link
Contributor Author

jakzal commented Apr 25, 2013

@jfsimon Calling CssSelector::disableHtmlExtension() before CssSelector::toXPath() makes that my example works. However, it's probably not what we want with html.

@jfsimon
Copy link
Contributor

jfsimon commented Apr 25, 2013

@jakzal exactly. The HtmlExtension must be enabled for HTML queries and disabled for XML ones.
I guess this wont be easy to implement in the DomCrawler component.
That's why I talked about case-insensitive search, which could be a "cross-type" solution (but not a perfect one).

@fabpot
Copy link
Member

fabpot commented Aug 9, 2013

Any news on this one?

@arryon
Copy link

arryon commented Aug 9, 2013

For my own use I 'solved' this problem by naming the default namespace 'default', and keeping a registry of all other namespaces. Xpath queries using the default namespace must then begin with 'default', or any other sensible name that you could come up with.

I haven't followed this discussing since, but here's my own commit implementing adding namespaces, could be a starting point for someone to pick this issue up:
https://github.com/arryon/DomCrawler/commit/7c4a774b531110ad8516d3bcdb5052d3f439418e

@fabpot
Copy link
Member

fabpot commented Sep 13, 2013

@jakzal I need to take a decision about this one: can we rely on auto-detection or do we allow people to register namespaces?

@jakzal
Copy link
Contributor Author

jakzal commented Sep 13, 2013

Auto-detection seems to be working fine and there's no need for special treatment of a default namespace. See

I'd appreciate if someone reviewed it to make sure I'm not missing anything (ping @jfsimon @stof).

Notice I had to manually disable the html extension for css selector with CssSelector::disableHtmlExtension() to make filter() work. Does it make sense to query XML with a css selector? I could update the filter() tests to query an HTML document instead.

@stof
Copy link
Member

stof commented Sep 13, 2013

@jakzal I think it makes sense to be able to use CssSelector for XML documents (this is exactly why we allow disabling the HTML extension of the CssSelector component btw).

@fabpot
Copy link
Member

fabpot commented Sep 13, 2013

The CssSelector and the DomCrawler components must support both XML and HTML.

@fabpot
Copy link
Member

fabpot commented Sep 13, 2013

@jakzal Can you make a PR to update the docs?

@jakzal
Copy link
Contributor Author

jakzal commented Sep 13, 2013

👍

@jakzal
Copy link
Contributor Author

jakzal commented Sep 13, 2013

There's an inconsistency in the way XMLs are loaded to the Crawler, I'm not sure if it should be addressed here or as a separate issue.

If an XML is loaded directly from string, the Crawler::addXmlContent is used, which removes the default namespace. If an XML is loaded from a DOMDocument, the default namespace is not removed.

Implication for the current PR is that if we load DOMDocument filtering by default namespace doesn't work atm. I think inconsistency should be solved first, otherwise we'd have two behaviours depending on method used to load the document.

Options we have:

  • Make that addXmlContent doesn't remove the default namespace and register the namespace under a fixed alias (default?) when calling filterXPath
  • Make that addDocument removes the default namespace just like addXmlContent (probably would require dumping to string and loading it again or $dom->documentElement->removeAttributeNS($dom->documentElement->getAttributeNode('xmlns')->nodeValue, '');). This wouldn't involve further changes in filterXPath.
  • Document differences in behaviour between addXmlContent and addDocument

@fabpot
Copy link
Member

fabpot commented Sep 15, 2013

For the inconsistency you found, I would go with option 1. If that's not too complex, I would like to get the fix in this PR, if not, let's create an issue.

Also, can you add a note in the component CHANGELOG file about this new feature?

@jakzal
Copy link
Contributor Author

jakzal commented Sep 17, 2013

Updated this PR with changes described in the option 1.

Note that the outcome is we HAVE to use prefixes once there's at least one non-default namespace in the document (that's how it works in php).

We could still improve this PR by:

  • letting users change the default namespace prefix (Crawler::setDefaultNamespacePrefix()) - this would be very easy
  • allowing users to register namespaces manually - not much harder

@fabpot
Copy link
Member

fabpot commented Sep 17, 2013

The 2 suggestions make sense.

@jakzal
Copy link
Contributor Author

jakzal commented Sep 22, 2013

@fabpot both suggestions are now implemented.

/**
* @covers Symfony\Component\DomCrawler\Crawler::filter
*/
public function testFilter()
{
$this->markSkippedIfCssSelectorNotPresent();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the test for CSS selector as we have removed all such checks everywhere as composer must have been run to execute the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

fabpot added a commit that referenced this pull request Sep 25, 2013
This PR was merged into the master branch.

Discussion
----------

[DomCrawler] Added auto-discovery and explicit registration of namespaces in filter() and filterByXPath()

| Q | A
| --- | ---
|Bug fix: | no
|Feature addition: |yes
|Backwards compatibility break: | yes, default namespace is no longer removed in the `addContent` method
|Symfony2 tests pass: | yes|
|Fixes the following tickets: | #4845
|Todo: | -
|License of the code:| MIT
|Documentation PR: | symfony/symfony-docs#2979

* added support for automatic discovery and explicit registration of document namespaces for `Crawler::filterXPath()` and `Crawler::filter()`
* improved content type guessing in `Crawler::addContent()`
* [BC BREAK] `Crawler::addXmlContent()` no longer removes the default document namespace

I mentioned in #4845 it would probably be possible to use [DOMNode::lookupNamespaceURI()](http://www.php.net/manual/en/domnode.lookupnamespaceuri.php) to find a namespace URI by given prefix. Unfortunately we cannot use it here since we'd have to call it on a node in the namespace we're looking for.

Current implementation makes the following query to find a namespace:
```php
$domxpath->query('(//namespace::*[name()="media"])[last()]')
```

Commits
-------

77e2fa5 [DomCrawler] Removed checks if CssSelector is present.
9110468 [DomCrawler] Enabled manual namespace registration.
be1e4e6 [DomCrawler] Enabled default namespace prefix overloading.
943d446 [DomCrawler] Updated the CHANGELOG with namespace auto-registration details.
c6fbb13 [DomCrawler] Added support for an automatic default namespace registration.
587e2dd [DomCrawler] Made that default namespace is no longer removed when loading documents with addXmlContent().
c905bba [DomCrawler] Added more tests for namespaced filtering.
6e717a3 [DomCrawler] Made sure only the default namespace is removed when loading an XML content.
e5b8abb [DomCrawler] Added auto-discovery of namespaces in Crawler::filter() and Crawler::filterByXPath().
@fabpot fabpot merged commit 77e2fa5 into symfony:master Sep 25, 2013
@jakzal jakzal deleted the domcrawler-namespace-autodiscovery branch September 25, 2013 07:22
fabpot added a commit that referenced this pull request Dec 17, 2013
This PR was squashed before being merged into the 2.4 branch (closes #9771).

Discussion
----------

Crawler default namespace fix

| Q             | A
| ------------- | ---
| Bug fix?      | no
| New feature?  | no
| BC breaks?    | yes
| Deprecations? | no
| Tests pass?   | yes
| Fixed tickets | #9732, #6650
| License       | MIT
| Doc PR        | symfony/symfony-docs/pull/2979

Fix backwards compatibility of xml namespaces for having only one default namespace.

Commits
-------

cfff054 Crawler default namespace fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants