-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Fix GH-11404: DOMDocument::savexml and friends ommit xmlns="" declara… #11428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The extra checks and allocations caused a slowdown for an artificial benchmark I created. |
…aration for null namespace, creating incorrect xml representation of the DOM The NULL namespace is only correct when there is no default namespace override. When there is, we need to manually set it to the empty string namespace.
I did some artificial micro benchmarks. The worst case is the following: $dom = new DOMDocument;
$dom->loadXML('<?xml version="1.0"?><div xmlns="http://php.net/" />');
for ($i = 0; $i < 50000; $i++) {
$dom->documentElement->appendChild($dom->createElement('my_element'));
} This is the absolute worst case, because we always have to explicitly add the empty namespace because the default namespace is "http://php.net". Most of the performance loss in this case is because of the additional namespace memory allocation. Unfortunately, libxml2 does not provide us with a way to avoid the allocation. However, this case is very rare: this is a complete misuse of the "default namespace" feature because nothing is in the default namespace. This code yields these results:
where benchmark 1 is the code with this patch applied. Benchmark 2 is the old code which gave the wrong results. A more realistic benchmark case is the following code: $dom = new DOMDocument;
$dom->loadXML('<?xml version="1.0"?><div xmlns="http://php.net/" />');
$de = $dom->documentElement;
$de->appendChild($dom->createElementNS(null, 'newcontainer'));
for ($i = 0; $i < 50000; $i++) {
$de->firstElementChild->appendChild($dom->createElement('my_element'));
}
echo $dom->saveXML(), "\n"; In this case the newly added elements are all in the empty namespace. Only one additional empty namespace needs to be created in this case. This yields the following results:
Again: benchmark 1 is the new code, while benchmark 2 is the old code (which gave incorrect results). So this has minimal performance loss in normal use cases. Note these are micro-benchmarks, so there's a high chance we won't even feel this in real-world use cases. I guess this is just the price we pay for correctness (that we can't avoid). |
NULL | ||
string(7) "some:ns" | ||
<?xml version="1.0"?> | ||
<with xmlns="some:ns"><a/><b/><c xmlns=""/><d xmlns:x="some:ns" x:attrib="val"/><e attrib1="val" attrib2="val"/></with> | ||
<with xmlns="some:ns"><a xmlns=""/><b xmlns=""/><c xmlns=""/><d xmlns:x="some:ns" xmlns="" x:attrib="val"/><e attrib1="val" attrib2="val"/></with> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it normal that the <d>
tag here has 2 xmlns
attributes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's xmlns:x
and xmlns
. xmlns:x
defines the namespace for prefix x
which is used for x:attrib
. The reason it also has xmlns
is for the default namespace because d
itself is in the empty namespace. If it were not to define xmlns=""
then d
would instead be in some:ns
(inherited from <with>
).
It's a tricky test case :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, thanks for the explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, the impact seems really low for the massive benefit that is correctness. :)
This changes the behavior of The following code: $sitemap = new \DOMDocument();
$sitemap->loadXML('<urlset xmlns="https://www.sitemaps.org/schemas/sitemap/0.9"><url><loc>https://example.com/</loc></url></urlset>');
$sitemap->documentElement->appendChild($sitemap->createElement('url')); now adds an I’m afraid this would affect a huge number of usages out there. |
@ausi Adding parser=new DOMParser;
dom=parser.parseFromString('<urlset xmlns="https://www.sitemaps.org/schemas/sitemap/0.9"><url><loc>https://example.com/</loc></url></urlset>', 'text/xml');
dom.documentElement.appendChild(dom.createElement('url'));
(new XMLSerializer).serializeToString(dom) I see however, that this is too dangerous for a bugfix release at least. I'll revert this for now (I can do that tonight), reopen this PR and then discuss what we can do... |
@nielsdos Agree, I know (now
Thank you! I suspect that many people are using |
Yeah probably... There's a couple of spec compliance problems with the DOM extension in PHP. I was wondering whether it makes sense to add an opt-in mode for spec compliance. This would allow existing PHP code to keep working, and new code could opt-in to the spec-compliant behaviour. |
I also found a small bug in this implementation: $dom = new \DOMDocument();
$root = $dom->createElementNS('foo', 'root');
$dom->appendChild($root);
$a1 = $dom->createElementNS('foo', 'a');
$a1->appendChild($dom->createElementNS('foo', 'b'));
$root->appendChild($a1);
$a2 = $dom->createElementNS('foo', 'a');
$a2->appendChild($dom->createElementNS('foo', 'b'));
$root->appendChild($a2);
echo $dom->saveXML(); This adds
While PHP 8.2 correctly outputs:
|
I'll check why this is, and add your test case as a regression test. Thanks for checking. |
… declaration for null namespace, creating incorrect xml representation of the DOM" This reverts commit 7eb3e9c. Although the fix follows the spec, it causes issues because a lot of old code assumes the incorrect behaviour PHP had since a long time. We cannot do this yet, especially not in a stable release. We revert this for the time being. See GH-11428.
@ausi Revert pushed. |
I disagree with this being a problematic BC break - but I might be biased as I reported that problem this change addresses. The current behavior is clearly wrong and there is no workaround for the broken behavior in userland except writing your own serializer using The aforementioned example using I'm also not sure how The example by @ausi is technically invalid as well by the way: |
I, for one, very much like the idea of having a spec compliant implementation ;-) and would be perfectly fine with being required to activate that explicitly for the time being. One has to manually setup things for proper error handling already anyway, adding another option here sounds okay. In case you don't plan to maintain the broken implementation for ever next to the spec compliant one, I'd recommend not naming it 'spec compliance mode' but more like 'legacy mode', allowing you to emit deprecation notices at some / in PHP 8.3+. |
The current behaviour has been in place for about 20 years. Lots of code potentially, unfortunately, depends on the broken behaviour. I dislike this situation too. In fact, there's a couple of spec compliance issues which we can't just fix without causing small or big BC breaks. I'm also annoyed about the ext/dom situation.
I don't have a good answer for this. Technically yes any semantic bugfix can cause a BC break.
Well, the WHATWG spec defines the behaviour that should be followed (which PHP doesn't correctly follow as we know...), so I wouldn't call it broken.
Yes, it's an invalid namespace. The revert is not because of the bug, but because of the BC break (see his linked issue).
Yeah. For now I think this would be the best path forward. I don't know if I can introduce it without an RFC though... It at the very least needs a discussion topic on the internals list. It's likely too late in the 8.3 release given how close the feature freeze is though...
I agree with your name recommendation. |
At least I'm not alone ;-)
I didn't expect any. There probably isn't one.
Point taken.
Not trying to be a smart ass here but the WHATWG deals with HTML. This is an XML issue, something the WHATWG clearly choose to almost ignore given how HTML 5 is defined and how broken things like webcomponents are with regards to XML. But that's all besides the point here. PHP currently doesn't support HTML 5. If one wants to use HTML 5, only the XML serialization format works - unless LibXML 2 suddenly learned HTML 5 and I somehow missed that. As far as my knowledge goes - and that might be of course wrong -, using a DOM Level 1 API with a document that contains XML Namespaces is at the very least highly discouraged as it leads to the problems I described. Even if the WHATWG promotes the use of DOM Level 1 APIs - which I didn't check ;) -, it would probably be consistent to their dislike of XML and the fact HTML 5 doesn't have a namespace - even in XML syntax.
I know and understood that. I was trying to build support for a spec compliant mode of operation that would actually fail this. Actually believe I had to fix some tests in one of my libraries because I got failures for invalid IRI definitions in them.
If you need any help here? Not that I have very much time at hand but this is kind of important for me ;)
I wouldn't know about the effort. I could imagine though that quite some changes might become bigger and thus more annoying to handle. But I'm not a C developer ;) Point taken regarding heat, though. |
Yes, I just shortened the code as far as I could to illustrate the bug. Using a valid namespace has the same effect.
Please don’t trust my judgment here, I just made a wild guess. I don’t know much about XML, only suspected that many other people have the same knowledge level as me 😅 |
To move forward, I'll need to gather a list of all the spec compliance issues. |
Just FYI, we can land and make such changes/add new features post feature freeze, it just has more stingy requirements and ABI is usually only fixed on RC1 so there isn't even an ABI compatibility requirement until then. |
I can confirm that @nielsdos Should I create a seperate bug report for that? |
Sure. It looks like the namespace reuse code is triggering an issue somewhere, might be libxml2 related. |
This example changed results after the dom changes in 8.2.8: <?php
$template = <<< EOB
<wsse:Security xmlns:wsse=".">
<ds:Signature xmlns:ds="http://www.w3.org/2000/09/xmldsig#">
</ds:Signature>
</wsse:Security>
EOB;
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->loadXML($template);
$parent_doc = new DOMDocument('1.0', 'UTF-8');
$parent_doc->loadXML('<xml><child/></xml>');
$wsse = $parent_doc->importNode($doc->documentElement, true);
$parent_doc->firstChild->firstChild->appendChild($wsse);
echo $parent_doc->saveXML(); 8.2.7: <?xml version="1.0"?>
<xml><child><wsse:Security xmlns:wsse="." xmlns:ds="http://www.w3.org/2000/09/xmldsig#">
<ds:Signature xmlns:ds="http://www.w3.org/2000/09/xmldsig#">
</ds:Signature>
</wsse:Security></child></xml> 8.2.8: <?xml version="1.0"?>
<xml><child><wsse:Security xmlns:wsse=".">
<ds:Signature xmlns:ds="http://www.w3.org/2000/09/xmldsig#">
</ds:Signature>
</wsse:Security></child></xml> |
@rlerdorf Do you consider that a bug? Because the only difference I see is the removal of the redundant ds-prefix definition on |
I’d also consider the result from 8.2.8 to be correct as the |
I consider it a bug to change this behaviour in a point release, yes. This needs to be done in a major release with a blurb in the announcement explaining the behaviour change so people don't spend a lot of time trying to track down why their unit tests suddenly started failing on a point release upgrade. Something this subtle can be especially hard to find. |
The removal of the redundant namespace declaration is a consequence of a bugfix. If every observable detail is considered a breaking change, then many bugfixes become impossible to do in minor releases IMHO. |
In particular given that it's technically correct and not breaking change from an XML perspective. The actual serialized string might be different but if code relies on that, I'd say somebody did not understand XML ;) |
Will be obsoleted by my opt-in spec compliance project. |
…tion for null namespace, creating incorrect xml representation of the DOM
The NULL namespace is only correct when there is no default namespace override. When there is, we need to manually set it to the empty string namespace.
Marking as a draft because even though the results are correct, this now runs into https://bugs.php.net/bug.php?id=47530 ...EDIT: this one is fixed now