Skip to content

Fix GH-11404: DOMDocument::savexml and friends ommit xmlns="" declara… #11428

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

nielsdos
Copy link
Member

@nielsdos nielsdos commented Jun 11, 2023

…tion for null namespace, creating incorrect xml representation of the DOM

The NULL namespace is only correct when there is no default namespace override. When there is, we need to manually set it to the empty string namespace.

Marking as a draft because even though the results are correct, this now runs into https://bugs.php.net/bug.php?id=47530 ... EDIT: this one is fixed now

@nielsdos
Copy link
Member Author

The extra checks and allocations caused a slowdown for an artificial benchmark I created.
I did some fix in the last commit to improve this. The slowdown for the normal cases is now fairly small, but I need some more time to think on how to avoid the extra work in more cases.

nielsdos added 2 commits June 17, 2023 00:37
…aration for null namespace, creating incorrect xml representation of the DOM

The NULL namespace is only correct when there is no default namespace
override. When there is, we need to manually set it to the empty string
namespace.
@nielsdos
Copy link
Member Author

I did some artificial micro benchmarks.

The worst case is the following:

$dom = new DOMDocument;
$dom->loadXML('<?xml version="1.0"?><div xmlns="http://php.net/" />');

for ($i = 0; $i < 50000; $i++) {
    $dom->documentElement->appendChild($dom->createElement('my_element'));
}

This is the absolute worst case, because we always have to explicitly add the empty namespace because the default namespace is "http://php.net". Most of the performance loss in this case is because of the additional namespace memory allocation. Unfortunately, libxml2 does not provide us with a way to avoid the allocation. However, this case is very rare: this is a complete misuse of the "default namespace" feature because nothing is in the default namespace.

This code yields these results:

Benchmark 1: ./sapi/cli/php dombench.php
  Time (mean ± σ):      33.8 ms ±   0.8 ms    [User: 27.9 ms, System: 5.7 ms]
  Range (min … max):    32.2 ms …  36.4 ms    80 runs
 
Benchmark 2: ./sapi/cli/php_old dombench.php
  Time (mean ± σ):      28.4 ms ±   1.0 ms    [User: 23.0 ms, System: 5.3 ms]
  Range (min … max):    26.4 ms …  31.3 ms    96 runs
 
Summary
  ./sapi/cli/php_old dombench.php ran
    1.19 ± 0.05 times faster than ./sapi/cli/php dombench.php

where benchmark 1 is the code with this patch applied. Benchmark 2 is the old code which gave the wrong results.

A more realistic benchmark case is the following code:

$dom = new DOMDocument;
$dom->loadXML('<?xml version="1.0"?><div xmlns="http://php.net/" />');

$de = $dom->documentElement;
$de->appendChild($dom->createElementNS(null, 'newcontainer'));

for ($i = 0; $i < 50000; $i++) {
    $de->firstElementChild->appendChild($dom->createElement('my_element'));
}

echo $dom->saveXML(), "\n";

In this case the newly added elements are all in the empty namespace. Only one additional empty namespace needs to be created in this case.

This yields the following results:

Benchmark 1: ./sapi/cli/php dombench.php
  Time (mean ± σ):      31.1 ms ±   1.0 ms    [User: 26.5 ms, System: 4.5 ms]
  Range (min … max):    29.5 ms …  33.8 ms    86 runs
 
Benchmark 2: ./sapi/cli/php_old dombench.php
  Time (mean ± σ):      30.6 ms ±   0.8 ms    [User: 25.6 ms, System: 4.9 ms]
  Range (min … max):    29.2 ms …  33.1 ms    90 runs
 
Summary
  ./sapi/cli/php_old dombench.php ran
    1.02 ± 0.04 times faster than ./sapi/cli/php dombench.php

Again: benchmark 1 is the new code, while benchmark 2 is the old code (which gave incorrect results). So this has minimal performance loss in normal use cases.

Note these are micro-benchmarks, so there's a high chance we won't even feel this in real-world use cases. I guess this is just the price we pay for correctness (that we can't avoid).

@nielsdos nielsdos marked this pull request as ready for review June 16, 2023 22:55
@nielsdos nielsdos requested a review from Girgias June 16, 2023 22:55
NULL
string(7) "some:ns"
<?xml version="1.0"?>
<with xmlns="some:ns"><a/><b/><c xmlns=""/><d xmlns:x="some:ns" x:attrib="val"/><e attrib1="val" attrib2="val"/></with>
<with xmlns="some:ns"><a xmlns=""/><b xmlns=""/><c xmlns=""/><d xmlns:x="some:ns" xmlns="" x:attrib="val"/><e attrib1="val" attrib2="val"/></with>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it normal that the <d> tag here has 2 xmlns attributes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's xmlns:x and xmlns. xmlns:x defines the namespace for prefix x which is used for x:attrib. The reason it also has xmlns is for the default namespace because d itself is in the empty namespace. If it were not to define xmlns="" then d would instead be in some:ns (inherited from <with>).
It's a tricky test case :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks for the explanation!

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the impact seems really low for the massive benefit that is correctness. :)

@ausi
Copy link

ausi commented Jun 19, 2023

This changes the behavior of createElement() in documents that use namespaces quite drastically I think.

The following code:

$sitemap = new \DOMDocument();
$sitemap->loadXML('<urlset xmlns="https://www.sitemaps.org/schemas/sitemap/0.9"><url><loc>https://example.com/</loc></url></urlset>');
$sitemap->documentElement->appendChild($sitemap->createElement('url'));

now adds an <url xmlns=""/> instead of the expected <url/>. (Tested with version PHP 8.3.0-dev a37ad64)

I’m afraid this would affect a huge number of usages out there.

@nielsdos
Copy link
Member Author

nielsdos commented Jun 19, 2023

@ausi Adding xmlns="" to url is actually the expected behaviour required by the spec. You should be using createElementNS to place it in a namespace. You can cross-check this with the JS console in the browser:

parser=new DOMParser;
dom=parser.parseFromString('<urlset xmlns="https://www.sitemaps.org/schemas/sitemap/0.9"><url><loc>https://example.com/</loc></url></urlset>', 'text/xml');
dom.documentElement.appendChild(dom.createElement('url'));
(new XMLSerializer).serializeToString(dom)

I see however, that this is too dangerous for a bugfix release at least. I'll revert this for now (I can do that tonight), reopen this PR and then discuss what we can do...
cc @Girgias

@ausi
Copy link

ausi commented Jun 19, 2023

@nielsdos Agree, I know (now ☺️) that this usage of createElement() is wrong.

I see however, that this is too dangerous for a bugfix release at least. I'll revert this for now (I can do that tonight), reopen this PR and then discuss what we can do...

Thank you! I suspect that many people are using createElement() wrong and would be affected by this.

@nielsdos
Copy link
Member Author

I suspect that many people are using createElement() wrong and would be affected by this.

Yeah probably... There's a couple of spec compliance problems with the DOM extension in PHP. I was wondering whether it makes sense to add an opt-in mode for spec compliance. This would allow existing PHP code to keep working, and new code could opt-in to the spec-compliant behaviour.
It's just a quick thought right now. I'll try to find some time to discuss this on the mailing list sometime soon hopefully.

@ausi
Copy link

ausi commented Jun 19, 2023

I also found a small bug in this implementation:

$dom = new \DOMDocument();
$root = $dom->createElementNS('foo', 'root');
$dom->appendChild($root);

$a1 = $dom->createElementNS('foo', 'a');
$a1->appendChild($dom->createElementNS('foo', 'b'));
$root->appendChild($a1);

$a2 = $dom->createElementNS('foo', 'a');
$a2->appendChild($dom->createElementNS('foo', 'b'));
$root->appendChild($a2);

echo $dom->saveXML();

This adds xmlns="foo" to the second <b> element:

<root xmlns="foo"><a><b/></a><a><b xmlns="foo"/></a></root>

While PHP 8.2 correctly outputs:

<root xmlns="foo"><a><b/></a><a><b/></a></root>

@nielsdos
Copy link
Member Author

I'll check why this is, and add your test case as a regression test. Thanks for checking.

nielsdos added a commit that referenced this pull request Jun 19, 2023
… declaration for null namespace, creating incorrect xml representation of the DOM"

This reverts commit 7eb3e9c.

Although the fix follows the spec, it causes issues because a lot of old
code assumes the incorrect behaviour PHP had since a long time.
We cannot do this yet, especially not in a stable release.
We revert this for the time being.
See GH-11428.
@nielsdos nielsdos reopened this Jun 19, 2023
@nielsdos
Copy link
Member Author

@ausi Revert pushed.

@theseer
Copy link
Contributor

theseer commented Jun 19, 2023

I disagree with this being a problematic BC break - but I might be biased as I reported that problem this change addresses.

The current behavior is clearly wrong and there is no workaround for the broken behavior in userland except writing your own serializer using XMLWriter. If fixing this would qualify as a BC break, then any bug fix is one given the current behavior changes and there might be code out there that will break because it assumed the broken behavior to be (still) in place.

The aforementioned example using createElement rather than createElementNS is broken in itself: You cannot expect to use a DOM Level 1 API to work on a document that uses namespaces. More over, it will only work by accident if you never actually work with the DOM created other than serializing it to XML. Any attempt for instance to use DOMXPath will lead to unexpected results or requires quirky workarounds like the use of local-name() to "fix" the inconsistent namespacing.

I'm also not sure how getElementsByTagName would work with such a mixed DOM.

The example by @ausi is technically invalid as well by the way: foo is not a valid namespace IRI according to the specs as I understand it - but that's of course besides the point of the example. (See https://www.w3.org/TR/xml-names11/#sec-namespaces and https://www.rfc-editor.org/rfc/rfc3987.html#section-2.2 in case you care anyway. I believe the minimum requirement would be to have a : in the string - making it an URI - or even :// for full IRI compliance.)

@theseer
Copy link
Contributor

theseer commented Jun 19, 2023

I suspect that many people are using createElement() wrong and would be affected by this.

Yeah probably... There's a couple of spec compliance problems with the DOM extension in PHP. I was wondering whether it makes sense to add an opt-in mode for spec compliance. This would allow existing PHP code to keep working, and new code could opt-in to the spec-compliant behaviour. It's just a quick thought right now. I'll try to find some time to discuss this on the mailing list sometime soon hopefully.

I, for one, very much like the idea of having a spec compliant implementation ;-) and would be perfectly fine with being required to activate that explicitly for the time being. One has to manually setup things for proper error handling already anyway, adding another option here sounds okay.

In case you don't plan to maintain the broken implementation for ever next to the spec compliant one, I'd recommend not naming it 'spec compliance mode' but more like 'legacy mode', allowing you to emit deprecation notices at some / in PHP 8.3+.

@nielsdos
Copy link
Member Author

I disagree with this being a problematic BC break - but I might be biased as I reported that problem this change addresses.

The current behaviour has been in place for about 20 years. Lots of code potentially, unfortunately, depends on the broken behaviour. I dislike this situation too. In fact, there's a couple of spec compliance issues which we can't just fix without causing small or big BC breaks. I'm also annoyed about the ext/dom situation.

The current behavior is clearly wrong and there is no workaround for the broken behavior in userland except writing your own serializer using XMLWriter. If fixing this would qualify as a BC break, then any bug fix is one given the current behavior changes and there might be code out there that will break because it assumed the broken behavior to be (still) in place.

I don't have a good answer for this. Technically yes any semantic bugfix can cause a BC break.
I guess it depends on how big the BC break is, and it needs to be handled case by case.
For some bug fixes which are considered risky, they're only done for the master branch.
For this case, I think given the amount of code potentially out there, it's risky even for master. Unless we could somehow prove that the impact would be very small.

The aforementioned example using createElement rather than createElementNS is broken in itself: You cannot expect to use a DOM Level 1 API to work on a document that uses namespaces. More over, it will only work by accident if you never actually work with the DOM created other than serializing it to XML. Any attempt for instance to use DOMXPath will lead to unexpected results or requires quirky workarounds like the use of local-name() to "fix" the inconsistent namespacing.

Well, the WHATWG spec defines the behaviour that should be followed (which PHP doesn't correctly follow as we know...), so I wouldn't call it broken.
But that doesn't change the fact it's a risky BC break for at least stable branches.

The example by @ausi is technically invalid as well by the way: foo is not a valid namespace IRI according to the specs as I understand it - but that's of course besides the point of the example. (See https://www.w3.org/TR/xml-names11/#sec-namespaces and https://www.rfc-editor.org/rfc/rfc3987.html#section-2.2 in case you care anyway. I believe the minimum requirement would be to have a : in the string - making it an URI - or even :// for full IRI compliance.)

Yes, it's an invalid namespace. The revert is not because of the bug, but because of the BC break (see his linked issue).

I, for one, very much like the idea of having a spec compliant implementation ;-) and would be perfectly fine with being required to activate that explicitly for the time being. One has to manually setup things for proper error handling already anyway, adding another option here sounds okay.

Yeah. For now I think this would be the best path forward. I don't know if I can introduce it without an RFC though... It at the very least needs a discussion topic on the internals list. It's likely too late in the 8.3 release given how close the feature freeze is though...

In case you don't plan to maintain the broken implementation for ever next to the spec compliant one, I'd recommend not naming it 'spec compliance mode' but more like 'legacy mode', allowing you to emit deprecation notices at some / in PHP 8.3+.

I agree with your name recommendation.
I wouldn't deprecate it anytime soon. The overhead of such a legacy/non-legacy mode would likely be small, and PHP already gets quite some heat about the amount of BC breaks & deprecations.

@theseer
Copy link
Contributor

theseer commented Jun 19, 2023

I disagree with this being a problematic BC break - but I might be biased as I reported that problem this change addresses.

The current behaviour has been in place for about 20 years. Lots of code potentially, unfortunately, depends on the broken behaviour. I dislike this situation too. In fact, there's a couple of spec compliance issues which we can't just fix without causing small or big BC breaks. I'm also annoyed about the ext/dom situation.

At least I'm not alone ;-)

The current behavior is clearly wrong and there is no workaround for the broken behavior in userland except writing your own serializer using XMLWriter. If fixing this would qualify as a BC break, then any bug fix is one given the current behavior changes and there might be code out there that will break because it assumed the broken behavior to be (still) in place.

I don't have a good answer for this.

I didn't expect any. There probably isn't one.

Technically yes any semantic bugfix can cause a BC break. I guess it depends on how big the BC break is, and it needs to be handled case by case. For some bug fixes which are considered risky, they're only done for the master branch. For this case, I think given the amount of code potentially out there, it's risky even for master. Unless we could somehow prove that the impact would be very small.

Point taken.

The aforementioned example using createElement rather than createElementNS is broken in itself: You cannot expect to use a DOM Level 1 API to work on a document that uses namespaces. More over, it will only work by accident if you never actually work with the DOM created other than serializing it to XML. Any attempt for instance to use DOMXPath will lead to unexpected results or requires quirky workarounds like the use of local-name() to "fix" the inconsistent namespacing.

Well, the WHATWG spec defines the behaviour that should be followed (which PHP doesn't correctly follow as we know...), so I wouldn't call it broken. But that doesn't change the fact it's a risky BC break for at least stable branches.

Not trying to be a smart ass here but the WHATWG deals with HTML. This is an XML issue, something the WHATWG clearly choose to almost ignore given how HTML 5 is defined and how broken things like webcomponents are with regards to XML. But that's all besides the point here. PHP currently doesn't support HTML 5. If one wants to use HTML 5, only the XML serialization format works - unless LibXML 2 suddenly learned HTML 5 and I somehow missed that.

As far as my knowledge goes - and that might be of course wrong -, using a DOM Level 1 API with a document that contains XML Namespaces is at the very least highly discouraged as it leads to the problems I described. Even if the WHATWG promotes the use of DOM Level 1 APIs - which I didn't check ;) -, it would probably be consistent to their dislike of XML and the fact HTML 5 doesn't have a namespace - even in XML syntax.

The example by @ausi is technically invalid as well by the way: foo is not a valid namespace IRI according to the specs as I understand it - but that's of course besides the point of the example. (See https://www.w3.org/TR/xml-names11/#sec-namespaces and https://www.rfc-editor.org/rfc/rfc3987.html#section-2.2 in case you care anyway. I believe the minimum requirement would be to have a : in the string - making it an URI - or even :// for full IRI compliance.)

Yes, it's an invalid namespace. The revert is not because of the bug, but because of the BC break (see his linked issue).

I know and understood that. I was trying to build support for a spec compliant mode of operation that would actually fail this. Actually believe I had to fix some tests in one of my libraries because I got failures for invalid IRI definitions in them.

I, for one, very much like the idea of having a spec compliant implementation ;-) and would be perfectly fine with being required to activate that explicitly for the time being. One has to manually setup things for proper error handling already anyway, adding another option here sounds okay.

Yeah. For now I think this would be the best path forward. I don't know if I can introduce it without an RFC though... It at the very least needs a discussion topic on the internals list. It's likely too late in the 8.3 release given how close the feature freeze is though...

If you need any help here? Not that I have very much time at hand but this is kind of important for me ;)

In case you don't plan to maintain the broken implementation for ever next to the spec compliant one, I'd recommend not naming it 'spec compliance mode' but more like 'legacy mode', allowing you to emit deprecation notices at some / in PHP 8.3+.

I agree with your name recommendation. I wouldn't deprecate it anytime soon. The overhead of such a legacy/non-legacy mode would likely be small, and PHP already gets quite some heat about the amount of BC breaks & deprecations.

I wouldn't know about the effort. I could imagine though that quite some changes might become bigger and thus more annoying to handle. But I'm not a C developer ;)

Point taken regarding heat, though.

@ausi
Copy link

ausi commented Jun 19, 2023

The example by @ausi is technically invalid as well by the way: foo is not a valid namespace IRI

Yes, I just shortened the code as far as I could to illustrate the bug. Using a valid namespace has the same effect.
But yeah, if this is invalid, triggering a notice/warning would probably make sense 👍

I disagree with this being a problematic BC break - but I might be biased as I reported that problem this change addresses.

Please don’t trust my judgment here, I just made a wild guess. I don’t know much about XML, only suspected that many other people have the same knowledge level as me 😅

@nielsdos
Copy link
Member Author

To move forward, I'll need to gather a list of all the spec compliance issues.
Only then I can know for sure whether adding a flag to opt-in would suffice, and what other alternative solutions could look like.
In general I want to avoid breaking BC too hard for fundamental functionality like this, especially given the age of the bugs, so an opt-in solution would be my preference.

@Girgias
Copy link
Member

Girgias commented Jun 20, 2023

Just FYI, we can land and make such changes/add new features post feature freeze, it just has more stingy requirements and ABI is usually only fixed on RC1 so there isn't even an ABI compatibility requirement until then.

@ausi
Copy link

ausi commented Jun 21, 2023

@ausi Revert pushed.

I can confirm that createElement() works now as before, but the bug described in #11428 (comment) seems to be present still (on master).

@nielsdos Should I create a seperate bug report for that?

@nielsdos
Copy link
Member Author

@ausi Revert pushed.

I can confirm that createElement() works now as before, but the bug described in #11428 (comment) seems to be present still (on master).

@nielsdos Should I create a seperate bug report for that?

Sure. It looks like the namespace reuse code is triggering an issue somewhere, might be libxml2 related.

@rlerdorf
Copy link
Member

rlerdorf commented Aug 4, 2023

This example changed results after the dom changes in 8.2.8:

<?php
$template = <<< EOB
<wsse:Security xmlns:wsse=".">
    <ds:Signature xmlns:ds="http://www.w3.org/2000/09/xmldsig#">
    </ds:Signature>
</wsse:Security>
EOB;

$doc = new DOMDocument('1.0', 'UTF-8');
$doc->loadXML($template);
$parent_doc = new DOMDocument('1.0', 'UTF-8');
$parent_doc->loadXML('<xml><child/></xml>');
$wsse = $parent_doc->importNode($doc->documentElement, true);
$parent_doc->firstChild->firstChild->appendChild($wsse);
echo $parent_doc->saveXML();

8.2.7:

<?xml version="1.0"?>
<xml><child><wsse:Security xmlns:wsse="." xmlns:ds="http://www.w3.org/2000/09/xmldsig#">
    <ds:Signature xmlns:ds="http://www.w3.org/2000/09/xmldsig#">
    </ds:Signature>
</wsse:Security></child></xml>

8.2.8:

<?xml version="1.0"?>
<xml><child><wsse:Security xmlns:wsse=".">
    <ds:Signature xmlns:ds="http://www.w3.org/2000/09/xmldsig#">
    </ds:Signature>
</wsse:Security></child></xml>

@theseer
Copy link
Contributor

theseer commented Aug 4, 2023

@rlerdorf Do you consider that a bug? Because the only difference I see is the removal of the redundant ds-prefix definition on <wsse:Security> - and I'd say technically both variants are correct.

@ausi
Copy link

ausi commented Aug 4, 2023

I’d also consider the result from 8.2.8 to be correct as the xmlns:ds prefix definition is not needed on the <wsse:Security> node.

@rlerdorf
Copy link
Member

rlerdorf commented Aug 4, 2023

@rlerdorf Do you consider that a bug? Because the only difference I see is the removal of the redundant ds-prefix definition on <wsse:Security> - and I'd say technically both variants are correct.

I consider it a bug to change this behaviour in a point release, yes. This needs to be done in a major release with a blurb in the announcement explaining the behaviour change so people don't spend a lot of time trying to track down why their unit tests suddenly started failing on a point release upgrade. Something this subtle can be especially hard to find.

@nielsdos
Copy link
Member Author

nielsdos commented Aug 4, 2023

I consider it a bug to change this behaviour in a point release, yes. This needs to be done in a major release with a blurb in the announcement explaining the behaviour change so people don't spend a lot of time trying to track down why their unit tests suddenly started failing on a point release upgrade. Something this subtle can be especially hard to find.

The removal of the redundant namespace declaration is a consequence of a bugfix. If every observable detail is considered a breaking change, then many bugfixes become impossible to do in minor releases IMHO.

@theseer
Copy link
Contributor

theseer commented Aug 5, 2023

The removal of the redundant namespace declaration is a consequence of a bugfix. If every observable detail is considered a breaking change, then many bugfixes become impossible to do in minor releases IMHO.

In particular given that it's technically correct and not breaking change from an XML perspective. The actual serialized string might be different but if code relies on that, I'd say somebody did not understand XML ;)

@nielsdos
Copy link
Member Author

Will be obsoleted by my opt-in spec compliance project.

@nielsdos nielsdos closed this Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOMDocument::savexml and friends ommit xmlns="" declaration for null namespace, creating incorrect xml representation of the DOM
5 participants