Skip to content

Set DOMAttr::$value and DOMAttr::$nodeValue without expanding entities #10132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

tstarling
Copy link
Contributor

The manual refers to the DOM Level 3 Core spec which says:

"On setting, this creates a Text node with the unparsed contents of the
string. I.e. any characters that an XML processor would recognize as
markup are instead treated as literal text."

PHP is expanding entities when DOMAttr::value is set, which is
non-compliant and is a difference in behaviour compared to browser DOM
implementations.

So, when value is set, remove all children of the attribute node. Then
create a single text node and insert that as the only child of the
attribute.

Add tests.

@tstarling
Copy link
Contributor Author

I ran all the ext/dom tests under valgrind. There were three existing failures against master: ext/dom/tests/dom003.phpt ext/dom/tests/dom_set_attr_node.phpt ext/dom/tests/domelement.phpt . No new failures.

The unlink loop is tedious, but at least it modifies the tree using a stable API, and doesn't crash. We can't do exactly what xmlNodeSetContentLen() is doing, because the UPDATE_LAST_CHILD_AND_PARENT() macro is private.

Downstream bug: https://phabricator.wikimedia.org/T324408

@beberlei
Copy link
Contributor

While DomAttr::$value behavior is not compliant to DOM Spec, I don't consider this a bug at this point, but changing it is a BC break and people might be relying on the current behavior, which is not documented in detail.

As such I would prefer if we introduce a new method instead that reaches the spec behavior, DOMAttr::setValue for example.

If we consider changing DOMAttr::$value write behavior at the very least we should first emit a warning that the behavior changes in PHP 8 and then make the change in 9.

@tstarling
Copy link
Contributor Author

Note that the property does not round-trip at present, it reads back a different value to what you put in.

php > $attrNode = new DOMAttr('test');
php > $attrNode->value = '&';
php > print $attrNode->value;
&

That seems like a bug to me.

It may lead to security vulnerabilities in applications, because if you sanitize JS or CSS and then put it in an attribute this way, and then serialize the result, user input may escape from a quoted string inside the JS or CSS attribute.

My suspicion is that approximately nobody uses this, that's why nobody has reported this before.

But if you're really sure it's not a bug, and you're offering to merge DOMAttr::setValue(), then I will do that instead.

@beberlei
Copy link
Contributor

I'd like to get @cmb69 input on what to do

@cmb69
Copy link
Member

cmb69 commented Dec 21, 2022

I tend to agree that the current behavior is a bug, although we probably should not refer to a working draft of a W3C specification, but rather to the WHATWG living standard (but I have no idea how the latter expects to handle that). Anyhow, even the libxml2 documentation on xmlNodeSetContentLen() is pretty clear:

@content is supposed to be a piece of XML CDATA, so it allows entity references, but XML special chars need to be escaped first by using xmlEncodeEntitiesReentrant() resp. xmlEncodeSpecialChars().

Actually, using xmlEncodeSpecialChars() would show the same behavior wrt. the test case as the current implementation of this PR. However, there is no variant which would allow to pass the length of the string, so NUL bytes would terminate the string. That might not be an issue, though; consider:

<?php
$doc = new DOMDocument;
$elt = $doc->createElement('elt');
$doc->appendChild($elt);
$elt->setAttribute('a','&');
$attr = $elt->getAttributeNode('a');
$attr->value = "1\x002";
var_dump($doc->saveXML($elt));

outputs:

string(12) "<elt a="1"/>"

So the NUL byte truncated the attribute's value anyway.

Note that DOMDocument::createElement() has the same behavior:

<?php

$doc = new DOMDocument;
$elt = $doc->createElement('elt', 'foo & bar & &amp;');
var_dump($doc->saveXML($elt));

outputs:

string(28) "<elt>foo & bar & &amp;</elt>"

This has been reported ages ago, and finally fixed as doc bug only. Maybe we should handle this issue in the same way, or apply one of @beberlei's suggestions for BC.

@ju1ius
Copy link
Contributor

ju1ius commented Dec 21, 2022

we probably should not refer to a working draft of a W3C specification, but rather to the WHATWG living standard (but I have no idea how the latter expects to handle that).

I've read the whatwg HTML and DOM specs several times and entities should:

  1. be expanded when tokenizing (see the various character-reference related states of the tokenization algorithm)
  2. be used for escaping while serializing

In the spec for replacing data of CharacterData nodes, you won't find anything about entities. In the spec for setting an attribute's value, you won't find anything either.

So the current behaviour is clearly a bug.

@cscott
Copy link
Contributor

cscott commented Dec 21, 2022

Agreed that the behavior is a bug according to DOM spec. Using an up-to-date browser console as an executable version of "the latest WHATWG spec":

 <foo>
 > a = document.createAttribute("local")
 local=""
 > a.value = "&amp;"
 "&amp;"
 > e.setAttributeNodeNS(a)
 null
 > e.outerHTML
 "<foo local=\"&amp;amp;\"></foo>" 

I don't think setting a value has ever decoded entities in any version of the DOM spec. Entities are only decoded in the .innerHtml setters.

@tstarling
Copy link
Contributor Author

@content is supposed to be a piece of XML CDATA, so it allows entity references, but XML special chars need to be escaped first by using xmlEncodeEntitiesReentrant() resp. xmlEncodeSpecialChars().

Actually, using xmlEncodeSpecialChars() would show the same behavior wrt. the test case as the current implementation of this PR. However, there is no variant which would allow to pass the length of the string, so NUL bytes would terminate the string.

I did read that doc comment. Using xmlEncodeSpecialChars() is a bad idea for the following additional reasons:

  • Performance: xmlNodeSetContentLen() is parsing an XML-like input string and producing a node list. If you use xmlNodeSetContent(xmlEncodeSpecialChars()), every ampersand becomes a separately allocated entity node.
  • Spec compliance: I need to go off on a tangent for a couple of paragraphs to explain this.

In the old DOM specs up to DOM Level 3 Core, an Attr node in an XML document has an arbitrary number of children, whereas an Attr node in an HTML document just has a string value. The WHATWG DOM favours the HTML data model. It specifies that the value of an Attr is a string, and removes the old discussion of XML Attr children.

In the old DOM specs, assigning to the value of an Attr node creates a single Text node and replaces all children of the Attr with that Text node. In the WHATWG DOM spec, assigning to the value of an Attr node just updates the string associated with the Attr.

Never in any DOM spec is it suggested that assigning to the value of an Attr will create an Entity node child for every ampersand contained in the value. But that is what xmlNodeSetContent(xmlEncodeSpecialChars()) would do.

In theory it would be nice if PHP followed the WHATWG data model, but in practice it is tied to libxml2 in which an Attr node has an arbitrary number of children. The nearest available approximation to the WHATWG spec is to remove all children of the Attr and insert a single Text node child with the specified value string. That is what my patch does.

@tstarling
Copy link
Contributor Author

tstarling commented Dec 21, 2022

To assess the need for bugwards compatibility, I did the following searches:

  • PHPStorm usages of DOMAttr::$value in MediaWiki, vendor, some extensions
  • DOMAttr in MediaWiki code search
  • DOMAttr in Symfony
  • DOMAttr in *.php files in GitHub. I looked through the first 100 hits. I opened relevant files and looked for nearby ->value.
  • "attr->value" in *.php files in GitHub. I looked through the first 100 hits.

Findings:

  • An assignment to nodeValue in Vanilla, although it doesn't seem to be called, it's only there for completeness. Assignment to nodeValue is similarly broken and a fix could be added to this PR. There is no awareness of the bug in that file.
  • A couple of instances of code that reads from DOMAttr::$value but then writes with DOMElement::setAttribute(). So it looks like bug awareness, but not bug utilization, i.e. a workaround anticipating a fix.
  • My own code showed up, the motivating downstream bug.
  • A possibly buggy assignment which would be fixed by this PR.
  • A possibly buggy assignment. At least, tracing callers back a few steps shows no awareness or htmlspecialchars() calls. Google translate on this doc comment shows no awareness of the fact that entity expansion will be done.

In summary, the code review supports my intuition that fixing this bug in PHP will be beneficial for users, implicitly fixing bugs in applications which depend on the broken feature.

@tstarling
Copy link
Contributor Author

The situation with DOMAttr::$nodeValue is very much the same, so I added the fix to the PR.

@tstarling tstarling changed the title Set DOMAttr::$value without expanding entities Set DOMAttr::$value and DOMAttr::$nodeValue without expanding entities Dec 22, 2022
@beberlei
Copy link
Contributor

Ok, then lets consider it a bug and fix it. Thank you for the changes @tstarling

@cmb69
Copy link
Member

cmb69 commented Dec 27, 2022

I wonder why AppVeyorCI has not been triggered.

By the way, the missing documentation is lying around for a long time: php/doc-en#923. Still, it makes sense to fix it in the next minor version.

@ThomasWeinert
Copy link
Contributor

It seems that DOMAttr::$textContent is broken, too (since PHP 5.6.14). Writing the property results in an empty value. Should this be addressed in this issue or to a separate one?

@cmb69
Copy link
Member

cmb69 commented Jan 4, 2023

@ThomasWeinert, oh, that is likely a regression due to b2954c6. Could you please open another ticket?

@nielsdos
Copy link
Member

nielsdos commented Jun 3, 2023

Related to GH-8388.
I also agree that fixing this will likely fix hidden bugs and prevent security issues.

@tstarling Are you willing to rebase this? The code looks good.

tstarling added 6 commits June 5, 2023 09:10
The manual refers to the DOM Level 3 Core spec which says:

"On setting, this creates a Text node with the unparsed contents of the
string. I.e. any characters that an XML processor would recognize as
markup are instead treated as literal text."

PHP is expanding entities when DOMAttr::value is set, which is
non-compliant and is a difference in behaviour compared to browser DOM
implementations.

So, when value is set, remove all children of the attribute node. Then
create a single text node and insert that as the only child of the
attribute.

Add tests.
A few callers remove all children of a node. The way it was done in
node.c was unsafe, because it left nodep->last dangling. It just happens
to not crash if xmlNodeSetContent() is called immediately afterwards.
The length is passed to xmlStrndup(), which also adds 1, and adds a null
terminator past the end. It worked because the length is not actually
stored. Strings in libxml2 are null terminated. Passing the length just
avoids a call to strlen().
@tstarling
Copy link
Contributor Author

@tstarling Are you willing to rebase this? The code looks good.

Done.

@nielsdos
Copy link
Member

nielsdos commented Jun 5, 2023

Merged manually via 50fdad8..ab77485. Thanks.

@nielsdos nielsdos closed this Jun 5, 2023
nielsdos added a commit to nielsdos/php-src that referenced this pull request Dec 14, 2023
This is an adapted re-introduction of phpGH-10132.

Co-authored-by: Tim Starling <tstarling@wikimedia.org>
nielsdos added a commit to nielsdos/php-src that referenced this pull request Dec 14, 2023
This is an adapted re-introduction of phpGH-10132.

Co-authored-by: Tim Starling <tstarling@wikimedia.org>
nielsdos added a commit to nielsdos/php-src that referenced this pull request Dec 16, 2023
This is an adapted re-introduction of phpGH-10132.

Co-authored-by: Tim Starling <tstarling@wikimedia.org>
nielsdos added a commit to nielsdos/php-src that referenced this pull request Dec 16, 2023
This is an adapted re-introduction of phpGH-10132.

Co-authored-by: Tim Starling <tstarling@wikimedia.org>
nielsdos added a commit to nielsdos/php-src that referenced this pull request Dec 17, 2023
This is an adapted re-introduction of phpGH-10132.

Co-authored-by: Tim Starling <tstarling@wikimedia.org>
nielsdos added a commit to nielsdos/php-src that referenced this pull request Dec 22, 2023
This is an adapted re-introduction of phpGH-10132.

Co-authored-by: Tim Starling <tstarling@wikimedia.org>
nielsdos added a commit to nielsdos/php-src that referenced this pull request Dec 25, 2023
This is an adapted re-introduction of phpGH-10132.

Co-authored-by: Tim Starling <tstarling@wikimedia.org>
nielsdos added a commit to nielsdos/php-src that referenced this pull request Dec 26, 2023
This is an adapted re-introduction of phpGH-10132.

Co-authored-by: Tim Starling <tstarling@wikimedia.org>
nielsdos added a commit to nielsdos/php-src that referenced this pull request Dec 28, 2023
This is an adapted re-introduction of phpGH-10132.

Co-authored-by: Tim Starling <tstarling@wikimedia.org>
nielsdos added a commit to nielsdos/php-src that referenced this pull request Jan 17, 2024
This is an adapted re-introduction of phpGH-10132.

Co-authored-by: Tim Starling <tstarling@wikimedia.org>
nielsdos added a commit to nielsdos/php-src that referenced this pull request Jan 24, 2024
This is an adapted re-introduction of phpGH-10132.

Co-authored-by: Tim Starling <tstarling@wikimedia.org>
nielsdos added a commit to nielsdos/php-src that referenced this pull request Jan 27, 2024
This is an adapted re-introduction of phpGH-10132.

Co-authored-by: Tim Starling <tstarling@wikimedia.org>
nielsdos added a commit to nielsdos/php-src that referenced this pull request Feb 7, 2024
This is an adapted re-introduction of phpGH-10132.

Co-authored-by: Tim Starling <tstarling@wikimedia.org>
nielsdos added a commit to nielsdos/php-src that referenced this pull request Feb 8, 2024
This is an adapted re-introduction of phpGH-10132.

Co-authored-by: Tim Starling <tstarling@wikimedia.org>
nielsdos added a commit to nielsdos/php-src that referenced this pull request Feb 9, 2024
This is an adapted re-introduction of phpGH-10132.

Co-authored-by: Tim Starling <tstarling@wikimedia.org>
nielsdos added a commit to nielsdos/php-src that referenced this pull request Feb 10, 2024
This is an adapted re-introduction of phpGH-10132.

Co-authored-by: Tim Starling <tstarling@wikimedia.org>
nielsdos added a commit to nielsdos/php-src that referenced this pull request Feb 17, 2024
This is an adapted re-introduction of phpGH-10132.

Co-authored-by: Tim Starling <tstarling@wikimedia.org>
nielsdos added a commit to nielsdos/php-src that referenced this pull request Feb 19, 2024
This is an adapted re-introduction of phpGH-10132.

Co-authored-by: Tim Starling <tstarling@wikimedia.org>
nielsdos added a commit to nielsdos/php-src that referenced this pull request Feb 23, 2024
This is an adapted re-introduction of phpGH-10132.

Co-authored-by: Tim Starling <tstarling@wikimedia.org>
nielsdos added a commit to nielsdos/php-src that referenced this pull request Feb 23, 2024
This is an adapted re-introduction of phpGH-10132.

Co-authored-by: Tim Starling <tstarling@wikimedia.org>
nielsdos added a commit to nielsdos/php-src that referenced this pull request Feb 29, 2024
This is an adapted re-introduction of phpGH-10132.

Co-authored-by: Tim Starling <tstarling@wikimedia.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants