-
-
Notifications
You must be signed in to change notification settings - Fork 32.6k
gh-135661: Fix CDATA section parsing in HTMLParser #135665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-135661: Fix CDATA section parsing in HTMLParser #135665
Conversation
"] ]>" and "]] >" no longer end the CDATA section.
Lib/html/parser.py
Outdated
j = rawdata.find(']]>') | ||
if j < 0: | ||
return -1 | ||
self.unknown_decl(rawdata[i+3: j]) | ||
return j + 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the HTML5 standard (https://html.spec.whatwg.org/multipage/parsing.html#markup-declaration-open-state), it should be either data or bogus comment (which ends with >
, not ]]>
), but this depends on the context. It may be that I incorrectly understand the HTML5 standard, because this part is difficult to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried copying the content of the tests in the following file:
<!DOCTYPE html>
<html>
<body>
<![CDATA[just some plain text]]><hr>
<![CDATA[<!-- not a comment -->]]><hr>
<![CDATA[¬-an-entity-ref;]]><hr>
<![CDATA[<not a='start tag'>]]><hr>
<![CDATA[]]><hr>
<![CDATA[[[I have many brackets]]]]><hr>
<![CDATA[I have a > in the middle]]><hr>
<![CDATA[I have a ]] in the middle]]><hr>
<![CDATA[] ]>]]><hr>
<![CDATA[]] >]]><hr>
<![CDATA[
if (a < b && a > b) {
printf("[<marquee>How?</marquee>]");
}
]]><hr>
</body>
</html>
and this was the result on Firefox:
<html><head></head><body>
<!--[CDATA[just some plain text]]--><hr>
<!--[CDATA[<!-- not a comment ---->]]><hr>
<!--[CDATA[¬-an-entity-ref;]]--><hr>
<!--[CDATA[<not a='start tag'-->]]><hr>
<!--[CDATA[]]--><hr>
<!--[CDATA[[[I have many brackets]]]]--><hr>
<!--[CDATA[I have a --> in the middle]]><hr>
<!--[CDATA[I have a ]] in the middle]]--><hr>
<!--[CDATA[] ]-->]]><hr>
<!--[CDATA[]] -->]]><hr>
<!--[CDATA[
if (a < b && a --> b) {
printf("[<marquee>How?</marquee>]");
}
]]><hr>
</body></html>

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and if you try <svg><text y="100"><![CDATA[foo<br>bar]]></text></svg>
, you will see that content between <![CDATA[
and ]]>
is interpreted as a raw data.
This is context dependent.
HTMLParser is actually just a tokenizer. To determine the context automatically, it needs to support the stack of open elements and to know what elements are in the HTML namespace. This is all in the specification, and we will implement this in future. But this is a different level of complexity. So I solved the issue by letting the user to determine the context. New method support_cdata()
sets how HTMLParser will parse CDATA. This is not good, but perhaps better than the current state.
Lib/html/parser.py
Outdated
j = rawdata.find(']]>') | ||
if j < 0: | ||
return -1 | ||
self.unknown_decl(rawdata[i+3: j]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.unknown_decl(rawdata[i+3: j]) | |
self.unknown_decl(rawdata[i+3:j]) |
Lib/html/parser.py
Outdated
j = rawdata.find(']]>') | ||
if j < 0: | ||
return -1 | ||
self.unknown_decl(rawdata[i+3: j]) | ||
return j + 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried copying the content of the tests in the following file:
<!DOCTYPE html>
<html>
<body>
<![CDATA[just some plain text]]><hr>
<![CDATA[<!-- not a comment -->]]><hr>
<![CDATA[¬-an-entity-ref;]]><hr>
<![CDATA[<not a='start tag'>]]><hr>
<![CDATA[]]><hr>
<![CDATA[[[I have many brackets]]]]><hr>
<![CDATA[I have a > in the middle]]><hr>
<![CDATA[I have a ]] in the middle]]><hr>
<![CDATA[] ]>]]><hr>
<![CDATA[]] >]]><hr>
<![CDATA[
if (a < b && a > b) {
printf("[<marquee>How?</marquee>]");
}
]]><hr>
</body>
</html>
and this was the result on Firefox:
<html><head></head><body>
<!--[CDATA[just some plain text]]--><hr>
<!--[CDATA[<!-- not a comment ---->]]><hr>
<!--[CDATA[¬-an-entity-ref;]]--><hr>
<!--[CDATA[<not a='start tag'-->]]><hr>
<!--[CDATA[]]--><hr>
<!--[CDATA[[[I have many brackets]]]]--><hr>
<!--[CDATA[I have a --> in the middle]]><hr>
<!--[CDATA[I have a ]] in the middle]]--><hr>
<!--[CDATA[] ]-->]]><hr>
<!--[CDATA[]] -->]]><hr>
<!--[CDATA[
if (a < b && a --> b) {
printf("[<marquee>How?</marquee>]");
}
]]><hr>
</body></html>

* Add HTMLParser.support_cdata().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand HTML enough to judge if this change is the right one.
I'm adding docs/changelog suggestions to clarify the behaviour, as I understand it.
Doc/library/html.parser.rst
Outdated
If *flag* is false, then the :meth:`handle_comment` method will be called | ||
for ``<![CDATA[...>``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should mention the default behaviour.
If *flag* is false, then the :meth:`handle_comment` method will be called | |
for ``<![CDATA[...>``. | |
If *flag* is false, or if :meth:`!support_cdata` has not been called yet, | |
then the :meth:`handle_comment` method will be called for ``<![CDATA[...>``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a weak point of such approach. It should be true by default to be able to parse valid HTML (when <![CDATA[...]]>
is only used in foreign content) by default. But secure parsing needs to set it to false at the beginning and the set to true or false after every open or close tag, depending on the complex algorithm.
So we should set it to true by default if we keep this approach.
Doc/library/html.parser.rst
Outdated
If *flag* is false, then the :meth:`handle_comment` method will be called | ||
for ``<![CDATA[...>``. | ||
|
||
.. versionadded:: 3.13.6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should mention the previous behaviour.
.. versionadded:: 3.13.6 | |
.. versionadded:: 3.13.6 | |
Previously, :meth:`unknown_decl` was called for ``<![CDATA[...>``. |
Fix CDATA section parsing in :class:`html.parser.HTMLParser` according to | ||
the HTML5 standard: ``] ]>`` and ``]] >`` no longer end the CDATA section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix CDATA section parsing in :class:`html.parser.HTMLParser` according to | |
the HTML5 standard: ``] ]>`` and ``]] >`` no longer end the CDATA section. | |
Fix CDATA section parsing in :class:`html.parser.HTMLParser` according to | |
the HTML5 standard: ``] ]>`` and ``]] >`` no longer end the CDATA section. | |
By default, :meth:`~HTMLParser.handle_comment` is called for CDATA. | |
The old behavior (calling :meth:`~HTMLParser.unknown_decl`) can be restored | |
using a new method, :meth:`~HTMLParser.support_cdata`. |
Lib/html/parser.py
Outdated
def support_cdata(self, flag=True): | ||
self._support_cdata = flag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced this is the best way to handle the issue.
Since this solves a security issue, it also needs to be added to a bug fix release and backported to several other releases. Making the method private should be enough to solve the security issue in the short term without adding to the public API. This will also give us time to think about alternative (and possibly better) solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Private means that users shouldn't call it. Given that nothing in html
calls it, making it private is the same as not adding it at all.
It looks like the issue can't be solved in CPython alone -- users need to adapt their code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be solved in CPython, and I hope we will, but it looks like the solution will be too complex to risk backporting it to security-only branches. Providing a method to alter behavior we pass the ball to user side. This is not good, but otherwise the problem will left unsolved. And without closing this hole, all other security fixes for HTMLParser are worthless.
I will try other approach, but can't guarantee anything.
Here's an idea (though it might be naive): what if the parser called a |
I haven't been able to implement automatic detection of this flag yet, the algorithm is too complex. I made the new setter method private. It is now hardly discoverable, but we need a lever which can be used in principle. Anyway, it will not be used in most of user code.
This does not provide additional flexibility in comparison with using |
Please review this PR, so we will have chance to get it in today's releases. |
Misc/NEWS.d/next/Security/2025-06-18-13-34-55.gh-issue-135661.NZlpWf.rst
Outdated
Show resolved
Hide resolved
def _set_support_cdata(self, flag=True): | ||
"""Enable or disable support of the CDATA sections. | ||
If enabled, "<[CDATA[" starts a CDATA section which ends with "]]>". | ||
If disabled, "<[CDATA[" starts a bogus comments which ends with ">". | ||
|
||
This method is not called by default. Its purpose is to be called | ||
in custom handle_starttag() and handle_endtag() methods, with | ||
value that depends on the adjusted current node. | ||
See https://html.spec.whatwg.org/multipage/parsing.html#markup-declaration-open-state | ||
for details. | ||
""" | ||
self._support_cdata = flag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to have a setter method, rather than changing the value of _support_cdata
directly (other than having a place for the docstring 🙃)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can change it in future if we need more complex logic. Of course, in Python we can just add a property.
And yes, docstring.
…NZlpWf.rst Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.9, 3.10, 3.11, 3.12, 3.13, 3.14. |
…5665) "] ]>" and "]] >" no longer end the CDATA section. Make CDATA section parsing context depending. Add private method HTMLParser._set_support_cdata() to change the context. If called with True, "<[CDATA[" starts a CDATA section which ends with "]]>". If called with False, "<[CDATA[" starts a bogus comments which ends with ">". (cherry picked from commit 0cbbfc462119b9107b373c24d2bda5a1271bed36) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
…5665) "] ]>" and "]] >" no longer end the CDATA section. Make CDATA section parsing context depending. Add private method HTMLParser._set_support_cdata() to change the context. If called with True, "<[CDATA[" starts a CDATA section which ends with "]]>". If called with False, "<[CDATA[" starts a bogus comments which ends with ">". (cherry picked from commit 0cbbfc4) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
GH-137772 is a backport of this pull request to the 3.14 branch. |
Sorry, @serhiy-storchaka, I could not cleanly backport this to
|
Sorry, @serhiy-storchaka, I could not cleanly backport this to
|
GH-137773 is a backport of this pull request to the 3.13 branch. |
Sorry, @serhiy-storchaka, I could not cleanly backport this to
|
Sorry, @serhiy-storchaka, I could not cleanly backport this to
|
…onGH-135665) "] ]>" and "]] >" no longer end the CDATA section. Make CDATA section parsing context depending. Add private method HTMLParser._set_support_cdata() to change the context. If called with True, "<[CDATA[" starts a CDATA section which ends with "]]>". If called with False, "<[CDATA[" starts a bogus comments which ends with ">". (cherry picked from commit 0cbbfc4) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
GH-137774 is a backport of this pull request to the 3.12 branch. |
…GH-137773) "] ]>" and "]] >" no longer end the CDATA section. Make CDATA section parsing context depending. Add private method HTMLParser._set_support_cdata() to change the context. If called with True, "<[CDATA[" starts a CDATA section which ends with "]]>". If called with False, "<[CDATA[" starts a bogus comments which ends with ">". (cherry picked from commit 0cbbfc4) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
"] ]>" and "]] >" no longer end the CDATA section.