gh-135661: Fix CDATA section parsing in HTMLParser #135665

serhiy-storchaka · 2025-06-18T10:49:00Z

"] ]>" and "]] >" no longer end the CDATA section.

Issue: HTMLParser differences from the HTML5 specification #135661

"] ]>" and "]] >" no longer end the CDATA section.

serhiy-storchaka · 2025-07-04T06:20:47Z

Lib/html/parser.py

+            j = rawdata.find(']]>')
+            if j < 0:
+                return -1
+            self.unknown_decl(rawdata[i+3: j])
+            return j + 3


According to the HTML5 standard (https://html.spec.whatwg.org/multipage/parsing.html#markup-declaration-open-state), it should be either data or bogus comment (which ends with >, not ]]>), but this depends on the context. It may be that I incorrectly understand the HTML5 standard, because this part is difficult to implement.

I tried copying the content of the tests in the following file:

<!DOCTYPE html> <html> <body> <![CDATA[just some plain text]]><hr> <![CDATA[]]><hr> <![CDATA[&not-an-entity-ref;]]><hr> <![CDATA[<not a='start tag'>]]><hr> <![CDATA[]]><hr> <![CDATA[[[I have many brackets]]]]><hr> <![CDATA[I have a > in the middle]]><hr> <![CDATA[I have a ]] in the middle]]><hr> <![CDATA[] ]>]]><hr> <![CDATA[]] >]]><hr> <![CDATA[ if (a < b && a > b) { printf("[<marquee>How?</marquee>]"); } ]]><hr> </body> </html>

and this was the result on Firefox:

<html><head></head><body> <hr> ]]><hr> <hr> ]]><hr> <hr> <hr>  in the middle]]><hr> <hr> ]]><hr> ]]><hr>  b) { printf("[<marquee>How?</marquee>]"); } ]]><hr> </body></html>

Yes, and if you try <svg><text y="100"><![CDATA[foo<br>bar]]></text></svg>, you will see that content between <![CDATA[ and ]]> is interpreted as a raw data.

This is context dependent.

HTMLParser is actually just a tokenizer. To determine the context automatically, it needs to support the stack of open elements and to know what elements are in the HTML namespace. This is all in the specification, and we will implement this in future. But this is a different level of complexity. So I solved the issue by letting the user to determine the context. New method support_cdata() sets how HTMLParser will parse CDATA. This is not good, but perhaps better than the current state.

ezio-melotti · 2025-07-05T00:06:41Z

Lib/html/parser.py

+            j = rawdata.find(']]>')
+            if j < 0:
+                return -1
+            self.unknown_decl(rawdata[i+3: j])


Suggested change

self.unknown_decl(rawdata[i+3: j])

self.unknown_decl(rawdata[i+3:j])

ezio-melotti · 2025-07-05T00:19:18Z

Lib/html/parser.py

+            j = rawdata.find(']]>')
+            if j < 0:
+                return -1
+            self.unknown_decl(rawdata[i+3: j])
+            return j + 3


I tried copying the content of the tests in the following file:

<!DOCTYPE html> <html> <body> <![CDATA[just some plain text]]><hr> <![CDATA[]]><hr> <![CDATA[&not-an-entity-ref;]]><hr> <![CDATA[<not a='start tag'>]]><hr> <![CDATA[]]><hr> <![CDATA[[[I have many brackets]]]]><hr> <![CDATA[I have a > in the middle]]><hr> <![CDATA[I have a ]] in the middle]]><hr> <![CDATA[] ]>]]><hr> <![CDATA[]] >]]><hr> <![CDATA[ if (a < b && a > b) { printf("[<marquee>How?</marquee>]"); } ]]><hr> </body> </html>

and this was the result on Firefox:

<html><head></head><body> <hr> ]]><hr> <hr> ]]><hr> <hr> <hr>  in the middle]]><hr> <hr> ]]><hr> ]]><hr>  b) { printf("[<marquee>How?</marquee>]"); } ]]><hr> </body></html>

* Add HTMLParser.support_cdata().

encukou

I don't understand HTML enough to judge if this change is the right one.
I'm adding docs/changelog suggestions to clarify the behaviour, as I understand it.

encukou · 2025-07-21T14:48:04Z

Doc/library/html.parser.rst

+   If *flag* is false, then the :meth:`handle_comment` method will be called
+   for ``<![CDATA[...>``.


This should mention the default behaviour.

Suggested change

If *flag* is false, then the :meth:`handle_comment` method will be called

for ``<![CDATA[...>``.

If *flag* is false, or if :meth:`!support_cdata` has not been called yet,

then the :meth:`handle_comment` method will be called for ``<![CDATA[...>``.

This is actually a weak point of such approach. It should be true by default to be able to parse valid HTML (when <![CDATA[...]]> is only used in foreign content) by default. But secure parsing needs to set it to false at the beginning and the set to true or false after every open or close tag, depending on the complex algorithm.

So we should set it to true by default if we keep this approach.

encukou · 2025-07-21T15:15:20Z

Doc/library/html.parser.rst

+   If *flag* is false, then the :meth:`handle_comment` method will be called
+   for ``<![CDATA[...>``.
+
+   .. versionadded:: 3.13.6


This should mention the previous behaviour.

Suggested change

.. versionadded:: 3.13.6

.. versionadded:: 3.13.6

Previously, :meth:`unknown_decl` was called for ``<![CDATA[...>``.

encukou · 2025-07-21T16:51:50Z

Misc/NEWS.d/next/Security/2025-06-18-13-34-55.gh-issue-135661.NZlpWf.rst

+Fix CDATA section parsing in :class:`html.parser.HTMLParser` according to
+the HTML5 standard: ``] ]>`` and ``]] >`` no longer end the CDATA section.


Suggested change

Fix CDATA section parsing in :class:`html.parser.HTMLParser` according to

the HTML5 standard: ``] ]>`` and ``]] >`` no longer end the CDATA section.

Fix CDATA section parsing in :class:`html.parser.HTMLParser` according to

the HTML5 standard: ``] ]>`` and ``]] >`` no longer end the CDATA section.

By default, :meth:`~HTMLParser.handle_comment` is called for CDATA.

The old behavior (calling :meth:`~HTMLParser.unknown_decl`) can be restored

using a new method, :meth:`~HTMLParser.support_cdata`.

ezio-melotti · 2025-07-21T18:09:08Z

Lib/html/parser.py

+    def support_cdata(self, flag=True):
+        self._support_cdata = flag


I'm not convinced this is the best way to handle the issue.
Since this solves a security issue, it also needs to be added to a bug fix release and backported to several other releases. Making the method private should be enough to solve the security issue in the short term without adding to the public API. This will also give us time to think about alternative (and possibly better) solution.

Private means that users shouldn't call it. Given that nothing in html calls it, making it private is the same as not adding it at all.

It looks like the issue can't be solved in CPython alone -- users need to adapt their code?

This can be solved in CPython, and I hope we will, but it looks like the solution will be too complex to risk backporting it to security-only branches. Providing a method to alter behavior we pass the ball to user side. This is not good, but otherwise the problem will left unsolved. And without closing this hole, all other security fixes for HTMLParser are worthless.

I will try other approach, but can't guarantee anything.

encukou · 2025-07-22T11:26:05Z

Here's an idea (though it might be naive): what if the parser called a handle_cdata_start method if it exists, and used its return value to steer the tokenization?

serhiy-storchaka · 2025-08-03T16:16:25Z

I haven't been able to implement automatic detection of this flag yet, the algorithm is too complex.

I made the new setter method private. It is now hardly discoverable, but we need a lever which can be used in principle. Anyway, it will not be used in most of user code.

Here's an idea (though it might be naive): what if the parser called a handle_cdata_start method if it exists, and used its return value to steer the tokenization?

This does not provide additional flexibility in comparison with using support_cdata() in handle_starttag() and handle_endtag(). The user need to override handle_starttag() and handle_endtag() to maintain the stack of open elements.

serhiy-storchaka · 2025-08-14T12:03:12Z

Please review this PR, so we will have chance to get it in today's releases.

Misc/NEWS.d/next/Security/2025-06-18-13-34-55.gh-issue-135661.NZlpWf.rst

ezio-melotti · 2025-08-14T15:01:00Z

Lib/html/parser.py

+    def _set_support_cdata(self, flag=True):
+        """Enable or disable support of the CDATA sections.
+        If enabled, "<[CDATA[" starts a CDATA section which ends with "]]>".
+        If disabled, "<[CDATA[" starts a bogus comments which ends with ">".
+
+        This method is not called by default. Its purpose is to be called
+        in custom handle_starttag() and handle_endtag() methods, with
+        value that depends on the adjusted current node.
+        See https://html.spec.whatwg.org/multipage/parsing.html#markup-declaration-open-state
+        for details.
+        """
+        self._support_cdata = flag


Is there any reason to have a setter method, rather than changing the value of _support_cdata directly (other than having a place for the docstring 🙃)?

We can change it in future if we need more complex logic. Of course, in Python we can just add a property.

And yes, docstring.

…NZlpWf.rst Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>

miss-islington-app · 2025-08-14T18:13:26Z

Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.9, 3.10, 3.11, 3.12, 3.13, 3.14.
🐍🍒⛏🤖 I'm not a witch! I'm not a witch!

…5665) "] ]>" and "]] >" no longer end the CDATA section. Make CDATA section parsing context depending. Add private method HTMLParser._set_support_cdata() to change the context. If called with True, "<[CDATA[" starts a CDATA section which ends with "]]>". If called with False, "<[CDATA[" starts a bogus comments which ends with ">". (cherry picked from commit 0cbbfc462119b9107b373c24d2bda5a1271bed36) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

…5665) "] ]>" and "]] >" no longer end the CDATA section. Make CDATA section parsing context depending. Add private method HTMLParser._set_support_cdata() to change the context. If called with True, "<[CDATA[" starts a CDATA section which ends with "]]>". If called with False, "<[CDATA[" starts a bogus comments which ends with ">". (cherry picked from commit 0cbbfc4) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

bedevere-app · 2025-08-14T18:13:38Z

GH-137772 is a backport of this pull request to the 3.14 branch.

miss-islington-app · 2025-08-14T18:13:39Z

Sorry, @serhiy-storchaka, I could not cleanly backport this to 3.12 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 0cbbfc462119b9107b373c24d2bda5a1271bed36 3.12

miss-islington-app · 2025-08-14T18:13:42Z

Sorry, @serhiy-storchaka, I could not cleanly backport this to 3.11 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 0cbbfc462119b9107b373c24d2bda5a1271bed36 3.11

bedevere-app · 2025-08-14T18:13:44Z

GH-137773 is a backport of this pull request to the 3.13 branch.

miss-islington-app · 2025-08-14T18:13:45Z

Sorry, @serhiy-storchaka, I could not cleanly backport this to 3.10 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 0cbbfc462119b9107b373c24d2bda5a1271bed36 3.10

miss-islington-app · 2025-08-14T18:13:48Z

Sorry, @serhiy-storchaka, I could not cleanly backport this to 3.9 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 0cbbfc462119b9107b373c24d2bda5a1271bed36 3.9

…onGH-135665) "] ]>" and "]] >" no longer end the CDATA section. Make CDATA section parsing context depending. Add private method HTMLParser._set_support_cdata() to change the context. If called with True, "<[CDATA[" starts a CDATA section which ends with "]]>". If called with False, "<[CDATA[" starts a bogus comments which ends with ">". (cherry picked from commit 0cbbfc4) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

bedevere-app · 2025-08-14T18:38:08Z

GH-137774 is a backport of this pull request to the 3.12 branch.

…GH-137773) "] ]>" and "]] >" no longer end the CDATA section. Make CDATA section parsing context depending. Add private method HTMLParser._set_support_cdata() to change the context. If called with True, "<[CDATA[" starts a CDATA section which ends with "]]>". If called with False, "<[CDATA[" starts a bogus comments which ends with ">". (cherry picked from commit 0cbbfc4) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

pythongh-135661: Fix CDATA section parsing in HTMLParser

f7f9f56

"] ]>" and "]] >" no longer end the CDATA section.

serhiy-storchaka requested a review from ezio-melotti as a code owner June 18, 2025 10:49

serhiy-storchaka added needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels Jun 18, 2025

bedevere-app bot added the awaiting core review label Jun 18, 2025

bedevere-app bot mentioned this pull request Jun 18, 2025

HTMLParser differences from the HTML5 specification #135661

Open

serhiy-storchaka added 4 commits July 3, 2025 18:16

Merge branch 'main' into htmlparser-cdata

816f34e

Move to Security.

cf918e3

Update 2025-06-18-13-34-55.gh-issue-135661.NZlpWf.rst

d346c10

Merge branch 'main' into htmlparser-cdata

9e1ae33

serhiy-storchaka added type-security A security issue needs backport to 3.9 only security fixes needs backport to 3.10 only security fixes needs backport to 3.11 only security fixes needs backport to 3.12 only security fixes labels Jul 4, 2025

serhiy-storchaka commented Jul 4, 2025

View reviewed changes

ezio-melotti reviewed Jul 5, 2025

View reviewed changes

* Make CDATA section parsing context depending.

524cac5

* Add HTMLParser.support_cdata().

encukou reviewed Jul 21, 2025

View reviewed changes

ezio-melotti reviewed Jul 21, 2025

View reviewed changes

serhiy-storchaka added 2 commits July 22, 2025 17:32

Merge branch 'main' into htmlparser-cdata

2a1bb46

Enable support of CDATA sections by default.

8cdbc95

serhiy-storchaka added the DO-NOT-MERGE label Jul 22, 2025

serhiy-storchaka added 2 commits August 3, 2025 18:38

Merge branch 'main' into htmlparser-cdata

e4f13a8

Make the setter method private.

50fd4b3

serhiy-storchaka removed the DO-NOT-MERGE label Aug 3, 2025

Update NEWS.

165fd1e

ezio-melotti approved these changes Aug 14, 2025

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting core review labels Aug 14, 2025

Update Misc/NEWS.d/next/Security/2025-06-18-13-34-55.gh-issue-135661.…

a5f45b8

…NZlpWf.rst Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>

serhiy-storchaka enabled auto-merge (squash) August 14, 2025 17:33

serhiy-storchaka merged commit 0cbbfc4 into python:main Aug 14, 2025
79 of 81 checks passed

bedevere-app bot removed the awaiting merge label Aug 14, 2025

bedevere-app bot removed the needs backport to 3.14 bugs and security fixes label Aug 14, 2025

bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Aug 14, 2025

serhiy-storchaka deleted the htmlparser-cdata branch August 14, 2025 18:15

bedevere-app bot removed the needs backport to 3.12 only security fixes label Aug 14, 2025

serhiy-storchaka removed needs backport to 3.9 only security fixes needs backport to 3.10 only security fixes needs backport to 3.11 only security fixes labels Aug 14, 2025

	self.unknown_decl(rawdata[i+3: j])
	self.unknown_decl(rawdata[i+3:j])

		If flag is false, then the :meth:`handle_comment` method will be called
		for ``<![CDATA[...>``.

		Fix CDATA section parsing in :class:`html.parser.HTMLParser` according to
		the HTML5 standard: ``] ]>`` and ``]] >`` no longer end the CDATA section.

		def support_cdata(self, flag=True):
		self._support_cdata = flag

Uh oh!

gh-135661: Fix CDATA section parsing in HTMLParser #135665

gh-135661: Fix CDATA section parsing in HTMLParser #135665

Conversation

serhiy-storchaka commented Jun 18, 2025 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezio-melotti Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezio-melotti Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

encukou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

encukou commented Jul 22, 2025

Uh oh!

serhiy-storchaka commented Aug 3, 2025

Uh oh!

serhiy-storchaka commented Aug 14, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

miss-islington-app bot commented Aug 14, 2025

Uh oh!

bedevere-app bot commented Aug 14, 2025

Uh oh!

miss-islington-app bot commented Aug 14, 2025

Uh oh!

miss-islington-app bot commented Aug 14, 2025

Uh oh!

bedevere-app bot commented Aug 14, 2025

Uh oh!

miss-islington-app bot commented Aug 14, 2025

Uh oh!

miss-islington-app bot commented Aug 14, 2025

Uh oh!

bedevere-app bot commented Aug 14, 2025

Uh oh!

Uh oh!

serhiy-storchaka commented Jun 18, 2025 •

edited by bedevere-app bot

Loading

ezio-melotti Jul 5, 2025 •

edited

Loading

ezio-melotti Jul 5, 2025 •

edited

Loading