Skip to content

gh-135661: Fix CDATA section parsing in HTMLParser #135665

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open
Prev Previous commit
Next Next commit
* Make CDATA section parsing context depending.
* Add HTMLParser.support_cdata().
  • Loading branch information
serhiy-storchaka committed Jul 5, 2025
commit 524cac599dc5554650e6f1a8c81d808fa8ef54d6
11 changes: 11 additions & 0 deletions Doc/library/html.parser.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,17 @@ The output will then be:
attributes can be preserved, etc.).


.. method:: HTMLParser.support_cdata(flag)

Sets how the parser will parse CDATA declarations.
If *flag* is true, then the :meth:`unknown_decl` method will be called
for the CDATA section ``<![CDATA[...]]>``.
If *flag* is false, then the :meth:`handle_comment` method will be called
for ``<![CDATA[...>``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should mention the default behaviour.

Suggested change
If *flag* is false, then the :meth:`handle_comment` method will be called
for ``<![CDATA[...>``.
If *flag* is false, or if :meth:`!support_cdata` has not been called yet,
then the :meth:`handle_comment` method will be called for ``<![CDATA[...>``.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a weak point of such approach. It should be true by default to be able to parse valid HTML (when <![CDATA[...]]> is only used in foreign content) by default. But secure parsing needs to set it to false at the beginning and the set to true or false after every open or close tag, depending on the complex algorithm.

So we should set it to true by default if we keep this approach.


.. versionadded:: 3.13.6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should mention the previous behaviour.

Suggested change
.. versionadded:: 3.13.6
.. versionadded:: 3.13.6
Previously, :meth:`unknown_decl` was called for ``<![CDATA[...>``.



The following methods are called when data or markup elements are encountered
and they are meant to be overridden in a subclass. The base class
implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
Expand Down
22 changes: 16 additions & 6 deletions Lib/html/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,7 @@ def reset(self):
self.lasttag = '???'
self.interesting = interesting_normal
self.cdata_elem = None
self._support_cdata = False
super().reset()

def feed(self, data):
Expand Down Expand Up @@ -174,6 +175,9 @@ def clear_cdata_mode(self):
self.interesting = interesting_normal
self.cdata_elem = None

def support_cdata(self, flag=True):
self._support_cdata = flag
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this is the best way to handle the issue.
Since this solves a security issue, it also needs to be added to a bug fix release and backported to several other releases. Making the method private should be enough to solve the security issue in the short term without adding to the public API. This will also give us time to think about alternative (and possibly better) solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Private means that users shouldn't call it. Given that nothing in html calls it, making it private is the same as not adding it at all.

It looks like the issue can't be solved in CPython alone -- users need to adapt their code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be solved in CPython, and I hope we will, but it looks like the solution will be too complex to risk backporting it to security-only branches. Providing a method to alter behavior we pass the ball to user side. This is not good, but otherwise the problem will left unsolved. And without closing this hole, all other security fixes for HTMLParser are worthless.

I will try other approach, but can't guarantee anything.


# Internal -- handle data as far as reasonable. May leave state
# and data to be processed by a subsequent call. If 'end' is
# true, force handling all data as if followed by EOF marker.
Expand Down Expand Up @@ -249,7 +253,10 @@ def goahead(self, end):
break
self.handle_comment(rawdata[i+4:j])
elif startswith("<![CDATA[", i):
self.unknown_decl(rawdata[i+3:])
if self._support_cdata:
self.unknown_decl(rawdata[i+3:])
else:
self.handle_comment(rawdata[i+1:])
elif rawdata[i:i+9].lower() == '<!doctype':
self.handle_decl(rawdata[i+2:])
elif startswith("<!", i):
Expand Down Expand Up @@ -325,11 +332,14 @@ def parse_html_declaration(self, i):
# this case is actually already handled in goahead()
return self.parse_comment(i)
elif rawdata[i:i+9] == '<![CDATA[':
j = rawdata.find(']]>')
if j < 0:
return -1
self.unknown_decl(rawdata[i+3: j])
return j + 3
if self._support_cdata:
j = rawdata.find(']]>', i+9)
if j < 0:
return -1
self.unknown_decl(rawdata[i+3: j])
return j + 3
else:
return self.parse_bogus_comment(i)
elif rawdata[i:i+9].lower() == '<!doctype':
# find the closing >
gtpos = rawdata.find('>', i+9)
Expand Down
54 changes: 47 additions & 7 deletions Lib/test/test_htmlparser.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,16 @@ def get_events(self):

def handle_starttag(self, tag, attrs):
self.append(("starttag", tag, attrs))
if tag == 'svg':
self.support_cdata(True)

def handle_startendtag(self, tag, attrs):
self.append(("startendtag", tag, attrs))

def handle_endtag(self, tag):
self.append(("endtag", tag))
if tag == 'svg':
self.support_cdata(False)

# all other markup

Expand Down Expand Up @@ -643,10 +647,22 @@ def test_eof_in_declarations(self):
('<!', [('comment', '')]),
('<!-', [('comment', '-')]),
('<![', [('comment', '[')]),
('<![CDATA[', [('unknown decl', 'CDATA[')]),
('<![CDATA[x', [('unknown decl', 'CDATA[x')]),
('<![CDATA[x]', [('unknown decl', 'CDATA[x]')]),
('<![CDATA[x]]', [('unknown decl', 'CDATA[x]]')]),
('<![CDATA[', [('comment', '![CDATA[')]),
('<![CDATA[x', [('comment', '![CDATA[x')]),
('<![CDATA[x]', [('comment', '![CDATA[x]')]),
('<![CDATA[x]]', [('comment', '![CDATA[x]]')]),
('<svg><text y="100"><![CDATA[',
[('starttag', 'svg', []), ('starttag', 'text', [('y', '100')]),
('unknown decl', 'CDATA[')]),
('<svg><text y="100"><![CDATA[x',
[('starttag', 'svg', []), ('starttag', 'text', [('y', '100')]),
('unknown decl', 'CDATA[x')]),
('<svg><text y="100"><![CDATA[x]',
[('starttag', 'svg', []), ('starttag', 'text', [('y', '100')]),
('unknown decl', 'CDATA[x]')]),
('<svg><text y="100"><![CDATA[x]]',
[('starttag', 'svg', []), ('starttag', 'text', [('y', '100')]),
('unknown decl', 'CDATA[x]]')]),
('<!DOCTYPE', [('decl', 'DOCTYPE')]),
('<!DOCTYPE ', [('decl', 'DOCTYPE ')]),
('<!DOCTYPE html', [('decl', 'DOCTYPE html')]),
Expand Down Expand Up @@ -737,11 +753,35 @@ def test_broken_condcoms(self):
' printf("[<marquee>How?</marquee>]");\n'
' }\n'),
])
def test_cdata_section(self, content):
def test_cdata_section_content(self, content):
# See "13.2.5.42 Markup declaration open state",
# "13.2.5.69 CDATA section state", and issue bpo-32876.
html = f'<![CDATA[{content}]]>'
expected = [('unknown decl', 'CDATA[' + content)]
html = f'<svg><text y="100"><![CDATA[{content}]]></text></svg>'
expected = [
('starttag', 'svg', []),
('starttag', 'text', [('y', '100')]),
('unknown decl', 'CDATA[' + content),
('endtag', 'text'),
('endtag', 'svg'),
]
self._run_check(html, expected)

def test_cdata_section(self):
# See "13.2.5.42 Markup declaration open state".
html = ('<![CDATA[foo<br>bar]]>'
'<svg><text y="100"><![CDATA[foo<br>bar]]></text></svg>'
'<![CDATA[foo<br>bar]]>')
expected = [
('comment', '[CDATA[foo<br'),
('data', 'bar]]>'),
('starttag', 'svg', []),
('starttag', 'text', [('y', '100')]),
('unknown decl', 'CDATA[foo<br>bar'),
('endtag', 'text'),
('endtag', 'svg'),
('comment', '[CDATA[foo<br'),
('data', 'bar]]>'),
]
self._run_check(html, expected)

def test_convert_charrefs_dropped_text(self):
Expand Down
Loading