Skip to content

gh-135661: Fix CDATA section parsing in HTMLParser #135665

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open
Next Next commit
gh-135661: Fix CDATA section parsing in HTMLParser
"] ]>" and "]] >" no longer end the CDATA section.
  • Loading branch information
serhiy-storchaka committed Jun 18, 2025
commit f7f9f562f1b31c2130e26269cf4f196f378d80f2
6 changes: 5 additions & 1 deletion Lib/html/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,11 @@ def parse_html_declaration(self, i):
# this case is actually already handled in goahead()
return self.parse_comment(i)
elif rawdata[i:i+9] == '<![CDATA[':
return self.parse_marked_section(i)
j = rawdata.find(']]>')
if j < 0:
return -1
self.unknown_decl(rawdata[i+3: j])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.unknown_decl(rawdata[i+3: j])
self.unknown_decl(rawdata[i+3:j])

return j + 3
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the HTML5 standard (https://html.spec.whatwg.org/multipage/parsing.html#markup-declaration-open-state), it should be either data or bogus comment (which ends with >, not ]]>), but this depends on the context. It may be that I incorrectly understand the HTML5 standard, because this part is difficult to implement.

Copy link
Member

@ezio-melotti ezio-melotti Jul 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried copying the content of the tests in the following file:

<!DOCTYPE html>
<html>
<body>
<![CDATA[just some plain text]]><hr>
<![CDATA[<!-- not a comment -->]]><hr>
<![CDATA[&not-an-entity-ref;]]><hr>
<![CDATA[<not a='start tag'>]]><hr>
<![CDATA[]]><hr>
<![CDATA[[[I have many brackets]]]]><hr>
<![CDATA[I have a > in the middle]]><hr>
<![CDATA[I have a ]] in the middle]]><hr>
<![CDATA[] ]>]]><hr>
<![CDATA[]] >]]><hr>
<![CDATA[
    if (a < b && a > b) {
        printf("[<marquee>How?</marquee>]");
    }
]]><hr>

</body>
</html>

and this was the result on Firefox:

<html><head></head><body>
<!--[CDATA[just some plain text]]--><hr>
<!--[CDATA[<!-- not a comment ---->]]&gt;<hr>
<!--[CDATA[&not-an-entity-ref;]]--><hr>
<!--[CDATA[<not a='start tag'-->]]&gt;<hr>
<!--[CDATA[]]--><hr>
<!--[CDATA[[[I have many brackets]]]]--><hr>
<!--[CDATA[I have a --> in the middle]]&gt;<hr>
<!--[CDATA[I have a ]] in the middle]]--><hr>
<!--[CDATA[] ]-->]]&gt;<hr>
<!--[CDATA[]] -->]]&gt;<hr>
<!--[CDATA[
    if (a < b && a --> b) {
        printf("[<marquee>How?</marquee>]");
    }
]]&gt;<hr>



</body></html>
Image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and if you try <svg><text y="100"><![CDATA[foo<br>bar]]></text></svg>, you will see that content between <![CDATA[ and ]]> is interpreted as a raw data.

This is context dependent.

HTMLParser is actually just a tokenizer. To determine the context automatically, it needs to support the stack of open elements and to know what elements are in the HTML namespace. This is all in the specification, and we will implement this in future. But this is a different level of complexity. So I solved the issue by letting the user to determine the context. New method support_cdata() sets how HTMLParser will parse CDATA. This is not good, but perhaps better than the current state.

elif rawdata[i:i+9].lower() == '<!doctype':
# find the closing >
gtpos = rawdata.find('>', i+9)
Expand Down
42 changes: 21 additions & 21 deletions Lib/test/test_htmlparser.py
Original file line number Diff line number Diff line change
Expand Up @@ -686,27 +686,27 @@ def test_broken_condcoms(self):
]
self._run_check(html, expected)

def test_cdata_declarations(self):
# More tests should be added. See also "8.2.4.42. Markup
# declaration open state", "8.2.4.69. CDATA section state",
# and issue 32876
html = ('<![CDATA[just some plain text]]>')
expected = [('unknown decl', 'CDATA[just some plain text')]
self._run_check(html, expected)

def test_cdata_declarations_multiline(self):
html = ('<code><![CDATA['
' if (a < b && a > b) {'
' printf("[<marquee>How?</marquee>]");'
' }'
']]></code>')
expected = [
('starttag', 'code', []),
('unknown decl',
'CDATA[ if (a < b && a > b) { '
'printf("[<marquee>How?</marquee>]"); }'),
('endtag', 'code')
]
@support.subTests('content', [
'just some plain text',
'<!-- not a comment -->',
'&not-an-entity-ref;',
"<not a='start tag'>",
'',
'[[I have many brackets]]',
'I have a > in the middle',
'I have a ]] in the middle',
'] ]>',
']] >',
('\n'
' if (a < b && a > b) {\n'
' printf("[<marquee>How?</marquee>]");\n'
' }\n'),
])
def test_cdata_section(self, content):
# See "13.2.5.42 Markup declaration open state",
# "13.2.5.69 CDATA section state", and issue bpo-32876.
html = f'<![CDATA[{content}]]>'
expected = [('unknown decl', 'CDATA[' + content)]
self._run_check(html, expected)

def test_convert_charrefs_dropped_text(self):
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Fix CDATA section parsing in :class:`html.parser.HTMLParser`: ``] ]>`` and
``]] >`` no longer end the CDATA section.
Loading