Skip to content

gh-69426: only unescape properly terminated character entities in attribute values #95215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Next Next commit
gh-69426: only unescape properly terminated character entities in att…
…ribute values
  • Loading branch information
sissbruecker committed Jul 24, 2022
commit 71a89f98c31e2f1285221568de73fbc1e09ad84d
23 changes: 22 additions & 1 deletion Lib/html/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
import _markupbase

from html import unescape
from html.entities import html5 as html5_entities


__all__ = ['HTMLParser']
Expand Down Expand Up @@ -57,6 +58,26 @@
# </ and the tag name, so maybe this should be fixed
endtagfind = re.compile(r'</\s*([a-zA-Z][-.a-zA-Z0-9:_]*)\s*>')

# Character reference processing logic specific to attribute values
# See: https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state
attr_charref = re.compile(r'&(#[0-9]+|#[xX][0-9a-fA-F]+|[a-zA-Z][a-zA-Z0-9]*)[;=]?')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This partially duplicates an existing Regex, but I was not able to reuse the existing one for this purpose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the issue only seems to affect named character references, is there a reason to include numeric charrefs too in this regex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the new _unescape_attrvalue is effectively a wrapper for html.escape that only delegates to html.escape if the attribute specific conditions are met. Since we still want to escape numeric and hex char refs in attributes, we need to include them in the regex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to move this immediately after the definition of entityref and charref. If we change one regexp, we will not forget to change the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


def replace_attr_charref(match):
ref = match.group(0)
# Numeric / hex char refs must always be unescaped
if ref[1] == '#':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ref[1] == '#':
if ref.startswith('&#'):

I think this is clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return unescape(ref)
# Named character / entity references must only be unescaped
# if they are an exact match, and they are not followed by an equals sign
terminates_with_equals = ref[-1:] == '='
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
terminates_with_equals = ref[-1:] == '='
terminates_with_equals = ref.endswith('=')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

exact_match = ref.lstrip('&').rstrip('=') in html5_entities
if exact_match and not terminates_with_equals:
return unescape(ref)
# Otherwise do not unescape
return ref

def unescape_attrvalue(s):
return attr_charref.sub(replace_attr_charref, s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both functions should be private, and their name prefixed by an _.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done



class HTMLParser(_markupbase.ParserBase):
Expand Down Expand Up @@ -322,7 +343,7 @@ def parse_starttag(self, i):
attrvalue[:1] == '"' == attrvalue[-1:]:
attrvalue = attrvalue[1:-1]
if attrvalue:
attrvalue = unescape(attrvalue)
attrvalue = unescape_attrvalue(attrvalue)
attrs.append((attrname.lower(), attrvalue))
k = m.end()

Expand Down
52 changes: 47 additions & 5 deletions Lib/test/test_htmlparser.py
Original file line number Diff line number Diff line change
Expand Up @@ -347,17 +347,17 @@ def test_convert_charrefs(self):
self.assertTrue(collector().convert_charrefs)
charrefs = ['&quot;', '&#34;', '&#x22;', '&quot', '&#34', '&#x22']
# check charrefs in the middle of the text/attributes
expected = [('starttag', 'a', [('href', 'foo"zar')]),
expected = [('starttag', 'a', [('href', 'foo " zar')]),
('data', 'a"z'), ('endtag', 'a')]
for charref in charrefs:
self._run_check('<a href="foo{0}zar">a{0}z</a>'.format(charref),
self._run_check('<a href="foo {0} zar">a{0}z</a>'.format(charref),
expected, collector=collector())
# check charrefs at the beginning/end of the text/attributes
# check charrefs at the beginning/end of the text
expected = [('data', '"'),
('starttag', 'a', [('x', '"'), ('y', '"X'), ('z', 'X"')]),
('starttag', 'a', []),
('data', '"'), ('endtag', 'a'), ('data', '"')]
for charref in charrefs:
self._run_check('{0}<a x="{0}" y="{0}X" z="X{0}">'
self._run_check('{0}<a>'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the existing tests to remove flawed assumptions about how the unescaping in attribute values should work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to remove all attribute-related checks from this test, and move them in the next.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

'{0}</a>{0}'.format(charref),
expected, collector=collector())
# check charrefs in <script>/<style> elements
Expand All @@ -380,6 +380,48 @@ def test_convert_charrefs(self):
self._run_check('no charrefs here', [('data', 'no charrefs here')],
collector=collector())

def test_convert_charrefs_in_attribute_values(self):
# default value for convert_charrefs is now True
collector = lambda: EventCollectorCharrefs()
self.assertTrue(collector().convert_charrefs)

# do unescape numeric and hex char refs
expected = [('starttag', 'a',
[('href', 'https://example.com?foo¢=bar¢&baz¢=bla¢')]),
('endtag', 'a')]
self._run_check('<a href="https://example.com?foo&#xa2;=bar&#xa2&baz&#162;=bla&#162"></a>', expected, collector=collector())

# do unescape entity matches not followed by ASCII alphanumeric
expected = [('starttag', 'a',
[('href', 'https://example.com?foo¢¢ ¢+¢')]),
('endtag', 'a')]
self._run_check('<a href="https://example.com?foo&cent;&cent &cent+&cent"></a>', expected, collector=collector())

# do not unescape entity matches followed by ASCII alphanumeric
expected = [('starttag', 'a',
[('href', 'https://example.com?foo&center&cent123')]),
('endtag', 'a')]
self._run_check('<a href="https://example.com?foo&center&cent123"></a>', expected, collector=collector())

# do not unescape entity matches followed by equals
expected = [('starttag', 'a',
[('href', 'https://example.com?foo&cent=123')]),
('endtag', 'a')]
self._run_check('<a href="https://example.com?foo&cent=123"></a>', expected, collector=collector())

# do unescape terminated entity matches followed by equals
expected = [('starttag', 'a',
[('href', 'https://example.com?foo¢=123')]),
('endtag', 'a')]
self._run_check('<a href="https://example.com?foo&cent;=123"></a>', expected, collector=collector())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, it would be better to match the style of the previous test, creating different lists of charrefs (e.g. valid, invalid, named, numeric, etc.) and add them in different places in the attribute (beginning, end, before an alnum/space/semicolon/equal).

Also try to keep the lines shorter than 80 chars (you can remove the initial part of the URLs, since they are not necessary).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looking at it combining multiple cases in a single attribute is indeed hard to read. I restructured the test to have two scenarios:

  • terminated entity, numeric and hex char refs
  • unterminated entity char refs

Both include cases for start, middle, end, as well as followed by alphanumeric, non-alphanumeric and equals sign. I hope it's a bit clearer now.

Also updated formatting to respect the 80 char limit.


# do unescape char refs at begging and end of text attributes
charrefs = ['&quot;', '&#34;', '&#x22;', '&quot', '&#34', '&#x22']
expected = [('starttag', 'a', [('x', '"'), ('y', '"-X'), ('z', 'X-"')]), ('endtag', 'a')]
for charref in charrefs:
self._run_check('<a x="{0}" y="{0}-X" z="X-{0}"></a>'.format(charref),
expected, collector=collector())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted this from test_convert_charrefs


# the remaining tests were for the "tolerant" parser (which is now
# the default), and check various kind of broken markup
def test_tolerant_parsing(self):
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Fix :class:`HTMLParser` to not unescape character entities in attribute
values if they are followed by an ASCII alphanumeric or an equals sign.