Skip to content

Differentiate bogus and normal comments in HTMLParser #137877

@serhiy-storchaka

Description

@serhiy-storchaka

The HTML specs support three kinds of bogus comments:

And, of course, handle_comment() is called for normal comments <!--...--> and <!--...--!>. This includes abnormal cases <!--> and <!---> which are treated as empty comment <!---->.

It is now impossible to differentiate <![if !(IE)]> from </[if !(IE)]> and <!--[if !(IE)]-->. This may be important, even if they are the same comment from the point of view of the HTML specs.

It was proposed in #70197 to add a new handler handle_bogus_comment() which calls bogus_comment() by default to differentiate bogus comments from normal comments. Additional information should be passed to it besides the comment value to differentiate different kinds of bogus comments. For example, the character preceding the comment value (?, ! or /). But since handle_pi() is already called for <?...> and unknown_decl() used to be called for <!...>, we can just restore the use of unknown_decl() and add a new handler for </...>.

The second way will partially revert #9295. The difference is that a bogus comment (unknown declaration) starting with <![ will be terminated by first > instead of ]> or ]]>, in accordance to the HTML specs.

The problem is that unknown_decl() is also called for valid CDATA section (and trailing ]] is omitted). According to the HTML specs, its content should be treated as normal text, so we could simply call handle_data() (as for resolved character references), but for flexibility we can call a special method.

cc @ezio-melotti

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibPython modules in the Lib dirtype-featureA feature request or enhancement

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions