Differentiate bogus and normal comments in HTMLParser

The HTML specs support three kinds of bogus comments:

* `<?...>`. `HTMLParser` calls `handle_pi()` for it.
* `<!...>`. `HTMLParser` used to call `unknown_decl()` for it, but now (after #9295) it calls `handle_comment()`.
* `</...>` if no ASCII letter follows `/`. `HTMLParser` calls `handle_comment()` for it.

And, of course, `handle_comment()` is called for normal comments `` and `` and ``.

It is now impossible to differentiate `<![if !(IE)]>` from `</[if !(IE)]>` and ``. This may be important, even if they are the same comment from the point of view of the HTML specs.

It was proposed in #70197 to add a new handler `handle_bogus_comment()` which calls `bogus_comment()` by default to differentiate bogus comments from normal comments. Additional information should be passed to it besides the comment value to differentiate different kinds of bogus comments. For example, the character preceding the comment value (`?`, `!` or `/`). But since `handle_pi()` is already called for `<?...>` and `unknown_decl()` used to be called for `<!...>`, we can just restore the use of `unknown_decl()` and add a new handler for `</...>`.

The second way will partially revert #9295. The difference is that a bogus comment (unknown declaration) starting with `<![` will be terminated by first `>` instead of `]>` or  `]]>`, in accordance to the HTML specs.

The problem is that `unknown_decl()` is also called for valid CDATA section (and trailing `]]` is omitted). According to the HTML specs, its content should be treated as normal text, so we could simply call `handle_data()` (as for resolved character references), but for flexibility we can call a special method.

cc @ezio-melotti

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Differentiate bogus and normal comments in HTMLParser #137877

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Differentiate bogus and normal comments in HTMLParser #137877

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions