Page MenuHomePhabricator

Some linebreak chars in <ref>'s name attribute (and possibly other attributes as well) not handled correctly by Parsoid's Cite implementation
Closed, ResolvedPublic

Description

Spinning off a new task from T263698#6492289

Parsoid's Cite impl doesn't seem to have the right regexp for handling whitespace chars in ref names

In this snippet below, even the non-visible line break chars should be replaced with a space char. Check out output in a wiki sandbox to see how the core Cite extension handles it.

test<ref name="a
b">asdf</ref>

blah<ref name="a
b" />

test<ref name="a
b">qwer</ref>

blah<ref name="a
b" />

<references />

Event Timeline

ssastry triaged this task as Medium priority.Nov 16 2020, 9:29 PM
ssastry moved this task from Needs Triage to Current & Upcoming Work on the Parsoid board.
ssastry renamed this task from Some linebreak chars in <ref> not handled correctly by Parsoid's Cite implementation to Some linebreak chars in <ref>'s name attribute (and possibly other attributes as well) not handled correctly by Parsoid's Cite implementation.Nov 16 2020, 9:31 PM

This is done in core as part of the general extension handling, by Sanitizer::decodeTagAttributes() (called from Parser::extensionSubstitution). The code in question is:

		foreach ( $pairs as $set ) {
			$attribute = strtolower( $set[1] );

			// Filter attribute names with unacceptable characters
			if ( !preg_match( self::getAttribNameRegex(), $attribute ) ) {
				continue;
			}

			$value = self::getTagAttributeCallback( $set );

			// Normalize whitespace
			$value = preg_replace( '/[\t\r\n ]+/', ' ', $value );
			$value = trim( $value );

			// Decode character references
			$attribs[$attribute] = self::decodeCharReferences( $value );
		}
		return $attribs;

I don't like this code in core as a general principle, because it makes it hard for extensions to access newlines/tabs/etc even if they want them; like with other stripping issues, the pragmatic solution in existing code is to have the end-user wrap the argument with a <nowiki>....</nowiki> tag and then fetch the unstripped text from the strip state.

Here, close examination reveals that you can sneak a tab/newline into the attribute value by HTML-entity-encoding it, since the entity decoding is done *after* the whitespace normalization. Eg:

<myExtension name="  whitespace  stripped  here  ">

gives the attribute value whitespace stripped here (no leading/trailing, all single spaces). If you wanted the literal value typed, you'd need to write:

<myExtension name="&#32; whitespace&#32; stripped&#32; here &#32;">

Is this wikitext behavior intuitive/discoverable/documented anywhere?

Similar to the recommendation in T204307: Parser Functions should support named parameters, I'd suggest we make whitespace stripping an opt-in/opt-out feature of extension registration, so that extensions which want access to the raw text can easily get it -- without resorting to <nowiki>/strip state workarounds.

to avoid further confusion, there is a unicode \u2028 between 'a' and 'b' in the second example

test<ref name="a
b">asdf</ref>

blah<ref name="a
b" />

test invisible unicode character between a and b<ref name="a
b">qwer</ref>

blah<ref name="a
b" />

<references />

So analyzing the example WT, with current Parsoid/Cite code the following behavior for the $refName ID text occurs;

$refName as provided by tokenizer seems to have already converted the 'a', LF (0x0A), 'b' to 'a' 'space' 'b'

But if the tokenizer is handed "a\u2028b" it passes that through unchanged

And the $extApi->sanitzeHTMLId passes that back unchanged... (line 182 References.php)

No other code in Cite attempts to process $refName using regex or other techniques.

It looks like code in Sanitizer.php (line 1302) escapeIdInternal() is insufficient to handle the case of other characters
than "\t", "\n", "\f", "\r", " " being replaced as Cite refName IDs. Not sure how we ought to determine which unicode characters that some fonts my not have a glyph for should be converted to an underbar.

Change 645207 had a related patch set uploaded (by Sbailey; owner: Sbailey):
[mediawiki/services/parsoid@master] WIP Parsoid Cite does not handle multibyte whitespace characters

https://gerrit.wikimedia.org/r/645207

Change 647787 had a related patch set uploaded (by Sbailey; owner: Sbailey):
[mediawiki/services/parsoid@master] WIP Fix for Parsoid Cite whitespace character in refname disambig

https://gerrit.wikimedia.org/r/647787

Change 645207 abandoned by Sbailey:
[mediawiki/services/parsoid@master] WIP Parsoid Cite does not handle multibyte whitespace characters

Reason:
Replaced this patch with a different approach, and fixes a bug at the same time. => 647787

https://gerrit.wikimedia.org/r/645207

Change 647787 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Fix for Parsoid Cite refname whitespace handling

https://gerrit.wikimedia.org/r/647787

Change 655118 had a related patch set uploaded (by Sbailey; owner: Sbailey):
[mediawiki/services/parsoid@master] WIP adding additional refname sanitizing

https://gerrit.wikimedia.org/r/655118

Change 655482 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a22

https://gerrit.wikimedia.org/r/655482

Change 655482 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a22

https://gerrit.wikimedia.org/r/655482

Change 655118 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Contract multiple underbars in a row in refnames to a single underbar

https://gerrit.wikimedia.org/r/655118

Change 658462 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a23

https://gerrit.wikimedia.org/r/658462

Change 658462 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a23

https://gerrit.wikimedia.org/r/658462