Page MenuHomePhabricator

AWB can't handle categories, inserted with end RTL marks
Closed, ResolvedPublic

Description

If wikicode has category, inserted as [[Category:Categoryname ]] (with space after categoryname), AWB can't process this category (e.g. rename it). This manner of inserting categories is relatively common, AWB should handle this code correctly (it must be added \s* into regexes).

Event Timeline

Removing such trailing whitespace should probably be added as a minor genfix, too.

Rjwilmsi changed the task status from Open to Stalled.Mar 20 2016, 11:35 AM
Rjwilmsi subscribed.

@MaxBioHazard I have used the Replace category feature on the More tab and this is working for me to replace (rename) categories that are in the wikitext with trailing whitespace. Please provide details of an example that isn't working for you.

I can't find that pages yet, but there are pages without this whitespace, that AWB also can't handle. [[ru:Гиппиус, Василий Васильевич]], category replacement "Выпускники историко-филологического факультета Санкт-Петербургского государственного университета" -> "Выпускники историко-филологического факультета Санкт-Петербургского университета" doesn't work. There are 5 those pages, but several dozens other pages was processed with this replacement.

Looking at the example of [[ru:Гиппиус, Василий Васильевич]], the "space" is not a space there, it is the left-to-right mark, Unicode U+200E

The example at https://en.wikipedia.org/wiki/Left-to-right_mark suggests to me that on certain wikis there could be legitimate use of this character, so I'm not sure we can just always remove it from categories. If people think AWB should do this please link to a discussion confirming no expected problems for all wikipedia wikis.

Alternatively perhaps this is something CHECKWIKI could check/fix specifically on a per-wiki basis.

I would argue that even in the case of legitimate uses, invisible control characters such as LRM and RLM should always be used via templates so the uses can be tracked and it's immediately obvious to editors that there's something there (and, with an appropriate template name, exactly what that something is). Furthermore, I would push for the character entities being used instead of the actual characters, and for AWB to do one of throwing an error when it comes across one of the characters, converting the character to an entity, or simply removing it.

Of course, I'm not a participant on any wikis that do have a legitimate use case for these characters, and I'm not an expert on them otherwise, so there could be some reason this wouldn't work in practice that I'm unaware of.

AWB could remove this char (and RTL mark too) from cats on wikis that don't use RTL scripting (almost all wikis, except hebrew, arabic, persian and so on).

@MaxBioHazard My bot is doing this in English Wikipedia.

Magioladitis claimed this task.

Closed this. We handle end spaces. We do not handle invisible characters but we have bots removing them.

If a character is invisible in your browser, it's a control character, not a space; the examples you linked to in particular are U+200E right-to-left marks. This bug is still closed; if you think AWB should handle control characters as well, you need to open a new bug (personally I think it should, as I outlined in T130477#2137895).

I didn't tell this character is a space. Why I need to open new bug, not rename this?

MBH renamed this task from AWB can't handle categories, inserted with end spaces to AWB can't handle categories, inserted with end RTL marks.Aug 13 2017, 7:29 AM

Because, as was explained above, blind removal of control characters might have unintended effects on certain wikis that make legitimate use of them. Therefore, changing AWB to remove them is well beyond the scope of this bug and requires broader discussion to discover what, if any, such effects might happen, and whether these communities would be amenable to alternate methods of inserting these characters, such as with character entities or templates.