Page MenuHomePhabricator

Old variables are computed wrongly for old entries
Closed, ResolvedPublic

Description

While examinating this AbuseLog entry on itwikinews, the page_namespace variable has the incorrect value of null. However, in the fields below it's clearly shown as 0, and the edit was on a ns0 page, too. Trying to reproduce this error on other AbuseLog entries didn't work, and the namespace has the correct value (0 or anything else).
The first suspect could be that this is due to the edit not being saved, however other non-saved edits do have the correct value for page_namespace.
The second could be that page_namespace isn't used in any filter, and thus computed non-lazily upon examination (something similar to T176291), but the variable was actually computed at edit time.
While further investigation is coming, I have no clues for the moment.

UPDATE 1: This comes from wmf.19. In fact, page_namespace is identical to null for entries older than August 30th, 19:47 UTC, which is the deployment time.
UPDATE 2: This affects all renamed variables. Now, I also suspect the reason: currently, we only use new names for computing, and if the pattern contains an old name, the parser translates it to the new one. However, var_dumps for old entries are stored with the old names, and trying to retrieve (for instance) page_namespace from a dump where such variable is saved as article_namespace returns null because the variable doesn't exist. If this is correct, the possible solutions are:

  1. Change the mapping direction, i.e. translate new names to the old ones. This would require a big change (basically most of the patches for T173889 should be reverted) and produce the opposite problem for entries after wmf.19, plus add some inconsistency (since preferred names aren't actually used) and just delay this problem if we'll decide to completely remove old variables.
  2. Add a special-case backward mapping when examining old entries.
  3. Use a maintenance script to convert old names to the new ones in old entries. Note: We should seize the opportunity and fix T110854 and T187153 with this script.

Point 2 could be a good short term solution, while point 3 would be the long term one. If, however, we assess that any of the two could produce other problems, then we could proceed with point 1 and apply point 3 (and maybe 2) only for newer entries.

Event Timeline

Daimona renamed this task from page_namespace (and maybe other variables) are computed wrongly for some entries to Old variables are computed wrongly for old entries.Sep 13 2018, 1:38 PM
Daimona triaged this task as High priority.
Daimona updated the task description. (Show Details)

Change 460391 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] Unbreak /examine for old log entries

https://gerrit.wikimedia.org/r/460391

The patch above is for point 2. Point 3, however, is quite hard, for several reasons. The main one is that storing the dump may happen (or, I should say, happened) in lots of ways. See loadVarDump to see what I mean: there are rows where it's stored directly in afl_var_dump, where in turn it may either be an array or (I think) an instance of AbuseFilterVariableHolder. Plus, it could have up to three flags (external, gzip and nativeDataArray) to determine how it's encoded. Also, it could be stored inside ExternalStorage or not. All these things mean that we'd need to handle lots of different cases, and this could require too much effort since with the above patch everything will work again. My proposal is to first find a standard way to store the dump. Having it directly inside the abuse_filter_log doesn't seem bad to me: direct access and no need to write in the text table. However, I guess that there's a reason if we do it. At any rate, the method should be simplified. Then, we should write the script to change variable names, and seize the opportunity to fix T187153 as well.

Daimona raised the priority of this task from High to Unbreak Now!.Sep 14 2018, 5:36 PM

Old entries are untestable. While for big wikis this isn't a problem, for small ones it definitely is. For instance, we're trying to test a new filter on itwikinews, but since title variables are null for old entries we're more or less paralyzed. While there's no hurry for point 3, point 2 is vital.

Change 460391 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Unbreak /examine for old log entries

https://gerrit.wikimedia.org/r/460391

Daimona lowered the priority of this task from Unbreak Now! to Low.Sep 16 2018, 12:31 PM
Daimona removed a project: Patch-For-Review.

The cleanup is low priority, since it should be addressed together with other stuff, and with care.

Change 482499 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] Add a maintenance script to clean afl_var_dump

https://gerrit.wikimedia.org/r/482499

Daimona moved this task from Future to Under review on the User-Daimona board.

Change 482499 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] [WIP] Add a maintenance script to clean afl_var_dump

https://gerrit.wikimedia.org/r/482499

Change 482499 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Add a maintenance script to clean afl_var_dump

https://gerrit.wikimedia.org/r/482499