User:LennardHofmann/GSoC 2022/Report 2

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

I can't believe it's already three weeks since I wrote the first progress report – time really can fly when I'm programming. Also, I've had some school events lately that have prevented me from working as much as in the first two weeks.

Yet, I have made good progress on rewriting the Wikidata Infobox in Lua. Only the authority control and autocat sections still need to be rewritten. The infobox needs some fine-tuning, though, before it can be released for coummunity testing. I'm aiming to release it in mid-July.

While the rewrite from Wikitext to Lua has brought much improvement to the performance of the infobox, it also comes with increased Lua memory usage (more code written in Lua obviously needs more Lua memory). This makes the infobox unusable for some of the biggest Wikidata items. Since Scribunto has no memory profiler and – to my knowledge – only seven category pages run out of memory, I won't spend any energy on this issue.

Technical challenges

[edit]

Retrieving a sitelink from Wikidata to a particular language edition of Wikipedia is easy, right? For the English Wikipedia you simply request the "enwiki" sitelink and so on.

It's not always that easy. WikidataIB has a special case for the "be-tarask" Wikipedia because its domain name has been renamed but its global site ID "be_x_oldwiki" remained. I dug deep into Wikimedia documentation[1][2][3][4][5] and found 15 other exceptions (bho, cbk-zam, gsw, ike, lzh, map-bms, nan, nb, nds-nl, mo, roa-tara, rup, sgs, vro, yue). I wish these discrepancies between language codes and site IDs were documented in a single place. If you would like to see a standalone Lua module that converts language codes into site IDs, then please let me know.

The Lua code that produces a switchable image gallery is probably the ugliest code I have written because it has to deal with many nested HTML elements, some of which are optional. I will rewrite it using the mw.html library, which is implemented similarly as my code. Using it comes with a tiny performance cost, but it hides the ugliness away.

I also discovered an interesting quirk in the way the output of a Lua module is parsed, see this discussion.

Tips for template editors

[edit]

Add the following text to meta:Special:MyPage/global.js to use the "Preview page with this template" feature for protected templates:

// [[w:User:Jackmcbarn/advancedtemplatesandbox.js]]
mw.loader.load( 'https://en.wikipedia.org/w/index.php?title=User:Jackmcbarn/advancedtemplatesandbox.js&action=raw&ctype=text/javascript' );

It is really hard to spot subtle bugs in wikitext. In the following snippet, a WikidataIB call was missing the |name= parameter:

{{#if:{{#invoke:Wikidata Infobox|checkProplist|fwd={{{fwd}}}|name=P6507}}||{{#if:{{#invoke:WikidataIB | getValue | rank=best |qid={{{qid|}}} |P225 |qual=P405 |fwd={{{fwd|ALL}}} |osd={{{osd|no}}} |qlinkprefix=":" |qualsonly=yes}}|{{#invoke:Wikidata Infobox|formatLine | P405 | {{#invoke:WikidataIB | getValue | rank=best |qid={{{qid|}}} |P225 |qual=P405, P574 |fwd={{{fwd|ALL}}} |osd={{{osd|no}}} |qlinkprefix=":" |qualsonly=yes}} }}}}}}

This problem can be avoided in Lua by defining a function getValue that always uses the |name= parameter so that you cannot forget it. Below is an almost literal translation of the above wikitext:[6]

if not claims['P6507'] and getValue('P225', { qual='P405', qualsonly='yes' }) then
	local content = getValue('P225', { qual='P405, P574', qualsonly='yes' })
	return formatLine( 'P405', content )
end

Since Lua is such a minimalistic language, it's easy to read code written in Lua and mastering the language only takes a few weeks. I encourage you to try it: the Scribunto reference manual contains everything you need to know about writing Lua modules.

Advice for Wikidata editors

[edit]

Fetching data from large Wikidata items is slow, so please don't use Wikidata to store large amounts of numbers or strings that could be stored as Tabular Data in the Data namespace on Commons instead.

For example, COVID-19 pandemic in Luxembourg (Q87250860) has hundreds of number of deaths (P1120) and number of cases (P1603) statements and the data mostly comes from a single source. See Data:COVID-19 (STCenter)/LU/Q32.tab for how this data should be stored.

Footnotes

[edit]
  1. mw:Manual:$wgExtraLanguageCodes
  2. meta:Special_language codes
  3. meta:List_of_Wikipedias#Nonstandard_language_codes
  4. meta:Template:N en/list
  5. meta:Template:Wikilangcode
  6. For efficiency this should be written without using WikidataIB. You can find the definition of getValue and getTaxonAuthor (the actual translation of the above wikitext) in my sandbox module.

Previous post: Report 1Next post: Report 3