Page MenuHomePhabricator

Community Relations support needed for several read-only windows (s2, s3, s4 and s8)
Closed, ResolvedPublic

Description

What is the problem?

Due to on-site maintenance (T226778) going on some of the racks that host our primary database servers, we need to switchover some of our current primary masters to other hosts, to ensure that there is not an unexpected downtime as result of this on-site maintenance.

This means we need read-only windows to perform this maintenance.

How can we help you?

Notifying the affected wikis (below) with the scheduled maintenance.

What does success look like?

Affected wiki users will get to know that there will be a period of read-only time (30 minutes requested, expected just a few minutes if everything goes fine)
Users will know that the impact is the that: writes will be blocked, and reads will remain unaffected

What is your deadline?

These are the windows, days, time and affected wikis:

Related Objects

Event Timeline

OK.

I've added this to Tech News #36 which will go out on September 2 (no issue on Monday, all the usual writers are travelling and no one else picked up the torch). I've also posted on Project Chat on Wikidata as that is fairly soon.

In general, I recommend:

  • Posting on the Village Pump, to give a heads up, at least a couple of weeks prior to the read-only period. It would be good if this could be done for all wikis.
  • Mentioning it in Tech News. (This is being done.)
    • We can follow up in Tech News later, especially for September 10, 24 and 26. The wikis on September 24 are a) few and b) not Commons and Wikidata that are used by other wikis. If we want to remind the wikis on September 17, that's probably best done individually, outside of Tech News.
  • Set up banner maybe 30–60 minutes before the read-only period. This can be done by asking the communities or by a CentralNotice banner (ping @Trizek-WMF).

This is a routine by now, and works well, so the above should be enough.

I'll be OoO for two and half weeks starting tomorrow, so someone else will have to take this up from here on.

@Marostegui Something that would be helpful would be if we could have a simplified technical explanation (aimed at fairly non-technical audience) explaining why we need the read-only periods, so that we could explain to the communities what's happening and why they can't edit. Would you have the time to do that?

(If you feel uncomfortable writing for a non-technical audience: we can help with editing, once we're back.)

Sure @Johan, let me know if this is good enough or you need more or less detail.

There is an on-site maintenance at our primary datacenter, where the primary database masters are located. The maintenance will be specifically on those racks, and it involves plugging and unplugging servers from one power source to another, whilst all our servers have redundant power supplies and they will most likely not lose power at anytime, accidents can occur and if that happens, the affected server can go off.
Our primary database masters are the ones that get the edits, and replicate them to other hosts (replicas) where reads happen.

If any of our primary database masters lose power, we don't only lose the ability to edit until it is restored (it can take a few minutes for a server to boot up) but we could run into data corruptions (due to that abrupt crash).
To avoid running into any of those unexpected scenarios, we prefer to switch our master to a host located in a rack that won't be affected by this maintenance.
The reason the read-only time is required, is because we have to change the configuration to make mediawiki point to the new host, and that needs to be done while there are not edits happening to avoid the possibility of a split brain (an edit happening at the exact moment the master switch change is being propagated to all our mediawiki servers).

@Marostegui Thanks! The one additional detail some people keep asking about where it'd be great if I could give them a good answer would be what in our setup creates this problem, when they can't remember it happening for other websites.

That is a broader discussion.
Essentially, the architecture we have at the moment (both, systems and MW related) is thought to have only one active master at the time. So only one host receives writes at the time. Having more than one host allowing this, is something that we are discussing but requires lots of changes on both, our system's architecture and on MW code itself.

There is a huge number of organizations out there with a similar architecture, and the key is to do this switchover as fast as possible so your users don't get impacted that much. We have done tremendous improvements to be able to do this under 2 minutes (and we the latest changes we are aiming to do it under 1 minute now).

There is one thing to keep in mind here, and it is the fact that we do announce our read-only times, even though they are very short, some other organizations don't do so, and they let their users experiment an error, which is usually fixed the next time the write is attempted if the process is fast. So the fact that sometimes the feeling is that other websites doesn't suffer from this problem doesn't mean it is not there, it is handled in a different way.
In some organizations showing errors during a minute (or less) is fine and it is assumed as part of regular maintenance. We, however, prefer to announce our read-only times to make sure users and bots are aware so we don't create unexpected inconveniences.

Hope this helps!

Elitre subscribed.

I'll assign this temporarily to Trizek, although Johan has already done most of the diligence for the first read-only period, and I'll make sure that he's aware of the few pending tasks for that.

I've worked on the announce on Tech News.

Concerning the banners, it may be a bit more complicated, since we have multiple wikis families. I'll work on it tomorrow.

@Marostegui is this task only to coordinate our team support?

@Marostegui is this task only to coordinate our team support?

Yes :-)
The technical bits are on different tickets

Thanks for confirming! In this case, it'd be helpful to follow the structure suggested at https://office.wikimedia.org/wiki/Community_Relations#Public_requests_(standard) - we more or less know what it is that you may need from us, but more details for everyone else could also help - and thanks for the email headsup BTW!

Elitre renamed this task from Several read-only windows needed for: s2, s3, s4 and s8 to Community Relations support needed for several read-only windows (s2, s3, s4 and s8).Aug 29 2019, 4:10 PM

Thanks for confirming! In this case, it'd be helpful to follow the structure suggested at https://office.wikimedia.org/wiki/Community_Relations#Public_requests_(standard) - we more or less know what it is that you may need from us, but more details for everyone else could also help - and thanks for the email headsup BTW!

I have edited the task with that template.

Here is the Great Banners Matrix.
Each cell is a separate banner. Since we have on banner template and I don't want to multiply it, I will setup each banner once at the time.

10th Sept17th Sept24th Sept26th Sept
wikimediaam, be, br, ca, cn, co, dk, ec, et, fi, hi, id, il, mai, mk, mx, nl, no, nyc, nz, pa_us, pl, pt, punjabi, romd, rs, ru, se, tr, ua, wb
wikipediabg, cs, eo, fi, id, it, nl, no, pl, pt, sv, th, tr, zhaa, ab, ace, ady, af, ak, als, am, ang, an, arc, arz, ast, as, atj, av, ay, azb, az, bar, bat_smg, ba, bcl, be_x_old, be, bh, bi, bjn, bm, bn, bo, bpy, br, bo, bpy, br, bs, bug, bxr, cbk_zam, cdo, ce, cho, chr, ch, chy, ckb, co, crh, cr, csb, cu, cv, cy, da, din, diq, dsb, dty, dv, dz, ee, el, elm, et, eu, ext, fdc, ff, fiu_vro, fj, fo, frp, frr, fur, fy, gag, gan, ga, gd, glk, gl, gn, gom, gor, got, gu, gv, hak, ha, haw, hif, hi, ho, hr, hsb, ht, hy, hyw, hz, ia, ie, ig, ii, ik, ilo, inh, io, is, iu, jam, jbo, jv, kaa, kab, ka, kbd, kbp, kg, ki, kj, kk, kl, km, kn, koi, krc, kr, ksh, ks, ku, kv, kw, ky, lad, la, lbe, lb, lez, lfn, lg, lij, li, lmo, ln, lo, lrc, ltg, lt, lv, mai, mdf, mg, mhr, mh, min, mi, mk, ml, mn, mrj, mr, ms, mt, mus, mwl, myv, my, mzn, nah, nap, na, nds, ne, new, ng, nn, nov, nrm, nso, nv, ny, oc, olo, om, or, os, pag, pam, pap, pa, pcd, pdc, pfl, pih, pi, pms, pnb, pnt, ps, qu, rm, rmy, rn, roa_rup, roa_tara, rue, rw, sah, sat, sa, scn, sco, sc, sd, se, sg, shn, simple, si, sk, sl, sm, sn, so, sq, srn, ss, stq, st, su, sw, szl, ta, tcy, tet, te, tg, ti, tk, tl, tn, to, tpi, ts, tt, tum, tw, tyv, ty, udm, ug, ur, uz, vec, vep, ve, vls, vo, wa, wo, wuu, xal, xh, xmf, yi, yo, za, zea, zh_classical, zh_min_nan, zh_yue, zu
wikibooksak, ang, ar, ast, as, ay, az, ba, be, bg, bi, bm, bn, bo, bs, ca, ch, co, cs, cv, cy, da, de, el, en, eo, es, et, eu, fa, fi, fr, fy, ga, gl, gn, gn, got, gu, he, hi, hr, hu, hy, ia, id, ie, is, it, ja, ka, kk, km, kn, ko, ks, ku, ky, la, lb, li, ln, lt, lv, mg, mi, mk, ml, mn, mr, ms, my, nah, na, nds, ne, nl, no, oc, pa, pl, ps, pt, qu, rm, ro, ru, sa, se, simple, si, , sk, sl, sq, sr, , tr, su, sv, sw, ta, te, tg, th, tk, tl, tt, ug, uk, ur, uz, vi, vo, wa, xh, yo, za, zh_min_nan, zh, zu
wikinewsar, bg, bs, ca, cs, de, el, en, eo, es fa, fi, fr, he, hu, it, ja, ko, li, nl, no, pl, pt, ro, ru, sd, sq, sr, sv, ta, th, tr, uk, zh
wikiquoteenaa, af, ang, am, ar, ast, az, be, bg, bm, br, bs, ca, co, cr, cs, cy, da, de, el, eo, es, et, eu, fa, fi, fr, ga, gl, gu, he, hi, hr, hu, hy, id, is, itkk, kn, ko, kr, ks, ku, kw, ky, la, lb, li, lt, ml, mr, na, nds, nl, nn, no, pl, pt, qu, ro, ru, sah, sa, simple, sk, sl, sq, sr, su, sv, ta, te, th, tk, tr, tt, ug, uk, ur, uz, vi, vo, wo, za, zh_min_nan, zh
wikisourceang, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, el, en, eo, es, et, eu, fa, fi, fo, fr, gl, gu, he, hr, ht, hu, hy, id, is, it, ja, kn, ko, la, li, ltmk, ml, mr, nap, nl, no, ,pa, pl, pms, pt, ro, ru, sah, sa, sk, sl, sr, sv, ta, te, th, tr, uk, vec, vi, yi, zh_min_nan, zh
wikiversityar, cs, de, el, en, es, fi, fr, hi, it, ja, ko, pt, ru, sl, sv, zh
wikivoyagebn, de, el, es, fa, fi, fr, he, hi, it, nl, pl, ps, pt, ro, ru, sv, uk, vi, zh
wiktionarybg, enaa, ab, af, ak, am, ang, an, ar, ast, as, av, ay, az, be, bh, bi, bm, bn, bo, br, bs, ca, chr, ch, co, cr, csb, cs, cy, da, de, dv, dz, el, eo, es, et, eu, fa, fi, fj, fo, fy, ga, gd, gl, gn, gu, gv, ha, he, hif, hi, hr, hsb, hu, hy, ia, id, ie, ik, io, is, it, iu, ja, jbo, jv, ka, kk, kl, km, km, kn, ko, ks, ku, kw, ky, la, lb, li, ln, lo, lt, lv, mh, mi, mk, ml, mn, mr, ms, mt, my, nah, na, nds, ne, nl, nn, no, oc, om, or, pa, pi, pl, pnb, ps, pt, qu, rm, rn, roa_rup, ro, ru, rw, sa, scn, sc, sd, sg, sh, simple, si, sk, sl, sm, sn, so, sq, sr, ss, st, su, sv, sw, ta, te, tg, th, ti, tk, tl, tn, to, tpi, tr, ts, tt, tw, ug, uk, ur, uz, vec, vi, vo, wa, wo, xh, yi, yo, yue, za, zh_min_nan, zh, zu
global wikiswikidatamediawiki, outreach, speciescommons

Wikis that can't be covered:

  • arbcom
  • boardgovcom
  • board
  • beta
  • chairwiki
  • chapcomwiki
  • checkuserwiki
  • collabwiki
  • donate
  • electcomwiki
  • exec
  • noboard_chapters,
  • nostalgia
  • foundation
  • fixcopyright
  • grants
  • id_internal
  • incubator
  • internal
  • legalteam
  • login
  • map_bms
  • movementroles
  • nds_nl
  • nyc
  • pa_us
  • office
  • otrs
  • ombudsmen
  • quality
  • searchcom
  • strategy
  • stewards
  • spcom
  • ten
  • techconduct
  • iegcom
  • projectcom
  • all test wikis
  • transitionteam
  • usability
  • vote
  • wg_en

We can skip them, since they don't seem to have a lot of trafic, or being used by people who know how to use them. So the locked database message would be enough, with no prior warning.

Wikimania wikis are read-only, except wikimania.wikimedia.org. But since Wikimania is over, we can expect to only rely on the locked database message.

Znotch190711 raised the priority of this task from High to Unbreak Now!.Sep 5 2019, 2:19 AM
RhinosF1 lowered the priority of this task from Unbreak Now! to High.Sep 5 2019, 5:50 AM
RhinosF1 removed a subscriber: Liuxinyu970226.

Lowering - no reason for UBN! On a community relations task where the first impact is 5 days away and there getting on with what the task asks them to perfectly fine.

UBN is a drop everything and fix priority and no one needs to drop everything now to finish this.

Banner set for Sept 10. Will be displayed from 04:30 to 05:30 UTC.

s8 (wikidata) has been done today:
read-only start: Tue Sep 10 05:00:47 UTC 2019
read-only stop: Tue Sep 10 05:02:14 UTC 2019

Total read-only time: 1 minute 27 seconds.

Thank you for the update here @Marostegui , much appreciated!

3 banners set for September 17.

I'm warming up for the 24th! :)

Just for the record, I realised that the banner for today on itwiki is wrong:

A breve verrà svolta della manutenzione tecnica. 17 settembre - 05:00 AM UTC - 05:30 AM UTC
Durante tale intervallo potresti non riuscire a salvare alcuna modifica. (dalle 15:00 alle 15:15 UTC del 19 marzo 2019)

So the first part is correct, 17th Sept from 05:00AM-05:30AM UTC, but the second part looks wrong (19 march from 15:00 to 15:15 UTC).
Just saying it here in case you get some messages from people about it as it is a past date :-)

s2 switchover is done
read-only start: 05:00:44
read-only stop: 05:01:34

Total read-only time: 50 seconds.

Just for the record, I realised that the banner for today on itwiki is wrong:

A breve verrà svolta della manutenzione tecnica. 17 settembre - 05:00 AM UTC - 05:30 AM UTC
Durante tale intervallo potresti non riuscire a salvare alcuna modifica. (dalle 15:00 alle 15:15 UTC del 19 marzo 2019)

So the first part is correct, 17th Sept from 05:00AM-05:30AM UTC, but the second part looks wrong (19 march from 15:00 to 15:15 UTC).
Just saying it here in case you get some messages from people about it as it is a past date :-)

Thank you for noticing it! The original text does not have that "19 march from 15:00 to 15:15 UTC" part. It is an addition that passed the translation validation without being noticed. I will fix it to avoid that propagation for the next banners.

Banners set for tomorrow.

Concerning chapters and usergroups sites (*.wikimedia.org), the system for banners is automatically filtering languages. I don't know how much of those wikis will get the banner. However, most users on those wikis are experienced wikimedians and they will not be surprised to have the "database locked" maintenance message.

Surprisingly, Outreach is not eligible for a banner. But it is included if you target all wikis. (This system is not optimal at all.) I've warned the wiki.

Trizek-WMF lowered the priority of this task from High to Medium.Sep 23 2019, 4:45 PM

Banner for Sept 26 ready (minus a minor tweak I'll handle tomorrow).

Lowering priority since almost everything is ready.

DannyS712 raised the priority of this task from Medium to High.EditedSep 24 2019, 5:05 AM
DannyS712 subscribed.

Reported onwiki, but here may be faster: the current (s3, Sep 24, T230783) central notices link to the wrong task. See https://meta.wikimedia.org/wiki/Meta:Requests_for_help_from_a_sysop_or_bureaucrat#Errors_in_current_central_notices

s3 was done successfully.

read only start: 05:10:14 UTC AM
read only stop: 05:13:08 UTC AM

total read only time: 2 minutes 54 seconds.

We had a slightly longer read only time compared to the all the previous ones due to some issues with the way we set read-only, those will be followed up at T233679

Reported onwiki, but here may be faster: the current (s3, Sep 24, T230783) central notices link to the wrong task. See https://meta.wikimedia.org/wiki/Meta:Requests_for_help_from_a_sysop_or_bureaucrat#Errors_in_current_central_notices

Thank you for spotting it. I've double checked the task link for the next session (goes to T230784: Switchover s4 (commonswiki) primary database master db1081 -> db1138 - 26th Sept @05:00 UTC).

Previous banners have been archived. They could be reused if the list of wikis doesn't change.

On my side, everything is done.

s4 was done:

Read-only start: 05:00:51
Read-only stop: 05:01:42
Total read-only time: 51 seconds

@Marostegui Something that would be helpful would be if we could have a simplified technical explanation (aimed at fairly non-technical audience) explaining why we need the read-only periods, so that we could explain to the communities what's happening and why they can't edit. Would you have the time to do that?

Hi @Johan and @Marostegui,

I’m reading this a bit late but your explanations have clarified things for me and removed my doubts that these read-only periods and WMF’s datacenters are managed seriously (those doubts I expressed on the translators mailing list at the end of July, with a quite rude tone, sorry for that). So if I understand well, the architecture model which is used is so that:

  • each wiki always has a unique primary database master (active master) which is the only server to receive write operations (which can be more than 100 in one minute on en.wp if counting only the changes on pages) and (probably, but it might be a setting) does not deal with any read operations;
  • this primary database master transfers the changes to replica hosts which deal with read operations (which are certainly much more numerous than the write operations);
  • it’s not difficult to change the active master (so I guess it’s a replica which ceases to communicate with readers and instead communicates with writers), but this can lead to losing some edits (so maybe this part could be fixed specifically);
  • thus, a maintenance operation on the active master will be likely to imply a read-only period.

And the MediaWiki software does not allow to have several active masters at a time (so that one could save the other).

I’m pleased to see that my work as a translator can lead me to improve my skills in computing too. Thank you!