Page MenuHomePhabricator

service implementation tracking: arclamp1001.eqiad.wmnet
Closed, ResolvedPublic

Description

The following steps are needed to replace webperf1004 with the new arclamp1001 hardware:

  • add rsync config (rsync::quickdatacopy) to allow syncing data from webperf1004 to arclamp1001
  • add xenon user/group: groupadd --system xenon --gid 1001 && useradd --system --gid xenon --uid 498 xenon && install -d -o xenon -g xenon /srv/xenon
  • do the initial rsync and validate filesystem permissions

Coordianate a one hour window with Performance team for the actual migration:

  • disable Puppet on webperf1004
  • mask the excimer-k8s-log.service, excimer-k8s-wall-log.service, excimer-log.service, excimer-wall-log.service services on webperf1004 (that's why we need to disable puppet, otherwise it would unmask them)
  • we also need to mask some timers used for the sync: arclamp_generate_metrics.timer, arclamp_compress_logs.timer, arclamp_generate_svgs.timer
  • rsync again to catch up with the latest changes
  • apply the webperf::profiling_tools role to arclamp1001
  • Change arclamp_host in Hiera to point to arclamp1001
  • Extend scap/dsh config to also deploy to arclamp1001
  • puppet needs to remain disabled on webperf1004 before it's decommed, so best to set downtime for a week

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptOct 5 2022, 3:35 PM
Dzahn changed the task status from Open to Stalled.Oct 12 2022, 7:51 PM
Krinkle renamed this task from service implementation tracking: webperf1005.eqiad.wmnet to service implementation tracking: arclamp1001.eqiad.wmnet.Oct 15 2022, 10:21 PM
Krinkle updated the task description. (Show Details)
LSobanski subscribed.

Doesn't look like collab, unassigning from Daniel.

per T316223#8381863 serviceops-core is taking this over

Dzahn changed the task status from Stalled to Open.Nov 9 2022, 5:57 PM

per T316223#8381863 serviceops-core is taking this over

Let us know if there is anything you need from the perf team.

@aaron CCing @akosiaris Depending how you want to look at it this is either a subtask and unblocks or a duplicate of T316223. Also see T316223#8383941, T316223#8185277. Cheers

akosiaris added a subscriber: lmata.

Actually, per 28f86674054b7 observability has taken over arclamp from serviceops. It aligns more closely with their area of expertise and focus than serviceops and we are thankful. I forgot to update T316223 to account for this, doing so now. Also adding @lmata as the best person now to take this over.

Actually, per 28f86674054b7 observability has taken over arclamp from serviceops. It aligns more closely with their area of expertise and focus than serviceops and we are thankful. I forgot to update T316223 to account for this, doing so now. Also adding @lmata as the best person now to take this over.

I had synched up with Leo on this: Since I migrated webperf* to Bullseye, I will pair up with someone from o11y to move these to their new baremetal home.

Actually, per 28f86674054b7 observability has taken over arclamp from serviceops. It aligns more closely with their area of expertise and focus than serviceops and we are thankful. I forgot to update T316223 to account for this, doing so now. Also adding @lmata as the best person now to take this over.

I had synched up with Leo on this: Since I migrated webperf* to Bullseye, I will pair up with someone from o11y to move these to their new baremetal home.

indeed, we have a q3 placeholder for this. /cc @fgiunchedi

MoritzMuehlenhoff updated the task description. (Show Details)

Change 879813 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] webperf: use rsync::quickdatacopy for arclamp data

https://gerrit.wikimedia.org/r/879813

Change 879814 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] arclamp: move to EnvironmentFile for generate/compress jobs

https://gerrit.wikimedia.org/r/879814

Change 879813 merged by Filippo Giunchedi:

[operations/puppet@production] webperf: use rsync::quickdatacopy for arclamp data

https://gerrit.wikimedia.org/r/879813

Change 879814 merged by Filippo Giunchedi:

[operations/puppet@production] arclamp: move to EnvironmentFile for generate/compress jobs

https://gerrit.wikimedia.org/r/879814

Change 881352 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] site: apply webperf::profiling_tools to arclamp1001

https://gerrit.wikimedia.org/r/881352

Change 881353 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Move arclamp to arclamp1001

https://gerrit.wikimedia.org/r/881353

Mentioned in SAL (#wikimedia-operations) [2023-01-18T09:49:37Z] <godog> start migration from webperf1004 to arclamp1001 - T319434

Change 881352 merged by Filippo Giunchedi:

[operations/puppet@production] site: apply webperf::profiling_tools to arclamp1001

https://gerrit.wikimedia.org/r/881352

Change 881353 merged by Filippo Giunchedi:

[operations/puppet@production] Move arclamp to arclamp1001

https://gerrit.wikimedia.org/r/881353

Change 881449 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] webperf: update rsync source/dest with the new reality

https://gerrit.wikimedia.org/r/881449

Change 881449 merged by Filippo Giunchedi:

[operations/puppet@production] webperf: update rsync source/dest with the new reality

https://gerrit.wikimedia.org/r/881449

Change 881569 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Decom webperf[12]004

https://gerrit.wikimedia.org/r/881569

Change 881569 merged by Filippo Giunchedi:

[operations/puppet@production] Decom webperf[12]004

https://gerrit.wikimedia.org/r/881569

cookbooks.sre.hosts.decommission executed by filippo@cumin1001 for hosts: webperf1004.eqiad.wmnet

  • webperf1004.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
fgiunchedi claimed this task.

This is completed -- arclamp is hosted on arclamp1001 and webperf1004 has been decom'd

lmata moved this task from Backlog to Radar on the SRE board.