Page MenuHomePhabricator

ops-monitoring-bot (Operations Monitoring Bot)
UserBot

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Aug 12 2016, 1:45 PM (432 w, 4 d)
Roles
Bot
Availability
Available
LDAP User
Unknown
MediaWiki User
Unknown

Bot managed by SRE for automated interaction with Phabricator from monitoring tools.

Recent Activity

Today

ops-monitoring-bot created T380905: Degraded RAID on cp7004.
Tue, Nov 26, 6:49 PM · SRE, ops-magru
ops-monitoring-bot added a comment to T380731: Reboots of Bookworm systems which use 6.1.115.

Started rebooting nodes in wikikube-codfw cluster:

  • wikikube-ctrl[2001-2003].codfw.wmnet
Tue, Nov 26, 5:59 PM · Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 pool for host wikikube-worker[1313-1327].eqiad.wmnet completed:

  • wikikube-worker[1313-1327].eqiad.wmnet (PASS)
    • Host wikikube-worker[1313-1327].eqiad.wmnet pooled in wikikube-eqiad
Tue, Nov 26, 5:47 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

pool host wikikube-worker[1313-1327].eqiad.wmnet by cgoubert@cumin1002 with reason: None

Tue, Nov 26, 5:47 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1321.eqiad.wmnet with OS bookworm completed:

  • wikikube-worker1321 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261709_cgoubert_623223_wikikube-worker1321.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Tue, Nov 26, 5:28 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1325.eqiad.wmnet with OS bookworm completed:

  • wikikube-worker1325 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261704_cgoubert_621468_wikikube-worker1325.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Tue, Nov 26, 5:24 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1326.eqiad.wmnet with OS bookworm completed:

  • wikikube-worker1326 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261701_cgoubert_621524_wikikube-worker1326.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Tue, Nov 26, 5:21 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1324.eqiad.wmnet with OS bookworm completed:

  • wikikube-worker1324 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261658_cgoubert_621433_wikikube-worker1324.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Tue, Nov 26, 5:17 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1327.eqiad.wmnet with OS bookworm completed:

  • wikikube-worker1327 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261654_cgoubert_621568_wikikube-worker1327.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Tue, Nov 26, 5:13 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1322.eqiad.wmnet with OS bookworm completed:

  • wikikube-worker1322 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261651_cgoubert_621385_wikikube-worker1322.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Tue, Nov 26, 5:11 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1323.eqiad.wmnet with OS bookworm completed:

  • wikikube-worker1323 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261645_cgoubert_618650_wikikube-worker1323.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Tue, Nov 26, 5:04 PM · serviceops
ops-monitoring-bot added a comment to T380731: Reboots of Bookworm systems which use 6.1.115.

Started rebooting nodes in wikikube-eqiad cluster:

  • wikikube-ctrl[1001-1003].eqiad.wmnet
Tue, Nov 26, 5:00 PM · Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1321.eqiad.wmnet with OS bookworm

Tue, Nov 26, 4:46 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1321.eqiad.wmnet with OS bookworm executed with errors:

  • wikikube-worker1321 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-worker1321.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.
Tue, Nov 26, 4:44 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1327.eqiad.wmnet with OS bookworm

Tue, Nov 26, 4:30 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1326.eqiad.wmnet with OS bookworm

Tue, Nov 26, 4:30 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1325.eqiad.wmnet with OS bookworm

Tue, Nov 26, 4:30 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1324.eqiad.wmnet with OS bookworm

Tue, Nov 26, 4:29 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1322.eqiad.wmnet with OS bookworm

Tue, Nov 26, 4:29 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1321.eqiad.wmnet with OS bookworm

Tue, Nov 26, 4:29 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1321.eqiad.wmnet with OS bookworm executed with errors:

  • wikikube-worker1321 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-worker1321.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.
Tue, Nov 26, 4:28 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1327.eqiad.wmnet with OS bookworm executed with errors:

  • wikikube-worker1327 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-worker1327.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.
Tue, Nov 26, 4:28 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1326.eqiad.wmnet with OS bookworm executed with errors:

  • wikikube-worker1326 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-worker1326.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.
Tue, Nov 26, 4:27 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1325.eqiad.wmnet with OS bookworm executed with errors:

  • wikikube-worker1325 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-worker1325.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.
Tue, Nov 26, 4:27 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1324.eqiad.wmnet with OS bookworm executed with errors:

  • wikikube-worker1324 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-worker1324.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.
Tue, Nov 26, 4:27 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1322.eqiad.wmnet with OS bookworm executed with errors:

  • wikikube-worker1322 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-worker1322.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.
Tue, Nov 26, 4:26 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1323.eqiad.wmnet with OS bookworm

Tue, Nov 26, 4:22 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1323.eqiad.wmnet with OS bookworm executed with errors:

  • wikikube-worker1323 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-worker1323.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.
Tue, Nov 26, 4:20 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1327.eqiad.wmnet with OS bookworm

Tue, Nov 26, 3:49 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1326.eqiad.wmnet with OS bookworm

Tue, Nov 26, 3:49 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1325.eqiad.wmnet with OS bookworm

Tue, Nov 26, 3:48 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1324.eqiad.wmnet with OS bookworm

Tue, Nov 26, 3:48 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1323.eqiad.wmnet with OS bookworm

Tue, Nov 26, 3:47 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1322.eqiad.wmnet with OS bookworm

Tue, Nov 26, 3:46 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1321.eqiad.wmnet with OS bookworm

Tue, Nov 26, 3:45 PM · serviceops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host dns7001.wikimedia.org with OS bookworm completed:

  • dns7001 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261511_fabfur_601294_dns7001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
Tue, Nov 26, 3:42 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye executed with errors:

  • wdqs1027 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wdqs1027.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.
Tue, Nov 26, 3:40 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wdqs1025.eqiad.wmnet with OS bullseye executed with errors:

  • wdqs1025 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wdqs1025.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.
Tue, Nov 26, 3:39 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wdqs1026.eqiad.wmnet with OS bullseye executed with errors:

  • wdqs1026 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wdqs1026.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.
Tue, Nov 26, 3:39 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T378921: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022.

Draining ganeti1020.eqiad.wmnet of running VMs

Tue, Nov 26, 3:22 PM · Ganeti, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T378921: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022.

Draining ganeti1020.eqiad.wmnet of running VMs

Tue, Nov 26, 3:20 PM · Ganeti, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host dns7001.wikimedia.org with OS bookworm

Tue, Nov 26, 2:43 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host dns7001.wikimedia.org with OS bullseye completed:

  • dns7001 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261409_fabfur_587599_dns7001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Tue, Nov 26, 2:43 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye

Tue, Nov 26, 2:19 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1025.eqiad.wmnet with OS bullseye

Tue, Nov 26, 2:19 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1026.eqiad.wmnet with OS bullseye

Tue, Nov 26, 2:19 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7015.magru.wmnet with OS bullseye completed:

  • cp7015 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261335_fabfur_580242_cp7015.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Tue, Nov 26, 2:01 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host lvs7003.magru.wmnet with OS bullseye completed:

  • lvs7003 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261320_fabfur_575669_lvs7003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Tue, Nov 26, 1:49 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host dns7001.wikimedia.org with OS bullseye

Tue, Nov 26, 1:43 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T364870: Q4:rack/setup/install new cloudcephmon hosts.

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye completed:

  • cloudcephmon1004 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261257_dcaro_560976_cloudcephmon1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Tue, Nov 26, 1:15 PM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7015.magru.wmnet with OS bullseye

Tue, Nov 26, 1:11 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bookworm executed with errors:

  • dns7001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console dns7001.wikimedia.org" to get a root shell, but depending on the failure this may not work.
Tue, Nov 26, 1:07 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bookworm

Tue, Nov 26, 1:03 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host lvs7003.magru.wmnet with OS bullseye

Tue, Nov 26, 12:58 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host cp7015.magru.wmnet with OS bullseye executed with errors:

  • cp7015 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cp7015.magru.wmnet" to get a root shell, but depending on the failure this may not work.
Tue, Nov 26, 12:48 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host cp7015.magru.wmnet with OS bullseye

Tue, Nov 26, 12:30 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T364870: Q4:rack/setup/install new cloudcephmon hosts.

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye

Tue, Nov 26, 11:29 AM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T364870: Q4:rack/setup/install new cloudcephmon hosts.

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephmon1004 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephmon1004.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.
Tue, Nov 26, 11:25 AM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T379599: Reevaluate the requirement for dedicated sessionstore/kask nodes in wikikube clusters.

cookbooks.sre.hosts.decommission executed by jayme@cumin2002 for hosts: kubernetes[2005-2006,2015-2016].codfw.wmnet,kubernetes[1005-1006,1015-1016].eqiad.wmnet

  • kubernetes2005.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
Tue, Nov 26, 9:23 AM · Data-Persistence, serviceops, Prod-Kubernetes
ops-monitoring-bot added a comment to T379599: Reevaluate the requirement for dedicated sessionstore/kask nodes in wikikube clusters.

Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin2002 depool for host kubernetes[1005-1006,1015-1016].eqiad.wmnet completed:

  • kubernetes[1005-1006,1015-1016].eqiad.wmnet (PASS)
    • Host kubernetes[1005-1006,1015-1016].eqiad.wmnet depooled from wikikube-eqiad
Tue, Nov 26, 8:49 AM · Data-Persistence, serviceops, Prod-Kubernetes
ops-monitoring-bot added a comment to T379599: Reevaluate the requirement for dedicated sessionstore/kask nodes in wikikube clusters.

depool host kubernetes[1005-1006,1015-1016].eqiad.wmnet by jayme@cumin2002 with reason: None

Tue, Nov 26, 8:48 AM · Data-Persistence, serviceops, Prod-Kubernetes
ops-monitoring-bot added a comment to T379599: Reevaluate the requirement for dedicated sessionstore/kask nodes in wikikube clusters.

Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin2002 depool for host kubernetes[2005-2006,2015-2016].codfw.wmnet completed:

  • kubernetes[2005-2006,2015-2016].codfw.wmnet (PASS)
    • Host kubernetes[2005-2006,2015-2016].codfw.wmnet depooled from wikikube-codfw
Tue, Nov 26, 8:46 AM · Data-Persistence, serviceops, Prod-Kubernetes
ops-monitoring-bot added a comment to T379599: Reevaluate the requirement for dedicated sessionstore/kask nodes in wikikube clusters.

depool host kubernetes[2005-2006,2015-2016].codfw.wmnet by jayme@cumin2002 with reason: None

Tue, Nov 26, 8:46 AM · Data-Persistence, serviceops, Prod-Kubernetes
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bookworm

Tue, Nov 26, 1:29 AM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7015.magru.wmnet with OS bullseye executed with errors:

  • cp7015 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cp7015.magru.wmnet" to get a root shell, but depending on the failure this may not work.
Tue, Nov 26, 1:04 AM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7015.magru.wmnet with OS bullseye

Tue, Nov 26, 12:55 AM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

Yesterday

ops-monitoring-bot added a comment to T380790: decommission restbase202[1-3].codfw.wmnet.

cookbooks.sre.hosts.decommission executed by eevans@cumin1002 for hosts: restbase[2021-2023].codfw.wmnet

  • restbase2021.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Mon, Nov 25, 9:03 PM · SRE, ops-codfw, DC-Ops, decommission-hardware
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti7004.magru.wmnet with OS bookworm

Mon, Nov 25, 8:00 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti7003.magru.wmnet with OS bookworm completed:

  • ganeti7003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411251935_robh_3696544_ganeti7003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Mon, Nov 25, 7:58 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti7003.magru.wmnet with OS bookworm

Mon, Nov 25, 6:59 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: cp7015.magru.wmnet

  • cp7015.magru.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Mon, Nov 25, 6:28 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: lvs7003.magru.wmnet

  • lvs7003.magru.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Failed to wipe swraid, partition-table and filesystem signatures, manual intervention required to make it unbootable: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Mon, Nov 25, 6:17 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: cp7008.magru.wmnet

  • cp7008.magru.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Mon, Nov 25, 5:43 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: ganeti7004.magru.wmnet

  • ganeti7004.magru.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Mon, Nov 25, 5:39 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: cp7006.magru.wmnet

  • cp7006.magru.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Mon, Nov 25, 5:10 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: ganeti7003.magru.wmnet

  • ganeti7003.magru.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Mon, Nov 25, 4:59 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops
ops-monitoring-bot added a comment to T364870: Q4:rack/setup/install new cloudcephmon hosts.

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye

Mon, Nov 25, 4:45 PM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T380027: Decommission kubernetes10[09-14].

cookbooks.sre.hosts.decommission executed by cgoubert@cumin1002 for hosts: kubernetes[1009-1014].eqiad.wmnet

  • kubernetes1009.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Mon, Nov 25, 3:41 PM · SRE, ops-eqiad, DC-Ops, decommission-hardware, serviceops
ops-monitoring-bot added a comment to T377022: wikikube-worker13[05-12] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1309.eqiad.wmnet with OS bookworm completed:

  • wikikube-worker1309 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411251518_cgoubert_338099_wikikube-worker1309.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Mon, Nov 25, 3:38 PM · serviceops
ops-monitoring-bot added a comment to T380731: Reboots of Bookworm systems which use 6.1.115.

Icinga downtime and Alertmanager silence (ID=5f6d9bcd-ea50-4930-ae11-44dc16f236cc) set by jynus@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Reboot

backup2011.codfw.wmnet
Mon, Nov 25, 3:37 PM · Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T380731: Reboots of Bookworm systems which use 6.1.115.

Icinga downtime and Alertmanager silence (ID=27d9fbf8-1966-45be-a935-7d8411d2fcbf) set by jynus@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Reboot

backup2010.codfw.wmnet
Mon, Nov 25, 3:37 PM · Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T380027: Decommission kubernetes10[09-14].

Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 depool for host kubernetes[1009-1014].eqiad.wmnet completed:

  • kubernetes[1009-1014].eqiad.wmnet (PASS)
    • Host kubernetes[1009-1014].eqiad.wmnet depooled from wikikube-eqiad
Mon, Nov 25, 2:59 PM · SRE, ops-eqiad, DC-Ops, decommission-hardware, serviceops
ops-monitoring-bot added a comment to T380027: Decommission kubernetes10[09-14].

depool host kubernetes[1009-1014].eqiad.wmnet by cgoubert@cumin1002 with reason: decom

Mon, Nov 25, 2:56 PM · SRE, ops-eqiad, DC-Ops, decommission-hardware, serviceops
ops-monitoring-bot added a comment to T377022: wikikube-worker13[05-12] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1309.eqiad.wmnet with OS bookworm

Mon, Nov 25, 2:54 PM · serviceops
ops-monitoring-bot added a comment to T377022: wikikube-worker13[05-12] implementation tracking.

Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 depool for host wikikube-worker1309.eqiad.wmnet completed:

  • wikikube-worker1309.eqiad.wmnet (PASS)
    • Host wikikube-worker1309.eqiad.wmnet depooled from wikikube-eqiad
Mon, Nov 25, 2:53 PM · serviceops
ops-monitoring-bot added a comment to T377022: wikikube-worker13[05-12] implementation tracking.

depool host wikikube-worker1309.eqiad.wmnet by cgoubert@cumin1002 with reason: None

Mon, Nov 25, 2:53 PM · serviceops
ops-monitoring-bot added a comment to T377022: wikikube-worker13[05-12] implementation tracking.

Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 pool for host wikikube-worker[1310-1312].eqiad.wmnet completed:

  • wikikube-worker[1310-1312].eqiad.wmnet (PASS)
    • Host wikikube-worker[1310-1312].eqiad.wmnet pooled in wikikube-eqiad
Mon, Nov 25, 2:48 PM · serviceops
ops-monitoring-bot added a comment to T377022: wikikube-worker13[05-12] implementation tracking.

pool host wikikube-worker[1310-1312].eqiad.wmnet by cgoubert@cumin1002 with reason: None

Mon, Nov 25, 2:47 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1318.eqiad.wmnet with OS bookworm completed:

  • wikikube-worker1318 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411251328_cgoubert_304102_wikikube-worker1318.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Mon, Nov 25, 1:47 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1320.eqiad.wmnet with OS bookworm completed:

  • wikikube-worker1320 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411251324_cgoubert_304292_wikikube-worker1320.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Mon, Nov 25, 1:43 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1314.eqiad.wmnet with OS bookworm completed:

  • wikikube-worker1314 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411251321_cgoubert_303773_wikikube-worker1314.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Mon, Nov 25, 1:41 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1319.eqiad.wmnet with OS bookworm completed:

  • wikikube-worker1319 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411251317_cgoubert_304197_wikikube-worker1319.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Mon, Nov 25, 1:35 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1315.eqiad.wmnet with OS bookworm completed:

  • wikikube-worker1315 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411251313_cgoubert_303847_wikikube-worker1315.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Mon, Nov 25, 1:33 PM · serviceops
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1317.eqiad.wmnet with OS bookworm completed:

  • wikikube-worker1317 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411251311_cgoubert_304019_wikikube-worker1317.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Mon, Nov 25, 1:30 PM · serviceops
ops-monitoring-bot added a comment to T380731: Reboots of Bookworm systems which use 6.1.115.

Started rebooting nodes in wikikube-eqiad cluster:

  • wikikube-worker[1305-1312].eqiad.wmnet
Mon, Nov 25, 1:28 PM · Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1316.eqiad.wmnet with OS bookworm completed:

  • wikikube-worker1316 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411251308_cgoubert_303917_wikikube-worker1316.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Mon, Nov 25, 1:28 PM · serviceops
ops-monitoring-bot added a comment to T380731: Reboots of Bookworm systems which use 6.1.115.

Started rebooting nodes in wikikube-codfw cluster:

  • wikikube-worker[2128-2170].codfw.wmnet
Mon, Nov 25, 1:27 PM · Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1313.eqiad.wmnet with OS bookworm completed:

  • wikikube-worker1313 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411251304_cgoubert_303715_wikikube-worker1313.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Mon, Nov 25, 1:25 PM · serviceops
ops-monitoring-bot added a comment to T380731: Reboots of Bookworm systems which use 6.1.115.

Icinga downtime and Alertmanager silence (ID=0c1a74c1-8a06-405c-a5fc-07dabe312239) set by jynus@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Reboot

backup1011.eqiad.wmnet
Mon, Nov 25, 12:47 PM · Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1320.eqiad.wmnet with OS bookworm

Mon, Nov 25, 12:44 PM · serviceops