ops-monitoring-bot (Operations Monitoring Bot)
UserBot

Projects

Trusted-Contributors
Group

Calendar

User Details

User Since: Aug 12 2016, 1:45 PM (432 w, 4 d)
Roles: Bot
Availability: Available
LDAP User: Unknown
MediaWiki User: Unknown

Bot managed by SRE for automated interaction with Phabricator from monitoring tools.

Recent Activity
View All

Today

ops-monitoring-bot created T380905: Degraded RAID on cp7004.

Tue, Nov 26, 6:49 PM · SRE, ops-magru

ops-monitoring-bot added a comment to T380731: Reboots of Bookworm systems which use 6.1.115.

Started rebooting nodes in wikikube-codfw cluster:

wikikube-ctrl[2001-2003].codfw.wmnet

Tue, Nov 26, 5:59 PM · Infrastructure-Foundations, SRE

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 pool for host wikikube-worker[1313-1327].eqiad.wmnet completed:

wikikube-worker[1313-1327].eqiad.wmnet (PASS)
- Host wikikube-worker[1313-1327].eqiad.wmnet pooled in wikikube-eqiad

Tue, Nov 26, 5:47 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

pool host wikikube-worker[1313-1327].eqiad.wmnet by cgoubert@cumin1002 with reason: None

Tue, Nov 26, 5:47 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1321.eqiad.wmnet with OS bookworm completed:

wikikube-worker1321 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261709_cgoubert_623223_wikikube-worker1321.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Tue, Nov 26, 5:28 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1325.eqiad.wmnet with OS bookworm completed:

wikikube-worker1325 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261704_cgoubert_621468_wikikube-worker1325.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Tue, Nov 26, 5:24 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1326.eqiad.wmnet with OS bookworm completed:

wikikube-worker1326 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261701_cgoubert_621524_wikikube-worker1326.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Tue, Nov 26, 5:21 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1324.eqiad.wmnet with OS bookworm completed:

wikikube-worker1324 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261658_cgoubert_621433_wikikube-worker1324.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Tue, Nov 26, 5:17 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1327.eqiad.wmnet with OS bookworm completed:

wikikube-worker1327 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261654_cgoubert_621568_wikikube-worker1327.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Tue, Nov 26, 5:13 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1322.eqiad.wmnet with OS bookworm completed:

wikikube-worker1322 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261651_cgoubert_621385_wikikube-worker1322.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Tue, Nov 26, 5:11 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1323.eqiad.wmnet with OS bookworm completed:

wikikube-worker1323 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261645_cgoubert_618650_wikikube-worker1323.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Tue, Nov 26, 5:04 PM · serviceops

ops-monitoring-bot added a comment to T380731: Reboots of Bookworm systems which use 6.1.115.

Started rebooting nodes in wikikube-eqiad cluster:

wikikube-ctrl[1001-1003].eqiad.wmnet

Tue, Nov 26, 5:00 PM · Infrastructure-Foundations, SRE

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1321.eqiad.wmnet with OS bookworm

Tue, Nov 26, 4:46 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1321.eqiad.wmnet with OS bookworm executed with errors:

wikikube-worker1321 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-worker1321.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Tue, Nov 26, 4:44 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1327.eqiad.wmnet with OS bookworm

Tue, Nov 26, 4:30 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1326.eqiad.wmnet with OS bookworm

Tue, Nov 26, 4:30 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1325.eqiad.wmnet with OS bookworm

Tue, Nov 26, 4:30 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1324.eqiad.wmnet with OS bookworm

Tue, Nov 26, 4:29 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1322.eqiad.wmnet with OS bookworm

Tue, Nov 26, 4:29 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1321.eqiad.wmnet with OS bookworm

Tue, Nov 26, 4:29 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1321.eqiad.wmnet with OS bookworm executed with errors:

wikikube-worker1321 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-worker1321.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Tue, Nov 26, 4:28 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1327.eqiad.wmnet with OS bookworm executed with errors:

wikikube-worker1327 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-worker1327.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Tue, Nov 26, 4:28 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1326.eqiad.wmnet with OS bookworm executed with errors:

wikikube-worker1326 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-worker1326.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Tue, Nov 26, 4:27 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1325.eqiad.wmnet with OS bookworm executed with errors:

wikikube-worker1325 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-worker1325.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Tue, Nov 26, 4:27 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1324.eqiad.wmnet with OS bookworm executed with errors:

wikikube-worker1324 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-worker1324.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Tue, Nov 26, 4:27 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1322.eqiad.wmnet with OS bookworm executed with errors:

wikikube-worker1322 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-worker1322.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Tue, Nov 26, 4:26 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1323.eqiad.wmnet with OS bookworm

Tue, Nov 26, 4:22 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1323.eqiad.wmnet with OS bookworm executed with errors:

wikikube-worker1323 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-worker1323.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Tue, Nov 26, 4:20 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1327.eqiad.wmnet with OS bookworm

Tue, Nov 26, 3:49 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1326.eqiad.wmnet with OS bookworm

Tue, Nov 26, 3:49 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1325.eqiad.wmnet with OS bookworm

Tue, Nov 26, 3:48 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1324.eqiad.wmnet with OS bookworm

Tue, Nov 26, 3:48 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1323.eqiad.wmnet with OS bookworm

Tue, Nov 26, 3:47 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1322.eqiad.wmnet with OS bookworm

Tue, Nov 26, 3:46 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1321.eqiad.wmnet with OS bookworm

Tue, Nov 26, 3:45 PM · serviceops

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host dns7001.wikimedia.org with OS bookworm completed:

dns7001 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261511_fabfur_601294_dns7001.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Tue, Nov 26, 3:42 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye executed with errors:

wdqs1027 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wdqs1027.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Tue, Nov 26, 3:40 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops

ops-monitoring-bot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wdqs1025.eqiad.wmnet with OS bullseye executed with errors:

wdqs1025 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wdqs1025.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Tue, Nov 26, 3:39 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops

ops-monitoring-bot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wdqs1026.eqiad.wmnet with OS bullseye executed with errors:

wdqs1026 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wdqs1026.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Tue, Nov 26, 3:39 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops

ops-monitoring-bot added a comment to T378921: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022.

Draining ganeti1020.eqiad.wmnet of running VMs

Tue, Nov 26, 3:22 PM · Ganeti, Infrastructure-Foundations, SRE

ops-monitoring-bot added a comment to T378921: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022.

Draining ganeti1020.eqiad.wmnet of running VMs

Tue, Nov 26, 3:20 PM · Ganeti, Infrastructure-Foundations, SRE

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host dns7001.wikimedia.org with OS bookworm

Tue, Nov 26, 2:43 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host dns7001.wikimedia.org with OS bullseye completed:

dns7001 (WARN)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261409_fabfur_587599_dns7001.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully

Tue, Nov 26, 2:43 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye

Tue, Nov 26, 2:19 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops

ops-monitoring-bot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1025.eqiad.wmnet with OS bullseye

Tue, Nov 26, 2:19 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops

ops-monitoring-bot added a comment to T378030: Q2:rack/setup/install wdqs102[567].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1026.eqiad.wmnet with OS bullseye

Tue, Nov 26, 2:19 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), wmde-wikidata-tech, Wikidata, Wikidata-Query-Service, SRE, Discovery-Search, ops-eqiad, DC-Ops

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7015.magru.wmnet with OS bullseye completed:

cp7015 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261335_fabfur_580242_cp7015.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully

Tue, Nov 26, 2:01 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host lvs7003.magru.wmnet with OS bullseye completed:

lvs7003 (WARN)
- Downtimed on Icinga/Alertmanager
- Unable to disable Puppet, the host may have been unreachable
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261320_fabfur_575669_lvs7003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully

Tue, Nov 26, 1:49 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host dns7001.wikimedia.org with OS bullseye

Tue, Nov 26, 1:43 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T364870: Q4:rack/setup/install new cloudcephmon hosts.

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye completed:

cloudcephmon1004 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411261257_dcaro_560976_cloudcephmon1004.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Tue, Nov 26, 1:15 PM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7015.magru.wmnet with OS bullseye

Tue, Nov 26, 1:11 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bookworm executed with errors:

dns7001 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console dns7001.wikimedia.org" to get a root shell, but depending on the failure this may not work.

Tue, Nov 26, 1:07 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bookworm

Tue, Nov 26, 1:03 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host lvs7003.magru.wmnet with OS bullseye

Tue, Nov 26, 12:58 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host cp7015.magru.wmnet with OS bullseye executed with errors:

cp7015 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cp7015.magru.wmnet" to get a root shell, but depending on the failure this may not work.

Tue, Nov 26, 12:48 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host cp7015.magru.wmnet with OS bullseye

Tue, Nov 26, 12:30 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T364870: Q4:rack/setup/install new cloudcephmon hosts.

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye

Tue, Nov 26, 11:29 AM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops

ops-monitoring-bot added a comment to T364870: Q4:rack/setup/install new cloudcephmon hosts.

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye executed with errors:

cloudcephmon1004 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephmon1004.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Tue, Nov 26, 11:25 AM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops

ops-monitoring-bot added a comment to T379599: Reevaluate the requirement for dedicated sessionstore/kask nodes in wikikube clusters.

cookbooks.sre.hosts.decommission executed by jayme@cumin2002 for hosts: kubernetes[2005-2006,2015-2016].codfw.wmnet,kubernetes[1005-1006,1015-1016].eqiad.wmnet

kubernetes2005.codfw.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster codfw to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster codfw to Netbox

Tue, Nov 26, 9:23 AM · Data-Persistence, serviceops, Prod-Kubernetes

ops-monitoring-bot added a comment to T379599: Reevaluate the requirement for dedicated sessionstore/kask nodes in wikikube clusters.

Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin2002 depool for host kubernetes[1005-1006,1015-1016].eqiad.wmnet completed:

kubernetes[1005-1006,1015-1016].eqiad.wmnet (PASS)
- Host kubernetes[1005-1006,1015-1016].eqiad.wmnet depooled from wikikube-eqiad

Tue, Nov 26, 8:49 AM · Data-Persistence, serviceops, Prod-Kubernetes

ops-monitoring-bot added a comment to T379599: Reevaluate the requirement for dedicated sessionstore/kask nodes in wikikube clusters.

depool host kubernetes[1005-1006,1015-1016].eqiad.wmnet by jayme@cumin2002 with reason: None

Tue, Nov 26, 8:48 AM · Data-Persistence, serviceops, Prod-Kubernetes

ops-monitoring-bot added a comment to T379599: Reevaluate the requirement for dedicated sessionstore/kask nodes in wikikube clusters.

Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin2002 depool for host kubernetes[2005-2006,2015-2016].codfw.wmnet completed:

kubernetes[2005-2006,2015-2016].codfw.wmnet (PASS)
- Host kubernetes[2005-2006,2015-2016].codfw.wmnet depooled from wikikube-codfw

Tue, Nov 26, 8:46 AM · Data-Persistence, serviceops, Prod-Kubernetes

ops-monitoring-bot added a comment to T379599: Reevaluate the requirement for dedicated sessionstore/kask nodes in wikikube clusters.

depool host kubernetes[2005-2006,2015-2016].codfw.wmnet by jayme@cumin2002 with reason: None

Tue, Nov 26, 8:46 AM · Data-Persistence, serviceops, Prod-Kubernetes

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bookworm

Tue, Nov 26, 1:29 AM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7015.magru.wmnet with OS bullseye executed with errors:

cp7015 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cp7015.magru.wmnet" to get a root shell, but depending on the failure this may not work.

Tue, Nov 26, 1:04 AM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7015.magru.wmnet with OS bullseye

Tue, Nov 26, 12:55 AM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

Yesterday

ops-monitoring-bot added a comment to T380790: decommission restbase202[1-3].codfw.wmnet.

cookbooks.sre.hosts.decommission executed by eevans@cumin1002 for hosts: restbase[2021-2023].codfw.wmnet

restbase2021.codfw.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

Mon, Nov 25, 9:03 PM · SRE, ops-codfw, DC-Ops, decommission-hardware

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti7004.magru.wmnet with OS bookworm

Mon, Nov 25, 8:00 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti7003.magru.wmnet with OS bookworm completed:

ganeti7003 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411251935_robh_3696544_ganeti7003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully

Mon, Nov 25, 7:58 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti7003.magru.wmnet with OS bookworm

Mon, Nov 25, 6:59 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: cp7015.magru.wmnet

cp7015.magru.wmnet (FAIL)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Alertmanager
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

Mon, Nov 25, 6:28 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: lvs7003.magru.wmnet

lvs7003.magru.wmnet (FAIL)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Alertmanager
- Failed to wipe swraid, partition-table and filesystem signatures, manual intervention required to make it unbootable: Cumin execution failed (exit_code=2)
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

Mon, Nov 25, 6:17 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: cp7008.magru.wmnet

cp7008.magru.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

Mon, Nov 25, 5:43 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: ganeti7004.magru.wmnet

ganeti7004.magru.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

Mon, Nov 25, 5:39 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: cp7006.magru.wmnet

cp7006.magru.wmnet (FAIL)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Alertmanager
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

Mon, Nov 25, 5:10 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T380307: installation tracking for hosts affected by magru re-shuffle.

cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: ganeti7003.magru.wmnet

ganeti7003.magru.wmnet (FAIL)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Alertmanager
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

Mon, Nov 25, 4:59 PM · Infrastructure-Foundations, Traffic, ops-magru, DC-Ops

ops-monitoring-bot added a comment to T364870: Q4:rack/setup/install new cloudcephmon hosts.

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye

Mon, Nov 25, 4:45 PM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops

ops-monitoring-bot added a comment to T380027: Decommission kubernetes10[09-14].

cookbooks.sre.hosts.decommission executed by cgoubert@cumin1002 for hosts: kubernetes[1009-1014].eqiad.wmnet

kubernetes1009.eqiad.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

Mon, Nov 25, 3:41 PM · SRE, ops-eqiad, DC-Ops, decommission-hardware, serviceops

ops-monitoring-bot added a comment to T377022: wikikube-worker13[05-12] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1309.eqiad.wmnet with OS bookworm completed:

wikikube-worker1309 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411251518_cgoubert_338099_wikikube-worker1309.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Mon, Nov 25, 3:38 PM · serviceops

ops-monitoring-bot added a comment to T380731: Reboots of Bookworm systems which use 6.1.115.

Icinga downtime and Alertmanager silence (ID=5f6d9bcd-ea50-4930-ae11-44dc16f236cc) set by jynus@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Reboot

backup2011.codfw.wmnet

Mon, Nov 25, 3:37 PM · Infrastructure-Foundations, SRE

ops-monitoring-bot added a comment to T380731: Reboots of Bookworm systems which use 6.1.115.

Icinga downtime and Alertmanager silence (ID=27d9fbf8-1966-45be-a935-7d8411d2fcbf) set by jynus@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Reboot

backup2010.codfw.wmnet

Mon, Nov 25, 3:37 PM · Infrastructure-Foundations, SRE

ops-monitoring-bot added a comment to T380027: Decommission kubernetes10[09-14].

Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 depool for host kubernetes[1009-1014].eqiad.wmnet completed:

kubernetes[1009-1014].eqiad.wmnet (PASS)
- Host kubernetes[1009-1014].eqiad.wmnet depooled from wikikube-eqiad

Mon, Nov 25, 2:59 PM · SRE, ops-eqiad, DC-Ops, decommission-hardware, serviceops

ops-monitoring-bot added a comment to T380027: Decommission kubernetes10[09-14].

depool host kubernetes[1009-1014].eqiad.wmnet by cgoubert@cumin1002 with reason: decom

Mon, Nov 25, 2:56 PM · SRE, ops-eqiad, DC-Ops, decommission-hardware, serviceops

ops-monitoring-bot added a comment to T377022: wikikube-worker13[05-12] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1309.eqiad.wmnet with OS bookworm

Mon, Nov 25, 2:54 PM · serviceops

ops-monitoring-bot added a comment to T377022: wikikube-worker13[05-12] implementation tracking.

Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 depool for host wikikube-worker1309.eqiad.wmnet completed:

wikikube-worker1309.eqiad.wmnet (PASS)
- Host wikikube-worker1309.eqiad.wmnet depooled from wikikube-eqiad

Mon, Nov 25, 2:53 PM · serviceops

ops-monitoring-bot added a comment to T377022: wikikube-worker13[05-12] implementation tracking.

depool host wikikube-worker1309.eqiad.wmnet by cgoubert@cumin1002 with reason: None

Mon, Nov 25, 2:53 PM · serviceops

ops-monitoring-bot added a comment to T377022: wikikube-worker13[05-12] implementation tracking.

Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 pool for host wikikube-worker[1310-1312].eqiad.wmnet completed:

wikikube-worker[1310-1312].eqiad.wmnet (PASS)
- Host wikikube-worker[1310-1312].eqiad.wmnet pooled in wikikube-eqiad

Mon, Nov 25, 2:48 PM · serviceops

ops-monitoring-bot added a comment to T377022: wikikube-worker13[05-12] implementation tracking.

pool host wikikube-worker[1310-1312].eqiad.wmnet by cgoubert@cumin1002 with reason: None

Mon, Nov 25, 2:47 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1318.eqiad.wmnet with OS bookworm completed:

wikikube-worker1318 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411251328_cgoubert_304102_wikikube-worker1318.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Mon, Nov 25, 1:47 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1320.eqiad.wmnet with OS bookworm completed:

wikikube-worker1320 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411251324_cgoubert_304292_wikikube-worker1320.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Mon, Nov 25, 1:43 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1314.eqiad.wmnet with OS bookworm completed:

wikikube-worker1314 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411251321_cgoubert_303773_wikikube-worker1314.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Mon, Nov 25, 1:41 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1319.eqiad.wmnet with OS bookworm completed:

wikikube-worker1319 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411251317_cgoubert_304197_wikikube-worker1319.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Mon, Nov 25, 1:35 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1315.eqiad.wmnet with OS bookworm completed:

wikikube-worker1315 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411251313_cgoubert_303847_wikikube-worker1315.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Mon, Nov 25, 1:33 PM · serviceops

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1317.eqiad.wmnet with OS bookworm completed:

wikikube-worker1317 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411251311_cgoubert_304019_wikikube-worker1317.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Mon, Nov 25, 1:30 PM · serviceops

ops-monitoring-bot added a comment to T380731: Reboots of Bookworm systems which use 6.1.115.

Started rebooting nodes in wikikube-eqiad cluster:

wikikube-worker[1305-1312].eqiad.wmnet

Mon, Nov 25, 1:28 PM · Infrastructure-Foundations, SRE

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1316.eqiad.wmnet with OS bookworm completed:

wikikube-worker1316 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411251308_cgoubert_303917_wikikube-worker1316.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Mon, Nov 25, 1:28 PM · serviceops

ops-monitoring-bot added a comment to T380731: Reboots of Bookworm systems which use 6.1.115.

Started rebooting nodes in wikikube-codfw cluster:

wikikube-worker[2128-2170].codfw.wmnet

Mon, Nov 25, 1:27 PM · Infrastructure-Foundations, SRE

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker1313.eqiad.wmnet with OS bookworm completed:

wikikube-worker1313 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411251304_cgoubert_303715_wikikube-worker1313.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Mon, Nov 25, 1:25 PM · serviceops

ops-monitoring-bot added a comment to T380731: Reboots of Bookworm systems which use 6.1.115.

Icinga downtime and Alertmanager silence (ID=0c1a74c1-8a06-405c-a5fc-07dabe312239) set by jynus@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Reboot

backup1011.eqiad.wmnet

Mon, Nov 25, 12:47 PM · Infrastructure-Foundations, SRE

ops-monitoring-bot added a comment to T380350: wikikube-worker13[13-27] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker1320.eqiad.wmnet with OS bookworm

Mon, Nov 25, 12:44 PM · serviceops

ops-monitoring-bot (Operations Monitoring Bot)
UserBot

Projects

Calendar

Today

Tomorrow

Thursday

User Details

Recent Activity
View All

Today

Yesterday

ops-monitoring-bot (Operations Monitoring Bot)UserBot

Projects

Calendar

Today

Tomorrow

Thursday

User Details

Recent ActivityView All

Today

Yesterday

ops-monitoring-bot (Operations Monitoring Bot)
UserBot

Recent Activity
View All