Page MenuHomePhabricator

Reboot issues for mw13[77-83].eqiad.wmnet
Closed, ResolvedPublic

Description

When rebooting these nodes imaged with the kubernetes::worker role:

before reboot: [57118.905101] watchdog: watchdog0: watchdog did not stop!

after reboot, in UEFI:

UEFI0082: The system was reset due to a timeout from the watchdog timer.
Check the System Event Log (SEL) or crash dumps from Operating System to
identify the source that triggered the watchdog timer reset. Update the
firmware or driver for the identified device.
and then it presents a menu
Available Actions:
F1 to Continue and Retry Boot Order
...

Manaual intervention (pressing F1 in the management console) is then required to continue booting, otherwise it stays stuck.

This only happens when the kubernetes::worker role is applied -- the same node works fine when reimaged to insetup.

My theory is that this may affect all PowerEdge R440 servers. There aren't that many MW servers that aren't this model, so this seems like a blocker for expanding the cluster once we run out of the other nodes.

Event Timeline

Change 987958 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] TEMPORARY role for debugging T354413 for mw1377

https://gerrit.wikimedia.org/r/987958

Change 987958 merged by Kamila Součková:

[operations/puppet@production] TEMPORARY role for debugging T354413 for mw1377

https://gerrit.wikimedia.org/r/987958

Change 987960 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] TEMPORARY changes for debugging T354413

https://gerrit.wikimedia.org/r/987960

Change 987743 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] Revert "TEMPORARY role for debugging T354413 for mw1377"

https://gerrit.wikimedia.org/r/987743

Change 987743 merged by Kamila Součková:

[operations/puppet@production] Revert "TEMPORARY role for debugging T354413 for mw1377"

https://gerrit.wikimedia.org/r/987743

Note that we have tried updating the firmware: mw1388 is on new UEFI and iDRAC and still exhibits this problem.

Pasting SEL entries for completeness here. Surprise, they aren't particularly helpful

Record:      11
Date/Time:   01/03/2024 19:44:59
Source:      system
Severity:    Critical
Description: The watchdog timer reset the system.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   01/04/2024 15:01:31
Source:      system
Severity:    Critical
Description: The watchdog timer reset the system.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   01/04/2024 15:09:40
Source:      system
Severity:    Critical
Description: The watchdog timer reset the system.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   01/04/2024 15:31:19
Source:      system
Severity:    Critical
Description: The watchdog timer reset the system.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   01/04/2024 15:48:28
Source:      system
Severity:    Critical
Description: The watchdog timer reset the system.
-------------------------------------------------------------------------------
Record:      16
Date/Time:   01/04/2024 15:58:29
Source:      system
Severity:    Critical
Description: The watchdog timer reset the system.
-------------------------------------------------------------------------------
Record:      17
Date/Time:   01/04/2024 16:44:32
Source:      system
Severity:    Critical
Description: The watchdog timer reset the system.
-------------------------------------------------------------------------------
Record:      18
Date/Time:   01/04/2024 17:14:58
Source:      system
Severity:    Critical
Description: The watchdog timer reset the system.

Host rebooted by kamila@cumin1002 with reason: None

Is this reproducible with every reboot or just some? One thing worth doing is to connect to the serial console an then issue a reboot over Cumin. Maybe we're seeing a kernel oops during the system tear down?

Additional findings:

  • the watchdog: watchdog0: watchdog did not stop! message seems to be a red herring, it's always there
  • the problem only occurs when running nohup reboot (which is what the cookbooks do), not when I run just reboot
  • it cannot be reproduced on an insetup system with overlayfs in use

Is this reproducible with every reboot or just some? One thing worth doing is to connect to the serial console an then issue a reboot over Cumin. Maybe we're seeing a kernel oops during the system tear down?

It seems to happen every time I run nohup reboot Edit: sorry, it's not deterministic, but seems to happen more than half the time (regardless of whether I run it via cumin or type it in a terminal). There is no indication of an oops, I can't tell if the problem happened until the BIOS tells me.

Change 987960 abandoned by Kamila Součková:

[operations/puppet@production] TEMPORARY changes for debugging T354413

Reason:

reverted parent

https://gerrit.wikimedia.org/r/987960

Note that this is non-deterministic: the problem seems to happen more than half the time but far from always. So several reboots may be required to reproduce. Yay!

Change 988507 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] TEMPORARY role for debugging T354413 for mw1377

https://gerrit.wikimedia.org/r/988507

Change 988507 merged by Kamila Součková:

[operations/puppet@production] TEMPORARY role for debugging T354413 for mw1377

https://gerrit.wikimedia.org/r/988507

Change 988510 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] TEMPORARY for debugging T354413: enable overlayfs

https://gerrit.wikimedia.org/r/988510

Change 988510 merged by Kamila Součková:

[operations/puppet@production] TEMPORARY for debugging T354413: enable overlayfs

https://gerrit.wikimedia.org/r/988510

Change 988263 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] Revert "TEMPORARY for debugging T354413: enable overlayfs"

https://gerrit.wikimedia.org/r/988263

Change 988263 merged by Kamila Součková:

[operations/puppet@production] Revert "TEMPORARY for debugging T354413: enable overlayfs"

https://gerrit.wikimedia.org/r/988263

Change 988656 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] TEMPORARY for debugging T354413: kubernetes::node

https://gerrit.wikimedia.org/r/988656

Change 988656 merged by Kamila Součková:

[operations/puppet@production] TEMPORARY for debugging T354413: kubernetes::node

https://gerrit.wikimedia.org/r/988656

Change 988664 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] TEMPORARY role for debugging T354413: puppet7

https://gerrit.wikimedia.org/r/988664

Change 988664 merged by Kamila Součková:

[operations/puppet@production] TEMPORARY role for debugging T354413: puppet7

https://gerrit.wikimedia.org/r/988664

Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host mw1377.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host mw1377.eqiad.wmnet with OS bullseye executed with errors:

  • mw1377 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202401081652_kamila_698453_mw1377.out, asking the operator what to do
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202401081654_kamila_698453_mw1377.out, asking the operator what to do
    • First Puppet run failed and the operator aborted
    • The reimage failed, see the cookbook logs for the details

Change 988676 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] mw1377: change role to insetup

https://gerrit.wikimedia.org/r/988676

Change 988676 merged by Kamila Součková:

[operations/puppet@production] mw1377: change role to insetup

https://gerrit.wikimedia.org/r/988676

Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host mw1377.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host mw1377.eqiad.wmnet with OS bullseye completed:

  • mw1377 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401081717_kamila_702504_mw1377.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Note a few other things that have been tried:

kamila triaged this task as High priority.Jan 8 2024, 5:41 PM
kamila updated the task description. (Show Details)
kamila updated the task description. (Show Details)

Change 989192 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] TEMPORARY for debugging T354413: add role hiera

https://gerrit.wikimedia.org/r/989192

Mentioned in SAL (#wikimedia-operations) [2024-01-09T15:37:39Z] <akosiaris> depool and reboot mw1349 for a test T354413

I 've unloaded the wdat_wdt module and issued one more reboot on mw1378 (the tests in mw1349 have led nowhere)

And previously I would see

[  OK  ] Reached target Shutdown.
[  OK  ] Reached target Final Step.
[  OK  ] Finished Reboot.
[  OK  ] Reached target Reboot.
[  386.995482] watchdog: watchdog0: watchdog did not stop!

whereas with the module unloaded now I see

[  OK  ] Reached target Shutdown.
[  OK  ] Reached target Final Step.
[  OK  ] Finished Reboot.
[  OK  ] Reached target Reboot.
[ 2189.702299] systemd-shutdown[1]: Waiting for process: containerd-shim
[ 2269.817404] systemd-shutdown[1]: Could not stop MD /dev/md0: Device or resource busy
[ 2269.835737] systemd-shutdown[1]: Failed to finalize DM devices, MD devices, ignoring.
[ 2270.846859] reboot: Restarting system

and a successful reboot.

I am gonna repeat this experiment on mw1378 a couple of times.

Some more rough data:

That container-shim message led me to track and find which containerd-shim we were talking about. Some trial and error [1] afterwards it appears that one of the containers (istio ingressgateway) we ran on every kubernetes nodes as a daemonset takes a while to stop. This probably is the trigger for the following race condition

  1. The reboot command starts the shutdown process.
  2. istio ingress containerd-shim is issued a signal to stop from systemd but takes a while to gracefully stop (for unknown reasons currently)
  3. The kernel has also stopped informing the watchdog (which is, in this batch of systems, powered by the wdat_wdt kernel module) that it is around
  4. The watchdog decides to reboot the system before systemd-shutdown has a chance to reboot the system

There are a couple of ways that this can be avoided manually

In some tests on mw1349 (which differs from mw1378 in that it is a) an mediawiki and not a kubernetes host b) runs buster and not bullseye), I think I got close to triggering this, but systemd-shutdown up to now has been faster than the watch dog.

  • Removing the kernel module wdat_wdt (in which case we don't even see that watchdog: watchdog0: watchdog did not stop! message like in the output I pasted above.
  • Manually stopping kubelet and ingressgateway with a command like sudo docker stop k8s_istio-proxy_istio-ingressgateway-7khms_istio-system_79e01ffb-3e75-4263-9b41-f1fec3ec7c49_11

In this case we see something like

[  OK  ] Finished Reboot.
[  OK  ] Reached target Reboot.
[   95.370553] watchdog: watchdog0: watchdog did not stop!
[   95.503223] systemd-shutdown[1]: Could not stop MD /dev/md0: Device or resource busy
[   95.521114] systemd-shutdown[1]: Failed to finalize DM devices, MD devices, ignoring.
[   96.531523] reboot: Restarting system

This looks like the messages I saw on mw1349 (where I failed to reproduce this up to now)

[  OK  ] Started Reboot.
[  OK  ] Reached target Reboot.
[  362.149733] watchdog: watchdog0: watchdog did not stop!
[  363.419076] reboot: Restarting system

With @kamila, we 'll dive more into this tomorrow so we can come up with a recommendation.

Icinga downtime and Alertmanager silence (ID=39a549ae-98fb-4aef-878d-0821f2d1ea4b) set by akosiaris@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Trying to reproduce wdat_wdt watchdog problem

mw1349.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=9b28fccf-ebb0-4701-b5cb-3d157b3ca2b0) set by akosiaris@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Trying to reproduce wdat_wdt watchdog problem

mw1378.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=116c5a1a-2682-42d9-b281-94b33ec2e23c) set by akosiaris@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Trying to reproduce wdat_wdt watchdog problem

mw1349.eqiad.wmnet

Change 989455 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] kubernetes::node: Blacklist wdat_wdt kernel module

https://gerrit.wikimedia.org/r/989455

Change 989460 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] kmod::blacklist: Allow also rmmoding modules

https://gerrit.wikimedia.org/r/989460

Change 989460 merged by Kamila Součková:

[operations/puppet@production] kmod::blacklist: Allow also rmmoding modules

https://gerrit.wikimedia.org/r/989460

Icinga downtime and Alertmanager silence (ID=b2284900-4b6b-4cc1-aba3-ee88a4fb1e3e) set by akosiaris@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Trying to reproduce wdat_wdt watchdog problem

mw1378.eqiad.wmnet

Change 989455 merged by Kamila Součková:

[operations/puppet@production] kubernetes::node: Blacklist wdat_wdt kernel module

https://gerrit.wikimedia.org/r/989455

Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host mw1380.eqiad.wmnet with OS bullseye

Icinga downtime and Alertmanager silence (ID=32e531db-6a10-4d67-adb6-cb3288c935b2) set by akosiaris@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Trying to reproduce wdat_wdt watchdog problem

mw1349.eqiad.wmnet

Appears to be related to the wdat_wdt watchdog driver (all affected hosts have that driver). Kernels >= 5.10.205-1 should have a related patch backported (https://patchwork.kernel.org/project/linux-watchdog/patch/[email protected]/), but this also happens on newer kernels.

Correcting myself, it's 5.10.127-1, not 5.10.205-1 per https://metadata.ftp-master.debian.org/changelogs/main/l/linux-signed-amd64/linux-signed-amd64_5.10.197+1_changelog

But the funny thing is I can't find that patch in https://salsa.debian.org/kernel-team/linux/-/blob/master/debian/patches/series?ref_type=heads

I am probably missing something.

Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host mw1380.eqiad.wmnet with OS bullseye completed:

  • mw1380 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401101457_kamila_1087167_mw1380.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host mw1381.eqiad.wmnet with OS bullseye

Appears to be related to the wdat_wdt watchdog driver (all affected hosts have that driver). Kernels >= 5.10.205-1 should have a related patch backported (https://patchwork.kernel.org/project/linux-watchdog/patch/[email protected]/), but this also happens on newer kernels.

Correcting myself, it's 5.10.127-1, not 5.10.205-1 per https://metadata.ftp-master.debian.org/changelogs/main/l/linux-signed-amd64/linux-signed-amd64_5.10.197+1_changelog

But the funny thing is I can't find that patch in https://salsa.debian.org/kernel-team/linux/-/blob/master/debian/patches/series?ref_type=heads

I am probably missing something.

This patch wasn't directly backported by Debian, but was backported for the 5.10 LTS kernel series, which Debian rebases to frequently. Specifically for this patch, the backport landed in the 5.10.122 upstream kernel and the first upload to include that kernel was 5.10.127-1:
https://tracker.debian.org/news/1341816/accepted-linux-510127-1-source-into-proposed-updates-stable-new-proposed-updates/

Appears to be related to the wdat_wdt watchdog driver (all affected hosts have that driver). Kernels >= 5.10.205-1 should have a related patch backported (https://patchwork.kernel.org/project/linux-watchdog/patch/[email protected]/), but this also happens on newer kernels.

Correcting myself, it's 5.10.127-1, not 5.10.205-1 per https://metadata.ftp-master.debian.org/changelogs/main/l/linux-signed-amd64/linux-signed-amd64_5.10.197+1_changelog

But the funny thing is I can't find that patch in https://salsa.debian.org/kernel-team/linux/-/blob/master/debian/patches/series?ref_type=heads

I am probably missing something.

This patch wasn't directly backported by Debian, but was backported for the 5.10 LTS kernel series, which Debian rebases to frequently. Specifically for this patch, the backport landed in the 5.10.122 upstream kernel and the first upload to include that kernel was 5.10.127-1:
https://tracker.debian.org/news/1341816/accepted-linux-510127-1-source-into-proposed-updates-stable-new-proposed-updates/

And that was the context I was missing. My thanks!

Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host mw1381.eqiad.wmnet with OS bullseye completed:

  • mw1381 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401101537_kamila_1093611_mw1381.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Icinga downtime and Alertmanager silence (ID=fe00207a-5a06-4fc6-a98d-1de1261b924f) set by kamila@cumin1002 for 1:00:00 on 5 host(s) and their services with reason: testing reboot

mw[1379-1383].eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host mw1378.eqiad.wmnet with OS bullseye

Change 989192 abandoned by Kamila Součková:

[operations/puppet@production] TEMPORARY for debugging T354413: add role hiera

Reason:

not needed anymore (I hope)

https://gerrit.wikimedia.org/r/989192

Change 989562 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] Clean up the temporary changes for debugging T354413

https://gerrit.wikimedia.org/r/989562

Change 989562 merged by Kamila Součková:

[operations/puppet@production] Clean up the temporary changes for debugging T354413

https://gerrit.wikimedia.org/r/989562

Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host mw1378.eqiad.wmnet with OS bullseye completed:

  • mw1378 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401101655_kamila_1112003_mw1378.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host mw1377.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host mw1377.eqiad.wmnet with OS bullseye completed:

  • mw1377 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401101734_kamila_1119888_mw1377.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

With Alex's patches, I ran 7 reimages and 20 reboots without the issue reappearing. It might be worthwhile to understand the issue better to see if that workaround is adequate, but as it appears to not be blocking us anymore, I'm closing this.