Page MenuHomePhabricator

Upgrade clouddb* hosts to Bookworm
Closed, ResolvedPublic

Description

Clouddb* hosts currently run Bullseye and 10.6. We are going to stop packaging 10.6 for Bullseye and only focus on Bookworm (to avoid duplicating work). Most of our database infrastructure is already running Bookworm and we've had no performance issues.

Please schedule the upgrade of clouddb hosts to Bookworm (in terms of databases there's nothing new as the version will still be 10.6).

EDIT: make sure to upgrade the MariaDB package to the latest minor version 10.6.19 (T372536)

Reimage:

  • clouddb1013
  • clouddb1014
  • clouddb1015
  • clouddb1016
  • clouddb1017
  • clouddb1018
  • clouddb1019
  • clouddb1020

Package upgrade to 10.6.19:

  • clouddb1013
  • clouddb1014
  • clouddb1015
  • clouddb1016
  • clouddb1017
  • clouddb1018
  • clouddb1019
  • clouddb1020

The reimage procedure is at https://wikitech.wikimedia.org/wiki/MariaDB/Rebooting_a_host

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@fnegri before clouddb1021 gets decommissioned - it could be a good test for you to reimage it to bookworm and see how the process can look like for the rest of hosts cc @BTullis

Thanks. Yes, @fnegri I'm more than happy for you to reimage clouddb1021, if it would be helpful for you.
That host is about ready to be decommissioned now (T368518) so my plan was to turn off all sections and leave them stopped for a week or so, whilst we check for anything that goes wrong as a result.
If this could be helpful as a testbed for reimaging the other clouddb hosts, then feel free to go ahead.

@BTullis I think you can proceed with your test and turn off all sections for a week. When that is done and you are confident nothing goes wrong as a result, I will proceed with the reimage. After the reimage is done, we can decommission it.

@BTullis I think you can proceed with your test and turn off all sections for a week. When that is done and you are confident nothing goes wrong as a result, I will proceed with the reimage. After the reimage is done, we can decommission it.

Great, thanks. I was thinking also of putting it into the insetup::data_engineering role. Otherwise, having this server with sections disabled will probably cause problems for sre.wikireplica.add-wiki and similar cookbooks.
Would that still be useful for you, or would it make more sense for you to do the reimage first, then switch roles afterwards?

I think it makes sense to do your test first. I can change back the role before the reimage.

Change #1054516 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Switch the rols of clouddb1021 to insetup::data_engineering

https://gerrit.wikimedia.org/r/1054516

Change #1054516 merged by Btullis:

[operations/puppet@production] Switch the rols of clouddb1021 to insetup::data_engineering

https://gerrit.wikimedia.org/r/1054516

@BTullis can I revert your patch and switch back the role for clouddb1021 to wmcs::db::wikireplicas::dedicated::analytics_multiinstance?

clouddb1021 was decommed today by @BTullis before reading this comment, so it can no longer be used as a test host. :/

@Marostegui I think I can depool any random clouddb host and use that one as a test for reimaging. clouddb1019 is already depooled so that's an easy candidate. In the worst case if the reimage doesn't go smoothly we can keep that host depooled for a few days while we fix it, and it should not cause any impact to end users.

Change #1059888 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Configure clouddb servers with reuse-parts-test

https://gerrit.wikimedia.org/r/1059888

I'm going forward and reimaging clouddb1019 pairing with @BTullis

Change #1059888 merged by Btullis:

[operations/puppet@production] Configure clouddb servers with reuse-parts-test

https://gerrit.wikimedia.org/r/1059888

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1002 for host clouddb1019.eqiad.wmnet with OS bookworm

fnegri changed the task status from Open to In Progress.Aug 5 2024, 2:59 PM

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1002 for host clouddb1019.eqiad.wmnet with OS bookworm completed:

  • clouddb1019 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408051402_fnegri_298453_clouddb1019.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

@fnegri, I forgot to mention something. I think there is a manual step that we need to carry out on these hosts after reimage.
The reason is this:

image.png (447×478 px, 57 KB)

We have a prometheus-mysqld-exporter for each MariaDB instance on the host. However, there is a stray process which seems to have been created at some point during the install, but is not disabled and so it remains in an error state.

I think that we just have to do this once for each host:

sudo systemctl disable prometheus-mysqld-exporter.service
sudo systemctl mask prometheus-mysqld-exporter.service

It would be nice if it were in puppet, but I think it hasn't been high enough priority to fix it yet.

@BTullis thanks! That was actually in my checklist at https://wikitech.wikimedia.org/wiki/MariaDB/Rebooting_a_host but I somehow managed to miss it. :/

Mentioned in SAL (#wikimedia-operations) [2024-08-06T14:28:53Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Reimaging clouddb1020 T365424

Mentioned in SAL (#wikimedia-operations) [2024-08-06T14:29:06Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Reimaging clouddb1020 T365424

Mentioned in SAL (#wikimedia-operations) [2024-08-06T16:03:04Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Reimaging clouddb1020 T365424

Mentioned in SAL (#wikimedia-operations) [2024-08-06T16:03:18Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Reimaging clouddb1020 T365424

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1002 for host clouddb1020.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1002 for host clouddb1020.eqiad.wmnet with OS bookworm completed:

  • clouddb1020 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408061623_fnegri_498616_clouddb1020.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-08-08T13:25:04Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1018.eqiad.wmnet with reason: Reimaging clouddb1018 T365424

Mentioned in SAL (#wikimedia-operations) [2024-08-08T13:25:16Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1018.eqiad.wmnet with reason: Reimaging clouddb1018 T365424

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1002 for host clouddb1018.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1002 for host clouddb1018.eqiad.wmnet with OS bookworm completed:

  • clouddb1018 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408081344_fnegri_901336_clouddb1018.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status failed -> active
    • Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually

Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually

The last step of the cookbook failed with https://phabricator.wikimedia.org/P67253 but the reimage was successful, and the host is now repooled.

The failure was likely caused by me pressing "enter" 3 times without noticing instead of typing "go".

Mentioned in SAL (#wikimedia-operations) [2024-08-08T14:51:58Z] <fnegri@cumin1002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Running sync-netbox-hiera manually because it failed during the reimage - fnegri@cumin1002 - T365424"

Mentioned in SAL (#wikimedia-operations) [2024-08-08T14:52:32Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Running sync-netbox-hiera manually because it failed during the reimage - fnegri@cumin1002 - T365424"

Mentioned in SAL (#wikimedia-operations) [2024-08-13T09:40:27Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Reimaging clouddb1016 T365424

Mentioned in SAL (#wikimedia-operations) [2024-08-13T09:40:40Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Reimaging clouddb1016 T365424

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1002 for host clouddb1016.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1002 for host clouddb1016.eqiad.wmnet with OS bookworm completed:

  • clouddb1016 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408131002_fnegri_1751466_clouddb1016.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-08-16T13:43:27Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1017.eqiad.wmnet with reason: Reimaging clouddb1017 T365424

Mentioned in SAL (#wikimedia-operations) [2024-08-16T13:43:55Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1017.eqiad.wmnet with reason: Reimaging clouddb1017 T365424

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1002 for host clouddb1017.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1002 for host clouddb1017.eqiad.wmnet with OS bookworm executed with errors:

  • clouddb1017 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408161407_fnegri_2379971_clouddb1017.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" clouddb1017.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

The reimage cookbook for clouddb1017 failed only because MariaDB took a bit longer to catch up with the primary, and the cookbook did not wait long enough.

Replication is now back at zero and the Icinga status is green. I have repooled the host.

Mentioned in SAL (#wikimedia-operations) [2024-08-19T12:27:34Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1015.eqiad.wmnet with reason: Reimaging clouddb1015 T365424

Mentioned in SAL (#wikimedia-operations) [2024-08-19T12:27:47Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1015.eqiad.wmnet with reason: Reimaging clouddb1015 T365424

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1002 for host clouddb1015.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1002 for host clouddb1015.eqiad.wmnet with OS bookworm completed:

  • clouddb1015 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408191249_fnegri_2910028_clouddb1015.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-08-20T13:06:12Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1014.eqiad.wmnet with reason: Reimaging clouddb1014 T365424

Mentioned in SAL (#wikimedia-operations) [2024-08-20T13:06:26Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1014.eqiad.wmnet with reason: Reimaging clouddb1014 T365424

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1002 for host clouddb1014.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1002 for host clouddb1014.eqiad.wmnet with OS bookworm completed:

  • clouddb1014 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408201331_fnegri_3115857_clouddb1014.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1002 for host clouddb1013.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1002 for host clouddb1013.eqiad.wmnet with OS bookworm completed:

  • clouddb1013 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408201424_fnegri_3124669_clouddb1013.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

All hosts are now running Bookworm, I will keep this task open until I've also upgraded MariaDB to version 10.6.19.

@Marostegui what is the procedure that you follow for minor-version upgrades? I would follow the same procedure for Rebooting but skipping the actual reboot.

What I normally do is:

  • Stop slave on each instance
  • Stop each instance's daemon (never all of them at the same time): systemctl stop mariadb@s1 etc
  • apt full-upgrade
  • Start each instance: systemctl start mariadb@s1 etc
  • There is no need to do this, but I normally do it: mysql_upgrade --force -S $SOCKET_PATH
  • Start slave on each instance
  • Enjoy

Mentioned in SAL (#wikimedia-operations) [2024-09-16T09:03:26Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1013.eqiad.wmnet with reason: Upgrading mariadb on clouddb1013 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T09:03:42Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1013.eqiad.wmnet with reason: Upgrading mariadb on clouddb1013 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T09:22:15Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1014.eqiad.wmnet with reason: Upgrading mariadb on clouddb1014 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T09:22:30Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1014.eqiad.wmnet with reason: Upgrading mariadb on clouddb1014 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T09:35:09Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1015.eqiad.wmnet with reason: Upgrading mariadb on clouddb1015 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T09:35:24Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1015.eqiad.wmnet with reason: Upgrading mariadb on clouddb1015 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T10:06:16Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Upgrading mariadb on clouddb1016 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T10:06:31Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Upgrading mariadb on clouddb1016 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T13:30:53Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Upgrading mariadb on clouddb1017 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T13:31:08Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Upgrading mariadb on clouddb1017 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T13:31:13Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1017.eqiad.wmnet with reason: Upgrading mariadb on clouddb1017 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T13:31:28Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1017.eqiad.wmnet with reason: Upgrading mariadb on clouddb1017 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T13:50:04Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1018.eqiad.wmnet with reason: Upgrading mariadb on clouddb1018 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T13:50:20Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1018.eqiad.wmnet with reason: Upgrading mariadb on clouddb1018 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T14:09:50Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Upgrading mariadb on clouddb1019 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T14:10:05Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Upgrading mariadb on clouddb1019 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T15:48:51Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Upgrading mariadb on clouddb1020 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T15:49:07Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Upgrading mariadb on clouddb1020 T365424

fnegri updated the task description. (Show Details)
fnegri moved this task from In progress to Done on the cloud-services-team (FY2024/2025-Q1-Q2) board.

All clouddb* hosts have been upgraded to MariaDB 10.6.19, following the process at https://wikitech.wikimedia.org/wiki/MariaDB/Rebooting_a_host I've also upgraded other upgradable packages and rebooted all hosts after the upgrade.