⚓ T365424 Upgrade clouddb* hosts to Bookworm

	Subject	Repo	Branch	Lines +/-
	Configure clouddb servers with reuse-parts-test	operations/puppet	production	+1 -1
	Switch the rols of clouddb1021 to insetup::data_engineering	operations/puppet	production	+3 -1

Status	Subtype	Assigned	Task
Resolved	Request	VRiley-WMF	T368518 decommission clouddb1021
Resolved		fnegri	T365424 Upgrade clouddb* hosts to Bookworm
Declined		BTullis	T365450 Upgrade clouddb1021 to bookworm

In T365424#9980124, @Marostegui wrote:

@fnegri before clouddb1021 gets decommissioned - it could be a good test for you to reimage it to bookworm and see how the process can look like for the rest of hosts cc @BTullis

Thanks. Yes, @fnegri I'm more than happy for you to reimage clouddb1021, if it would be helpful for you.
That host is about ready to be decommissioned now (T368518) so my plan was to turn off all sections and leave them stopped for a week or so, whilst we check for anything that goes wrong as a result.
If this could be helpful as a testbed for reimaging the other clouddb hosts, then feel free to go ahead.

fnegri added a parent task: T368518: decommission clouddb1021.Jul 16 2024, 10:36 AM

@BTullis I think you can proceed with your test and turn off all sections for a week. When that is done and you are confident nothing goes wrong as a result, I will proceed with the reimage. After the reimage is done, we can decommission it.

In T365424#9984673, @fnegri wrote:

@BTullis I think you can proceed with your test and turn off all sections for a week. When that is done and you are confident nothing goes wrong as a result, I will proceed with the reimage. After the reimage is done, we can decommission it.

Great, thanks. I was thinking also of putting it into the insetup::data_engineering role. Otherwise, having this server with sections disabled will probably cause problems for sre.wikireplica.add-wiki and similar cookbooks.
Would that still be useful for you, or would it make more sense for you to do the reimage first, then switch roles afterwards?

I think it makes sense to do your test first. I can change back the role before the reimage.

Change #1054516 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Switch the rols of clouddb1021 to insetup::data_engineering

https://gerrit.wikimedia.org/r/1054516

gerritbot added a project: Patch-For-Review.Jul 16 2024, 10:49 AM

Change #1054516 merged by Btullis:

[operations/puppet@production] Switch the rols of clouddb1021 to insetup::data_engineering

https://gerrit.wikimedia.org/r/1054516

Maintenance_bot removed a project: Patch-For-Review.Jul 16 2024, 2:33 PM

joanna_borun edited projects, added cloud-services-team (FY2024/2025-Q1-Q2); removed cloud-services-team (FY2023/2024-Q3-Q4).Jul 22 2024, 10:03 AM

fnegri mentioned this in T367778: [wikireplicas] frequent replag spikes in clouddb hosts.Jul 31 2024, 1:28 PM

@BTullis can I revert your patch and switch back the role for clouddb1021 to wmcs::db::wikireplicas::dedicated::analytics_multiinstance?

clouddb1021 was decommed today by @BTullis before reading this comment, so it can no longer be used as a test host. :/

@Marostegui I think I can depool any random clouddb host and use that one as a test for reimaging. clouddb1019 is already depooled so that's an easy candidate. In the worst case if the reimage doesn't go smoothly we can keep that host depooled for a few days while we fix it, and it should not cause any impact to end users.

fnegri edited projects, added Goal; removed Data-Persistence.Aug 5 2024, 11:44 AM

Change #1059888 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Configure clouddb servers with reuse-parts-test

https://gerrit.wikimedia.org/r/1059888

gerritbot added a project: Patch-For-Review.Aug 5 2024, 1:32 PM

I'm going forward and reimaging clouddb1019 pairing with @BTullis

Change #1059888 merged by Btullis:

[operations/puppet@production] Configure clouddb servers with reuse-parts-test

https://gerrit.wikimedia.org/r/1059888

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1002 for host clouddb1019.eqiad.wmnet with OS bookworm

Maintenance_bot removed a project: Patch-For-Review.Aug 5 2024, 2:31 PM

fnegri changed the task status from Open to In Progress.Aug 5 2024, 2:59 PM

fnegri moved this task from Backlog to In progress on the cloud-services-team (FY2024/2025-Q1-Q2) board.

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1002 for host clouddb1019.eqiad.wmnet with OS bookworm completed:

clouddb1019 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408051402_fnegri_298453_clouddb1019.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

fnegri updated the task description. (Show Details)Aug 5 2024, 3:31 PM

fnegri updated the task description. (Show Details)Aug 6 2024, 2:04 PM

@fnegri, I forgot to mention something. I think there is a manual step that we need to carry out on these hosts after reimage.
The reason is this:

We have a prometheus-mysqld-exporter for each MariaDB instance on the host. However, there is a stray process which seems to have been created at some point during the install, but is not disabled and so it remains in an error state.

I think that we just have to do this once for each host:

sudo systemctl disable prometheus-mysqld-exporter.service
sudo systemctl mask prometheus-mysqld-exporter.service

It would be nice if it were in puppet, but I think it hasn't been high enough priority to fix it yet.

@BTullis thanks! That was actually in my checklist at https://wikitech.wikimedia.org/wiki/MariaDB/Rebooting_a_host but I somehow managed to miss it. :/

Mentioned in SAL (#wikimedia-operations) [2024-08-06T14:28:53Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Reimaging clouddb1020 T365424

Mentioned in SAL (#wikimedia-operations) [2024-08-06T14:29:06Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Reimaging clouddb1020 T365424

Mentioned in SAL (#wikimedia-operations) [2024-08-06T16:03:04Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Reimaging clouddb1020 T365424

Mentioned in SAL (#wikimedia-operations) [2024-08-06T16:03:18Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Reimaging clouddb1020 T365424

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1002 for host clouddb1020.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1002 for host clouddb1020.eqiad.wmnet with OS bookworm completed:

clouddb1020 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408061623_fnegri_498616_clouddb1020.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

fnegri updated the task description. (Show Details)Aug 6 2024, 5:18 PM

Mentioned in SAL (#wikimedia-operations) [2024-08-08T13:25:04Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1018.eqiad.wmnet with reason: Reimaging clouddb1018 T365424

Mentioned in SAL (#wikimedia-operations) [2024-08-08T13:25:16Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1018.eqiad.wmnet with reason: Reimaging clouddb1018 T365424

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1002 for host clouddb1018.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1002 for host clouddb1018.eqiad.wmnet with OS bookworm completed:

clouddb1018 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408081344_fnegri_901336_clouddb1018.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status failed -> active
- Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually

Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually

The last step of the cookbook failed with https://phabricator.wikimedia.org/P67253 but the reimage was successful, and the host is now repooled.

The failure was likely caused by me pressing "enter" 3 times without noticing instead of typing "go".

Mentioned in SAL (#wikimedia-operations) [2024-08-08T14:51:58Z] <fnegri@cumin1002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Running sync-netbox-hiera manually because it failed during the reimage - fnegri@cumin1002 - T365424"

Mentioned in SAL (#wikimedia-operations) [2024-08-08T14:52:32Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Running sync-netbox-hiera manually because it failed during the reimage - fnegri@cumin1002 - T365424"

fnegri updated the task description. (Show Details)Aug 8 2024, 3:49 PM

Mentioned in SAL (#wikimedia-operations) [2024-08-13T09:40:27Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Reimaging clouddb1016 T365424

Mentioned in SAL (#wikimedia-operations) [2024-08-13T09:40:40Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Reimaging clouddb1016 T365424

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1002 for host clouddb1016.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1002 for host clouddb1016.eqiad.wmnet with OS bookworm completed:

clouddb1016 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408131002_fnegri_1751466_clouddb1016.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

fnegri updated the task description. (Show Details)Aug 13 2024, 10:27 AM

fnegri updated the task description. (Show Details)Aug 16 2024, 1:32 PM

fnegri mentioned this in T372536: Compile and package MariaDB 10.6.19.

Mentioned in SAL (#wikimedia-operations) [2024-08-16T13:43:27Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1017.eqiad.wmnet with reason: Reimaging clouddb1017 T365424

taavi unsubscribed.Aug 16 2024, 1:43 PM

Mentioned in SAL (#wikimedia-operations) [2024-08-16T13:43:55Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1017.eqiad.wmnet with reason: Reimaging clouddb1017 T365424

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1002 for host clouddb1017.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1002 for host clouddb1017.eqiad.wmnet with OS bookworm executed with errors:

clouddb1017 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408161407_fnegri_2379971_clouddb1017.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" clouddb1017.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

The reimage cookbook for clouddb1017 failed only because MariaDB took a bit longer to catch up with the primary, and the cookbook did not wait long enough.

Replication is now back at zero and the Icinga status is green. I have repooled the host.

fnegri updated the task description. (Show Details)Aug 16 2024, 2:53 PM

Mentioned in SAL (#wikimedia-operations) [2024-08-19T12:27:34Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1015.eqiad.wmnet with reason: Reimaging clouddb1015 T365424

Mentioned in SAL (#wikimedia-operations) [2024-08-19T12:27:47Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1015.eqiad.wmnet with reason: Reimaging clouddb1015 T365424

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1002 for host clouddb1015.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1002 for host clouddb1015.eqiad.wmnet with OS bookworm completed:

clouddb1015 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408191249_fnegri_2910028_clouddb1015.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

ABran-WMF mentioned this in T372764: mariadb monitoring: process list metric missing in grafana.Aug 19 2024, 1:18 PM

fnegri updated the task description. (Show Details)Aug 19 2024, 2:22 PM

Mentioned in SAL (#wikimedia-operations) [2024-08-20T13:06:12Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1014.eqiad.wmnet with reason: Reimaging clouddb1014 T365424

Mentioned in SAL (#wikimedia-operations) [2024-08-20T13:06:26Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1014.eqiad.wmnet with reason: Reimaging clouddb1014 T365424

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1002 for host clouddb1014.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1002 for host clouddb1014.eqiad.wmnet with OS bookworm completed:

clouddb1014 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408201331_fnegri_3115857_clouddb1014.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

fnegri updated the task description. (Show Details)Aug 20 2024, 2:05 PM

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1002 for host clouddb1013.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1002 for host clouddb1013.eqiad.wmnet with OS bookworm completed:

clouddb1013 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408201424_fnegri_3124669_clouddb1013.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

fnegri updated the task description. (Show Details)Aug 20 2024, 3:25 PM

All hosts are now running Bookworm, I will keep this task open until I've also upgraded MariaDB to version 10.6.19.

@Marostegui what is the procedure that you follow for minor-version upgrades? I would follow the same procedure for Rebooting but skipping the actual reboot.

What I normally do is:

Stop slave on each instance
Stop each instance's daemon (never all of them at the same time): systemctl stop mariadb@s1 etc
apt full-upgrade
Start each instance: systemctl start mariadb@s1 etc
There is no need to do this, but I normally do it: mysql_upgrade --force -S $SOCKET_PATH
Start slave on each instance
Enjoy

Mentioned in SAL (#wikimedia-operations) [2024-09-16T09:03:26Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1013.eqiad.wmnet with reason: Upgrading mariadb on clouddb1013 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T09:03:42Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1013.eqiad.wmnet with reason: Upgrading mariadb on clouddb1013 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T09:22:15Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1014.eqiad.wmnet with reason: Upgrading mariadb on clouddb1014 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T09:22:30Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1014.eqiad.wmnet with reason: Upgrading mariadb on clouddb1014 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T09:35:09Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1015.eqiad.wmnet with reason: Upgrading mariadb on clouddb1015 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T09:35:24Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1015.eqiad.wmnet with reason: Upgrading mariadb on clouddb1015 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T10:06:16Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Upgrading mariadb on clouddb1016 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T10:06:31Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Upgrading mariadb on clouddb1016 T365424

fnegri updated the task description. (Show Details)Sep 16 2024, 10:34 AM

Mentioned in SAL (#wikimedia-operations) [2024-09-16T13:30:53Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Upgrading mariadb on clouddb1017 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T13:31:08Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Upgrading mariadb on clouddb1017 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T13:31:13Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1017.eqiad.wmnet with reason: Upgrading mariadb on clouddb1017 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T13:31:28Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1017.eqiad.wmnet with reason: Upgrading mariadb on clouddb1017 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T13:50:04Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1018.eqiad.wmnet with reason: Upgrading mariadb on clouddb1018 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T13:50:20Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1018.eqiad.wmnet with reason: Upgrading mariadb on clouddb1018 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T14:09:50Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Upgrading mariadb on clouddb1019 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T14:10:05Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Upgrading mariadb on clouddb1019 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T15:48:51Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Upgrading mariadb on clouddb1020 T365424

Mentioned in SAL (#wikimedia-operations) [2024-09-16T15:49:07Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Upgrading mariadb on clouddb1020 T365424

All clouddb* hosts have been upgraded to MariaDB 10.6.19, following the process at https://wikitech.wikimedia.org/wiki/MariaDB/Rebooting_a_host I've also upgraded other upgradable packages and rebooted all hosts after the upgrade.

Upgrade clouddb* hosts to Bookworm
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

	Marostegui
	May 21 2024, 6:39 AM

	F57120942: image.png
	Aug 6 2024, 2:16 PM

Upgrade clouddb* hosts to BookwormClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Upgrade clouddb* hosts to Bookworm
Closed, ResolvedPublic
Actions

Related Objects
Search...