Page MenuHomePhabricator

upgrade mwmaint servers to buster
Closed, ResolvedPublic

Description

mwmaint1002 is the server where we run scheduled tasks thus a very special host. We need to come up with a plan of how this host will be updated or if this host will replaced by a new one eg mwmaint1003.

The server should not be migrated or replaced before Jan 2021

edit: now using this ticket for both mwmaint servers, also mwmaint2001

  • install mwmaint2002
  • decom mwmaint2001
  • upgrade mwmaint1002 while codfw is active DC
  • upgrade mwmaint2002 while eqiad is active DC

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
Resolved toan
ResolvedLucas_Werkmeister_WMDE
ResolvedJoe
ResolvedJdforrester-WMF
ResolvedLadsgroup
InvalidNone
ResolvedReedy
OpenNone
Resolvedtstarling
ResolvedJdforrester-WMF
StalledNone
ResolvedNone
ResolvedPRODUCTION ERRORLegoktm
Resolvedtstarling
ResolvedJoe
ResolvedKrinkle
Resolvedhashar
ResolvedJdforrester-WMF
ResolvedDzahn
ResolvedDzahn
ResolvedRequestPapaul

Event Timeline

jijiki triaged this task as Medium priority.Nov 10 2020, 3:59 PM

I could take this one (later). Have done mwmaint upgrade in the past. I would ideally like to create mwmaint1003 and eventually flip over.

You're probably already thinking about this, but just to make sure it's said out loud: mwmaint1002 is still running updateCollation for the ICU upgrade, and will be chewing through enwiki for some days, so best to leave mwmaint1002 in place until that's finished. (T264991)

We should have two mwmaint servers per DC anyway (with some mechanism to flip the active one), some failover capability is needed outside of OS updates as well (reboots e.g. are a total pain with the current SPOF setup that we have)

You're probably already thinking about this, but just to make sure it's said out loud: mwmaint1002 is still running updateCollation for the ICU upgrade, and will be chewing through enwiki for some days, so best to leave mwmaint1002 in place until that's finished. (T264991)

Yes, definitely not planning to touch the existing server. I was hoping to get new hardware to install in parallel.

We should have two mwmaint servers per DC anyway (with some mechanism to flip the active one), some failover capability is needed outside of OS updates as well (reboots e.g. are a total pain with the current SPOF setup that we have)

ACK, I might take a look at improving puppet code to allow switching between multiple servers per DC.

I unintentionally created some confusion I think, and I am very sorry. I have updated the description to reflect that our target for this quarter is to have done as much preliminary work as we can regarding the upgrades of any mediawiki servers.

Dzahn renamed this task from upgrade mwmaint1002 to buster to upgrade mwmaint servers to buster .Feb 18 2021, 6:25 PM
Dzahn updated the task description. (Show Details)

renaming this ticket to cover both mwmaint* servers and not be just for eqiad alone

Change 665144 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: switch mwmaint2001 to use buster installer

https://gerrit.wikimedia.org/r/665144

Change 665144 merged by Dzahn:
[operations/puppet@production] install_server: switch mwmaint2001 to use buster installer

https://gerrit.wikimedia.org/r/665144

Mentioned in SAL (#wikimedia-operations) [2021-02-18T23:11:16Z] <mutante> mwmaint2001 - will be rebooted for OS upgrade - T267607

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mwmaint2001.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102182316_dzahn_17848_mwmaint2001_codfw_wmnet.log.

Change 665225 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] scap: remove mwmaint2001 from "dsh" groups

https://gerrit.wikimedia.org/r/665225

Change 665225 merged by Dzahn:
[operations/puppet@production] scap: remove mwmaint2001 from "dsh" groups

https://gerrit.wikimedia.org/r/665225

Completed auto-reimage of hosts:

['mwmaint2001.codfw.wmnet']

and were ALL successful.

IIRC the previous update for the mwmaint servers happened via a hardware replacement: mwmaint1002 was new server which replaced terbium. Procedure-wise it's probably best if we reimage an existing mw* server in eqiad with the mediawiki::maintenance role and then fall back to mwmaint1002 once reimaged? But that would require to add some logic in Hiera to flag whether a server running role::mediawiki::maintenance is the current active one (most of the tasks are triggered via the common profile::mediawiki::periodic_job) or alternative Puppet is disabled manually and the systemd timers are stopped manually.

T274170 introduced new hardware mwmaint2002 and can be used now. timing :p

Dzahn changed the task status from Open to Stalled.Mar 19 2021, 9:17 PM

mwmaint1002 will be upgraded during the DC switchover period in Q4

So to summarise; the plan is to reimage mwmaint1002 now that eqiad is passive and the reimage mwmaint2002 once eqiad is primary again?

Change 704290 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] DHCP: let mwmaint1002 use buster installer

https://gerrit.wikimedia.org/r/704290

Change 704290 merged by Dzahn:

[operations/puppet@production] DHCP: let mwmaint1002 use buster installer

https://gerrit.wikimedia.org/r/704290

Change 704293 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] switch mwmaint.discovery (noc.wm.org backend) from eqiad to codfw

https://gerrit.wikimedia.org/r/704293

Change 704297 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] httpbb: add tests for noc.wikimedia.org

https://gerrit.wikimedia.org/r/704297

Change 704300 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] mediawiki::maintenance: open ferm hole for deployment servers to port 80

https://gerrit.wikimedia.org/r/704300

Change 704300 merged by Dzahn:

[operations/puppet@production] mediawiki::maintenance: open ferm hole for deployment servers to port 80

https://gerrit.wikimedia.org/r/704300

Change 704297 merged by Dzahn:

[operations/puppet@production] httpbb: add tests for noc.wikimedia.org

https://gerrit.wikimedia.org/r/704297

Mentioned in SAL (#wikimedia-operations) [2021-07-13T10:54:52Z] <mutante> switching https://noc.wikimedia.org backened from eqiad to codfw for mwmaint1002 OS upgrade, not affecting config-master/pybal, tests passed (T267607)

Change 704293 merged by Dzahn:

[operations/dns@master] switch mwmaint.discovery (noc.wm.org backend) from eqiad to codfw

https://gerrit.wikimedia.org/r/704293

Dzahn changed the task status from Stalled to Open.Jul 13 2021, 11:11 AM

Yes, that's correct. We are reimaging eqiad first. Just switched noc.wikimedia.org backend to codfw to avoid any downtime of that. mwmaint2002 will be done once eqiad is primary again and people are aware the real test of the maint periodic jobs happens when we switch back. We decided to accept that, given that we don't change the PHP version.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mwmaint1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107131113_dzahn_25582_mwmaint1002_eqiad_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2021-07-13T11:13:54Z] <mutante> mwmaint1002 - reimaging with buster (T267607)

Completed auto-reimage of hosts:

['mwmaint1002.eqiad.wmnet']

and were ALL successful.

[mwmaint1002:~] $ lsb_release -c
Codename:	buster

mwmaint1002 is on buster now. puppet runs without errors or warnings.

https://noc.wikimedia.org is hosted by mwmaint2002 in codfw for now and we have new httpbb tests for it.

@Legoktm done ^ the noc site is now hosted in codfw (leaving it like that until we switch back, right?). and mwmaint1002 is now on buster and puppet did not show any issues. it has the warning MOTD telling users not to use it right now and go to mwmaint2002 instead, just like before reimaging it.

Also we have this now which shows the noc site works on both hosts also after reimage:

[deploy1002:~] $ httpbb /srv/deployment/httpbb-tests/noc/* --hosts mwmaint1002.eqiad.wmnet,mwmaint2002.codfw.wmnet
Sending to 2 hosts...
PASS: 3 requests sent to each of 2 hosts. All assertions passed.

and running a a scap pull now.

Change 704786 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] httpbb: use https in tests for noc.wikimedia.org

https://gerrit.wikimedia.org/r/704786

Change 704786 merged by Dzahn:

[operations/puppet@production] httpbb: use https in tests for noc.wikimedia.org

https://gerrit.wikimedia.org/r/704786

Dzahn changed the task status from Open to Stalled.Aug 10 2021, 11:12 AM
Dzahn set Due Date to Sep 12 2021, 10:00 PM.

Change 721358 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] DHCP: switch mwmaint2002 from stretch to buster installer

https://gerrit.wikimedia.org/r/721358

Change 721546 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] switch mwmaint.discovery.wmnet from codfw to eqiad

https://gerrit.wikimedia.org/r/721546

Change 721546 merged by Dzahn:

[operations/dns@master] switch mwmaint.discovery.wmnet from codfw to eqiad

https://gerrit.wikimedia.org/r/721546

Change 721358 merged by Dzahn:

[operations/puppet@production] DHCP: switch mwmaint2002 from stretch to buster installer

https://gerrit.wikimedia.org/r/721358

Mentioned in SAL (#wikimedia-operations) [2021-09-16T14:35:07Z] <mutante> reimaging mwmaint2002 to buster (T267607, T245757)

Dzahn changed the task status from Stalled to In Progress.Sep 16 2021, 2:39 PM
Dzahn updated the task description. (Show Details)

https://noc.wikimedia.org (mwmaint.discovery.wmnet) has been switched from codfw to eqiad.

mwmaint2002 has been upgraded to buster. monitoring all green.