Page MenuHomePhabricator

Figure out how x2 should be handled in DC switchover
Closed, ResolvedPublic

Description

Today while testing the DC switchover scripts, we noticed x2 is already read-write in codfw. The live test script made x2 in codfw read-only triggering a page. Based on reading T269324: Productionize x2 databases and comments from @Krinkle, this seems intentional.

So how should x2 be treated during the switchover? Can we ignore it entirely and leave it read-write in both DCs during the switchover?

Or should it go read-only in both DCs and then post-switchover made read-write in both DCs?

@Krinkle suggested to "treat it like parser cache", which would be the former option of ignoring it entirely.

Event Timeline

Legoktm created this task.

Treating it like parsercache would also be my first approach.
I would be comfortable with doing RO on both DCs, then the switchover and then the RW on both DCs again.

Please note that x2 isn't in use yet.

Change 701471 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/software/spicerack@master] Revert "mysql_legacy.py: Add x2"

https://gerrit.wikimedia.org/r/701471

Change 701474 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/software/spicerack@master] mysql_legacy: Allow excluding sections from set_core_masters_readonly()

https://gerrit.wikimedia.org/r/701474

Change 701475 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/cookbooks@master] sre.switchdc.mediawiki: Handle x2 specially

https://gerrit.wikimedia.org/r/701475

Summary from #wikimedia-databases:

22:00:46 <marostegui> I am fine either way, I would prefer to set x2 as read-only and then rw once the switch is done, but 1) it is not in use and has no data 2) don't know how hard it is
22:01:08 <marostegui> legoktm: It is a matter of how important it is to have consistent data on x2, which is not something I can really tell, more a question for Timo
22:03:35 <legoktm> actually I don't think it would be too hard
22:04:05 <marostegui> Up to you and timo I would say

So I've submitted two sets of patches for the two approaches discussed:

  • spicerack: Revert "mysql_legacy.py: Add x2": this basically has switchdc ignore x2. It'll stay read-write in both DCs throughout the entire process. AFAICT this is the same thing we do for parsercache hosts.
  • cookbooks: sre.switchdc.mediawiki: Handle x2 specially: this has us treat x2 specially, it'll go read-only during the switchover and then afterward, we set it as read-write in the passive DC in addition to in the active DC. It also skips setting x2 as read-only during the live test, which is what triggered the page earlier today.

Legoktm asked me to copy comments I had given on the patches here- I think Manuel had already spoken my mind already- it is for the application setup (Timo?) to decide, but

a) I don't like having "special" cases and
b) IF there is a chance of inconsistencies happening or replication breaking when we write to both dcs at the same time, I would prefer to treat the passive dc as a normal section (active: readwrite, passive: read-only) as write can happen almost by accident (monitoring); otherwise stick to the "write anywhere" parsecache model, assuming the application cannot create inconsistencies and the data there is non-canonical.

The last thing I wrote is that, probably not something to do for this iteration, but eventually, as @Kormat has standardized the several replication models on hiera (mariadb.yaml), hopefully in the future we don't need to handle the sections by name on automation, but using that centralized configuration by the automation scripts (with each different behaviour: mwprimary dc rw- passive ro, rw everywhere, etc.)- that way we won't have to update the automation again in the future.

In the short-term, this is the simplest thing to do, and as x2 is not currently in use, i'm voting to go ahead with this approach this time. I've +1'd it.

  • cookbooks: sre.switchdc.mediawiki: Handle x2 specially: this has us treat x2 specially, it'll go read-only during the switchover and then afterward, we set it as read-write in the passive DC in addition to in the active DC. It also skips setting x2 as read-only during the live test, which is what triggered the page earlier today.

From a db perspective, i don't think we care about whether or not x2 goes ro or not doing a dc switch. It's really an application question. What you're proposing is probably the 'safe' option once x2 is actually in use, but given the current timeframe i'd definitely be in favour of not trying to get this in before the dc switch next week.

Change 701471 merged by jenkins-bot:

[operations/software/spicerack@master] Revert "mysql_legacy.py: Add x2"

https://gerrit.wikimedia.org/r/701471

Ack, thanks for all the input. For next week we'll just ignore x2, it'll stay RW in both DCs throughout. @Krinkle does that also work as the long-term solution, or did you want something different?

Unfortunately, my patch to just ignore x2 didn't really work. spicerack gets the list of core_dbs by querying A:core-db and A:db-role-master, which returns 12 hosts, while it expects 11 since it doesn't know about x2. So it really does need to know about x2 :/

I live hacked this onto cumin1001 for now:

diff --git a/spicerack/mysql_legacy.py b/spicerack/mysql_legacy.py
index a69cc74..be423e9 100644
--- a/spicerack/mysql_legacy.py
+++ b/spicerack/mysql_legacy.py
@@ -134,7 +134,7 @@ class MysqlLegacy:
             spicerack.mysql_legacy.MysqlLegacyRemoteHosts: an instance with the remote targets.
 
         """
-        query_parts = ["A:db-core"]
+        query_parts = ["A:db-core", "not A:db-section-x2"]
         dc_multipler = len(CORE_DATACENTERS)
         section_multiplier = len(CORE_SECTIONS)

Ack, thanks for all the input. For next week we'll just ignore x2, it'll stay RW in both DCs throughout. @Krinkle does that also work as the long-term solution, or did you want something different?

Yes, this would be my preferred outcome as well. This would be least risky and easiest to reason about from the application point of view, I think. When MW is in read-only, it still expects to be able to queue jobs and (try to) persist data via MainStash (x2), same as with parsercache, memcached, and php-apcu.

Yes, this would be my preferred outcome as well. This would be least risky and easiest to reason about from the application point of view, I think. When MW is in read-only, it still expects to be able to queue jobs and (try to) persist data via MainStash (x2), same as with parsercache, memcached, and php-apcu.

Hmmm, interesting. Ok, this means that x2 is going to stay a special-case permanently. I had been assuming it was going to be the 'new normal' when we have multi-dc serving happening. That's good to know, thanks!

Ok, so we need to treat it like parsercache in that regard.

Change 701475 abandoned by Legoktm:

[operations/cookbooks@master] sre.switchdc.mediawiki: Handle x2 specially

Reason:

I figured out how to do this without needing a cookbook patch, see Change-Id: I05bd81f2b0837f7340b6271efa6df7406d8e8380.

https://gerrit.wikimedia.org/r/701475

Most of the spicerack confusion and trouble is that x2 matches A:db-core even though it's more like parsercache. If it didn't match that, we could just exclude it from spicerack entirely.

I updated https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/701474 so that it contains a ACTIVE_ACTIVE_SECTIONS list that currently only contains x2. spicerack will exclude setting those sections to read-only and then read-write. This also keeps all the logic in spicerack and doesn't require changing the cookbook.

Change 701474 merged by jenkins-bot:

[operations/software/spicerack@master] mysql_legacy: Re-add x2 and properly support active/active sections

https://gerrit.wikimedia.org/r/701474

Still needs a new spicerack release, but hopefully finally fixed now :)