Page MenuHomePhabricator

Rebuild sanitarium hosts
Closed, ResolvedPublic

Assigned To
Authored By
Ladsgroup
May 25 2023, 5:35 AM
Referenced Files
F37094416: image.png
Jun 5 2023, 5:49 PM
F37084024: image.png
May 30 2023, 10:22 AM
Tokens
"Yellow Medal" token, awarded by MusikAnimal."Love" token, awarded by Lemonaka."Yellow Medal" token, awarded by Ladsgroup."Piece of Eight" token, awarded by MJL.

Description

NOTE: Per T337446#8882092, replag is likely to keep increasing until mid next week. This only affects s1, s2, s3, s4, s5 and s7. The rest of the sections should be working as normal.

db1154 and db1155 has their replication broken due to different errors. For example for s5:

PROBLEM - MariaDB Replica SQL: s5 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table dewiki.flaggedpage_pending: Cant find record in flaggedpage_pending, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1161-bin.001646, end_log_pos 385492288

for s4 and s7 also the replication is broken but on a different table.

Section with broken replication are: s1, s2, s5, s7

Broken summary:

db1154:

  • s1 (caught up)
  • s3 (caught up)
  • s5 (caught up)

db1155:

  • s2 (caught up)
  • s4 (caught up)
  • s7 (caught up)

Recloning process

s1:

  • clouddb1013
  • clouddb1017
  • clouddb1021

s2:

  • clouddb1014
  • clouddb1018
  • clouddb1021

s3:

  • clouddb1013
  • clouddb1017
  • clouddb1021

s4:

  • clouddb1015
  • clouddb1019
  • clouddb1021

s5:

  • clouddb1016
  • clouddb1020
  • clouddb1021

s7:

  • clouddb1014
  • clouddb1018
  • clouddb1021

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Thanks for the report. It was only on clouddb1021 but not on the others (as I did the transfer) before we found this issue. I have fixed it on the other two, sorry for the inconveniences. Lots of moving pieces on all this.

No apologies necessary, and thank you :-) confirmed working for me/the tool in question.

Is there an estimate for when things'll be fully restored? Y'all are great.

Is there an estimate for when things'll be fully restored? Y'all are great.

If nothing happens, tomorrow everything should be back.
However I will probably rebuild s4 tomorrow (and it will take probably 2-3 days) as I don't fully trust its data anymore since it broke earlier today - I fixed the row manually but there could be more stuff under the hood.

s1 is fully recloned, and catching up.

I am going to start with s4 to be on the safe side.

Mentioned in SAL (#wikimedia-operations) [2023-05-31T04:59:27Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1221 (sanitarium s4 master) T337446', diff saved to https://phabricator.wikimedia.org/P48640 and previous config saved to /var/cache/conftool/dbconfig/20230531-045927-root.json

Change 924772 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1154: Enable notifications

https://gerrit.wikimedia.org/r/924772

Change 924772 merged by Marostegui:

[operations/puppet@production] db1154: Enable notifications

https://gerrit.wikimedia.org/r/924772

s4 on clouddb1021 has been recloned, added views, heartbeat, grants etc. Once it has caught up I will reclone the other two hosts from it.

Marostegui updated the task description. (Show Details)

s4 has been fully recloned, clouddb1019:3314 is now catching up with its master

I'm not sure what's causing it (regarding s1), but I'm finding some bots are not returning up-to-date reports. With s1 down for 5 days, there should be a backlog of lengthy reports but I'm seeing short reports or none at all. Did every new edit since May 25th get restored and integrated? Sorry that I don't know the correct terminology.

I'm not sure what's causing it (regarding s1), but I'm finding some bots are not returning up-to-date reports. With s1 down for 5 days, there should be a backlog of lengthy reports but I'm seeing short reports or none at all. Did every new edit since May 25th get restored and integrated? Sorry that I don't know the correct terminology.

Can you give us more details about how to debug this? s1 data is up to date now so the reports should be providing the right data unless there's a queue and/or a cache layer somewhere

Change 925286 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1155: Enable notifications

https://gerrit.wikimedia.org/r/925286

Change 925286 merged by Marostegui:

[operations/puppet@production] db1155: Enable notifications

https://gerrit.wikimedia.org/r/925286

Marostegui lowered the priority of this task from Unbreak Now! to High.Jun 1 2023, 8:47 AM

I am reducing the priority of this as all the hosts have been recloned now and data should be up to date.
We shouldn't be surprised if s6 and s8 (the sections that never break) end up breaking on the sanitarium hosts, as if the problem was 10.4.29, data might have been corrupted there and simply didn't show up yet.
I am going to do some data checking now on the recloned versions before closing this task, hopefully by Monday if everything goes fine in the next few days.

Things might still be slow on some of the tools as we are adding the special indexes used in wikireplicas, that can be tracked at T337734

Something should be in Tech News

Please could someone suggest how to summarize this for Tech News? Draft wording always helps immensely! (1-3 short sentences, not too technical, 1-2 links for context or more details).
From a skim of all the above, the best I can guess at (probably very inaccurate!) is:

For a few days last week, [readers/editors] in some regions experienced delays seeing edits being visible, which also caused problems for some tools. This was caused by problems in the secondary databases. This should now be fixed.

Production databases didn't have lag. Only cloud replicas but lags in order of days. Basically it stopped getting any updates for around a week due to data,integrity issues.

Hope that clears it a bit. (On phone, otherwise I would have made an exact phrase to change)

Something should be in Tech News

Please could someone suggest how to summarize this for Tech News? Draft wording always helps immensely! (1-3 short sentences, not too technical, 1-2 links for context or more details).
From a skim of all the above, the best I can guess at (probably very inaccurate!) is:

For a few days last week, [readers/editors] in some regions experienced delays seeing edits being visible, which also caused problems for some tools. This was caused by problems in the secondary databases. This should now be fixed.

Wikireplicas had outdated data and were unavailable for around 1 week. There were periods where not even old data was available.
Tools have most likely experienced intermittent unavailability since Wednesday past week until today. We are still adding indexes, so even though everything is up, slowness in certain tools can be experienced.

This outage didn't affect production.

The issue of slow down is resolved by now. There are still some replicas that don't have the index yet but all of them are depooled so no user-facing slowdown anymore

Please could someone suggest how to summarize this for Tech News? Draft wording always helps immensely! (1-3 short sentences, not too technical, 1-2 links for context or more details).

Some tools and bots returned outdated information due to database breakage, and may have been down entirely while it was being fixed. These issues have now been fixed.

Possibly could link to https://en.wikipedia.org/wiki/Wikipedia:Replication_lag but that's English-only.

Thank you immensely @Legoktm that's exactly what I needed. :) Now added. If anyone has changes, please make them directly there, within the next ~23 hours. Thanks.

Ladsgroup moved this task from In progress to Done on the DBA board.

The hosts have been fully rebuilt and working as expected without any major replag anymore. The indexes have been added too. So I'm closing this. Some follow ups are needed (like T337961) but the user-facing parts are done. Sorry for the inconvenience and a major wikilove to @Marostegui who worked day and night in the last week and weekend to get everything back to normal.

The hosts have been fully rebuilt and working as expected without any major replag anymore. The indexes have been added too. So I'm closing this. Some follow ups are needed (like T337961) but the user-facing parts are done. Sorry for the inconvenience and a major wikilove to Marostegui who worked day and night in the last week and weekend to get everything back to normal.

Agreed, immense thanks to @Marostegui and also you, Ladsgroup!

I wanted to ask something I've genuinely been curious about for years -- since the wiki replicas are relied upon so heavily by the editing communities (and to some degree, readers), should we as an org treat their health with more scrutiny? This of course is insignificant compared to the production replicas going down, but nonetheless the effects were surely felt all across the movement (editathons don't have live tracking, stewards can't query for global contribs, important bots stop working, etc.). I.e. I wonder if there's any appetite to file an incident report, especially if we feel there are lessons to be learned to prevent similar future outages? I noticed other comparatively low-impact incidents have been documented, such as PAWS outages.

Also much thanks to especially @Marostegui from my side.

I wanted to ask something I've genuinely been curious about for years -- since the wiki replicas are relied upon so heavily by the editing communities (and to some degree, readers), should we as an org treat their health with more scrutiny? This of course is insignificant compared to the production replicas going down, but nonetheless the effects were surely felt all across the movement (editathons don't have live tracking, stewards can't query for global contribs, important bots stop working, etc.). I.e. I wonder if there's any appetite to file an incident report, especially if we feel there are lessons to be learned to prevent similar future outages? I noticed other comparatively low-impact incidents have been documented, such as PAWS outages.

I do think that at the very least we should have some way to recover from severe incidents like these a whole lot faster. Maybe having a delayed replica that we can use as a data source to speed up recovery, or something like a puppet run that preps a 'fresh' replica instance every single day, to make sure all the parts needed for that are known to be good ?

I think this one required too much learning on the job for something this critical, and the sole reason is that it luckily doesnt happen too often, but I think the whole process was too involved for everyone. It was affecting and disrupting too many ppl, and teams, which is I think the point of reference we should be using instead of "it's not production".

s1 looks to be down again. (Edit: Now tracked at T338172)

sd@tools-sgebastion-10:~$ sql enwiki
ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 11

DB-wise things are good:

image.png (257×1 px, 68 KB)

I think something is broken on network side of things. Please file a separate ticket.

fyi: I started a incident doc at https://wikitech.wikimedia.org/wiki/Incidents/2023-05-28_wikireplicas_lag because it was requested to have this incident in the next incident review ritual on Tuesday. I'll add some more information tomorrow and Monday, but feel free to add anything I missed.

You might want to sync up with @KOfori because he's also started one IR. And I have captured a lot more detailed timeline, so maybe we need to merge both.

There's a followup commit that was never merged, to re-enable pybal health monitoring on all the wikireplicas: https://gerrit.wikimedia.org/r/c/operations/puppet/+/924508/1/hieradata/common/service.yaml

Is it safe to assume we're back in a sane state and can turn this back on?

There's a followup commit that was never merged, to re-enable pybal health monitoring on all the wikireplicas: https://gerrit.wikimedia.org/r/c/operations/puppet/+/924508/1/hieradata/common/service.yaml

Is it safe to assume we're back in a sane state and can turn this back on?

Let's go for it Brandon!

Change 924508 merged by BBlack:

[operations/puppet@production] wikireplicas: restore pybal monitoring

https://gerrit.wikimedia.org/r/924508

Mentioned in SAL (#wikimedia-operations) [2023-09-18T14:04:17Z] <bblack> lvs1020, lvs1018: restarting pybal to re-enable healthchecks for wikireplicas ( T337446 -> https://gerrit.wikimedia.org/r/924508 )