Page MenuHomePhabricator

labsdb1011 mariadb crashed
Closed, ResolvedPublic

Description

We were paged at Tue Sept 24 22:52:48 UTC 2019 for a mariadb crash on labsdb1011. It left a bunch of diags in the logs and some tables were apparently marked corrupt (though it appears to be tables related to wmf-pt-kill and mysql events rather than wikis).

It appears to have recovered for some intents and purposes, but I think it needs checking. This isn't a normal condition, after all.

Related Objects

StatusSubtypeAssignedTask
ResolvedMarostegui
ResolvedMarostegui
Resolved Bstorm
Resolved Bstorm
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
Resolved Bstorm
Resolved Bstorm
ResolvedMoritzMuehlenhoff
ResolvedMarostegui
ResolvedMarostegui
Resolved Cmjohnson
Resolveddcaro
ResolvedMarostegui
ResolvedRequestwiki_willy
ResolvedRequest Cmjohnson
ResolvedRequest Cmjohnson
ResolvedRequest Cmjohnson
ResolvedRequest Cmjohnson
DeclinedNone
Resolved Kormat
ResolvedArielGlenn
Resolved Bstorm
Declined Bstorm
Resolved Bstorm
Resolved Bstorm
Resolved Jhernandez
Resolved razzi
ResolvedMarostegui
ResolvedMilimetric
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
ResolvedAndrew
Resolved Bstorm
Resolved ayounsi
Resolved Jhernandez
ResolvedMarostegui
ResolvedRagesoss
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm

Event Timeline

Bstorm created this task.

This is the logs on and around the crash P9170

Change 538987 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wiki replicas: depool lasbdb1011 just in case of issues

https://gerrit.wikimedia.org/r/538987

Change 538987 merged by Bstorm:
[operations/puppet@production] wiki replicas: depool lasbdb1011 just in case of issues

https://gerrit.wikimedia.org/r/538987

labsdb1011 is now depooled. @Marostegui if that doesn't seem useful, please repool it. I'm just hoping to prevent any possible harm in case it isn't working right.

From what I can see:

  • No HW errors.
  • Nothing relevant on the graphs that could indicate what caused the issue
  • Those warnings about the event scheduler are "normal".
  • No queries being killed by the query killer right before the crash.

Apart from the logs you pasted:

Sep 24 22:48:28 labsdb1011 kernel: [10074305.111470] mysqld[5779]: segfault at 18 ip 0000560ae346d099 sp 00007f8d8a561b10 error 4 in mysqld[560ae301c000+f9e000]

I have started replication on all threads - let's see if there is any corruption there. As we are running ROW, if there is any data drift, replication will broken.
Once it has caught up I will run a data check anyways against another host, just in case.

My bet is on a long heavy query that made the server run out of resources (although I would have expected an OOM there...)

Mentioned in SAL (#wikimedia-operations) [2019-09-25T05:06:42Z] <marostegui> Run a data check on labsdb1011 - T233766

s4 (commonswiki) data comparison came back clean.
Ongoing:
s1 enwiki
s2 multiple wikis (https://raw.githubusercontent.com/wikimedia/operations-mediawiki-config/master/dblists/s2.dblist)
s8 wikidata

s8 wikidata clean

As replication has also been working fine for almost 8 hours (and any data drift would break replication) I am going to repool this host.

Marostegui claimed this task.

I am going to close this as resolved, there is not much else we can do now. If it crashes again or we see replication getting broken it could mean there is indeed data corruption as a result of the crash and we might need to explore some other approaches (recloning it).

Out of curiosity, did you run the checks against their master, between replicas or something else?

Out of curiosity, did you run the checks against their master, between replicas or something else?

Between replicas

Marostegui closed subtask Restricted Task as Resolved.Dec 17 2019, 8:34 AM