This rename of a user with 200K contributions created between 0 and 4 minutes of lag on all enwiki slaves.
Identify why and avoid it in the future.
This rename of a user with 200K contributions created between 0 and 4 minutes of lag on all enwiki slaves.
Identify why and avoid it in the future.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Declined | None | T3268 Database replication lag issues (tracking) | |||
Duplicate | None | T108551 Database locked error while publishing article using CX | |||
Resolved | aaron | T95501 Fix causes of replica lag and get it to under 5 seconds at peak | |||
Resolved | aaron | T116425 Rename user creates lag on enwiki |
@Legoktm told me that RenameUser.php does not wait for the slaves- fix that in the future.
Change 248486 had a related patch set uploaded (by Aaron Schulz):
Make jobs wait for slaves after each batch
Change 248583 had a related patch set uploaded (by Aaron Schulz):
Batch the updates done for page delete/restore handling
Change 248583 merged by jenkins-bot:
Batch the updates done for page delete/restore handling
This caused for the second time in a week lag on db1036, afecting all s2 users' watchlist, contributions and recent-changes. I think this merits an unbreak now.
(First one was T141340).
As a way to try to mitigate this (but not fully solved, I will try to load balance this host with another to give it double the resources (but this may not work).
Here it is the list of queries (I didn't copy it here to avoid exposing account info), they are not individually that large, maybe a wait for slaves was missing (or it does not wait for slaves with 0 weight but a specific role?)
Change 302952 had a related patch set uploaded (by Jcrespo):
Pool db1074 as backup rc node in case db1036 lags
LoadBalancer::waitForAll() does not check 0-load slaves, so if it can't keep up with the other DBs it would end up in trouble.
OTOH, we might not care about 'vslow'/'dump' so much, so checking them all might not be good either. Some middle ground would be best.
Knowing that, I can put the ones we care with 1 load, fixing the issue by configuration.
Going back to high because of the previous deployment.
There was slave lag on Meta-Wiki (s7) causing writes to be locked for more than three minutes a little while ago. I suspect (but not completely sure) that it was also related to a global rename that was done at about the same time. https://meta.wikimedia.org/wiki/Special:CentralAuth/Derakhshan contains 7000+ edits on fawiki which is also s7. Probably worth more investigation.
I don't think a 7k row rename matters. I do however see a huge spike in write query time warnings in DBPerformance log on huwiki (also s7) at the moment. They come from recentchanges updates (rc_patrolled updates) in ChangeNotification jobs.
I made https://gerrit.wikimedia.org/r/303417 for the above problem (the jobs triggered a hook that triggered the patrol flag callback).
I can confirm the long running writes on the master were from queries such as:
UPDATE /* RevisionReviewForm::updateRecentChanges */ `recentchanges` SET rc_patrolled = '1' WHERE rc_cur_id = '148' AND rc_type = '1' AND (rc_timestamp <= '20160724202301')