Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | dcausse | T143862 s3 throughput tripled since 24 august | |||
Declined | None | T143911 Minimize potential job queue traffic disruption by setting a new mysql mediawiki loadbalancer tag "jobqueue" |
Event Timeline
mysql -BN -h db1077 -e "SHOW PROCESSLIST" | awk '{print $3}' | cut -d':' -f1 | sort | uniq -c | sort -nr | head -n10 12 10.64.32.34 12 10.64.32.33 11 10.64.32.149 9 10.64.48.141 8 10.64.32.39 8 10.64.16.67 8 10.64.16.66 8 10.64.16.64 7 10.64.32.36 7 10.64.32.32
For starters, snapshot1006.eqiad.wmnet is accessing a non-dump host; this is a bug, but I do not think is the problem here.
The queries seem to be, at least in part Title::loadRestrictions from job runners.
Potential offenders:
- RestbaseUpdateJobOnDependencyChange
- cirrusSearchCheckerJob
Please indicate if you performed any deployment or change around 19:30 (+-1 hour) UTC yesterday.
Maybe it's not the problem here, but I still need to find out why that's the case. This host does not run misc cron jobs (and so not the wikidata dumps from cron) but only xml/sql dumps.
On the CirrusSide, here is what I know:
- 18:44 UTC: Config change to send More Like queries to eqiad (I really don't see how that could be related, but you never know)
- ~19:00 UTC Saneitizer job started going crazy. @EBernhardson is looking into this. If needed we can disable the cronjob that is feeding the saneitize queue. @dcausse probably has more context about this.
<gehel> jynus: T143862 is likely related to the saneitizer issue. dcausse is looking into it.
Change 306639 had a related patch set uploaded (by Gehel):
CirrusSearch: disable saneitizer cron job
Change 306602 had a related patch set uploaded (by DCausse):
Fix a typo in BC code that handles toId => toPageId
Change 306649 had a related patch set uploaded (by DCausse):
Fix a typo in BC code that handles toId => toPageId
Around that time (2016-08-24T19:30Z) we deployed ChangeProp also, which was going through a backlog with elevated speed (~2k reqs/sec), but most them were Varnish purges, and not coming from the JobRunners (just FYI).
raising to UBN, https://gerrit.wikimedia.org/r/306649 should be deployed before wmf16 reaches group2.
Change 306602 abandoned by EBernhardson:
Fix a typo in BC code that handles toId => toPageId
Reason:
doing I97735a28b8 instead
Change 306649 merged by jenkins-bot:
Fix a typo in BC code that handles toId => toPageId
Change 306687 had a related patch set uploaded (by DCausse):
Fix a typo in BC code that handles toId => toPageId
I've deleted the relevant job queues across all wiki's which should reduce the load for now. Until the above patch is deployed though this could happen again. The changes in the patch make deleting the queue a second time unnecessary.
You can ignore the subtask and close this independently, I just wanted to write a follow-up to minimize future issues.
Change 306687 merged by jenkins-bot:
Fix a typo in BC code that handles toId => toPageId
Mentioned in SAL [2016-08-25T18:19:16Z] <thcipriani@tin> Synchronized php-1.28.0-wmf.16/extensions/CirrusSearch/includes/Job/CheckerJob.php: SWAT: [[gerrit:306687|Fix a typo in BC code that handles toId => toPageId (T143862)]] (duration: 00m 47s)
Change 306639 abandoned by Gehel:
CirrusSearch: disable saneitizer cron job
Reason:
Issue has been fixed, no need to disable those jobs.