Page MenuHomePhabricator

s3 throughput tripled since 24 august
Closed, ResolvedPublic

Description

Screenshot from 2016-08-25 09-15-49.png (629×657 px, 59 KB)

Event Timeline

mysql -BN -h db1077 -e "SHOW PROCESSLIST" | awk '{print $3}' | cut -d':' -f1 | sort | uniq -c | sort -nr | head -n10
     12 10.64.32.34
     12 10.64.32.33
     11 10.64.32.149
      9 10.64.48.141
      8 10.64.32.39
      8 10.64.16.67
      8 10.64.16.66
      8 10.64.16.64
      7 10.64.32.36
      7 10.64.32.32

For starters, snapshot1006.eqiad.wmnet is accessing a non-dump host; this is a bug, but I do not think is the problem here.

The queries seem to be, at least in part Title::loadRestrictions from job runners.

Potential offenders:

  • RestbaseUpdateJobOnDependencyChange
  • cirrusSearchCheckerJob

Please indicate if you performed any deployment or change around 19:30 (+-1 hour) UTC yesterday.

For starters, snapshot1006.eqiad.wmnet is accessing a non-dump host; this is a bug, but I do not think is the problem here.

Maybe it's not the problem here, but I still need to find out why that's the case. This host does not run misc cron jobs (and so not the wikidata dumps from cron) but only xml/sql dumps.

On the CirrusSide, here is what I know:

jcrespo removed a subscriber: ArielGlenn.
<gehel> jynus: T143862 is likely related to the saneitizer issue. dcausse is looking into it.

Change 306639 had a related patch set uploaded (by Gehel):
CirrusSearch: disable saneitizer cron job

https://gerrit.wikimedia.org/r/306639

Change 306602 had a related patch set uploaded (by DCausse):
Fix a typo in BC code that handles toId => toPageId

https://gerrit.wikimedia.org/r/306602

Change 306649 had a related patch set uploaded (by DCausse):
Fix a typo in BC code that handles toId => toPageId

https://gerrit.wikimedia.org/r/306649

Around that time (2016-08-24T19:30Z) we deployed ChangeProp also, which was going through a backlog with elevated speed (~2k reqs/sec), but most them were Varnish purges, and not coming from the JobRunners (just FYI).

dcausse triaged this task as Unbreak Now! priority.Aug 25 2016, 12:09 PM

raising to UBN, https://gerrit.wikimedia.org/r/306649 should be deployed before wmf16 reaches group2.

Change 306602 abandoned by EBernhardson:
Fix a typo in BC code that handles toId => toPageId

Reason:
doing I97735a28b8 instead

https://gerrit.wikimedia.org/r/306602

Change 306649 merged by jenkins-bot:
Fix a typo in BC code that handles toId => toPageId

https://gerrit.wikimedia.org/r/306649

Change 306687 had a related patch set uploaded (by DCausse):
Fix a typo in BC code that handles toId => toPageId

https://gerrit.wikimedia.org/r/306687

I've deleted the relevant job queues across all wiki's which should reduce the load for now. Until the above patch is deployed though this could happen again. The changes in the patch make deleting the queue a second time unnecessary.

You can ignore the subtask and close this independently, I just wanted to write a follow-up to minimize future issues.

Change 306687 merged by jenkins-bot:
Fix a typo in BC code that handles toId => toPageId

https://gerrit.wikimedia.org/r/306687

Mentioned in SAL [2016-08-25T18:19:16Z] <thcipriani@tin> Synchronized php-1.28.0-wmf.16/extensions/CirrusSearch/includes/Job/CheckerJob.php: SWAT: [[gerrit:306687|Fix a typo in BC code that handles toId => toPageId (T143862)]] (duration: 00m 47s)

Change 306639 abandoned by Gehel:
CirrusSearch: disable saneitizer cron job

Reason:
Issue has been fixed, no need to disable those jobs.

https://gerrit.wikimedia.org/r/306639