s3 throughput tripled since 24 august
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Aug 25 2016, 7:16 AM

Description

Screenshot from 2016-08-25 09-15-49.png (629×657 px, 59 KB)

Details

Subject	Repo	Branch	Lines +/-
CirrusSearch: disable saneitizer cron job	operations/puppet	production	+1 -0
Fix a typo in BC code that handles toId => toPageId	mediawiki/extensions/CirrusSearch	wmf/1.28.0-wmf.16	+58 -6
Fix a typo in BC code that handles toId => toPageId	mediawiki/extensions/CirrusSearch	master	+58 -6
Fix a typo in BC code that handles toId => toPageId	mediawiki/extensions/CirrusSearch	master	+39 -5

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		dcausse	T143862 s3 throughput tripled since 24 august
		Declined		None	T143911 Minimize potential job queue traffic disruption by setting a new mysql mediawiki loadbalancer tag "jobqueue"

Event Timeline

jcrespo created this task.Aug 25 2016, 7:16 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 25 2016, 7:16 AM

mysql -BN -h db1077 -e "SHOW PROCESSLIST" | awk '{print $3}' | cut -d':' -f1 | sort | uniq -c | sort -nr | head -n10
     12 10.64.32.34
     12 10.64.32.33
     11 10.64.32.149
      9 10.64.48.141
      8 10.64.32.39
      8 10.64.16.67
      8 10.64.16.66
      8 10.64.16.64
      7 10.64.32.36
      7 10.64.32.32

For starters, snapshot1006.eqiad.wmnet is accessing a non-dump host; this is a bug, but I do not think is the problem here.

The queries seem to be, at least in part Title::loadRestrictions from job runners.

Potential offenders:

RestbaseUpdateJobOnDependencyChange
cirrusSearchCheckerJob

Please indicate if you performed any deployment or change around 19:30 (+-1 hour) UTC yesterday.

In T143862#2581460, @jcrespo wrote:

For starters, snapshot1006.eqiad.wmnet is accessing a non-dump host; this is a bug, but I do not think is the problem here.

Maybe it's not the problem here, but I still need to find out why that's the case. This host does not run misc cron jobs (and so not the wikidata dumps from cron) but only xml/sql dumps.

On the CirrusSide, here is what I know:

18:44 UTC: Config change to send More Like queries to eqiad (I really don't see how that could be related, but you never know)
~19:00 UTC Saneitizer job started going crazy. @EBernhardson is looking into this. If needed we can disable the cronjob that is feeding the saneitize queue. @dcausse probably has more context about this.

<gehel> jynus: T143862 is likely related to the saneitizer issue. dcausse is looking into it.

Change 306639 had a related patch set uploaded (by Gehel):
CirrusSearch: disable saneitizer cron job

https://gerrit.wikimedia.org/r/306639

gerritbot added a project: Patch-For-Review.Aug 25 2016, 8:27 AM

Gehel mentioned this in rOPUPdc2a4d034b94: CirrusSearch: disable saneitizer cron job.Aug 25 2016, 8:34 AM

jcrespo mentioned this in T143870: Some mw snapshot hosts are accessing main db servers.Aug 25 2016, 8:59 AM

Change 306602 had a related patch set uploaded (by DCausse):
Fix a typo in BC code that handles toId => toPageId

https://gerrit.wikimedia.org/r/306602

Change 306649 had a related patch set uploaded (by DCausse):
Fix a typo in BC code that handles toId => toPageId

https://gerrit.wikimedia.org/r/306649

Around that time (2016-08-24T19:30Z) we deployed ChangeProp also, which was going through a backlog with elevated speed (~2k reqs/sec), but most them were Varnish purges, and not coming from the JobRunners (just FYI).

dcausse triaged this task as Unbreak Now! priority.Aug 25 2016, 12:09 PM

dcausse added a project: Discovery-Search (Current work).

Restricted Application added subscribers: Luke081515, TerraCodes. · View Herald TranscriptAug 25 2016, 12:09 PM

raising to UBN, https://gerrit.wikimedia.org/r/306649 should be deployed before wmf16 reaches group2.

dcausse claimed this task.Aug 25 2016, 12:11 PM

dcausse moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.

Nikerabbit mentioned this in T143863: In "Edit parent tasks" dialog, searching "all open objects" for some text takes a long time.Aug 25 2016, 2:02 PM

Change 306602 abandoned by EBernhardson:
Fix a typo in BC code that handles toId => toPageId

Reason:
doing I97735a28b8 instead

https://gerrit.wikimedia.org/r/306602

Change 306649 merged by jenkins-bot:
Fix a typo in BC code that handles toId => toPageId

https://gerrit.wikimedia.org/r/306649

Change 306687 had a related patch set uploaded (by DCausse):
Fix a typo in BC code that handles toId => toPageId

https://gerrit.wikimedia.org/r/306687

I've deleted the relevant job queues across all wiki's which should reduce the load for now. Until the above patch is deployed though this could happen again. The changes in the patch make deleting the queue a second time unnecessary.

I confirm it worked:

Screenshot from 2016-08-25 18-51-50.png (638×898 px, 82 KB)

jcrespo created subtask T143911: Minimize potential job queue traffic disruption by setting a new mysql mediawiki loadbalancer tag "jobqueue".Aug 25 2016, 4:59 PM

ReleaseTaggerBot added a project: MW-1.28-release (WMF-deploy-2016-08-30_(1.28.0-wmf.17)).Aug 25 2016, 5:00 PM

You can ignore the subtask and close this independently, I just wanted to write a follow-up to minimize future issues.

Change 306687 merged by jenkins-bot:
Fix a typo in BC code that handles toId => toPageId

https://gerrit.wikimedia.org/r/306687

Mentioned in SAL [2016-08-25T18:19:16Z] <thcipriani@tin> Synchronized php-1.28.0-wmf.16/extensions/CirrusSearch/includes/Job/CheckerJob.php: SWAT: [[gerrit:306687|Fix a typo in BC code that handles toId => toPageId (T143862)]] (duration: 00m 47s)