Wikidata dispatching is currently stuck: https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&panelId=1&fullscreen&orgId=1&from=1532595040590&to=1532605840590
Problems started around 10:00 UTC.
matej_suchanek | |
Jul 26 2018, 11:51 AM |
F24164255: image.png | |
Jul 26 2018, 8:42 PM |
F24156211: image.png | |
Jul 26 2018, 12:09 PM |
Wikidata dispatching is currently stuck: https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&panelId=1&fullscreen&orgId=1&from=1532595040590&to=1532605840590
Problems started around 10:00 UTC.
Looking at the logs on mwmaint1001 I am seeing logs of failing to grab locks.
T178652 would probably have made this situation recover quicker.
No idea what caused these locks to get stuck.
It looks like https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch-script is also currently broken, probably due to T196609 and the dashboard not yet being updated.
It looks like most wikis have now recovered their locks, I'll continue monitoring:
Looks like the same pattern as can be seen in that graph is now happening again and all locks seem to be taken and there is a failure to grab locks for dispatching.
Dispatch lag is again increasing....
Something to note, because the locks are no longer in the DB, we end up selecting the same 15 or so wikis that are locked all of the time.
It could be that the other wikis actually don't have locks:
12:34:54 Could not lock any of the candidate client wikis for dispatching 12:34:54 Failed to grab dispatch lock for plwiki 12:34:54 Failed to grab dispatch lock for ukwiki 12:34:54 Failed to grab dispatch lock for ruwiki 12:34:54 Failed to grab dispatch lock for zhwiki 12:34:54 Failed to grab dispatch lock for rowiki 12:34:54 Failed to grab dispatch lock for itwiki 12:34:54 Failed to grab dispatch lock for euwiki 12:34:54 Failed to grab dispatch lock for arwiki 12:34:54 Failed to grab dispatch lock for svwiki 12:34:54 Failed to grab dispatch lock for commonswiki 12:34:54 Failed to grab dispatch lock for huwiki 12:34:54 Failed to grab dispatch lock for eswiki 12:34:54 Failed to grab dispatch lock for ptwiki 12:34:54 Failed to grab dispatch lock for uzwiki 12:34:54 Failed to grab dispatch lock for enwiki 12:34:54 Could not lock any of the candidate client wikis for dispatching 12:34:54 Failed to grab dispatch lock for ptwiki 12:34:54 Failed to grab dispatch lock for zhwiki 12:34:54 Failed to grab dispatch lock for enwiki 12:34:54 Failed to grab dispatch lock for arwiki 12:34:54 Failed to grab dispatch lock for uzwiki 12:34:54 Failed to grab dispatch lock for eswiki 12:34:54 Failed to grab dispatch lock for huwiki 12:34:54 Failed to grab dispatch lock for itwiki 12:34:54 Failed to grab dispatch lock for euwiki 12:34:54 Failed to grab dispatch lock for commonswiki 12:34:54 Failed to grab dispatch lock for plwiki 12:34:54 Failed to grab dispatch lock for ruwiki 12:34:54 Failed to grab dispatch lock for rowiki 12:34:54 Failed to grab dispatch lock for ukwiki 12:34:54 Failed to grab dispatch lock for svwiki 12:34:54 Could not lock any of the candidate client wikis for dispatching
before using the redis lock manager the status of the lock from the db was also in the select so that locked dbs would not be selected at all.
Mentioned in SAL (#wikimedia-operations) [2018-07-26T12:38:03Z] <reedy@deploy1001> rebuilt and synchronized wikiversions files: wikidatawiki back to .13 T200420
Change 448034 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/mediawiki-config@master] Wikidata dispatch, select 20 wikis instead of 15
Change 448034 merged by jenkins-bot:
[operations/mediawiki-config@master] Wikidata dispatch, select 20 wikis instead of 15
Mentioned in SAL (#wikimedia-operations) [2018-07-26T12:44:01Z] <addshore@deploy1001> Synchronized wmf-config/Wikibase.php: T200420 - Wikidata, dispatch, select 20 instead of 15 wikis (duration: 00m 55s)
Change 448037 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/puppet@production] Verbose logging for wikidata dispatching
Change 448037 merged by Alexandros Kosiaris:
[operations/puppet@production] Verbose logging for wikidata dispatching
Mentioned in SAL (#wikimedia-operations) [2018-07-26T13:44:00Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Use new wikibase dispatch lock manager on wikidatawiki T200420 T178652 (duration: 00m 55s)
Change 448042 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/mediawiki-config@master] Add a logger to wikibaseDispatchRedisLockManager
Change 448043 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/mediawiki-config@master] Add wikibaseDispatchRedisLockManager to wmgMonologChannels
Change 448042 merged by jenkins-bot:
[operations/mediawiki-config@master] Add a logger to wikibaseDispatchRedisLockManager
Mentioned in SAL (#wikimedia-operations) [2018-07-26T14:34:37Z] <addshore@deploy1001> sync-file aborted: Add a logger to wikibaseDispatchRedisLockManager T200420 (duration: 00m 02s)
Mentioned in SAL (#wikimedia-operations) [2018-07-26T14:35:52Z] <addshore@deploy1001> Synchronized wmf-config/Wikibase.php: Add a logger to wikibaseDispatchRedisLockManager T200420 (duration: 00m 56s)
Change 448043 merged by jenkins-bot:
[operations/mediawiki-config@master] Add wikibaseDispatchRedisLockManager to wmgMonologChannels
Mentioned in SAL (#wikimedia-operations) [2018-07-26T14:38:52Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Add wikibaseDispatchRedisLockManager to wmgMonologChannels T200420 (duration: 00m 54s)
I deployed T178652 which should hide some of the brokenness, as the locks should now only remain for 15 mins instead of an hour, but the underlying issue still remains.
I added a logger to the LockManager used for dispatching, but that hasn't logged anything yet...
Currently staring at RedisLockManager and QuorumLockManager.
While testing some stuff with eval.php if I run the following:
$lock = str_replace( ' ', '_', "Wikibase.wikidatawiki.dispatchChanges.dewiki" ); $result = $lockManager->unlock( [ $lock ] ); var_dump( $result->__toString() ); $result = $lockManager->lock( [ $lock ] ); var_dump( $result->__toString() );
I get:
> $lock = str_replace( ' ', '_', "Wikibase.wikidatawiki.dispatchChanges.dewiki" ); > $result = $lockManager->unlock( [ $lock ] ); > var_dump( $result->__toString() ); string(292) "<OK, collected 1 error(s) on the way, no value set> +------+---------------------------+------------------------------------------+ | 1 | lockmanager-notlocked | Wikibase.wikidatawiki.dispatchChanges.de | +------+---------------------------+------------------------------------------+ " > $result = $lockManager->lock( [ $lock ] ); > var_dump( $result->__toString() ); string(295) "<Error, collected 1 error(s) on the way, no value set> +------+---------------------------+------------------------------------------+ | 1 | lockmanager-fail-acquirel | Wikibase.wikidatawiki.dispatchChanges.de | +------+---------------------------+------------------------------------------+
Which seems odd, as I can neither unlock or lock the key.
If I try another key, for example "Wikibase.wikidatawiki.dispatchChanges.addshoretest" locking and unlocking works as expected.
Mentioned in SAL (#wikimedia-operations) [2018-07-26T15:21:32Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Add LockManager logging T200420 (duration: 00m 55s)
Mentioned in SAL (#wikimedia-operations) [2018-07-26T15:27:20Z] <addshore@deploy1001> Synchronized wmf-config: Remove unused wikibaseDispatchRedisLockManager logging T200420 (duration: 00m 56s)
Un assigning myself for now as I'm not actively looking at this right now.
Due to the 15 min lock time out the dispatch lag is staying reasonably low, but the underlying issue is still not really known and not fixed.
You mean the condition on chd_touched in getCandidateClients()? That would roughly correlate with lock state.
It's odd that engageClientLock() is blocking for the SQL coordinator but not the LockManager-using subclass (Database::lock is 5s by default). That would make the LM-based selectClient() more likely to not find anything and return false (if there are a few busy wikis), as it only does one pass with no usleep(). Since dispatchMaxTime is not PHP_INT_MAX (but 720) and I don't see --max-passes in puppet, then max-passes for dispatchChanges.php runs is 1, right? Thus, once selectClient() returns false, then dispatchChanges.php would exit afaik. The loop/delay (idle-delay) logic would not happen in such cases.
There should only really ever be a number of wikis busy equal to the number of dispatchers being run, so ~4 or 5?
as it only does one pass with no usleep(). Since dispatchMaxTime is not PHP_INT_MAX (but 720) and I don't see --max-passes in puppet, then max-passes for dispatchChanges.php runs is 1, right?
Some of the options are now in mediawiki-config, see https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/Wikibase.php#L131-L134, so a maxtime is set and max passes will be infinate.
Ah, right, I read that ternary backwards, <<$maxTime < PHP_INT_MAX ? PHP_INT_MAX : 1>>.
So, it looks like when lockManager->unlock is called in LockManagerSqlChangeDispatchCoordinator chd_lock is used, which is null, so no lock is removed in redis....
I still have no idea how this triggered at 11am today, it looks like it will be due to .13 -> .14, but rolling that back didn't fix the problem.........
@aaron has made a patch to make Lockmanager throw when trying to unlock a null lock @ https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/448080/
The issue in Wikibase should be fixed first though to avoid dispatching throwing exceptions and dieing & spamming logs.
Change 448103 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] Use getClientLockName value for releaseClientLock when dispatching
Change 448113 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@wmf/1.32.0-wmf.14] Use getClientLockName value for releaseClientLock when dispatching
The fix in the patch above has been on mwmaint1001 for around 15 mins now (I hacked it on there), and checked that the correct locks were being removed by name.
Dispatch seems to have recovered, now just need to back port it, and then we can investigate what actually triggered this in the first place.....
Change 448113 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.32.0-wmf.14] Use getClientLockName value for releaseClientLock when dispatching
Mentioned in SAL (#wikimedia-operations) [2018-07-26T20:39:50Z] <addshore@deploy1001> Synchronized php-1.32.0-wmf.14/extensions/Wikibase/repo/includes/Store/Sql/SqlChangeDispatchCoordinator.php: Use getClientLockName value for releaseClientLock when dispatching T200420 (duration: 00m 57s)
Looks like after battling this all day service has been restored.... although still slightly confused as to how this happened...
Change 448103 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Use getClientLockName value for releaseClientLock when dispatching