Page MenuHomePhabricator

Replication lag detection broken in wmf.22
Closed, ResolvedPublicPRODUCTION ERROR

Description

After rolling out 1.31.0-wmf.22 to group1 wikis I immediately saw ~200 errors per second like the following:

Exception: Database is read-only: The database has been automatically locked while the replica database servers catch up to the master.
File : Line: /srv/mediawiki/php-1.31.0-wmf.22/includes/libs/rdbms/database/Database.php:920

Stack:

#0 /srv/mediawiki/php-1.31.0-wmf.22/includes/libs/rdbms/database/Database.php(1622): Wikimedia\Rdbms\Database->query(string, string)
#1 [internal function]: Wikimedia\Rdbms\Database->insert(string, array, string)
#2 /srv/mediawiki/php-1.31.0-wmf.22/includes/libs/rdbms/database/DBConnRef.php(49): call_user_func_array(array, array)
#3 /srv/mediawiki/php-1.31.0-wmf.22/includes/libs/rdbms/database/DBConnRef.php(312): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#4 /srv/mediawiki/php-1.31.0-wmf.22/includes/externalstore/ExternalStoreDB.php(97): Wikimedia\Rdbms\DBConnRef->insert(string, array, string)
#5 /srv/mediawiki/php-1.31.0-wmf.22/includes/externalstore/ExternalStore.php(194): ExternalStoreDB->store(string, string)
#6 /srv/mediawiki/php-1.31.0-wmf.22/includes/externalstore/ExternalStore.php(167): ExternalStore::insertWithFallback(array, string, array)
#7 /srv/mediawiki/php-1.31.0-wmf.22/includes/Storage/SqlBlobStore.php(218): ExternalStore::insertToDefault(string)
#8 /srv/mediawiki/php-1.31.0-wmf.22/includes/Storage/RevisionStore.php(361): MediaWiki\Storage\SqlBlobStore->storeBlob(string, array)
#9 /srv/mediawiki/php-1.31.0-wmf.22/includes/Revision.php(1097): MediaWiki\Storage\RevisionStore->insertRevisionOn(MediaWiki\Storage\MutableRevisionRecord, Wikimedia\Rdbms\DatabaseMysqli)
#10 /srv/mediawiki/php-1.31.0-wmf.22/includes/page/WikiPage.php(1918): Revision->insertOn(Wikimedia\Rdbms\DatabaseMysqli)
#11 /srv/mediawiki/php-1.31.0-wmf.22/includes/page/WikiPage.php(1673): WikiPage->doCreate(WikitextContent, integer, User, string, array)
#12 /srv/mediawiki/php-1.31.0-wmf.22/includes/filerepo/file/LocalFile.php(1653): WikiPage->doEditContent(WikitextContent, string, integer, boolean, User)
#13 /srv/mediawiki/php-1.31.0-wmf.22/includes/deferred/AutoCommitUpdate.php(42): Closure$LocalFile::recordUpload2(Wikimedia\Rdbms\DatabaseMysqli, string)
#14 /srv/mediawiki/php-1.31.0-wmf.22/includes/deferred/DeferredUpdates.php(259): AutoCommitUpdate->doUpdate()
#15 /srv/mediawiki/php-1.31.0-wmf.22/includes/deferred/DeferredUpdates.php(210): DeferredUpdates::runUpdate(AutoCommitUpdate, Wikimedia\Rdbms\LBFactoryMulti, string, integer)
#16 /srv/mediawiki/php-1.31.0-wmf.22/includes/deferred/DeferredUpdates.php(127): DeferredUpdates::execute(array, string, integer)
#17 /srv/mediawiki/php-1.31.0-wmf.22/includes/MediaWiki.php(606): DeferredUpdates::doUpdates(string, integer)
#18 /srv/mediawiki/php-1.31.0-wmf.22/includes/api/ApiMain.php(557): MediaWiki::preOutputCommit(DerivativeContext)
#19 /srv/mediawiki/php-1.31.0-wmf.22/includes/api/ApiMain.php(513): ApiMain->executeActionWithErrorHandling()
#20 /srv/mediawiki/php-1.31.0-wmf.22/api.php(94): ApiMain->execute()
#21 /srv/mediawiki/w/api.php(3): include(string)
#22 {main}

Oddly, it also seems to have triggered a ton of Notice: Array to string conversion in /srv/mediawiki/php-1.31.0-wmf.21/includes/libs/rdbms/database/position/MySQLMasterPos.php on line 41 but it indicates the old branch, wmf.21 so I'm not sure how these are related.

Rolling back to wmf.21 caused the errors to immediately cease.

Event Timeline

Assigned to @aaron because he's the only author to touch the code in question.

Mentioned in SAL (#wikimedia-operations) [2018-02-21T20:53:42Z] <twentyafterfour> MediaWiki Train for 1.31.0-wmf.22 is blocked by T187942

The array-to-string conversion bug was already in the old branch---it got triggered likely because of the underlying lag as an unrelated symptom.

I've looking at the 21->22 logs, changes, and trying things on mw.org. I don't see a read-only problem there and https://www.mediawiki.org/w/api.php?action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb= looks fine. I don't see any lag in the DBs or seen by MW in that time (LoadBalancer graph at Grafana, those the resolution is low).

'+channel:DBReplication +"read-only"' is mostly empty. The bump in "Server {host} is not replicating?" was for one server and ended before the wmf22 deploy.

$dbr->getLag() and $lb->getLagTimes() works fine in eval.php on wmf22 wikis as well.

So the log message is probably a false alarm, but why? I'm also digging through the changes trying to see what might be the cause...

Change 413292 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/extensions/Echo@master] Add various read-only mode checks

https://gerrit.wikimedia.org/r/413292

Change 413293 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/extensions/ContentTranslation@master] Check read-only mode in ApiContentTranslationSave

https://gerrit.wikimedia.org/r/413293

The php warning is noise. The "Database is read-only" flood is an actual bug...no idea why that happened.

Change 413318 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Add ExternalStoreMedium::isReadOnly() method

https://gerrit.wikimedia.org/r/413318

Change 413319 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Make LocalFile check early if the revision store is available

https://gerrit.wikimedia.org/r/413319

Change 413293 abandoned by Aaron Schulz:
Check read-only mode in ApiContentTranslationSave

https://gerrit.wikimedia.org/r/413293

Change 413293 restored by Aaron Schulz:
Check read-only mode in ApiContentTranslationSave

https://gerrit.wikimedia.org/r/413293

Change 413292 merged by jenkins-bot:
[mediawiki/extensions/Echo@master] Add various read-only mode checks

https://gerrit.wikimedia.org/r/413292

Change 413293 merged by jenkins-bot:
[mediawiki/extensions/ContentTranslation@master] Check read-only mode in ApiContentTranslationSave

https://gerrit.wikimedia.org/r/413293

Change 413474 had a related patch set uploaded (by Chad; owner: Aaron Schulz):
[mediawiki/core@wmf/1.31.0-wmf.22] Add ExternalStoreMedium::isReadOnly() method

https://gerrit.wikimedia.org/r/413474

Change 413318 merged by jenkins-bot:
[mediawiki/core@master] Add ExternalStoreMedium::isReadOnly() method

https://gerrit.wikimedia.org/r/413318

Change 413474 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.22] Add ExternalStoreMedium::isReadOnly() method

https://gerrit.wikimedia.org/r/413474

Change 413484 had a related patch set uploaded (by Chad; owner: Aaron Schulz):
[mediawiki/core@wmf/1.31.0-wmf.22] Make LocalFile check early if the revision store is available

https://gerrit.wikimedia.org/r/413484

Change 413484 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.22] Make LocalFile check early if the revision store is available

https://gerrit.wikimedia.org/r/413484

Change 413319 merged by jenkins-bot:
[mediawiki/core@master] Make LocalFile check early if the revision store is available

https://gerrit.wikimedia.org/r/413319

Change 413665 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: make DBMasterPos implement Serializable

https://gerrit.wikimedia.org/r/413665

Let me know if we can do something on infrastructure as a followup to avoid or detect this in the future- like better beta dbs, with GTID, or simulating lag there.

Mostly better! Still got a spike of the following, all with wikidatawiki:

#0 /srv/mediawiki/php-1.31.0-wmf.22/includes/externalstore/ExternalStore.php(168): ExternalStore::insertWithFallback(array, string, array)
#1 /srv/mediawiki/php-1.31.0-wmf.22/includes/Storage/SqlBlobStore.php(218): ExternalStore::insertToDefault(string)
#2 /srv/mediawiki/php-1.31.0-wmf.22/includes/Storage/RevisionStore.php(368): MediaWiki\Storage\SqlBlobStore->storeBlob(string, array)
#3 /srv/mediawiki/php-1.31.0-wmf.22/includes/Revision.php(1097): MediaWiki\Storage\RevisionStore->insertRevisionOn(MediaWiki\Storage\MutableRevisionRecord, Wikimedia\Rdbms\DatabaseMysqli)
#4 /srv/mediawiki/php-1.31.0-wmf.22/includes/page/WikiPage.php(1918): Revision->insertOn(Wikimedia\Rdbms\DatabaseMysqli)
#5 /srv/mediawiki/php-1.31.0-wmf.22/includes/page/WikiPage.php(1673): WikiPage->doCreate(Wikibase\ItemContent, integer, User, string, array)
#6 /srv/mediawiki/php-1.31.0-wmf.22/extensions/Wikibase/repo/includes/Store/Sql/WikiPageEntityStore.php(293): WikiPage->doEditContent(Wikibase\ItemContent, string, integer, boolean, User)
#7 /srv/mediawiki/php-1.31.0-wmf.22/extensions/Wikibase/repo/includes/Store/Sql/WikiPageEntityStore.php(190): Wikibase\Repo\Store\WikiPageEntityStore->saveEntityContent(Wikibase\ItemContent, string, User, integer, boolean)
#8 /srv/mediawiki/php-1.31.0-wmf.22/extensions/Wikibase/lib/includes/Store/TypeDispatchingEntityStore.php(88): Wikibase\Repo\Store\WikiPageEntityStore->saveEntity(Wikibase\DataModel\Entity\Item, string, User, integer, boolean)
#9 /srv/mediawiki/php-1.31.0-wmf.22/extensions/Wikibase/repo/includes/EditEntity.php(713): Wikibase\Lib\Store\TypeDispatchingEntityStore->saveEntity(Wikibase\DataModel\Entity\Item, string, User, integer, boolean)
#10 /srv/mediawiki/php-1.31.0-wmf.22/extensions/Wikibase/repo/includes/Api/EntitySavingHelper.php(351): Wikibase\EditEntity->attemptSave(Wikibase\DataModel\Entity\Item, string, integer, string)
#11 /srv/mediawiki/php-1.31.0-wmf.22/extensions/Wikibase/repo/includes/Api/ModifyEntity.php(286): Wikibase\Repo\Api\EntitySavingHelper->attemptSaveEntity(Wikibase\DataModel\Entity\Item, string)
#12 /srv/mediawiki/php-1.31.0-wmf.22/includes/api/ApiMain.php(1580): Wikibase\Repo\Api\ModifyEntity->execute()
#13 /srv/mediawiki/php-1.31.0-wmf.22/includes/api/ApiMain.php(542): ApiMain->executeAction()
#14 /srv/mediawiki/php-1.31.0-wmf.22/includes/api/ApiMain.php(513): ApiMain->executeActionWithErrorHandling()
#15 /srv/mediawiki/php-1.31.0-wmf.22/api.php(94): ApiMain->execute()
#16 /srv/mediawiki/w/api.php(3): include(string)
#17 {main}

So rolled wikidatawiki back to wmf.21, but other wikis moved forward ok :)

Change 413799 had a related patch set uploaded (by Chad; owner: Aaron Schulz):
[mediawiki/core@wmf/1.31.0-wmf.22] rdbms: make DBMasterPos implement Serializable

https://gerrit.wikimedia.org/r/413799

Change 413665 merged by jenkins-bot:
[mediawiki/core@master] rdbms: make DBMasterPos implement Serializable

https://gerrit.wikimedia.org/r/413665

Change 413799 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.22] rdbms: make DBMasterPos implement Serializable

https://gerrit.wikimedia.org/r/413799

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM