Make LoadMonitor server states more up-to-date and respond to outages more quickly
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aaron
	Oct 13 2020, 6:34 PM

Description

This task calls for several things:

Make the connectivity scaling factor converge faster to avoid connection attempts to downed servers faster. Currently, the logs still get flooded with connection errors. We should both using a higher "moving average ratio" and possible a better moving average function to better handle varying traffic.
Make the cache keys per DB-server so that extra query groups do not require separate cache keys (even if they are a subset of the main traffic servers or at least overlap). When checking which servers needs state updates, the order should be shuffled. This makes it easier to keep the data more up-to-date since it no longer requires a for-loop that always has to connect to everything.
Simplify the tiered apcu/WANCache logic to use either apcu (web mode) or the local cluster cache (CLI mode). Since both have the BagOStuff interface, this would simplify the code significantly. Placeholders should liberally be used for a few seconds when there is no stale value and the cache mutex is already held. Cluster cache updates from web requests should use WRITE_BACKGROUND to avoid latency.
Mitigate network slowness when LoadMonitor polls/gauges servers (e.g. lower connection timeout and set read timeout with mysqli). Maybe there could be a LoadBalancer::CONN_GAUGE_PROBE constant for a third connection class category to help this. Past outages have involved connections hanging in the ACCEPT state, or queries (including heartbeat table and SHOW ones) being slow.
Consider adding mysql connected/running/max_connections status variables to the server state polling to see how close a server is too being overloaded.

Server failure scenarios we should be able to quickly handle:

Immediately refuses connection attempts (mysql not listening)
Immediately rejects connection attempts (max_connections)
Times out for connection attempts (packet loss, mysql accept() loop stuck)
Takes a long time to accept connections (packet loss, CPU use)
Takes a long time to run even trivial non-data queries like SET SESSION sql_mode = 'x' (packet loss, CPU/IO use)
Replication stops or slows to a crawl (packet loss, replicated query failure, corruption, CPU/IO use, futex deadlocks)

See T314020 for ideas about tracking connection attempt failures.

Details

Subject	Repo	Branch	Lines +/-
rdbms: avoid session variable SET query for LoadMonitor connections	mediawiki/core	master	+47 -36
Add ConnectTimeoutScenario and related tweaks	mediawiki/extensions/EventSimulator	master	+140 -25
rdbms: improve caching and state convergence in LoadMonitor	mediawiki/core	master	+263 -270
rdbms: tweak the refresh probability in LoadMonitor	mediawiki/core	master	+42 -11
rdbms: add CONN_UNTRACKED_GAUGE LoadBalancer flag for LoadMonitor gauging	mediawiki/core	master	+103 -77
rdbms: improve the moving average method in LoadMonitor	mediawiki/core	master	+34 -38

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Ladsgroup	T314020 LoadMonitor connection weighting reimagined
Resolved	aaron	T265386 Make LoadMonitor server states more up-to-date and respond to outages more quickly
Resolved	aaron	T322689 Proper reconnection and error handling for Database::queryMulti()

Event Timeline

aaron created this task.Oct 13 2020, 6:34 PM

Restricted Application added a project: Platform Engineering. · View Herald TranscriptOct 13 2020, 6:34 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• Gilles moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.Oct 13 2020, 6:53 PM

Naike removed a project: Platform Engineering.Oct 13 2020, 8:08 PM

Change 636465 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: rewrite LoadMonitor caching of server states

https://gerrit.wikimedia.org/r/636465

gerritbot added a project: Patch-For-Review.Nov 26 2020, 4:10 AM

Krinkle moved this task from Doing (old) to Backlog: Maintenance, non-prioritized on the Performance-Team board.Mar 2 2021, 8:56 PM

Krinkle moved this task from Untriaged to Rdbms library on the MediaWiki-libs-Rdbms board.Oct 16 2021, 1:36 AM

aaron triaged this task as Lowest priority.Jan 7 2022, 1:10 AM

Krinkle removed a project: Patch-For-Review.Apr 13 2022, 10:56 PM

aaron updated the task description. (Show Details)Sep 16 2022, 5:33 PM

aaron updated the task description. (Show Details)Sep 16 2022, 5:43 PM

aaron moved this task from Backlog: Maintenance, non-prioritized to Doing: Goals on the Performance-Team board.Oct 18 2022, 7:00 PM

aaron renamed this task from Rewrite LoadMonitor to better handle cache regeneration and improve separation of concern to Make LoadMonitor server states more up-to-date and respond to outages more quickly.Oct 26 2022, 10:20 PM

aaron updated the task description. (Show Details)

aaron added a subtask: T322689: Proper reconnection and error handling for Database::queryMulti().Nov 14 2022, 7:50 PM

Change 851777 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: add Database method to get connection counts and lag

https://gerrit.wikimedia.org/r/851777

gerritbot added a project: Patch-For-Review.Nov 23 2022, 9:05 PM

Krinkle added a parent task: T314020: LoadMonitor connection weighting reimagined.Nov 24 2022, 7:09 AM

aaron updated the task description. (Show Details)Nov 24 2022, 7:19 AM

aaron updated the task description. (Show Details)Dec 5 2022, 10:07 PM

aaron closed subtask T322689: Proper reconnection and error handling for Database::queryMulti() as Resolved.Dec 13 2022, 11:27 PM

Change 868514 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: add LoadBalancer::CONN_INTENT_PROBE flag for use by LoadMonitor

https://gerrit.wikimedia.org/r/868514

Change 868514 merged by jenkins-bot:

[mediawiki/core@master] rdbms: add CONN_UNTRACKED_GAUGE LoadBalancer flag for LoadMonitor gauging

https://gerrit.wikimedia.org/r/868514

Krinkle mentioned this in T325389: Deprecate ILoadBalancer::getAnyOpenConnection().Jan 8 2023, 10:55 PM

ReleaseTaggerBot added a project: MW-1.40-notes (1.40.0-wmf.18; 2023-01-09).Jan 8 2023, 11:00 PM

Change 882792 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: improve the moving average method in LoadMonitor

https://gerrit.wikimedia.org/r/882792

Change 882793 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: tweak the refresh probability in LoadMonitor

https://gerrit.wikimedia.org/r/882793

aaron updated the task description. (Show Details)Jan 30 2023, 7:11 PM

Change 882792 merged by jenkins-bot:

[mediawiki/core@master] rdbms: improve the moving average method in LoadMonitor

https://gerrit.wikimedia.org/r/882792

ReleaseTaggerBot edited projects, added MW-1.40-notes (1.40.0-wmf.24; 2023-02-20); removed MW-1.40-notes (1.40.0-wmf.18; 2023-01-09).Feb 15 2023, 10:00 PM

Modelling the probability function from gerrit 882793 with STATE_TTR_MIN = 0.2 and STATE_TTR_MAX=1.0.

Request rate	Refresh rate r^1	Refresh rate r^2	Refresh rate r^3
1	1.00	1.00	1.00
2	1.23	1.08	1.03
4	1.48	1.21	1.12
8	1.81	1.39	1.24
16	2.18	1.61	1.38
32	2.58	1.85	1.55
64	2.99	2.12	1.73
128	3.37	2.39	1.92
256	3.72	2.68	2.13
512	4.02	2.96	2.34
1024	4.26	3.23	2.55
2048	4.45	3.48	2.77
4096	4.60	3.71	2.98
8192	4.71	3.92	3.19

Produced by P44694, validated with P44695.

You can express that as a cache hit ratio, but that only tells you that the best exponent is infinity, i.e. a step function. But a step function has the worst stampede protection. The idea is that there is some probability function which balances cache hit ratio with stampede protection.

The model above uses equally-spaced requests so can't easily tell you about stampede protection. Maybe it's time to wheel out the big guns. EventSimulator has an easy way to do Poisson-distributed requests.

aaron updated the task description. (Show Details)Feb 21 2023, 7:58 PM

Aklapper edited projects, added Patch-Needs-Improvement; removed Patch-For-Review.Feb 21 2023, 9:40 PM

CDanis subscribed.Feb 23 2023, 5:43 PM

I modified the cubic formula after testing 4k/8k qps and 8-16ms generation delays.
Toying around with https://www.desmos.com/calculator/r0j2enc9ej was also useful.

I improved the testing script to tabulate more data and use a queue for writes that would be concurrent. {F36897387}

The old formula (0-500ms random TTL) had:

isStateRefreshDueV1 (ac="ave contention", mc="max contention"):

Req/s	Regen/s (2ms delay)	Regen/s (8ms delay)	Regen/s (16ms delay)
1	0.81 [ac=0.00, mc=2]	0.81 [ac=0.01, mc=4]	0.82 [ac=0.01, mc=3]
2	1.37 [ac=0.00, mc=3]	1.38 [ac=0.01, mc=3]	1.40 [ac=0.02, mc=4]
4	2.16 [ac=0.00, mc=3]	2.20 [ac=0.02, mc=4]	2.25 [ac=0.03, mc=4]
8	3.19 [ac=0.01, mc=3]	3.28 [ac=0.03, mc=4]	3.41 [ac=0.05, mc=5]
16	4.56 [ac=0.01, mc=3]	4.75 [ac=0.04, mc=4]	5.00 [ac=0.08, mc=5]
32	6.49 [ac=0.01, mc=4]	6.86 [ac=0.05, mc=5]	7.39 [ac=0.11, mc=6]
64	9.25 [ac=0.02, mc=3]	10.00 [ac=0.08, mc=5]	11.06 [ac=0.16, mc=6]
128	13.20 [ac=0.03, mc=4]	14.72 [ac=0.11, mc=5]	16.76 [ac=0.23, mc=7]
256	18.89 [ac=0.04, mc=4]	21.97 [ac=0.16, mc=5]	26.35 [ac=0.35, mc=7]
512	27.20 [ac=0.05, mc=4]	33.72 [ac=0.23, mc=6]	42.21 [ac=0.52, mc=9]
1024	39.64 [ac=0.08, mc=4]	52.04 [ac=0.34, mc=6]	70.12 [ac=0.78, mc=11]
2048	57.83 [ac=0.11, mc=4]	83.39 [ac=0.51, mc=8]	121.84 [ac=1.23, mc=11]
4096	86.73 [ac=0.16, mc=5]	140.22 [ac=0.78, mc=9]	221.49 [ac=2.01, mc=15]
8192	129.96 [ac=0.22, mc=6]	242.93 [ac=1.22, mc=11]	414.84 [ac=3.44, mc=22]
16384	201.23 [ac=0.33, mc=6]	440.19 [ac=1.99, mc=15]	800.72 [ac=6.25, mc=29]

The results of the new cubic formula and some other formulas is shown below:

isStateRefreshDuePow3DelayAware (ac="ave contention", mc="max contention"):

Req/s	Regen/s (2ms delay)	Regen/s (8ms delay)	Regen/s (16ms delay)
1	0.56 [ac=0.00, mc=2]	0.50 [ac=0.00, mc=3]	0.42 [ac=0.01, mc=3]
2	0.79 [ac=0.00, mc=3]	0.68 [ac=0.01, mc=3]	0.55 [ac=0.01, mc=3]
4	1.01 [ac=0.00, mc=3]	0.86 [ac=0.01, mc=3]	0.68 [ac=0.01, mc=4]
8	1.24 [ac=0.00, mc=3]	1.04 [ac=0.01, mc=3]	0.82 [ac=0.01, mc=4]
16	1.48 [ac=0.00, mc=3]	1.25 [ac=0.01, mc=4]	0.98 [ac=0.02, mc=4]
32	1.76 [ac=0.00, mc=3]	1.50 [ac=0.01, mc=4]	1.18 [ac=0.02, mc=4]
64	2.10 [ac=0.00, mc=3]	1.79 [ac=0.01, mc=4]	1.41 [ac=0.02, mc=4]
128	2.49 [ac=0.00, mc=3]	2.14 [ac=0.02, mc=5]	1.71 [ac=0.03, mc=4]
256	2.98 [ac=0.01, mc=3]	2.58 [ac=0.02, mc=3]	2.04 [ac=0.03, mc=4]
512	3.57 [ac=0.01, mc=3]	3.12 [ac=0.02, mc=4]	2.51 [ac=0.04, mc=4]
1024	4.28 [ac=0.01, mc=3]	3.74 [ac=0.03, mc=4]	3.05 [ac=0.05, mc=4]
2048	5.12 [ac=0.01, mc=3]	4.53 [ac=0.03, mc=4]	3.72 [ac=0.06, mc=4]
4096	6.07 [ac=0.01, mc=3]	5.45 [ac=0.04, mc=3]	4.61 [ac=0.07, mc=5]
8192	7.13 [ac=0.01, mc=3]	6.63 [ac=0.05, mc=4]	5.34 [ac=0.08, mc=4]
16384	8.85 [ac=0.02, mc=3]	8.16 [ac=0.06, mc=4]	7.03 [ac=0.10, mc=5]

isStateRefreshDuePow4DelayAware (ac="ave contention", mc="max contention"):

Req/s	Regen/s (2ms delay)	Regen/s (8ms delay)	Regen/s (16ms delay)
1	0.55 [ac=0.00, mc=2]	0.50 [ac=0.00, mc=3]	0.44 [ac=0.01, mc=3]
2	0.76 [ac=0.00, mc=3]	0.68 [ac=0.01, mc=3]	0.58 [ac=0.01, mc=4]
4	0.96 [ac=0.00, mc=3]	0.84 [ac=0.01, mc=3]	0.71 [ac=0.01, mc=4]
8	1.14 [ac=0.00, mc=3]	1.00 [ac=0.01, mc=3]	0.84 [ac=0.01, mc=4]
16	1.32 [ac=0.00, mc=3]	1.16 [ac=0.01, mc=3]	0.97 [ac=0.02, mc=5]
32	1.52 [ac=0.00, mc=4]	1.35 [ac=0.01, mc=4]	1.13 [ac=0.02, mc=5]
64	1.75 [ac=0.00, mc=3]	1.55 [ac=0.01, mc=4]	1.31 [ac=0.02, mc=4]
128	2.01 [ac=0.00, mc=3]	1.80 [ac=0.01, mc=4]	1.52 [ac=0.02, mc=4]
256	2.33 [ac=0.00, mc=3]	2.09 [ac=0.02, mc=3]	1.77 [ac=0.03, mc=4]
512	2.68 [ac=0.01, mc=3]	2.43 [ac=0.02, mc=3]	2.05 [ac=0.03, mc=4]
1024	3.08 [ac=0.01, mc=3]	2.81 [ac=0.02, mc=4]	2.38 [ac=0.04, mc=4]
2048	3.53 [ac=0.01, mc=2]	3.24 [ac=0.02, mc=3]	2.80 [ac=0.04, mc=4]
4096	4.12 [ac=0.01, mc=3]	3.75 [ac=0.03, mc=4]	3.30 [ac=0.05, mc=4]
8192	4.75 [ac=0.01, mc=2]	4.51 [ac=0.03, mc=3]	4.03 [ac=0.06, mc=4]
16384	5.52 [ac=0.01, mc=2]	5.34 [ac=0.04, mc=4]	4.55 [ac=0.07, mc=3]

isStateRefreshDuePow5DelayAware (ac="ave contention", mc="max contention"):

Req/s	Regen/s (2ms delay)	Regen/s (8ms delay)	Regen/s (16ms delay)
1	0.54 [ac=0.00, mc=2]	0.50 [ac=0.00, mc=3]	0.45 [ac=0.01, mc=3]
2	0.74 [ac=0.00, mc=3]	0.68 [ac=0.01, mc=3]	0.60 [ac=0.01, mc=4]
4	0.92 [ac=0.00, mc=3]	0.84 [ac=0.01, mc=4]	0.73 [ac=0.01, mc=4]
8	1.08 [ac=0.00, mc=3]	0.98 [ac=0.01, mc=4]	0.85 [ac=0.01, mc=4]
16	1.23 [ac=0.00, mc=3]	1.12 [ac=0.01, mc=4]	0.97 [ac=0.01, mc=5]
32	1.39 [ac=0.00, mc=3]	1.26 [ac=0.01, mc=4]	1.10 [ac=0.02, mc=4]
64	1.56 [ac=0.00, mc=3]	1.43 [ac=0.01, mc=4]	1.25 [ac=0.02, mc=4]
128	1.76 [ac=0.00, mc=3]	1.61 [ac=0.01, mc=3]	1.41 [ac=0.02, mc=5]
256	1.97 [ac=0.00, mc=3]	1.83 [ac=0.01, mc=4]	1.62 [ac=0.02, mc=4]
512	2.20 [ac=0.00, mc=3]	2.07 [ac=0.02, mc=4]	1.84 [ac=0.03, mc=4]
1024	2.51 [ac=0.01, mc=3]	2.33 [ac=0.02, mc=3]	2.07 [ac=0.03, mc=4]
2048	2.82 [ac=0.01, mc=3]	2.64 [ac=0.02, mc=3]	2.41 [ac=0.04, mc=5]
4096	3.13 [ac=0.01, mc=3]	2.97 [ac=0.02, mc=3]	2.70 [ac=0.04, mc=4]
8192	3.59 [ac=0.01, mc=2]	3.50 [ac=0.03, mc=3]	3.14 [ac=0.04, mc=3]
16384	3.95 [ac=0.01, mc=2]	3.82 [ac=0.03, mc=3]	3.57 [ac=0.05, mc=4]

Change 882793 merged by jenkins-bot:

[mediawiki/core@master] rdbms: tweak the refresh probability in LoadMonitor

https://gerrit.wikimedia.org/r/882793

ReleaseTaggerBot edited projects, added MW-1.40-notes (1.40.0-wmf.27; 2023-03-13); removed MW-1.40-notes (1.40.0-wmf.24; 2023-02-20).Mar 13 2023, 5:00 AM

aaron mentioned this in T331914: Update WANCache preemptive refresh probability function.Mar 13 2023, 6:20 PM

@aaron: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on Feburary 22nd, 2023.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

aaron claimed this task.Apr 17 2023, 7:55 PM

I compared https://gerrit.wikimedia.org/r/c/mediawiki/core/+/636465/ and git master using small scripts:

lm_test.php

echo "Testing with apc.enable_cli=" . ini_get('apc.enable_cli');
$lb = \Wikimedia\TestingAccessWrapper::newFromObject( \MediaWiki\MediaWikiServices::getInstance()->getDBLoadBalancer() );
$prior = []; while(1) { $loads = $lb->groupLoads['']; $lb->getLoadMonitor()->scaleLoads( $loads, $lb->getLocalDomainID() ); if ( $loads !== $prior ) { echo "loads=[" . implode(',',$loads) . "] @" . round(microtime(1),3) . "\n"; } $prior = $loads; time_nanosleep( 0, 10e6 ); }

lm_test.sh

#!/bin/bash
sudo killall php
for i in {1..99}; do php maintenance/run.php eval < lm_test.php > /dev/null & done
php maintenance/run.php eval < lm_test.php

In terminal A:

./lm_test.sh

Then, in terminal B:

sudo systemctl stop mariadb@db3 && date +%s.%3N && sleep 20 && sudo systemctl start mariadb@db3 && date +%s.%3N

On my 3 server setup, it shows convergence time of load weights (db1,db2,db3) from [0,100,100] => [0,100,1] after stopping db3, then convergence from [0,100,1] => [0,100,99]. I get ~5.5-7 seconds for master and ~2.5-3 seconds for https://gerrit.wikimedia.org/r/882793 .

I also tried adding a db5 load with a bogus port then directly injected a cache entry via eval.php:

$cache = \MediaWiki\MediaWikiServices::getInstance()->getMainWANObjectCache();
$cache->set( $cache->makeGlobalKey( 'rdbms-gauge', '2', 's1', 'db1', 'db5' ), [ 'up' => 1.0, 'lag' => 0, 'time' => microtime(1), 'delay' => 0 ], 60 ); echo microtime(1) . "\n";

This yielded similar results.

I ran this with usleep() during the poll phase (1ms-1000ms). Based on previously poor results, the isStateRefreshDue() method exponent was tweaked from $genRatio*128 to min($genRatio*64,0.1) to stay more responsive to slow servers.

I reran the isStateRefreshDue() stochastic simulations for that below.
isStateRefreshDuePow4DelayAwareLate (ac="ave contention", mc="max contention", t="total time"):

Req/s	Regen/s (16ms delay)	Regen/s (512ms delay)	Regen/s (1024ms delay)	Regen/s (4096ms delay)
1 (t=2000001)	0.02 [ac=0.00, mc=2]	0.02 [ac=0.01, mc=3]	0.02 [ac=0.02, mc=4]	0.01 [ac=0.05, mc=5]
2 (t=1000002)	0.02 [ac=0.00, mc=2]	0.02 [ac=0.01, mc=3]	0.02 [ac=0.02, mc=4]	0.02 [ac=0.06, mc=6]
4 (t=500000)	0.02 [ac=0.00, mc=2]	0.02 [ac=0.01, mc=3]	0.02 [ac=0.02, mc=5]	0.02 [ac=0.07, mc=5]
8 (t=250000)	0.03 [ac=0.00, mc=2]	0.03 [ac=0.01, mc=3]	0.03 [ac=0.03, mc=4]	0.02 [ac=0.08, mc=5]
16 (t=125000)	0.03 [ac=0.00, mc=2]	0.03 [ac=0.01, mc=3]	0.03 [ac=0.03, mc=4]	0.03 [ac=0.09, mc=8]
32 (t=62500)	0.03 [ac=0.00, mc=2]	0.03 [ac=0.02, mc=3]	0.04 [ac=0.03, mc=5]	0.03 [ac=0.11, mc=7]
64 (t=31250)	0.04 [ac=0.00, mc=2]	0.04 [ac=0.02, mc=3]	0.04 [ac=0.04, mc=4]	0.04 [ac=0.13, mc=6]
128 (t=15625)	0.05 [ac=0.00, mc=2]	0.05 [ac=0.02, mc=3]	0.05 [ac=0.05, mc=5]	0.05 [ac=0.15, mc=8]
256 (t=7812)	0.05 [ac=0.00, mc=2]	0.05 [ac=0.03, mc=3]	0.06 [ac=0.05, mc=4]	0.05 [ac=0.18, mc=5]
512 (t=3906)	0.06 [ac=0.00, mc=1]	0.06 [ac=0.03, mc=3]	0.07 [ac=0.06, mc=6]	0.07 [ac=0.21, mc=6]
1024 (t=1953)	0.07 [ac=0.00, mc=2]	0.07 [ac=0.04, mc=3]	0.08 [ac=0.07, mc=4]	0.09 [ac=0.26, mc=6]
2048 (t=976)	0.08 [ac=0.00, mc=1]	0.09 [ac=0.04, mc=3]	0.10 [ac=0.09, mc=4]	0.11 [ac=0.32, mc=6]
4096 (t=488)	0.09 [ac=0.00, mc=2]	0.11 [ac=0.05, mc=3]	0.11 [ac=0.10, mc=2]	0.14 [ac=0.37, mc=6]
8192 (t=244)	0.11 [ac=0.00, mc=1]	0.10 [ac=0.05, mc=2]	0.14 [ac=0.12, mc=3]	0.19 [ac=0.47, mc=8]

Change 914030 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/EventSimulator@master] Add ConnectTimeoutScenario and related tweaks

https://gerrit.wikimedia.org/r/914030

gerritbot added a project: Patch-For-Review.May 2 2023, 1:28 AM

Restricted Application removed a project: Patch-Needs-Improvement. · View Herald TranscriptMay 2 2023, 1:28 AM

EventSimulator modelling of Gerrit 636465, tentative initial results. db1119 fails at t=5, waiting for the configured connection timeout and then incrementing the failed connection count metric. Failures of LoadMonitor related connections can be seen at t=6, and failures of regular connections can be seen at t=8. Aaron's patch reduces the failure count at t=6 and makes the ramp down of connection attempts about 0.5 seconds quicker.

Change 636465 merged by jenkins-bot:

[mediawiki/core@master] rdbms: improve caching and state convergence in LoadMonitor

https://gerrit.wikimedia.org/r/636465

Change 914030 merged by jenkins-bot:

[mediawiki/extensions/EventSimulator@master] Add ConnectTimeoutScenario and related tweaks

https://gerrit.wikimedia.org/r/914030

aaron updated the task description. (Show Details)May 4 2023, 1:36 AM

What timeout did you use? The LoadMonitor probe connections use 1 second.

aaron closed this task as Resolved.May 9 2023, 4:35 AM

Change 922180 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: avoid session variable SET query for LoadMonitor connections

https://gerrit.wikimedia.org/r/922180

Change 922180 merged by jenkins-bot: