This task calls for several things:
- Make the connectivity scaling factor converge faster to avoid connection attempts to downed servers faster. Currently, the logs still get flooded with connection errors. We should both using a higher "moving average ratio" and possible a better moving average function to better handle varying traffic.
- Make the cache keys per DB-server so that extra query groups do not require separate cache keys (even if they are a subset of the main traffic servers or at least overlap). When checking which servers needs state updates, the order should be shuffled. This makes it easier to keep the data more up-to-date since it no longer requires a for-loop that always has to connect to everything.
- Simplify the tiered apcu/WANCache logic to use either apcu (web mode) or the local cluster cache (CLI mode). Since both have the BagOStuff interface, this would simplify the code significantly. Placeholders should liberally be used for a few seconds when there is no stale value and the cache mutex is already held. Cluster cache updates from web requests should use WRITE_BACKGROUND to avoid latency.
- Mitigate network slowness when LoadMonitor polls/gauges servers (e.g. lower connection timeout and set read timeout with mysqli). Maybe there could be a LoadBalancer::CONN_GAUGE_PROBE constant for a third connection class category to help this. Past outages have involved connections hanging in the ACCEPT state, or queries (including heartbeat table and SHOW ones) being slow.
- Consider adding mysql connected/running/max_connections status variables to the server state polling to see how close a server is too being overloaded.
Server failure scenarios we should be able to quickly handle:
- Immediately refuses connection attempts (mysql not listening)
- Immediately rejects connection attempts (max_connections)
- Times out for connection attempts (packet loss, mysql accept() loop stuck)
- Takes a long time to accept connections (packet loss, CPU use)
- Takes a long time to run even trivial non-data queries like SET SESSION sql_mode = 'x' (packet loss, CPU/IO use)
- Replication stops or slows to a crawl (packet loss, replicated query failure, corruption, CPU/IO use, futex deadlocks)
See T314020 for ideas about tracking connection attempt failures.