Page MenuHomePhabricator

[toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-07-23
Closed, ResolvedPublic

Description

tools-db-3 is 9 hours behind the primary, due to a long transaction. This time the table is s55957__ncc.infos.

fnegri@cloudcumin1001:~$ sudo cookbook wmcs.toolforge.toolsdb.get_current_replica_transaction
START - Cookbook wmcs.toolforge.toolsdb.get_current_replica_transaction
###########################################################################
Replica: tools-db-3.tools.eqiad1.wikimedia.cloud
Suspicious tables:
    Table_map: `s55957__ncc`.`infos` mapped
Suspicious queries:
    #Q> DELETE FROM `infos`
END (PASS) - Cookbook wmcs.toolforge.toolsdb.get_current_replica_transaction (exit_code=0)

Event Timeline

fnegri changed the task status from Open to In Progress.Jul 23 2024, 10:27 AM
fnegri claimed this task.
fnegri created this task.
fnegri added a project: cloud-services-team.

I acked the alert for 24 hours, hopefully it will catch up by then.

fnegri triaged this task as High priority.Jul 24 2024, 3:26 PM

Replication is still stuck processing the same DELETE transaction. From past experience, it usually takes a few days to complete the transaction and catch up. I haven't yet found a way to estimate the time it's gonna take, as that can vary depending on the size of the transaction.

Update: I think I found a way to estimate how long the transaction will take, thanks to this brilliant StackOverflow post.

MariaDB [(none)]> SHOW ENGINE INNODB STATUS \G

[...]

---TRANSACTION 138649748342, ACTIVE 142482 sec starting index read
mysql tables in use 1, locked 1
15588 lock struct(s), heap size 1761400, 1315111 row lock(s), undo log entries 1299525

[...]

0.00 inserts/s, 0.00 updates/s, 3.00 deletes/s, 3.00 reads/s

The transaction has to update 1315111 rows (the number of "row locks" above), and has already completed 1299525 rows (the number of undo log entries above). That means we're at 100 * (1299525 / 1315111) = 98.8%.

The transaction has been active for 142482, so it completed an average of 1299525/142482 = 9.12 rows per second. The speed seems not to be constant, though, because the last line above shows a current speed of 3.00 deletes/s. If it continues at that speed, it should complete the transaction in (1315111-1299525)/3 = 5195 sec = 1.44 hours.

Unfortunately there was one wrong assumption in my calculations above: the number of row lock(s) reported by SHOW ENGINE INNODB STATUS does not match the actual number of rows to be deleted. The number of row locks keeps on increasing, maybe because rows are locked incrementally. I assume the second number (undo log entries) does show the correct number of processed rows, and that has now increased from 1299525 to 1500150. The total number it has to reach is not 1315111 as I incorrectly assumed yesterday, but I believe is the total number of rows in the table:

MariaDB [(none)]> SELECT COUNT(*) FROM s55957__ncc.infos;
+----------+
| COUNT(*) |
+----------+
|  2080762 |
+----------+

That means MariaDB still need to process 2080762-1500150 = 580612 rows, which at the current speed of 3 rows/sec will take 580612/3 = 193537 sec = 53,8 hours.

fnegri closed this task as Resolved.EditedJul 29 2024, 2:21 PM

Replication resumed at around 2024-07-27 23:30 UTC and caught up with the primary around 16 hours later.

Screenshot 2024-07-29 at 15.59.42.png (614×1 px, 82 KB)