During the investigation for T367778: [wikireplicas] frequent replag spikes in clouddb hosts I discovered an issue with our Grafana MySQL dashboard. Most charts use irate[5m] but that can produce some artifacts when zooming out, if the metric is "spiky".
An example is the bytes_sent metric for server clouddb1019:13314 in the past 90 days. This is what the dashboard currently shows:
It looks like the values for bytes_sent before 2024-06-12 are very low, but if I change the query to use rate[1h], I get this graph:
Zooming in and using the original query (irate[5m]), you can see the actual shape of the traffic:
I can see two possible fixes:
- Replacing irate[5m] with rate[$__rate_interval] seems to work fine in most situations, but it's possible it might hide some spikes that are captured by irate.
- Copying the approach used by percona/grafana-dashboards where they use max(rate[$interval] or irate[5m]).
We should probably do this for all metrics in the MySQL dashboard that are currently using irate[5m].