Page MenuHomePhabricator

Grafana MySQL charts can be inconsistent when zooming out
Open, Needs TriagePublic

Description

During the investigation for T367778: [wikireplicas] frequent replag spikes in clouddb hosts I discovered an issue with our Grafana MySQL dashboard. Most charts use irate[5m] but that can produce some artifacts when zooming out, if the metric is "spiky".

An example is the bytes_sent metric for server clouddb1019:13314 in the past 90 days. This is what the dashboard currently shows:

Screenshot 2024-07-31 at 13.48.19.png (626×1 px, 169 KB)

It looks like the values for bytes_sent before 2024-06-12 are very low, but if I change the query to use rate[1h], I get this graph:

Screenshot 2024-07-31 at 13.49.38.png (636×1 px, 216 KB)

Zooming in and using the original query (irate[5m]), you can see the actual shape of the traffic:

Screenshot 2024-07-31 at 13.52.18.png (540×1 px, 155 KB)

I can see two possible fixes:

  1. Replacing irate[5m] with rate[$__rate_interval] seems to work fine in most situations, but it's possible it might hide some spikes that are captured by irate.
  2. Copying the approach used by percona/grafana-dashboards where they use max(rate[$interval] or irate[5m]).

We should probably do this for all metrics in the MySQL dashboard that are currently using irate[5m].