Page MenuHomePhabricator

Remove redundant metrics "MediaWiki.centralauth.centrallogin_errors.*"
Closed, ResolvedPublic

Description

BTW there's also some pre-existing monitoring in SpecialCentralLogin::showError() which sends statd data for centralauth.centrallogin_errors.<error-message-key> stats. I think I'd rather use Logstash for that kind of thing. We did the same messagekey-based-statsd thing for the authevents channel, and never used it much. Seeing changes in volume are sometimes useful for security-wise interesting error types (like bad password or bad captcha) but central login doesn't really have anything like that. And Logstash still lets one visualise volumes with a little (but not that much) more effort.

Are you saying we should remove it? (I'd be on board with that)

This data is used for some alerts: https://gerrit.wikimedia.org/g/operations/puppet/+/c678553ed26e63dbad2a0a0924f96962b6447a72/modules/profile/manifests/graphite/alerts.pp#47 – no idea who or where receives them.

That also links to a dashboard: https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1 – which is probably redundant to https://grafana.wikimedia.org/d/000000004/authentication-metrics?orgId=1&viewPanel=13. No idea how to find if anything else uses the data.

Are you saying we should remove it? (I'd be on board with that)

I think we should either log SpecialCentral[Auto]Login steps to Logstash at the debug level (so people running into issues can use X-Wikimedia-Debug to provide a trace) and also log the final success/failure to statsd (so we can see trends over time), or we should log to Logastash at the info level, in which case we can also use that for trends/alerts (a little more painfully, OTOH with less code to maintain).

This data is used for some alerts:

That was added in rOPUP4dd4e50b6fbc: graphite::alerts: add alerting on session loss by @Joe so maybe he knows if it is still needed.


It seems that the old MediaWiki.centralauth.centrallogin_errors.* metric is exactly the same as the new MediaWiki.authmanager.centrallogin.*.failure.* metric. We should stop logging the old one, and update the dashboards/monitoring that use it. (Maybe we could combine them to show historical data in the dashboard, that would be neat if it's not too difficult.)

https://codesearch.wmcloud.org/search/?q=centrallogin_errors

Event Timeline

https://wikitech.wikimedia.org/wiki/Graphite#Extended_properties

Last 7 days, we store one datapoint per 1 minute, as flushed by Statsd.
Last 14 days, we keep one datapoint per 5 minutes, aggregated at ingestion time (avg for most properties, except sum for some).

Each data source is only queried once. Depending on date time range, you'll query either Graphite's 7d sequences, or it's longer term sequences.

Older data thus is naturally flatter, with fewer spikes and dips.

The average should remain accurate, eg if used to derive a multiple or total over the same period. However generating another average will produce unstable results, and does not have an (obvious) meaning to me.

I see, but that doesn't really explain it to me. How come that identical datapoints-per-1-minute result in different datapoints-per-5-minutes when aggregated? (Are the aggregation periods not aligned?)

They don't even average out to the same value. One of the metrics consistently has higher values. (link)

Well, at least the 'sum' is consistent. (link)

Change 977781 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/CentralAuth@master] Remove logging to "centralauth.centrallogin_errors.*"

https://gerrit.wikimedia.org/r/977781

I see, but that doesn't really explain it to me. How come that identical datapoints-per-1-minute result in different datapoints-per-5-minutes when aggregated?

The authevents metrics are split into api/web, so even though the individual reported values are the same, there are twice as many buckets and there is less aggregation happening.

I never quite understood why avg is inaccurate (for percentile metrics, it's impossible to get an exact result by aggregating aggregates, but for sum/avg it should be fine) but whatever mechanism causes avg aggregation to lose fidelity is used more for the old metric.

Change 977781 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@master] Remove logging to "centralauth.centrallogin_errors.*"

https://gerrit.wikimedia.org/r/977781

Ah, that sounds plausible enough, thanks.

Change 978679 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[operations/puppet@production] Update CentralAuth login failures metric

https://gerrit.wikimedia.org/r/978679

I updated both https://grafana.wikimedia.org/d/000000004/authentication-metrics?orgId=1&viewPanel=41 and https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?orgId=1&viewPanel=3 to show both the old and the new data sources. Currently they're shown in a slightly ugly overlapped way, but the old one will stop being logged next week.

Change 978679 merged by Filippo Giunchedi:

[operations/puppet@production] Update CentralAuth login failures metric

https://gerrit.wikimedia.org/r/978679