Maniphest T351948

Remove redundant metrics "MediaWiki.centralauth.centrallogin_errors.*"
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	matmarex
	Nov 25 2023, 3:26 AM

Description

In T327046#9262816, @matmarex wrote:

In T327046#9252499, @Tgr wrote:

BTW there's also some pre-existing monitoring in SpecialCentralLogin::showError() which sends statd data for centralauth.centrallogin_errors.<error-message-key> stats. I think I'd rather use Logstash for that kind of thing. We did the same messagekey-based-statsd thing for the authevents channel, and never used it much. Seeing changes in volume are sometimes useful for security-wise interesting error types (like bad password or bad captcha) but central login doesn't really have anything like that. And Logstash still lets one visualise volumes with a little (but not that much) more effort.

Are you saying we should remove it? (I'd be on board with that)

This data is used for some alerts: https://gerrit.wikimedia.org/g/operations/puppet/+/c678553ed26e63dbad2a0a0924f96962b6447a72/modules/profile/manifests/graphite/alerts.pp#47 – no idea who or where receives them.

That also links to a dashboard: https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1 – which is probably redundant to https://grafana.wikimedia.org/d/000000004/authentication-metrics?orgId=1&viewPanel=13. No idea how to find if anything else uses the data.

In T327046#9271099, @Tgr wrote:

In T327046#9262816, @matmarex wrote:

Are you saying we should remove it? (I'd be on board with that)

I think we should either log SpecialCentral[Auto]Login steps to Logstash at the debug level (so people running into issues can use X-Wikimedia-Debug to provide a trace) and also log the final success/failure to statsd (so we can see trends over time), or we should log to Logastash at the info level, in which case we can also use that for trends/alerts (a little more painfully, OTOH with less code to maintain).

This data is used for some alerts:

That was added in rOPUP4dd4e50b6fbc: graphite::alerts: add alerting on session loss by @Joe so maybe he knows if it is still needed.

It seems that the old MediaWiki.centralauth.centrallogin_errors.* metric is exactly the same as the new MediaWiki.authmanager.centrallogin.*.failure.* metric. We should stop logging the old one, and update the dashboards/monitoring that use it. (Maybe we could combine them to show historical data in the dashboard, that would be neat if it's not too difficult.)

https://codesearch.wmcloud.org/search/?q=centrallogin_errors

Details

	Subject	Repo	Branch	Lines +/-
	Update CentralAuth login failures metric	operations/puppet	production	+3 -2
	Remove logging to "centralauth.centrallogin_errors.*"	mediawiki/extensions/CentralAuth	master	+0 -9

Customize query in gerrit

Related Objects

Mentioned Here: rOPUP4dd4e50b6fbc: graphite::alerts: add alerting on session loss
T327046: Improve (or identify) monitoring for CentralAuth autologins on Wikimedia wikis

Event Timeline

matmarex created this task.Nov 25 2023, 3:26 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 25 2023, 3:26 AM

I've noticed something curious, I wonder if anyone can explain it to me.

When I look at the data for these two metrics for the last 6 days, they are exactly the same:

https://grafana-rw.wikimedia.org/explore?orgId=1&left={"queries":[{"target":"summarize(groupByNodes(MediaWiki.centralauth.centrallogin_errors.*.rate,'sum',2),'1h','avg',false)"},{"target":"summarize(groupByNodes(MediaWiki.authmanager.centrallogin.*.failure.*.rate,'sum',2,4),'1h','avg',false)"}],"range":{"from":"now-6d","to":"now"}}

But if I use the same query to view the last 8 days, they are slightly different:

https://wikitech.wikimedia.org/wiki/Graphite#Extended_properties

Last 7 days, we store one datapoint per 1 minute, as flushed by Statsd.
Last 14 days, we keep one datapoint per 5 minutes, aggregated at ingestion time (avg for most properties, except sum for some).

Each data source is only queried once. Depending on date time range, you'll query either Graphite's 7d sequences, or it's longer term sequences.

Older data thus is naturally flatter, with fewer spikes and dips.

The average should remain accurate, eg if used to derive a multiple or total over the same period. However generating another average will produce unstable results, and does not have an (obvious) meaning to me.

I see, but that doesn't really explain it to me. How come that identical datapoints-per-1-minute result in different datapoints-per-5-minutes when aggregated? (Are the aggregation periods not aligned?)

They don't even average out to the same value. One of the metrics consistently has higher values. (link)

Well, at least the 'sum' is consistent. (link)

larissagaulia moved this task from Inbox, needs triage to Current Sprint on the MediaWiki-Platform-Team board.Nov 27 2023, 12:43 PM

Change 977781 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/CentralAuth@master] Remove logging to "centralauth.centrallogin_errors.*"

https://gerrit.wikimedia.org/r/977781

gerritbot added a project: Patch-For-Review.Nov 27 2023, 7:58 PM

In T351948#9357367, @matmarex wrote:

I see, but that doesn't really explain it to me. How come that identical datapoints-per-1-minute result in different datapoints-per-5-minutes when aggregated?

The authevents metrics are split into api/web, so even though the individual reported values are the same, there are twice as many buckets and there is less aggregation happening.

I never quite understood why avg is inaccurate (for percentile metrics, it's impossible to get an exact result by aggregating aggregates, but for sum/avg it should be fine) but whatever mechanism causes avg aggregation to lose fidelity is used more for the old metric.

Change 977781 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@master] Remove logging to "centralauth.centrallogin_errors.*"

https://gerrit.wikimedia.org/r/977781

Maintenance_bot removed a project: Patch-For-Review.Nov 27 2023, 10:30 PM

ReleaseTaggerBot added a project: MW-1.42-notes (1.42.0-wmf.7; 2023-11-28).Nov 27 2023, 11:00 PM

Ah, that sounds plausible enough, thanks.

Change 978679 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[operations/puppet@production] Update CentralAuth login failures metric

https://gerrit.wikimedia.org/r/978679

gerritbot added a project: Patch-For-Review.Nov 29 2023, 7:20 PM

I updated both https://grafana.wikimedia.org/d/000000004/authentication-metrics?orgId=1&viewPanel=41 and https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?orgId=1&viewPanel=3 to show both the old and the new data sources. Currently they're shown in a slightly ugly overlapped way, but the old one will stop being logged next week.

Change 978679 merged by Filippo Giunchedi: