Page MenuHomePhabricator

Alert the Growth team when number of available task recommendations drops significantly
Open, LowPublic

Description

Recently, I accidentally noticed a feature-wide outage (T345188). Initial symptom of the outage was the image recommendation task pool being drained out and nearly empty. This outage could be noticed more quickly/effectively if we had better monitoring available.

We should think about adding better monitoring for Growth's structured edits. In an initial form, this could include alerting when the number of suggestions drops significantly (more than 20%, for example?). On very tiny wikis, this could still cause random alerts (if a wiki has merely 10 suggestions, 20% is 2 suggestions), but I believe it is better to resolve an unnecessary page rather than continuing to be noticing outages accidentally. If it shows to happen frequently, we can always change the alerting policies to be more accurate, but we gotta start somewhere.

Docs: https://wikitech.wikimedia.org/wiki/Alertmanager

Event Timeline

We only have link-recommendation task pool in Grafana. This is configured in two different places:listTaskCounts.php has a condition in reportTaskCounts to only report task counts to statsd for link-recommendation task type and puppet then explicitly passes --tasktype link-recommendation (so it doesn't run for other task types in the first place).

The script code has the following reasoning around the topic limitation:

// Limit to link recommendations to avoid excessive use of statsd metrics as we don't
// care too much about the others. Maybe there will be a nicer way to handle this in
// the future with Prometheus.

That's fair, as currently reporting to statsd for each task type means 1 metric per topic plus 1 global metric. I think it makes sense to stop producing the per-topic metrics, as they make very little sense for small-ish wikis, while the global task count is still useful even on small-ish wikis.

I'm in favour of removing the per-topic metrics for everything (incl. link recommendation), but it is currently used in Grafana (albeit in a broken dashboard) and it might be relied on by other engineers. I'll fill a separate task about removing the per-topic metrics. In the meantime, let's produce per-topic metrics only for link-recommendation (which are already included) and not for other tasks.

Once the task counts for all task types are in statsd, we can follow https://wikitech.wikimedia.org/wiki/Alertmanager#Grafana_alerts to implement the alerting.

Change 953343 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@master] listTaskCounts: Push total task counts to statsd for all tasks

https://gerrit.wikimedia.org/r/953343

Change 953344 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/puppet@production] growthexperiments: Run listTaskCounts for all task types

https://gerrit.wikimedia.org/r/953344

Change 953347 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/puppet@production] alertmanager: route Growth team alerts

https://gerrit.wikimedia.org/r/953347

lmata subscribed.

Moving to radar, please let me know if there's anything we can help with.

Moving to radar, please let me know if there's anything we can help with.

Thanks! Review of the puppet patch would be helpful :). It is by no means urgent, given I still need to get the data I want to base the alerting on from MediaWiki into graphite, but AFAIK those two parts should be largely independent.

Moving to radar, please let me know if there's anything we can help with.

Thanks! Review of the puppet patch would be helpful :). It is by no means urgent, given I still need to get the data I want to base the alerting on from MediaWiki into graphite, but AFAIK those two parts should be largely independent.

You are correct that the two parts are independent, please feel free to add me as a reviewer of the relevant changes and I'm happy to help

[...]
You are correct that the two parts are independent, please feel free to add me as a reviewer of the relevant changes and I'm happy to help

Thanks for the offer! Done. It's https://gerrit.wikimedia.org/r/c/operations/puppet/+/953347/ for the Puppet part and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/953343/ for the pushing data part.

Change 953347 merged by Filippo Giunchedi:

[operations/puppet@production] alertmanager: route Growth team alerts

https://gerrit.wikimedia.org/r/953347

Change 953343 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] listTaskCounts: Push total task counts to statsd for all tasks

https://gerrit.wikimedia.org/r/953343

Change 957396 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@wmf/1.41.0-wmf.26] listTaskCounts: Push total task counts to statsd for all tasks

https://gerrit.wikimedia.org/r/957396

Change 957396 merged by Urbanecm:

[mediawiki/extensions/GrowthExperiments@wmf/1.41.0-wmf.26] listTaskCounts: Push total task counts to statsd for all tasks

https://gerrit.wikimedia.org/r/957396

Mentioned in SAL (#wikimedia-operations) [2023-09-14T15:47:52Z] <urbanecm@deploy1002> Started scap: Backport for [[gerrit:957396|listTaskCounts: Push total task counts to statsd for all tasks (T345204)]], [[gerrit:957758|linkTaskCounts: Stop producing per-topic statsd data (T345210)]]

Mentioned in SAL (#wikimedia-operations) [2023-09-14T15:55:29Z] <urbanecm@deploy1002> Finished scap: Backport for [[gerrit:957396|listTaskCounts: Push total task counts to statsd for all tasks (T345204)]], [[gerrit:957758|linkTaskCounts: Stop producing per-topic statsd data (T345210)]] (duration: 07m 37s)

Change 953344 merged by Clément Goubert:

[operations/puppet@production] growthexperiments: Run listTaskCounts for all task types

https://gerrit.wikimedia.org/r/953344

I'm untagging o11y here since I don't think there's any immediate actionable, please reach out if that's not the case