Page MenuHomePhabricator

[Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api
Closed, ResolvedPublic

Description

User Story
As a platform engineer, I need to implement proper error handling in a streaming processor so that errors are understood and corrected

Like any other network call errors are likely to happen when fetching data from the MW api, such errors should be handled properly so that data loss is limited.
Eventual consistency issues are also likely to happen in such architecture those problems should be mitigated as much as possible to increase the consistency of the output of this processor.

What was proven to work relatively well is:

  • retry any error at most 3 times with 1 sec interval between retries
  • retry events that are considered recent, in our case a good value for recent is 10s (c.f. T279698)

Regarding unrecoverable errors they should be captured in a flink side-output so that they can be analyzed after the fact.
Ideally errors should be classified to ease the analysis of those, ideas taken from the wikidata service updater:

  • not found, the revision or title is not found: cause might be eventual consistency, page actually deleted before we had a chance to read its content (likely to happen during backfills)
  • no content, on http 204
  • unexpected response: on http >= 300
  • empty response: valid http code but no body in the response
  • unknown: for everything else
Done is:
  • Transient network errors do not cause loss of data
  • Eventual consistency issues are understood and mitigated
  • Unrecoverable errors are classified and captured in a side-output
  • (optional) the error side-output has a proper schema, is stored in kafka and persisted in hive for analysis purposes
Related work

Event Timeline

I used this in the past for building Python based services.

https://tenacity.readthedocs.io/en/latest/#waiting-before-retrying

Does anything exist like this for Scala? Maybe this...

https://index.scala-lang.org/hipjim/scala-retry

Could be an easy way for teams that implement services to set their own retry logic for calls

There's a number of libraries to support retries (with the typical backoff mechanisms you'd expect). Not that the choice is set in stone, but this was one of the requirements I looked at when choosing an HTTP library for the PoC.

The library I ended up using allows for pluggable backends (concurrency models), that should make retry logic on network errors pretty much zero cost to implement. The backend I'm using is Future based, at this would be their recommended retry plugin https://github.com/softwaremill/retry. There's more details at https://sttp.softwaremill.com/en/latest/resilience.html?highlight=retry#backend-specific-retries

It gets a bit more tricky when we want to retry on _successful_ requests; for example a query for a rev_id not yet available in the db replica, would return a 200 with a payload that describes the miss (we currently log these kind of responses).

Actually, no we should not close this.

Eventual consistency issues are understood and mitigated

This needs work. I'm starting to write https://wikitech.wikimedia.org/wiki/MediaWiki_Event_Enrichment#mediawiki.page_content_change_semantics, and we should figure this out.

retry events that are considered recent, in our case a good value for recent is 10s (c.f. T279698)

@dcausse do you retry on HTTP 200 with badrevids response? I think this is the 'page deleted before event is processed' problem? Or do you only retry on recent events if there is an actual HTTP error?

Ottomata renamed this task from [Shared Event Platform] Implement error handling and retry logic when fetching data from the MW api to [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api.Jun 5 2023, 5:41 PM
Ottomata moved this task from Backlog to Sprint 14 B on the Event-Platform board.
Ottomata edited projects, added Event-Platform (Sprint 14 B); removed Event-Platform.

@gmodena what should we do with the page_content_change event when we get badrevids? Right now, we discard it, so e.g. the page create event from page_change won't exist in the page_content_change stream.

I have a feeling we should just produce it as is into page_content_change with no content...or maybe some way of indicating that we could not get the content? Not sure though. cc also @Milimetric

@gmodena what should we do with the page_content_change event when we get badrevids? Right now, we discard it, so e.g. the page create event from page_change won't exist in the page_content_change stream.

I have a feeling we should just produce it as is into page_content_change with no content...or maybe some way of indicating that we could not get the content? Not sure though. cc also @Milimetric

I briefly discussed with @Milimetric and @ArielGlenn today. This will need some more thought. We might want to forward the message with a delimited for "bad revision". This might require a schema change.

We need to understand a bit more the semantic of "bad revision", and corner cases around MW that might lead to revision to be updated in the database (to fix content issues).

I'm picking up this phab up in sprint 14B and I'll sync with @ArielGlenn and @Milimetric to explore / document these semantics.

I'm picking up this phab up in sprint 14B and I'll sync with @ArielGlenn and @Milimetric to explore / document these semantics.

I summarized my findings re failure semantics of page content change enrichment at https://wikitech.wikimedia.org/wiki/MediaWiki_Event_Enrichment#mediawiki.page_content_change_semantics

More feedback and corrections if I got something wrong are welcome. In terms of action points some MRs/CRs will be linked to this task.

The main change I want to propose is to add a is_bad_rev_id to the revision entity fragment of page_change. The field is boolean because it is pretty much domain specific and coupled to Mediawiki APIs. In some cases, additional metadata already present in page_change (e.g. visibility status) might help with further qualifying the error type.

When the page_content_change application receives a "badrevids" response from Mediawiki API, an enriched event will be published with no content body and is_bad_rev_id set to true.

This will notify consumers that reconciliation of the stream with some ground truth might be necessary (depending on use cases and SLOs error budget).

Nice!

Small correction:

Content suppression: page is deleted before we had a chance to read its content

Let's avoid calling this content suppression, as that means something different than page deletion. Actual content suppression (is_content_visible: false) can only happen on revisions that are not the current version of the page.

When the page_content_change application receives a "badrevids" response from Mediawiki API, an enriched event will be published with no content body and is_bad_rev_id set to true.

Let's do a little bikeshedding of this name. Action API does use the term badrevids, but nothing else does. The fact that both MW REST and WMF REST APIs don't use this term might indicate that we should call this something else.

What is a proper/good term in MediaWiki for a page or revision that has been deleted, or is not present for some other reason? is_missing? cc @daniel in case you have any suggestions?

Actually, a codesearch for 'badrev' returns some more results, including a BadRevisionException in MW core.

retry events that are considered recent, in our case a good value for recent is 10s (c.f. T279698)

@dcausse do you retry on HTTP 200 with badrevids response? I think this is the 'page deleted before event is processed' problem? Or do you only retry on recent events if there is an actual HTTP error?

We do not use the action API in production yet so the only experience I have is with wikibase Special:EntityData and thus we only rely on the HTTP status code, 404 we retry for 10s (eventual consistency)
For badrevids I think that you should do the same, my impression is that the revision id you are requesting is not yet available on the mysql replica you're hitting.
I might be mistaken but content suppression appears with a "texthidden": true flag (see the content rvprop here: https://www.mediawiki.org/wiki/API:Revisions) in the api response, not in the badrevids section.

I might be mistaken but content suppression appears with a "texthidden": true flag (see the content rvprop here: https://www.mediawiki.org/wiki/API:Revisions) in the api response, not in the badrevids section.

Correcting myself, deleted pages will appear in the badrevids section too, but I agree that naming this field with bad revision sounds a bit misleading in this context.

For badrevids I think that you should do the same, my impression is that the revision id you are requesting is not yet available on the mysql replica you're hitting.

You have a point here. However. if we assume that client must reconcile data to account for eventual consistency of replicas, I wonder it this makes sense for our use case. I analyzed "badrevids" events from the enrichment application error topic, and in 98.5% of case the revision was still bad after a retry. This retry happened days after the event was first processed.

Do you have some metrics about how often fast retries on lagging replicas are successful (that is, some stats how long lag tends to be)?

Correcting myself, deleted pages will appear in the badrevids section too, but I agree that naming this field with bad revision sounds a bit misleading in this context.

You are right.

I was under the impression that "bad revision" was a canonical term used for cases like this. After a chat with @daniel it looks like it's not the case.

Let's re-think terminology. Maybe "missing content" (and is_content_body_missing in page_change) could be more appropriate?

I analyzed "badrevids" events from the enrichment application error topic, and in 98.5% of case the revision was still bad after a retry. This retry happened days after the event was first processed.

This might be a bit misleading. The sample size contains historical data, and we can't discriminate the reason of a badrevids response.

For badrevids I think that you should do the same, my impression is that the revision id you are requesting is not yet available on the mysql replica you're hitting.

You have a point here. However. if we assume that client must reconcile data to account for eventual consistency of replicas, I wonder it this makes sense for our use case. I analyzed "badrevids" events from the enrichment application error topic, and in 98.5% of case the revision was still bad after a retry. This retry happened days after the event was first processed.

Do you have some metrics about how often fast retries on lagging replicas are successful (that is, some stats how long lag tends to be)?

I did a quick analysis a couple years ago in T279698 (wikidata only), on a period of 8 days out of 390 events triggering a HTTP 404 (assuming similar badrevids semantic here), 249 could be salvaged with a retry. Most of the ones that could not be salvaged are the ones processed 10s after their event-time (late processing). In the end I found that retrying a 404 when the processing_time - event_time < 10s is generally worthwhile.

I did a quick analysis a couple years ago in T279698 (wikidata only), on a period of 8 days out of 390 events triggering a HTTP 404 (assuming similar badrevids semantic here), 249 could be salvaged with a retry. Most of the ones that could not be salvaged are the ones processed 10s after their event-time (late processing). In the end I found that retrying a 404 when the processing_time - event_time < 10s is generally worthwhile.

Thanks for the pointer. If anything, we might need to revisit the retry & timeout policy on 404s.

Oh, this is interesting. This bit of info could be useful to reason about retries:
The following HTTP headers are set:
Retry-After: a recommended minimum number of seconds that the client should wait before retrying
X-Database-Lag: The number of seconds of lag of the most lagged replica

I think we should adopt (or at least evaluate) the approach they recommend for lag mitigation. @dcausse did you come across this param? Is it something you considered?

I think we should adopt (or at least evaluate) the approach they recommend for lag mitigation. @dcausse did you come across this param? Is it something you considered?

No we did not, I thought that this param was mainly useful to slow down bot edits but here we do only reads so I'm not sure it might help unless we believe that the read operations we make can slow down replication?

Thanks for the pointer. If anything, we might need to revisit the retry & timeout policy on 404s.

Just realized that we consider badrevids as a 404 equivalent in the search update pipeline that uses the action api.

I'm not sure it might help unless we believe that the read operations we make can slow down replication?

No, but we could use it to know if the reason we are getting badrevid is because of replica lag.

E.g.

curr_time: 10:00:05
rev_dt: 10:00:00

If replica lag > 5 seconds, then rev_dt is probably not on the replica. So using max_lag=5 in this case would let us know why we get a badrevid: because of replica lag. I think? Or am I off here?

I'm not sure it might help unless we believe that the read operations we make can slow down replication?

No, but we could use it to know if the reason we are getting badrevid is because of replica lag.

E.g.

curr_time: 10:00:05
rev_dt: 10:00:00

If replica lag > 5 seconds, then rev_dt is probably not on the replica. So using max_lag=5 in this case would let us know why we get a badrevid: because of replica lag. I think? Or am I off here?

This is the lag of the most lagged replica so it does not necessarily mean that you are hitting this one with your MW api request, if maxlag is 5s it's still possible to get the revision you ask even when the request before the event is 5s of age.
Now if:

  • using max_lag=5
  • event age is > 5s
  • the request does not fail because of maxlag
  • you have badrevids

Can you be 100% sure that it's a deleted revision not the mysql replica you're hitting being behind?

I'm not sure, I don't think the intent of maxlag is to help with eventual consistency but possible?
My impression is that naively retrying requests receiving badrevids if the event age is < 10s is a good trade-off.

This is the lag of the most lagged replica so it does not necessarily mean that you are hitting this one with your MW api request, if maxlag is 5s it's still possible to get the revision you ask even when the request before the event is 5s of age.

Next to giving an upper bound, maxlag could help add semantic information on a "no content" event. This may help downstream users with their own reconciliation / retry logic.

@otto @dcausse I wonder if the is_bad_rev_id field (pending rename) should be an enum instead of bool, and account for badrevids and maxlag errors.

My impression is that naively retrying requests receiving badrevids if the event age is < 10s is a good trade-off.

I don't disagree here.

I wonder if the is_bad_rev_id field (pending rename) should be an enum instead of bool

Ya possibly, to give some indication of why the rev is bad.

Change 931957 had a related patch set uploaded (by Gmodena; author: Gmodena):

[schemas/event/primary@master] error: add error_type field

https://gerrit.wikimedia.org/r/931957

Change 931957 merged by jenkins-bot:

[schemas/event/primary@master] error: add error_type field

https://gerrit.wikimedia.org/r/931957