Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition: replies, deletes, undos, unfollows, etc get dropped if their target post is still processing #1361

Closed
qazmlp opened this issue Oct 5, 2024 · 20 comments
Labels
bug User-facing breakage and reliability issues within Bridgy Fed. now safety Related to user safety.

Comments

@qazmlp
Copy link

qazmlp commented Oct 5, 2024

A reply just (sporadically?) didn't make it to Bluesky. The log at https://fed.brid.gy/log?path=%2Fqueue%2Fwebmention%2C%2Finbox&start_time=1728105708&key=https%3A%2F%2Fwolfdo.gg%2Fnotes%2F9yz9tw8dfgws6ex9%2Factivity shows the following

I 2024-10-05 05:21:46.205812+00:00 Already seen this activity https://wolfdo.gg/notes/9yz9tw8dfgws6ex9/activity

Relevant thread on Bluesky: https://bsky.app/profile/qazm.bsky.social/post/3l5qmbrxis223

That looks like a race condition somewhere in the ActivityPub inbox to me.
(This activity doesn't have the green checkmark for delivering it to Bluesky.)

@snarfed
Copy link
Owner

snarfed commented Oct 5, 2024

Hah, race condition indeed. That reply (call it B) was to another reply (A) by the same author that was still processing when B came in. BF saw that A wasn't bridged, so it didn't bridge B either, due to the logic that replies to non-bridged post aren't bridged.

Ugh. I can think of ways to solve this, but they're all complicated. I don't see an easy fix yet.

(The log message was unrelated, we got two inbox deliveries of this reply and that log message was for the second one. Our logic for finding log messages is not the best 🤷)

@snarfed snarfed changed the title AP => AT reply dropped erroneously Race condition: replies get dropped if their in-reply-to post is still processing Oct 5, 2024
@TomCasavant
Copy link

Let me know if you want me to file a different issue for this, but I'm pretty sure this is related to this issue

https://fed.brid.gy/ap/@[email protected] <- this bot auto reposts any post from https://fed.brid.gy/ap/@[email protected] if certain conditions are met and it boosts them basically immediately after the post is created. It looks like Bridgy is seeing the boost but doesn't do anything about it (I'm guessing because Bridgy hasn't processed the original post yet)

@snarfed
Copy link
Owner

snarfed commented Oct 14, 2024

Ah, interesting, true! That does sound likely.

@snarfed
Copy link
Owner

snarfed commented Oct 19, 2024

Got another report of this, #1393 ^, for deletes. Probably the same issue.

One way to fix this could be to error out if we get a reply, repost, quote, delete, or undo of an object that either we haven't seen yet or isn't bridged. We'd then retry the receive task 5m later, and then again 10m after that.

I'm reluctant to do that though, since we get a lot of these, and the vast majority of the time, the object in question will never be bridged. We'd end up tripling our load for those incoming activities to fix a rare edge case. 😕

@TomCasavant
Copy link

FWIW, I ended up fixing my issue by just delaying the boost by 30 seconds. Though I'm not sure how useful that info is for fixing the actual issue

@qazmlp
Copy link
Author

qazmlp commented Oct 20, 2024

One complication is that I don't think there's any guarantee an e.g. Create won't be sent after its matching Delete in ActivityPub (unless the instance cancels relevant outbox items (synchronously), which Mastodon doesn't appear to do at all).

Backoff due to connectivity issues can sporadically (and individually explicitly somewhat randomly!) reorder activities across a long time span, at least in Mastodon.

Mastodon fixes this problem with a database lock on the post id and by permanently tombstoning e.g. statuses it receives a Delete for. That creates significant 'garbage' of course, which I don't think exists as such in ATProto.
(Sidenote: That technically means tombstones have to be keyed on object id and owner in order to prevent denial of service attacks via tombstone spam against Actors whose object ids are predictable.)

I think it also asynchronously fetches the target of an Announce if it doesn't have it already, but I think in bridgy's case it would probably be better to create a trigger with a timeout somehow that resumes processing in case the post is in fact bridged, unless it's a self-boost. (Can that even reliably be distinguished without fetching the Announced post, though?)

@Tamschi Tamschi added the bug User-facing breakage and reliability issues within Bridgy Fed. label Oct 31, 2024
@snarfed
Copy link
Owner

snarfed commented Nov 6, 2024

Related: #1338

Copy link

Got another report of this, #1393 ^, for deletes. Probably the same issue.

One way to fix this could be to error out if we get a reply, repost, quote, delete, or undo of an object that either we haven't seen yet or isn't bridged. We'd then retry the receive task 5m later, and then again 10m after that.

I'm reluctant to do that though, since we get a lot of these, and the vast majority of the time, the object in question will never be bridged. We'd end up tripling our load for those incoming activities to fix a rare edge case. 😕

This did happened to me too! I forgot to mark my post as mention only lol, and deleted it second after it get posted.

Bsky post link: https://bsky.app/profile/mashi.wip.la/post/3laciq5hih5p2
Masto post link: https://masto.ai/@atou/113437693154993234

@Tamschi Tamschi changed the title Race condition: replies get dropped if their in-reply-to post is still processing Race condition: replies and deletes get dropped if their target post is still processing Nov 10, 2024
@dos1
Copy link

dos1 commented Nov 13, 2024

Just stumbled upon it as well 😆 https://bsky.app/profile/dos.social.librem.one.ap.brid.gy/post/3lau4ps7fqe62

It seems quite serious to me. If I delete something soon after posting, it's very likely because I really don't want it to stay there.

@snarfed snarfed changed the title Race condition: replies and deletes get dropped if their target post is still processing Race condition: replies, deletes, undos, etc get dropped if their target post is still processing Nov 15, 2024
@snarfed snarfed changed the title Race condition: replies, deletes, undos, etc get dropped if their target post is still processing Race condition: replies, deletes, undos, unfollows, etc get dropped if their target post is still processing Nov 15, 2024
@Donnnno
Copy link

Donnnno commented Nov 18, 2024

Ah, this happened to me too! :')
https://bsky.app/profile/neppy.cat/post/3lb43ta3tbyl2
https://bsky.app/profile/neppy.cat/post/3lb7vejx4gfw2

is there a way to force delete posts on bluesky? 👀

Copy link

Same, it'll be nice if there's at least a way to delete it

@Tamschi
Copy link
Collaborator

Tamschi commented Nov 18, 2024

I should probably put the safety label on this too, since it can lead to situations where the user is unable to depublish a post (aside from taking the whole account offline).

@snarfed I think there does need to be some level of general Delete persistence, at least due to #1361 (comment).

One option would be to have any Deletes by at least previously bridged accounts tombstone-flag posts in the database by their URI. That could also act as a synchronisation point, where the sending task first creates a pending-flag in that same place, which, if the Delete handler sees that while placing the tombstone, causes the latter to optimistically delete the post in the PDS.

If the Delete-handler was too fast, then the Create or Update handler will see the tombstone when it checks after its other tasks and optimistically delete the post in the PDS on its end. Individually each delete towards ATProto could fail, but overall at least one of them should go through.

That should be fairly reasonable load-wise, at least if the URIs can be prefix-deduplicated or otherwise compressed in the database.
Even if not, this scheme is low-latency and doesn't involve retries, so there should be no(t much) scheduling load.

@Tamschi Tamschi added the safety Related to user safety. label Nov 18, 2024
@snarfed
Copy link
Owner

snarfed commented Nov 18, 2024

Agreed on the safety label and priority here. Re design, we could definitely add persistence for Deletes, but I'm still inclined to try #1361 (comment) first, ie just retry Deletes and Undos where we don't have a matching object or activity yet. Simpler, and with our current config of retrying after 5m and then again 10m later, we'd have a 15m total window to wait for the missing object/activity. Should almost always be enough!

@snarfed snarfed added the now label Nov 18, 2024
@snarfed
Copy link
Owner

snarfed commented Nov 19, 2024

Another approach here could be to delay initial processing of Deletes and Undos by a minute or two. That should catch the vast majority of these, enough that we might not need to add the retries and load I worried about.

@Archenoth
Copy link

Archenoth commented Nov 20, 2024

One possible way to work around this in the short term could be to add a button to posts on https://fed.brid.gy/ap/<account> pages to "refresh" individual posts

image

That way, if someone notices weirdness, they could click something to fix it!

Also, because this would just be a "refresh", you also wouldn't need to worry about authentication, or users trying to delete other people's posts

@snarfed
Copy link
Owner

snarfed commented Nov 29, 2024

^ I shipped a 2m delay for delete, undo, and stop-following, which should hopefully fix this. Fingers crossed!

@snarfed snarfed closed this as completed Nov 29, 2024
Copy link

Thank you! But how about the existing deletes that failed to bridge before? Is there anything we can do?

@snarfed
Copy link
Owner

snarfed commented Nov 30, 2024

@atouu I can delete them manually. I've already deleted the one you mentioned in #1361 (comment) . Let me know if you have others!

@Archenoth
Copy link

Like, in this issue directly? Or..?
(Because I imagine there are quite a few of these between the people who might be reading this)

@snarfed
Copy link
Owner

snarfed commented Nov 30, 2024

Yup! Here is fine, or emailed to [email protected]. I'd fix them myself if there was an obvious way to find them in our logs, but sadly there isn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug User-facing breakage and reliability issues within Bridgy Fed. now safety Related to user safety.
Projects
None yet
Development

No branches or pull requests

8 participants