Page MenuHomePhabricator

Image cache issue when 'over-writing' an image on commons
Closed, DuplicatePublic

Assigned To
Authored By
Stemoc
Nov 19 2015, 5:29 AM
Referenced Files
F3047381: slnqlmsyQg4Ei8bS_5lbgFA.png
Dec 4 2015, 3:37 PM
F3029094: Screenshot_2015-11-30_15-25-33.png
Nov 30 2015, 3:30 PM
Tokens
"The World Burns" token, awarded by Liuxinyu970226."The World Burns" token, awarded by Thibaut120094."The World Burns" token, awarded by zhuyifei1999."The World Burns" token, awarded by Perhelion."The World Burns" token, awarded by Steinsplitter.

Description

Recently i have seen that when you upload over a previous image, the changes don't seem to show up (for the picture atleast) [the resolution and size remain current], i thought it was an issue with my browser in the beginning but then i tried with 4 other browser on my PC and they all came looking the same as the previous image, The thumbnail on the file history may look updated or may not looked updated, it depends, here are two examples,

  1. I made a major change to this image (https://commons.wikimedia.org/wiki/File:Kim_Jong-Un_Photorealistic-Sketch.jpg) but it doesn't look changed at all, unless you look at the thumbnail below
  1. I tried this on another image, a higher resolution one and as previously, "ctrl +f5" multiple times but it still did not update to the "less pinkish" recent version (https://commons.wikimedia.org/wiki/File:Bernie_Sanders_September_2015.jpg)

An admin on commons told me he had issue like this as well and another user on IRC told me to change the pixels by 1 to get the "current" version" of the image...is this a known issue?..

I'm on FF43b1 on WinVista ..just in case..

Related Objects

StatusSubtypeAssignedTask
Resolvedaaron
Resolvedaaron
ResolvedOttomata
DuplicateBBlack
Resolved ema
Resolvedaaron
Resolved ema
Resolved ema
Resolved ema
Resolved ema
Resolved ema
Resolved ema
Resolved ema
Resolved ema
ResolvedBBlack
Resolved ema
Resolved ema
Resolved ema
ResolvedBBlack
Resolved ema
Resolveddaniel
Resolved GWicke
ResolvedOttomata
InvalidOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedRobH
ResolvedOttomata
Resolved Cmjohnson
Resolvedelukey
ResolvedRobH
Resolved mobrovac
ResolvedEevans
Declined csteipp
Resolved csteipp
Resolved GWicke
Resolvedssastry
Resolved Pchelolo
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
Resolved madhuvishy
ResolvedOttomata
Resolved madhuvishy
ResolvedOttomata
Resolved mobrovac
Resolved mobrovac
ResolvedBBlack
Resolved ema
OpenNone
ResolvedCDanis

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I can confirm this issue again in:

  1. File:Faro_card_game.jpg -- reverts did not update "original file"
  2. Help desk thread on File:Continuity_of_the_Exponential_at_0.svg in this case only png thumbs work as expected. "original file" fail.

Possible regression?

(Setting priority to high as this issue affect a wide range of reuploads)

Issue likely to be related to varnish cache (with X-Wikimedia-Debug test):

$ curl https://upload.wikimedia.org/wikipedia/commons/b/be/Continuity_of_the_Exponential_at_0.svg 2> /dev/null | md5sum
f9b13ad5cc4d5a44b0f2554570097186  -
$ curl -H 'X-Wikimedia-Debug: 1' https://upload.wikimedia.org/wikipedia/commons/b/be/Continuity_of_the_Exponential_at_0.svg 2> /dev/null | md5sum
5c7493f4df7baa9d9e4a15107245d709  -
faidon added a subscriber: BBlack.
ori removed ori as the assignee of this task.Nov 20 2015, 2:51 PM
faidon raised the priority of this task from High to Unbreak Now!.Nov 20 2015, 2:51 PM

the vhtcpd stats look weird to me ( http://ganglia.wikimedia.org/latest/stacked.php?m=vhtcpd_inpkts_sane&c=Upload%20caches%20eqiad&r=month&st=1448034519&host_regex= ). Right now vhtcpd_inpkts_sane seems to be hovering in the single digits, where last week it was between 100-200. And on monday there seemed to be some giant spike.
(Edit: I realize now, that might be a reasonable result of splitting up upload and normal htcp streams)

Anyways, unless there was some change in measurement methodology, that doesn't look right to me.

Given the timing of this all, I wonder if it has something to do with moving upload htcp to its own ip ( 290639a96e381d8 )

the vhtcpd stats look weird to me ( http://ganglia.wikimedia.org/latest/stacked.php?m=vhtcpd_inpkts_sane&c=Upload%20caches%20eqiad&r=month&st=1448034519&host_regex= ). Right now vhtcpd_inpkts_sane seems to be hovering in the single digits, where last week it was between 100-200. And on monday there seemed to be some giant spike.
(Edit: I realize now, that might be a reasonable result of splitting up upload and normal htcp streams)

Anyways, unless there was some change in measurement methodology, that doesn't look right to me.

Yeah, the rates actually did look "correct" in the sense that when filtering only for upload purges, the packet rate in the original combined stream was very low compared to the much higher rate of text purges. That was one of the drivers for the split, to avoid sending all of the pointless text-cluster purges to the numerous upload caches.

We should try to reproduce this reliably and test per-datacenter on the commandline (using curl with --resolve to point at specific upload-lb.$site.wikimedia.org).

the vhtcpd stats look weird to me ( http://ganglia.wikimedia.org/latest/stacked.php?m=vhtcpd_inpkts_sane&c=Upload%20caches%20eqiad&r=month&st=1448034519&host_regex= ). Right now vhtcpd_inpkts_sane seems to be hovering in the single digits, where last week it was between 100-200. And on monday there seemed to be some giant spike.
(Edit: I realize now, that might be a reasonable result of splitting up upload and normal htcp streams)

Anyways, unless there was some change in measurement methodology, that doesn't look right to me.

Yeah, the rates actually did look "correct" in the sense that when filtering only for upload purges, the packet rate in the original combined stream was very low compared to the much higher rate of text purges. That was one of the drivers for the split, to avoid sending all of the pointless text-cluster purges to the numerous upload caches.

We should try to reproduce this reliably and test per-datacenter on the commandline (using curl with --resolve to point at specific upload-lb.$site.wikimedia.org).

I tried testing for https://commons.wikimedia.org/wiki/File:2014_ebola_virus_epidemic_in_West_Africa_simplified.svg yesterday. I think i tested all data centres and none of the data centres were getting purges

I observed this effect today on two images from multi-page DjVu file: https://commons.wikimedia.org/wiki/File:Henryk_Sienkiewicz_-_Wiry_01.djvu :

  1. https://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/Henryk_Sienkiewicz_-_Wiry_01.djvu/page146-1024px-Henryk_Sienkiewicz_-_Wiry_01.djvu.jpg
  2. https://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/Henryk_Sienkiewicz_-_Wiry_01.djvu/page147-1024px-Henryk_Sienkiewicz_-_Wiry_01.djvu.jpg

New version of the file has been uploaded yesterday.

I resolved this problem purging manually the appropriate file pages on Commons few times:
https://commons.wikimedia.org/w/index.php?title=File%3AHenryk_Sienkiewicz_-_Wiry_01.djvu&page=146&action=purge
https://commons.wikimedia.org/w/index.php?title=File%3AHenryk_Sienkiewicz_-_Wiry_01.djvu&page=147&action=purge
Single purge did not help (maybe I should wait longer for the effect or maybe it behaves a bit randomly?)

Just FYI: While working on Wikisource with Commons-based DjVu files I observe this effect for years. It likely happens when a new file version is uploaded while thumbnails from old version are being generated or extensively used on other wikis. And it may be also correlated with long time response from commons servers (high load?). But unsure.

So I was just helping someone with a thumb caching issue on irc -

https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/Police_-_rue_de_la_R%C3%A9publique_Saint-Denis_-_18_nov_2015.jpg/220px-Police_-_rue_de_la_R%C3%A9publique_Saint-Denis_-_18_nov_2015.jpg was returning an old cached thumb on esams (but not other clusters). Purging it worked to clear the file, so perhaps the issue in this bug is intermittent, or maybe there's just a lot of packet loss around.

Apparently its impossible to download the over written version of the image unless you change the pixels and the "Original file" link will download the previous version instead ..

Change 255807 had a related patch set uploaded (by BBlack):
Revert "purging: do not VCL-filter on domain regex"

https://gerrit.wikimedia.org/r/255807

Change 255808 had a related patch set uploaded (by BBlack):
Revert "upload purging: do not listen on text/mobile addr"

https://gerrit.wikimedia.org/r/255808

Change 255807 merged by BBlack:
Revert "purging: do not VCL-filter on domain regex"

https://gerrit.wikimedia.org/r/255807

Change 255808 merged by BBlack:
Revert "upload purging: do not listen on text/mobile addr"

https://gerrit.wikimedia.org/r/255808

Above I've done some partial reversion of the multiple commits involved in the multicast split work as an experiment.

At this point, MediaWiki will still be splitting HTCP to separate addresses for the text/mobile and upload sets, but the original domain filters are back in place and upload is once again listening to the text/mobile multicast address in addition to its own. If this experiment doesn't resolve the issue (for new upload->purge; it won't fix existing issues), the next step would be to revert the mediawiki-config commit https://gerrit.wikimedia.org/r/#/c/249121/ . If that also doesn't resolve it, then it's mostly likely unrelated to the multicast changes.

zhuyifei1999 claimed this task.

No further reports of this bug AFAIK. Bug presumably resolved.

https://commons.wikimedia.org/wiki/File:Jennifer_Winget_at_the_launch_of_Watch_Time's_magazine_11.jpg <= reverted to wrong version? :P
https://commons.wikimedia.org/wiki/File:John_Lennon_Rock%27n_Roll_-_Hamburg.jpg <= Did you try clearing browser cache? I'm finding nothing wrong here. Maybe geographic difference loading cached image from different datacenters, or simply browser cached old version.

Regardless, even if the bug appears "resolved" now (at least, back to normal rare levels of occurrence), my changes were just experimental - we'll still need to track down what the real cause is and then put real fixes in place that still allow the traffic to be split.

geographic location may play a role here but then we would have a data sync issue somewhere. The thumb from first file is still wrong (multiple browsers tested, purge had no effect), full screen version may be OK. The second image seems to be resolved.
This is actually a major problem if a purge command is not passed through at all.

For me, the first one linked above in zhuyifei's comment looks like this:

Screenshot_2015-11-30_15-25-33.png (833×1 px, 565 KB)

(the picture contents on the right in the above screenshot has a license noted here) >.>

Thumbnail is wrong.

For 388px size of https://commons.wikimedia.org/wiki/File:Jennifer_Winget_at_the_launch_of_Watch_Time's_magazine_11.jpg

esams was returning wrong version (I've since purged that file, so you can't see anymore). The age header was set to
94485 (26.2 hours ago) [At time of writing its Mon Nov 30 20:26:02 UTC]. Thus this is an example that happened after the revert in question went through.

For the 77px size of the same image - https://upload.wikimedia.org/wikipedia/commons/thumb/6/61/Jennifer_Winget_at_the_launch_of_Watch_Time%27s_magazine_11.jpg/77px-Jennifer_Winget_at_the_launch_of_Watch_Time%27s_magazine_11.jpg

esams was showing wrong version with Age: 96131; X-Cache: cp1050 hit (3), cp3038 hit (40), cp3033 frontend miss (0); Last-Modified: Sun, 29 Nov 2015 18:02:03 GMT. codfw was showing wrong version with: Age: 96141; X-Cache: cp1050 hit (1), cp2014 hit (9), cp2014 frontend miss (0); Last-Modified: Sun, 29 Nov 2015 18:02:03 GMT. Note that the version being incorrectly displayed was uploaded at 2015-11-29T17:51:53 UTC, and the later version which should have overridden it, was uploaded at 2015-11-29T20:32:24‎. Thus a race condition seems unlikely

[subsequently, the cache has been purged]. ulsfo and eqiad were returning correct versions.

All this sounds like there is just exceptionally high packet loss on 239.128.0.113 relative to 239.128.0.112 (Otherwise, why is it so intermittent?). Given we go to all the trouble to give htcp packets sequential ids per host, it would be nice if we could set something up to keep track of how many ids are received, so we could quanitfy rate of packet loss.

Well anything's possible, but AFAIK there's no special treatment of 239.128.0.113 vs .112 in terms of multicast routing. We've always faced some small percentage of loss due to dropped UDP packets, but for the moment that's considered acceptable vs the costs of re-engineering a more-reliable mechanism for transmitting purges... but if anything that situation should have improved with the multicast split, as they're now queued separately on the sending and receiving host sockets.

The other general issue we face (both in historically and now) is the cache layering race: these images are stored in multiple layers of cache (2-3, depending on the datacenter of the request), and no attempt is currently made to order the invalidations to happen in the bottom layers first, so it's possible for an upper layer to process an invalidation and then refresh the image from a lower layer before said lower layer has processed the same invalidation, and thus end up with the wrong image back in cache.

The split of the multicast addresses could also potentially have negatively affected those races... perhaps being queued with (interleaved with) the higher-volume text invalidation traffic slowed them down enough in certain cases to make it less an issue. The best option right now would be to actually trace the HTCP traffic in response to manual purges and debug it from there, which I'll try to do this week...

The other general issue we face (both in historically and now) is the cache layering race: these images are stored in multiple layers of cache (2-3, depending on the datacenter of the request), and no attempt is currently made to order the invalidations to happen in the bottom layers first, so it's possible for an upper layer to process an invalidation and then refresh the image from a lower layer before said lower layer has processed the same invalidation, and thus end up with the wrong image back in cache.

Sample size of 1, but the last-modified timestamps on 77px of File:Jennifer_Winget_at_the_launch_of_Watch_Time's_magazine_11.jpg don't seem consistent with that happening. [I've ammended my previous comment to include the last-modified header]

I received a user complaint about https://upload.wikimedia.org/wikipedia/commons/thumb/7/70/PL_Karol_May_-_Dolina_%C5%9Amierci.djvu/page191-1024px-PL_Karol_May_-_Dolina_%C5%9Amierci.djvu.jpg - esams indeed had an older version it was serving (other caching centers were ok). The odd thing was, I had to do the purge 3 time before the purge seemed to go through for esams (And the reporter indicated that s/he had tried purging too, to no effect).

So, I did some tracing this morning using one of the files mentioned earlier in this ticket. The commons file in question is: https://commons.wikimedia.org/wiki/File:Continuity_of_the_Exponential_at_0.svg . You'll note on the page it lists the upload.wm.o links for the original as well as scaled 4x different widths of "thumbs" at:

https://upload.wikimedia.org/wikipedia/commons/b/be/Continuity_of_the_Exponential_at_0.svg
https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Continuity_of_the_Exponential_at_0.svg/200px-Continuity_of_the_Exponential_at_0.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Continuity_of_the_Exponential_at_0.svg/500px-Continuity_of_the_Exponential_at_0.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Continuity_of_the_Exponential_at_0.svg/1000px-Continuity_of_the_Exponential_at_0.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Continuity_of_the_Exponential_at_0.svg/2000px-Continuity_of_the_Exponential_at_0.svg.png

I pulled all of these up in a browser and they're all valid and working, and at least visually appear to have the same data currently.
Then I did a purge of this file using the browser URL https://commons.wikimedia.org/wiki/File:Continuity_of_the_Exponential_at_0.svg?action=purge while tracing HTCP network traffic on cp1071.eqiad.wmnet (an upload cache which is local to eqiad, so there should be no routing loss issues to even think about) several times.

In my first several purges, I observed the following HTCP (note commons URLs on .112 and upload URLs on .113 as expected with the split):

U 10.64.32.45:59104 -> 239.128.0.112:4827
  .q...k....$.....HEAD.Mhttp://commons.wikimedia.org/wiki/File:Continuity_of_the_Exponential_at_0.svg..HTTP/1.0....

U 10.64.32.45:59104 -> 239.128.0.112:4827
  ..........$.....HEAD.ihttp://commons.wikimedia.org/w/index.php?title=File:Continuity_of_the_Exponential_at_0.svg&action=history..HTTP/1.0....

U 10.64.32.45:33277 -> 239.128.0.113:4827
  ..........$.....HEAD..https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Continuity_of_the_Exponential_at_0.svg/1000px-Continuity_of_the_Exponential_at_0.svg.png..HTTP/1.0....

U 10.64.32.45:51868 -> 239.128.0.113:4827
  .~...x....$.....HEAD.Zhttps://upload.wikimedia.org/wikipedia/commons/b/be/Continuity_of_the_Exponential_at_0.svg..HTTP/1.0....

Then a few minutes later when I tried several times again, I got all of the above but without even the 1000px variant (just 2x commons + 1x upload HTCP for the main image itself). None of the purges ever sent an HTCP for the 200px, 500px, or 2000px variants. So I can definitely see how action=purge isn't going to fix all broken thumbnails - whether by bug or by design, it doesn't even attempt to purge them all, and that has nothing to do with the multicast split.

If we assume uploading a new image runs the same code as action=purge for purging the upload.wm.o image and thumbs, then the same problem is going to exist there as well...

Testing further, there's currently a complaint on commons about: https://commons.wikimedia.org/wiki/File:Rolamite.svg at https://commons.wikimedia.org/wiki/Commons:Help_desk#New_uploaded_version_of_.SVG_file_won.27t_become_the_current_version - the scaled image shown there on the commons page itself as a preview has the upper wheel on the right, whereas the correct image (and all the scaled upload links below the preview) have the upper wheel on the left. action=purge gives these HTCP:

U 10.64.48.90:51635 -> 239.128.0.113:4827
  .....|....K-....HEAD.^https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/1000px-Rolamite.svg.png..HTTP/1.0....

U 10.64.48.90:51635 -> 239.128.0.113:4827
  .....{....K.....HEAD.]https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/100px-Rolamite.svg.png..HTTP/1.0....

U 10.64.48.90:51635 -> 239.128.0.113:4827
  .....{....K/....HEAD.]https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/120px-Rolamite.svg.png..HTTP/1.0....

U 10.64.48.90:51635 -> 239.128.0.113:4827
  .....{....K0....HEAD.]https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/180px-Rolamite.svg.png..HTTP/1.0....

U 10.64.48.90:51635 -> 239.128.0.113:4827
  .....|....K1....HEAD.^https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/2000px-Rolamite.svg.png..HTTP/1.0....

U 10.64.48.90:51635 -> 239.128.0.113:4827
  .....{....K2....HEAD.]https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/200px-Rolamite.svg.png..HTTP/1.0....

U 10.64.48.90:51635 -> 239.128.0.113:4827
  .....{....K3....HEAD.]https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/220px-Rolamite.svg.png..HTTP/1.0....

U 10.64.48.90:51635 -> 239.128.0.113:4827
  .....{....K4....HEAD.]https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/240px-Rolamite.svg.png..HTTP/1.0....

U 10.64.48.90:51635 -> 239.128.0.113:4827
  .....{....K5....HEAD.]https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/500px-Rolamite.svg.png..HTTP/1.0....

U 10.64.48.90:51635 -> 239.128.0.113:4827
  .....{....K6....HEAD.]https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/768px-Rolamite.svg.png..HTTP/1.0....

U 10.64.48.90:56915 -> 239.128.0.113:4827
  .d...^....K7....HEAD.@https://upload.wikimedia.org/wikipedia/commons/f/f5/Rolamite.svg..HTTP/1.0....

U 10.64.48.90:44306 -> 239.128.0.112:4827
  .W...Q....K8....HEAD.3http://commons.wikimedia.org/wiki/File:Rolamite.svg..HTTP/1.0....

U 10.64.48.90:44306 -> 239.128.0.112:4827
  .s...m....K9....HEAD.Ohttp://commons.wikimedia.org/w/index.php?title=File:Rolamite.svg&action=history..HTTP/1.0....

Which seems to be covering a lot of thumbs, but not the thumb that's actually being used as the preview image on commons, which is: https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/512px-Rolamite.svg.png .

Trying two more times immediately afterwards, I only got:

U 10.64.48.90:45347 -> 239.128.0.113:4827
  .d...^....Kd....HEAD.@https://upload.wikimedia.org/wikipedia/commons/f/f5/Rolamite.svg..HTTP/1.0....

U 10.64.48.90:46624 -> 239.128.0.112:4827
  .W...Q....Ke....HEAD.3http://commons.wikimedia.org/wiki/File:Rolamite.svg..HTTP/1.0....

U 10.64.48.90:46624 -> 239.128.0.112:4827
  .s...m....Kf....HEAD.Ohttp://commons.wikimedia.org/w/index.php?title=File:Rolamite.svg&action=history..HTTP/1.0....
U 10.64.32.42:55295 -> 239.128.0.112:4827
  .W...Q..........HEAD.3http://commons.wikimedia.org/wiki/File:Rolamite.svg..HTTP/1.0....

U 10.64.32.42:55295 -> 239.128.0.112:4827
  .s...m..........HEAD.Ohttp://commons.wikimedia.org/w/index.php?title=File:Rolamite.svg&action=history..HTTP/1.0....

U 10.64.32.42:44759 -> 239.128.0.113:4827
  .....{..........HEAD.]https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Rolamite.svg/320px-Rolamite.svg.png..HTTP/1.0....

U 10.64.32.42:45333 -> 239.128.0.113:4827
  .d...^..........HEAD.@https://upload.wikimedia.org/wikipedia/commons/f/f5/Rolamite.svg..HTTP/1.0....

Thinking this through logically - either there's a code/data bug in how we're purging thumbnails in the general case at the application level, or the remaining infrastructure-level possibility is that the outbound purge packets from the appserver which is processing my action=purge request are overflowing a socket buffer and simply unreliable/lossy on the outbound side (it wouldn't be the inbound side I don't think). My gut feeling from observation is that it's the former though, as I don't see full thumbnail purge lists even occasionally coming through in repeated tests.

Digging through some of our repos looking for a possible related code regression as a shot in the dark, I do see a few code change that could be peripherally related depending on the real deployment date of the merge to production wikis:

https://gerrit.wikimedia.org/r/#/c/252696/
https://gerrit.wikimedia.org/r/#/c/251201/

In general, I don't fully understand how we're getting from action=purge to HTCP packets for specific thumbnails - is there someone on the MW side who understands it deeply. Black box observation indicates that we're saving a list of generated thumbnail sizes somewhere (memcache? mysql?), and we're maybe tracking timestamps on when they were last generated and/or purged so that we don't do duplicated purge work, which might be working against us here? The fact that the purged list seems to shrink as I test leans that direction - we should probably always purge all known thumbs, not just ones we think haven't already been purged since the last change...

I've also staged a revert of the mediawiki-config side of the multicast split at https://gerrit.wikimedia.org/r/#/c/256730/ . With the current pair of reversions on the puppet/cache side of things, it's fine to apply this one as well (but they'll have to be re-applied in correct order later if we go back again). I still don't think this is the direct cause, but it may be worth trying this going back to the theory that basically thumbnail purging has always been slightly-broken and mostly-saved by timing coincidences, and that the split affected those timing coincidences significantly. If we want to try this, I think it's fair game to go ahead and merge->deploy it regardless of current freeze(s)...

The list of thumbnails to purge comes from the list of thumbnails in swift. Since these are also deleted on purge, the next ?action=purge will have less thumbnails (only ones recreated since then).

The list of thumbnails to purge comes from the list of thumbnails in swift. Since these are also deleted on purge, the next ?action=purge will have less thumbnails (only ones recreated since then).

That seems problematic since swift and varnish are caching thumbnails separately. Is Swift also listening to HTCP, or does it get the purge via some other mechanism? Either way, the nasty issue here is then that if the purge succeeds on Swift and fails on varnish, further purges will never try to remove the invalid ones still stuck in varnish.

And to expound a bit earlier on the cache-layering problem (and Swift is another layer to consider!). When a thumbnail is generated in swift and broadly accessed by users through the various cache endpoints, the places in which it's cached and the diagram of which caches fetch from which sources looks like this:

slnqlmsyQg4Ei8bS_5lbgFA.png (720×960 px, 38 KB)

Each of those boxes with varnish-fe or varnish-be on them also represent multiple hosts: in the fe case, different users hit different equivalent fe hosts with separate caches. In the be case, there's hashing by-filename, so all users would hit the same host for the same image. In all cases, a cache miss for any box in the diagram would immediately re-fetch the image from the next connected thing down on the chart.

So, if a purge request is broadcast to swift and all varnishes simultaneously, immediately, and only once, we have a big race condition here. If the image was already cached in all of those places, and the invalidation arrives at varnish-fe (esams), then a user fetches the image from varnish-fe (esams) before the invalidation has been processed at varnish-be (esams), then varnish-fe (esams) will re-fetch the outdated image and keep it long after the invalidation has finished processing at all the other layers. This is true anytime invalidate->re-fetch-for-user happens at any layer before the invalidation completes at a lower layer. The only way to avoid the race is to purge sequentially from the bottom up in the diagram...

The list of thumbnails to purge comes from the list of thumbnails in swift. Since these are also deleted on purge, the next ?action=purge will have less thumbnails (only ones recreated since then).

That seems problematic since swift and varnish are caching thumbnails separately. Is Swift also listening to HTCP, or does it get the purge via some other mechanism? Either way, the nasty issue here is then that if the purge succeeds on Swift and fails on varnish, further purges will never try to remove the invalid ones still stuck in varnish.

Yeah that's a very known suckiness in the system. The VCL bucketing based thumbnail purges gilles worked on would solve the listing sync problem.

Swift just gets DELETEs via HTTP, not the UDP purger.

We can try the mediawiki-config revert during one of the Monday SWATs I think ( https://gerrit.wikimedia.org/r/#/c/256730/ ) - it may still have some pattern-effect that reduce the rate back down.

https://commons.wikimedia.org/wiki/File:Megabalanus_coccopoma.jpg
Another file to run experiments with - Logo was removed but still shows in bottom right on many thumbs.

More reports on WP:VP/T keep coming in.

The revert was pushed off for tomorrow, because SWAT today was too full of post-freeze changes...

This was merged in the SWAT that happened around 00:00-01:00 UTC Dec 8 (so about 17 hours before this post). Any new reports or observations of bad behavior since then (note this won't fix existing broken thumbnails, but may affect the rate of breakage for new uploads / overwrites).

Seems to be behaving correctly for me now. Should probably wait a bit more before closing?

Thank you for fixing this :D

We shouldn't close at all, as these fixes are just workarounds and don't get to the heart of the issue, but regardless I do want to wait for more reports for another day or so before calling this a good temporary workaround.

BBlack lowered the priority of this task from Unbreak Now! to High.Dec 9 2015, 8:30 PM

Happened again today: https://commons.wikimedia.org/wiki/File:KutlugAtaman.JPG
Was overwritten, reverted by me and intermediate file deleted. Despite multiple purges the deleted image is still shown although some thumbs either show the most recent version (or have not been updated from first overwrite).

@Denniss - problems today are unrelated, they're from general random purge loss due to: T124418