Page MenuHomePhabricator

Wikisource Export: Cache all API requests
Closed, ResolvedPublic5 Estimated Story Points

Description

As a Wikisource user, I want the credits information to be cached, so that generation of some ebooks. (in cases when the same book is being generated) is more quick and efficient.

Background: Follow-up to T222936. We'd like to eventually cache exported files on disk, but how we go about this will depend on how we implement the job queue (T253283). For now, we can just add simple 10-minute (configurable) caching to the result of all API requests (including those to phetools/credits.py, which is among the slowest bits). First we'll want to switch all API requests to use Symfony's HTTP Client ([insert ticket here]), then we can utilizing its built-in caching.

Acceptance criteria:

  • Cache all API requests
  • Cache adapter should be configurable, default to the file system
  • Cache period should be configurable, default 10 minutes

Event Timeline

MusikAnimal renamed this task from Wikisource Export: Cache credits information to Wikisource Export: Cache all API requests.Oct 21 2020, 9:39 PM
MusikAnimal updated the task description. (Show Details)

I have discussed this ticket with @satdeep_gill and we're wondering if the cache can be for longer than 10 minutes (for example, 12 hours), if there is a way to check for changes/if the cache system is updated if there are changes?

ifried set the point value for this task to 5.Oct 22 2020, 5:43 PM
ifried moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.
ARamirez_WMF changed the subtype of this task from "Task" to "Deadline".Oct 22 2020, 7:40 PM
ARamirez_WMF set Due Date to Nov 4 2020, 5:00 AM.

The first part of this is ready for review: https://github.com/wsexport/tool/pull/272

As for the cache duration, any time is as easy as any other (although we should keep an eye on the storage usage of the cache if it's too long; I don't think 12 hours is likely to be an issue though). That'll come in the 2nd patch (which is https://github.com/wsexport/tool/pull/266 and WIP).

@Samwilson I am finding that, with caching enabled, images are not being included in ebooks (e.g. https://wsexport-test.wmflabs.org/?lang=en&page=Life+in+a+Thousand+Worlds&format=epub-3&fonts=)

They do appear when I disable caching (e.g. https://wsexport-test.wmflabs.org/?lang=en&page=Life+in+a+Thousand+Worlds&format=epub-3&fonts=&nocache=1).

Could you check if you are seeing the same thing?

@Samwilson I am finding that, with caching enabled, images are not being included in ebooks (e.g. https://wsexport-test.wmflabs.org/?lang=en&page=Life+in+a+Thousand+Worlds&format=epub-3&fonts=)

They do appear when I disable caching (e.g. https://wsexport-test.wmflabs.org/?lang=en&page=Life+in+a+Thousand+Worlds&format=epub-3&fonts=&nocache=1).

Could you check if you are seeing the same thing?

Pinging @Samwilson again, in case this got lost over Christmas ;)

Sorry, I did see this but you're right it promptly got forgotten in favour of Christmas pudding.

It looks like we're hitting this bug: https://github.com/Kevinrob/guzzle-cache-middleware/issues/82

I'll add a workaround (or fix the bug).

With API requests now being cached, you should notice (on wsexport-test) that after the first export of a book, subsequent exports of the same book are quicker.

I have not detected anymore bugs with assets not being included.

On wsexport-test, I exported a number of books multiple times in a row and checked the files sizes did not change (by more than a few bytes).

I also compared the size of the same ebook on wsexport-test and production. There were differences in size, but mostly less than 5KB. The only significant differences in size were due to already known bugs.

I chose books from Category:Featured_texts, which are relatively large books, most of which have images.

Due to the caching period being set to 12 hours on wsexport-test, I couldn't practically test what happens when the cache time runs out. However, on my local version I did reduce the cache period so I could experiment with exporting ebooks after the period runs out. I did not detect any problems.

@ARamirez_WMF: Hi, the Due Date set for this open task is more than three months ago. Can you please either update or reset the Due Date (by clicking Edit Task), or set the status of this task to resolved in case this task is done? Thanks!

This is now on production. We have also updated the cache time to be 6 hours to help prevent shortages of disk space. As we have resolved some previous issues identified with this work, I'm marking this as Done.

Talking of disk space, we could probably increase that 6 hours again now, because we've got plenty of space:

Screenshot_2021-03-10 Cloud VPS Project Board - Grafana.png (666×1 px, 69 KB)

The little zigzgs are 6 hours; if their frequency halved we'd still be unlikely to get to 50% usage I think, but more people would be served with cached books (presumably; I guess we don't know that till after T270770).