Page MenuHomePhabricator

Handoff Proton service to Reading Infrastructure
Closed, ResolvedPublic

Description

As the work on the new Proton service is done, we need to make sure that all requirements are fulfilled before we handoff the service to Reading Infrastructure.

Acceptance criteria

  • Verify that the code documentation is up to date
  • Project documentation exists on MediaWiki.org
  • How to deploy information is available on wikitech
  • Provide code coverage
  • Provide documentation about logs/grafana/stats
  • People from Reading Infrastructure have rights to access/deploy proton (see T211382)
  • Verify Proton can handle Queue timeouts properly
  • security review T177765
  • Drop proton-staging as it is not used (chromium-pdf.reading-web-staging.eqiad.wmflabs) - can be done after the handoff

Signoff notes

security review T177765

The security review was re-closed in T177765#4868192 after a handful of subtasks were created.

Event Timeline

@Jhernandez are there any points you would like to add to the Acceptance Criteria section?

A couple of things come to mind:

  • A list of the current maintenance tasks with steps/descriptions
  • Project page on mediawiki.org with relevant information
  • A meeting to sync after the documentation has been delivered and the team has had a chance to look at it and prepare any questions

@Mholloway @bearND, anything else you can think about?

I hope the project page on mediawiki.org with relevant information also includes:

  • an architecture overview showing the major components. I know there's a queue involved. What are the other parts, etc.?
  • links to repo(s)
  • anything useful for debugging
  • how to solve/investigate typical issues
Aklapper renamed this task from Handoff Proton service to Reading Infrastucture to Handoff Proton service to Reading Infrastructure.Nov 29 2018, 6:10 AM

@Jhernandez could you provide me with a list of people who requires access to deploy/manage Proton service? just @Mholloway and @bearND?

@pmiazga I'm not sure who will be working on it on the team so I'd suggest adding all the engineers, except for James who is working on multimedia right now: @bearND @Mholloway @MSantos @Tgr

In the spirit of minimising the number of tasks that Infra take on immediately after taking up maintenance of the service, we should squash T210460: Eliminate usage of mocha-eslint for Proton too.

Unit tests coverage can be checked by using npm run coverage

=============================== Coverage summary ===============================
Statements   : 74.02% ( 436/589 )
Branches     : 44.83% ( 104/232 )
Functions    : 90% ( 54/60 )
Lines        : 74.61% ( 432/579 )
================================================================================

most important bits:

  • queue has 98.94% coverage
  • renderer.js has 68.42% coverage ( there are no tests related to extra safety checks when chromium returns something that it shouldn't, aborting requests and killing the chromium browser if it doesn't want to exit)
  • html2pdf-v1.js route has 78.57% coverage (not tested is aborting the request, and cases when HTTP request returns 500/503)

Not tested bits are swagger-ui.js and queueLogger.js which IMHO are not that important to test. QueueLogger subscribes to many Queue events and logs those, if something is wrong with QueueLogger both logs and Grafana dashboards would be empty.

Created ticket to grant access rights to the service: T211382

Thanks for the thorough introduction, @pmiazga!

Some notes I have from the meeting (not blockers, I just want to put them somewhere):

  • Discuss with ops how Chromium updates should work. If we pin Chromium, we don't get security updates, that's not great. If we don't pin it, the service can break with no warning whenever puppet updates the OS packages, that's also not great. Also we should document the steps on the beta server when we need to test with a specific Chromium version.
  • Need to find out how the service interacts with Varnish. (I imagine requests go through Varnish and get coalesced like everything else; whether to cache large PDF responses is not so trivial though. Also our multi-layered Varnish setup was meant for traffic spikes, which probably don't really happen for PDF downloads, so storing them in both a frontend and backend Varnish seems like a waste of space.)
  • T177765#4822198
  • File followup tasks about (eventual) language variant support.
  • We might want to figure out how to make the beta cluster service use production page content.

Note to self, it would be nice to have a vagrant role for this.

The information on how to deploy (and the beta cluster server name) is on Wikitech, today I'll push the tool that builds the API links for testing to toolforge. @alexhollender verified the PDFs and those look good. Now we wait till Rendering Infrastructure says "we take it from now on".

Jdlrobson updated the task description. (Show Details)

Reflected your comment on the task relationships. if the three production blockers don't block the handoff I can fix that.

To clarify what I said above, I think the three tasks should block the production switchover. I have no opinion on whether they should block the handoff - that's something for managers to battle out :) I see no problem with RI taking over Proton now, then finishing up those issues, then deploying to traffic. (They are not much effort, in any case - T213363 is pretty much done at this point, T213362 is just a few lines of code and T213366 probably just involves waiting for ops to respond.) I should probably have said this sooner, sorry for the confusion.

Removed the last two open so that they don't block the handoff task, and rather block only the deployment task

Drop proton-staging as it is not used (chromium-pdf.reading-web-staging.eqiad.wmflabs) - can be done after the handoff

I've stopped the instance for now until @pmiazga confirms that it should be deleted.

This doesn't block me signing off this task.

🎉

Per our (@Jhernandez, @Tgr, @pmiazga, and me) conversation in last Thursday's Audiences Platform Sync, Proton has officially been handed over to Readers Infrastructure for ongoing maintenance. I'll follow up with an email to [email protected] and [email protected] to confirm this.