Page MenuHomePhabricator

Serve production traffic via Kubernetes
Open, In Progress, MediumPublic

Assigned To
None
Authored By
jijiki
Sep 8 2021, 1:48 AM
Referenced Files
F34697834: c35.jpg
Oct 19 2021, 3:28 PM
F34697832: c20.jpg
Oct 19 2021, 3:28 PM
F34697829: c15.jpg
Oct 19 2021, 3:28 PM
F34697835: c30.jpg
Oct 19 2021, 3:28 PM
F34697833: c40.jpg
Oct 19 2021, 3:28 PM
F34697836: c25.jpg
Oct 19 2021, 3:28 PM
F34697825: c10.jpg
Oct 19 2021, 3:28 PM
Tokens
"Love" token, awarded by Ladsgroup.

Description

As we are getting closer and closer to a fully functional MW-on-K8s image, we can start discussing our testing in production and rolling out options.
(Task description will be updated as we are figuring out our next steps)

Background

History
When we migrated to PHP7, users were served by PHP7 based on the existence of a cookie, set by mediawiki, and the X-Seven header, set at the traffic layer. On the application layer, Apache would route a request to the relevant backend, HHVM or php-fpm.

Having a cookie allowed us to first let beta users in, and in turn progressively increase the amount of anonymous user traffic served via PHP7. We then continued by converting API servers to php7_only servers, and finally converted all jobrunners. Additionally, we split the outer caching layer into PHP7 rendered pages and HHVM rendered pages (vary-slotting). Note that we didn't though do this for parsercache too, At the time, we were only using Varnish, so all this logic was written in VCL, with some additional apache config.

Now
This migration is slightly different this time:

  • Caching layer consists of Varnish and ATS (VCL and LUA)
  • Decision of where to route an incoming request will be taken at the caching layer
  • We have 4 mediawiki clusters: api, app, jobrunners, and parsoid
  • we are older

Proposed Plans

After a brief discussion with Traffic and Performance-Team, we have:

Proposal #1: URL routing

Given that app and api servers share the same configuration, and assuming that initially we will have the same discovery URL, eg mw-k8s-rw.discovery.wmnet,
we can start by routing a some low traffic URLs to Kubernetes, for example https://en.wikipedia.org/wiki/Barack_Obama. When we are more comfortable, we can start migrating some small wikis, and eventually, migrate them all.

Prons

  • No complex and dangerous VCL and LUA changes
  • Edge cache will not be polluted since we will always have the k8s rendered article
  • Easy edge cache invalidation (single pages or entire wikis)

Cons

  • Less control over traffic served
  • Won't be able to create a beta feature
  • Longer rolling out strategy
  • Slightly complex rollbacks (traffic layer change + edge cache invalidation)

Beta users

In parsercache we have the ability to specify a key prefix and TTL for specific objects. Additionally, logged-in users bypass caches in our traffic layer, that being said, we could possibly have beta users:

  • A user has a special cookie indicating they are part of the k8s beta
  • When a server is storing in parcercache (and in turn, in memcached), uses a key prefix and a shorter TTL (cache slotting/Vary)
  • Beta users can always compare a page by simply opening it as anonymous
  • Beta users are more likely to report problems.
  • We can run this for as long as we want

Prons

  • No edge and parser cache pollution (pages rendered by k8s mixed with pages render by baremetal servers)
  • User reports

Cons

  • Browsing will be considerably slower for beta users, we could consider fiddling with the TTL a bit

Rollout Example

  1. X-Wikimedia-Debug
  2. Beta users/parsercache slotting
  3. Low traffic urls
  4. Low traffic wikis from group0
  5. Some group1 wikis
  6. Parsoid (?)
  7. All wikis except enwiki
  8. enwiki (Fin)

Note: Running jobs, timers, and standalone scripts are going to be approached differently

Proposal #2: Use a k8s cookie

Users with the cookie will be routed to k8s, and will have their own cache in the traffic layer (varnish+ATS). This is similar to how we rolled out PHP7, the difference is that previously, routing and cookie setting took place within the application layer, while now we have to do this in the traffic layer.

Prons

  • We have previous experience in rolling out
  • Beta users
  • Better control over amount of traffic served
  • Easier to roll back (?)

Cons

  • Complex VCL and LUA changes for edge cache slotting (not enough test coverage there)
  • Edge cache invalidation issue (ie how do we invalidate only k8s rendered cache from the edge?)
  • Where will we calculate if an anonymous user should get the k8s cookie or not?
  • Traffic would like to avoid this solution

Proposal #3: Per cluster rollout (winner)

We can create kubernetes services to server some (initially internal) traffic, and then do a per cluster migration. For instance, we could create api-internal-r{w,o}.discovery.wment service, and then start moving services to start using it.
This has been used at the beginning for all migrations, and will continue to be used for T333120: Migrate internal traffic to k8s

Proposal #4: Percentage-based global traffic redirect (followup to winner)

See T336038: Add traffic sampling support to mw-on-k8s.lua ATS script
A LUA script was added to ATS. It supports:

  • Sending any percentage of traffic for a domain to mw-on-k8s (including 0 and 100%)
  • Sending any percentage of global traffic to mw-on-k8s

This approach will be used going forward, with the current thresholds described in Roll out phase 2

After discussions, serviceops has decided to mix and match ideas from the above proposals.

Roll out

Roll out phase 1: Start serving a small portion of content from specific wikis

Roll out phase 2: Migrate global traffic by increments

Roll out phase 3: Cleanup, scripts, and stragglers, oh my!

Related Objects

StatusSubtypeAssignedTask
In ProgressNone
Resolvedjijiki
Resolvedaaron
In Progressjijiki
Resolvedjijiki
Resolvedjijiki
Resolvedjijiki
Resolvedjijiki
Resolvedjijiki
Openjijiki
ResolvedLadsgroup
Resolvedjijiki
Resolvedjijiki
ResolvedLadsgroup
Resolvedjijiki
ResolvedSLyngshede-WMF
ResolvedSLyngshede-WMF
OpenSLyngshede-WMF
OpenNone
Resolvedjijiki
OpenBUG REPORTNone
ResolvedBUG REPORTbd808
ResolvedBUG REPORTbd808
ResolvedBUG REPORTLadsgroup
ResolvedBUG REPORTbd808
ResolvedBUG REPORTbd808
ResolvedBUG REPORTbd808
Resolvedtaavi
Resolvedthcipriani
ResolvedMarcoAurelio
Resolvedtaavi
OpenNone
In ProgressClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
DuplicateClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
OpenNone
Resolvedhnowlan
ResolvedJoe
Resolvedkamila
Resolvedhnowlan
OpenNone
OpenNone
Resolvedhnowlan
OpenNone
InvalidNone
Resolvedhnowlan
OpenNone
OpenNone
ResolvedJoe
Resolvedcolewhite
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
InvalidClement_Goubert
ResolvedJoe
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedJoe
ResolvedJoe
ResolvedJoe
ResolvedJMeybohm
ResolvedJoe
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
DeclinedClement_Goubert
ResolvedClement_Goubert
Resolvedelukey
InvalidKrinkle
Resolvedjijiki
ResolvedJoe
ResolvedJoe
ResolvedClement_Goubert
ResolvedBUG REPORTClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedJoe
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedJclark-ctr
ResolvedJMeybohm
ResolvedJoe
ResolvedJoe
ResolvedNone
Resolvedjijiki
Resolvedjijiki
Resolveddancy
Resolveddancy
ResolvedJoe
ResolvedJoe
Resolvedjeena
ResolvedJoe
ResolvedJoe
Resolveddancy
ResolvedJoe
Resolved dpifke
Resolveddancy
ResolvedJoe
ResolvedClement_Goubert
Resolvedcolewhite
Resolvedjijiki
Resolved dpifke
ResolvedLegoktm
ResolvedClement_Goubert
ResolvedJMeybohm
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
Resolvedhnowlan
Resolvedakosiaris
Openhnowlan
ResolvedClement_Goubert
ResolvedNone
ResolvedDreamy_Jazz
ResolvedPRODUCTION ERRORDreamy_Jazz
Resolvedkostajh
Resolvedjijiki
ResolvedClement_Goubert
Resolvedkamila
ResolvedJhancock.wm
ResolvedJclark-ctr
ResolvedJhancock.wm
ResolvedVRiley-WMF
ResolvedJhancock.wm
ResolvedJhancock.wm
ResolvedVRiley-WMF
ResolvedJhancock.wm
ResolvedVRiley-WMF
ResolvedClement_Goubert
ResolvedClement_Goubert
Resolvedakosiaris
ResolvedABreault-WMF
Resolvedakosiaris
Resolveddancy
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedCDanis
Resolvedjijiki
ResolvedJoe
Resolvedjijiki
Resolvedjijiki
InvalidNone
ResolvedJdforrester-WMF
Resolvedjijiki
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedNone
ResolvedClement_Goubert
OpenNone
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 957241 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw-on-k8s: Fix test-commons redirect

https://gerrit.wikimedia.org/r/957241

Change 957241 merged by Clément Goubert:

[operations/puppet@production] mw-on-k8s: Fix test-commons redirect

https://gerrit.wikimedia.org/r/957241

Mentioned in SAL (#wikimedia-operations) [2023-09-13T08:46:30Z] <claime> Running puppet on cp-text P:trafficserver::backend - T290536

Change 961351 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw-on-k8s: Remove wikidata exception

https://gerrit.wikimedia.org/r/961351

Change 961351 merged by Clément Goubert:

[operations/puppet@production] mw-on-k8s: Remove wikidata exception

https://gerrit.wikimedia.org/r/961351

T355292: Port videoscaling to kubernetes should probably be a subtask of this (or maybe a subtask of T321899)? At least I’ve been told that videoscalers are blockers for the k8s migration being considered complete, and T355292 seems to be the currently active task in that area.

Clement_Goubert changed the task status from Open to In Progress.Jul 8 2024, 11:26 AM
Clement_Goubert updated the task description. (Show Details)

T355292: Port videoscaling to kubernetes should probably be a subtask of this (or maybe a subtask of T321899)? At least I’ve been told that videoscalers are blockers for the k8s migration being considered complete, and T355292 seems to be the currently active task in that area.

Closed T321899: Create mw-videoscaler helmfile deployment which was a placeholder task, T355292: Port videoscaling to kubernetes is indeed where the work on this is more accurately tracked.

I've done a bit of housekeeping on this task and its children