Heterogeneous deployment/Train deploys

Weekly steps

Monday: Sync up with your deployment partner

As of October 2019, there are two people assigned to each week's train: One as primary, and one as backup. These are rough guidelines for sharing the work, and should be improved as we learn more.

On Monday, communicate with your partner and establish how you'll collaborate over the course of the week.
- Updates on IRC while your partner is working and updates on the train blocker ticket if they're offline seems to be a useful pattern.
- Liberal use of video chat for pairing on hard problems is encouraged.
- It seems to work well to have the primary do the work of cutting the branch, syncing wikis, etc., while the backup keeps an eye on logs, works on improvements to deploy tooling, and is generally an extra pair of eyes for the whole process.
- If you are in doubt about any part of the process and it's during your partner's working hours, consult them first and get their help in resolving your questions.
If one member of the pair is in the European window and one is in the American window, both train deployment windows should be reserved on the Deployments calendar. This gives a backup deployer a defined window for moving the train forward outside the primary's working hours, if it becomes necessary.
If the train is blocked or there are any other issues, communicate the transfer of responsibility on the train blocker ticket by assigning it to the responsible party and leaving a note.

Tuesday: New branch creation and deploy

Before the deploy window

All pre-deploy steps have been automated.

Branch cut happens on releases-jenkins. (Note: The link to the branch cut job will report "Not Found" until you log into releases-jenkins). The changes that are part of a given branch can be found on the corresponding change log page on mediawiki.org.
scap stage-train auto is run by a cron job

Refer to #Troubleshooting_automated_jobs if something goes wrong.

During the deploy window

Step		host	command	example
0-0	Create and auto-merge/deploy the group0 patch	deploy1002	USERNAME@deploy1002:/srv/mediawiki-staging/$ scap train ____ \|DD\|_____T_ \|_ \|wmf.26\|< @-@-@-oo ========================================================================= START testwikis group0 group1 group2 1.41.0-wmf.26 1.41.0-wmf.25 1.41.0-wmf.25 1.41.0-wmf.25 [0] [1] [2] [3] [4] What station do you want the train to be at (0-4)? Select the index corresponding to group 0 ([2]) and press enter. Now wait for scap to finish the deployment.
0-1	Verify production has indeed switched	MediaWiki.org	Verify that mediawikiwiki has switched to the new version (Installed software, Product: MediaWiki, Version: VERSION)
0-2	Monitor production logs	logstash etc.	Monitor irc and logstash and/or logspam-watch for problems, see #Places to Watch for Breakage
0-3	Update roadmap page	mw:MediaWiki 1.43/Roadmap	Change the `Deployed to group` (if you're using VisualEditor) or the 3rd parameter of the `WMFReleaseTableRow` template (if you're using the wikitext editor) to `0` (deployed to group0)	{{WMFReleaseTableHead}} {{WMFReleaseTableRow\|12\|2018-07-10\|0}}

Wednesday: group0 to group1 deploy

Meta / coordination

Attend the Train Log Triage meeting with members of the Core Platform Team and others.

Step		host	command	example
1-0	Create and auto-merge/deploy the group1 patch	deploy1002	USERNAME@deploy1002:/srv/mediawiki-staging/$ scap train ____ \|DD\|_____T_ \|_ \|wmf.26\|< @-@-@-oo ========================================================================= START testwikis group0 group1 group2 1.41.0-wmf.26 1.41.0-wmf.26 1.41.0-wmf.25 1.41.0-wmf.25 [0] [1] [2] [3] [4] What station do you want the train to be at (0-4)? Select the index corresponding to group 1 ([3]) and press enter. Now wait for scap to finish the deployment.
1-1	Verify production has indeed switched	English Wiktionary	Verify that the English Wiktionary (and other group1 wikis) have switched to the new version (Installed software, Product: MediaWiki, Version: VERSION)
1-2	Monitor production logs	logstash etc.	Monitor irc and logstash and/or logspam-watch for problems, see #Places to Watch for Breakage
1-3	Update roadmap page	mw:MediaWiki 1.43/Roadmap	Change the `Deployed to group` (if you're using VisualEditor) or the 3rd parameter of the `WMFReleaseTableRow` template (if you're using the wikitext editor) to `1` (deployed to group1)	{{WMFReleaseTableHead}} {{WMFReleaseTableRow\|12\|2018-07-10\|1}} ... {{WMFReleaseTableFooter}}

Thursday: group{0,1} to all deploy

	Step	host	command	example
2-0	Create and auto-merge/deploy the group2 patch	deploy1002	USERNAME@deploy1002:/srv/mediawiki-staging/$ scap train ____ \|DD\|_____T_ \|_ \|wmf.26\|< @-@-@-oo ========================================================================= START testwikis group0 group1 group2 1.41.0-wmf.26 1.41.0-wmf.26 1.41.0-wmf.26 1.41.0-wmf.25 [0] [1] [2] [3] [4] What station do you want the train to be at (0-4)? Select the index corresponding to group 2 ([4]) and press enter. Now wait for scap to finish the deployment.
2-1	Verify production has indeed switched	English Wikipedia	Verify that the English Wikipedia (and other group2 wikis) have switched to the new version (Installed software, Product: MediaWiki, Version: VERSION)
2-2	Monitor production logs	logstash etc.	Monitor irc and logstash and/or logspam-watch for problems, see #Places to Watch for Breakage
2-3	Update roadmap page	mw:MediaWiki 1.43/Roadmap	Change the `Deployed to group` (if you're using VisualEditor) or the 3rd parameter of the `WMFReleaseTableRow` template (if you're using the wikitext editor) to `2` (deployed to all)	{{WMFReleaseTableHead}} {{WMFReleaseTableRow\|12\|2018-07-10\|2}} ... {{WMFReleaseTableFooter}}

Breakage

There will be times when this process does not go smoothly. There are guidelines for what to do when that happens.

In general, if there is an unexplained error that occurs within 1 hour of a train deployment — always roll back the train. Rolling back the train to eliminate it as the cause of unexplained breakage can be especially important if there are many ongoing possible causes for issues as this helps to eliminate one of those causes as the source of problems.

Rollback

To rollback a wikiversion change, it should be pretty quick. Go ahead and rollback production before you send patches up to gerrit since waiting on Jenkins may take a while:

USERNAME@deploy1002:/srv/mediawiki-staging$ git revert $(git log -1 --format=%H -- wikiversions.json)
USERNAME@deploy1002:/srv/mediawiki-staging$ scap sync-wikiversions 'Revert "group[0|1] wikis to [VERSION]"'

# Now that you've synced the revert, push patches up to gerrit, you have to run git commit --amend to get the changeid
# Ideally, you should also add the train blocker task id to the Bug: field for this commit
USERNAME@deploy1002:/srv/mediawiki-staging$ git commit --amend --no-edit
# [VERSION] below is the new version, e.g.: 1.43.0-wmf.6
USERNAME@deploy1002:/srv/mediawiki-staging$ git push origin HEAD:refs/for/master%topic=[VERSION],l=Code-Review+2

Example:

USERNAME@deploy1002:/srv/mediawiki-staging$ git push origin HEAD:refs/for/master%topic=1.34.0-wmf.0,l=Code-Review+2

Alternatively, if the rollback doesn't need to happen immediately and you can afford a few minutes, you can simply run the scap train command again to go back to a previous stage (however when deploying to group0 this would also make test servers go back!):

USERNAME@deploy1002:/srv/mediawiki-staging$ scap train

Wait for the patch to merge and the fetch back down to the deployment server
#Update roadmap.

Troubleshoot Kubernetes deployment

To get events for the service mw-api-ext on eqiad:

kube-env mw-api-ext eqiad
kubectl get events

See Kubernetes/Troubleshooting#Troubleshooting_a_deployment.

Places to Watch for Breakage

Train deployers should check for breakage as they are rolling out the train as they are effectively the first line of defense for train deploys.

Given limited resources, it is not possible to monitor every dashboard during the train. There are a limited set of signals that are actively monitored. And a much larger set of signals which may be monitored.

See MediaWiki_Engineering/Guides/Monitor_production_errors for a detailed breakdown of the log triage process.

Places we monitor

These are the places Release Engineering actively monitor during the train.

IRC
- Primary channel is #wikimedia-operations ^connect. This is where official deployment communications happen, alerts are broadcast, etc.
- For more channels see MediaWiki on IRC and IRC/Channels
Logs
- Current mwlog (mwlog1001 or mwlog2002, depending on primary datacenter):
  - logspam-watch
  - Logfiles can be found in /srv/mw-log
- Logstash
  - mediawiki-errors dashboard gives the full firehose of almost all errors
  - MediaWiki New Errors ECS is a workboard with known issues filtered out, useful for surfacing new breakage
- See the Wikimedia-production-error workboard for known issues
Grafana
- Application Servers RED - k8s Dashboard

Other places to look

These links are not actively monitored by Release Engineering, but may be useful for troubleshooting and investigation of problems with the train.

Logstash mw-client-errors dashboard
- New errors appearing more than 1000 times in a 12 hour period should be considered blockers
- See also Grafana dashboard with summary of average error rate over time
Grafana
- Varnish http-errors dashboard (HTTP 5XX % should have 3+ 0s after the decimal point, e.g. 0.0001%)
- Frontend Responses NGINX vs Varnish
- Production Logging
- Minerva Client Errors - Browser JS errors count (only wikipedias on mobile)

If the train is blocked

A task will be assigned to you, for example T191059 (1.32.0-wmf.13 deployment blockers) (you can see that week's task at https://train-blockers.toolforge.org)
Any open subtasks block the train from moving forward. This means no further deployments until the blockers are resolved.

Checklist

If there are blocking tasks, please do the following:

Make sure all tasks blocking train are set to UBN! priority in phabricator
Comment on the task asking for an ETA or if this can be solved by reverting a recent commit.

Send e-mail to:

[email protected]
[email protected]
Ping private #engineering-all Slack channel
Subject: [Train] {version} status update

Body

The {version} version of MediaWiki is blocked[0].

The new version is deployed to {group(s){0,1,2}}[1], but can proceed no
further until these issues are resolved:

* {Phab task name} - {phab task link}

Once these issues are resolved train can resume. If these issues are
resolved on a Friday the train will resume Monday.

Thank you for your help resolving these issues!

-- Your humble train toiler

[0]. <{link to phab task for train}>
[1]. <https://versions.toolforge.org/>

Add relevant people (see Developers/Maintainers) to the blocking task
Ping relevant people in IRC
Once train is unblocked be sure to thank the folks who helped unblock it

Troubleshooting automated jobs

Troubleshooting pre-sync failure
What you're seeing	Likely problem	How to fix it
You received an email that indicates the automated branch cut job has failed.	The job has failed.	Follow the link in the email to the failed build. Inspect the console and continue below to troubleshoot.
The failed build console includes the message `<url> was rejected by a test failure`	The branch-cut change for `mediawiki/core` has failed in CI.	Follow the link to the change in Gerrit. Remove any existing +2 vote and re-vote +2 to trigger gate-and-submit. If the change is merged, all is well (but you should report the flaky behavior). If it fails again, continue below to troubleshoot.
The branch-cut change has failed in CI again (above).	This is a real test failure.	Yell for help from developers in Slack (#engineering-all) and/or on IRC (#wikimedia-releng ?). After a fix has been merged into the mainline branch and backported to the version branch, click rebuild last in Jenkins to rerun the branch-cut job.
You received an email with subject line FAIL: train-presync	The systemd timer that runs `scap stage-train auto` has failed.	Continue below to troubleshoot.
The email contains `.gitmodules does not exist. Did the train branch commit get merged?`.	The automated branch cut job has failed.	Head to the top of this table and troubleshoot the branch cut failure. Once you've solved the issue, re-run `scap stage-train --yes auto` on the deployment server.
The email contains `ERROR: git am: error: Failed to merge in the changes`.	Security patches have failed to apply cleanly.	Ping the Phabricator task for the security patch and ask for a rebase. Once they've resolved the issue, re-run `scap stage-train --yes auto` on the deployment server. This command will checkout the code on the deployment server and deploy to test wikis.
The email contains `ssh: connect to host <host> port 22: Connection timed out`.	?	?
The email contains `error: insufficient permission for adding an object to repository database .git/objects`.	?	?
Something else.	???	Get help from your backup conductor and fellow RelEngineers to troubleshoot the failure. Once you have solved the issue, be sure to update this section with: what you saw, the root problem, how you fixed it.

Incident documentation

If there were problems during the train, follow instructions at Incident documentation on incident reports and post-mortem review.
Use Create report form to create a new page, train-[VERSION]. Example: Incident documentation/20181212-Train-1.33.0-wmf.8.
For the Timeline section, events from SAL and Phabricator task are a good start.

Footnotes