Jump to content

Heterogeneous deployment/Train deploys

From Wikitech
Trainbows not Painbows


Weekly steps

Monday: Sync up with your deployment partner

As of October 2019, there are two people assigned to each week's train: One as primary, and one as backup. These are rough guidelines for sharing the work, and should be improved as we learn more.

  • On Monday, communicate with your partner and establish how you'll collaborate over the course of the week.
    • Updates on IRC while your partner is working and updates on the train blocker ticket if they're offline seems to be a useful pattern.
    • Liberal use of video chat for pairing on hard problems is encouraged.
    • It seems to work well to have the primary do the work of cutting the branch, syncing wikis, etc., while the backup keeps an eye on logs, works on improvements to deploy tooling, and is generally an extra pair of eyes for the whole process.
    • If you are in doubt about any part of the process and it's during your partner's working hours, consult them first and get their help in resolving your questions.
  • If one member of the pair is in the European window and one is in the American window, both train deployment windows should be reserved on the Deployments calendar. This gives a backup deployer a defined window for moving the train forward outside the primary's working hours, if it becomes necessary.
  • If the train is blocked or there are any other issues, communicate the transfer of responsibility on the train blocker ticket by assigning it to the responsible party and leaving a note.

Tuesday: New branch creation and deploy

Before the deploy window

All pre-deploy steps have been automated.

  • Branch cut happens on releases-jenkins. (Note: The link to the branch cut job will report "Not Found" until you log into releases-jenkins). The changes that are part of a given branch can be found on the corresponding change log page on mediawiki.org.
  • scap stage-train auto is run by a cron job

Refer to #Troubleshooting_automated_jobs if something goes wrong.

During the deploy window
Step host command example
0-0 Create and auto-merge/deploy the group0 patch deploy1002
USERNAME@deploy1002:/srv/mediawiki-staging/$ scap train
                                                                         
          ____                                           
          |DD|_____T_                                    
          |_ |wmf.26|<                                   
            @-@-@-oo                                     
=========================================================================
  START   testwikis       group0          group1          group2         
          1.41.0-wmf.26   1.41.0-wmf.25   1.41.0-wmf.25   1.41.0-wmf.25  
  [0]     [1]             [2]             [3]             [4]            
What station do you want the train to be at (0-4)?

Select the index corresponding to group 0 ([2]) and press enter. Now wait for scap to finish the deployment.

0-1 Verify production has indeed switched MediaWiki.org Verify that mediawikiwiki has switched to the new version (Installed software, Product: MediaWiki, Version: VERSION)
0-2 Monitor production logs logstash etc. Monitor irc and logstash and/or logspam-watch for problems, see #Places to Watch for Breakage
0-3 Update roadmap page mw:MediaWiki 1.43/Roadmap Change the Deployed to group (if you're using VisualEditor) or the 3rd parameter of the WMFReleaseTableRow template (if you're using the wikitext editor) to 0 (deployed to group0)
{{WMFReleaseTableHead}}
{{WMFReleaseTableRow|12|2018-07-10|0}}

Wednesday: group0 to group1 deploy

Meta / coordination

Attend the Train Log Triage meeting with members of the Core Platform Team and others.

Step host command example
1-0 Create and auto-merge/deploy the group1 patch deploy1002
USERNAME@deploy1002:/srv/mediawiki-staging/$ scap train
                                                                         
                          ____                                           
                          |DD|_____T_                                    
                          |_ |wmf.26|<                                   
                            @-@-@-oo                                     
=========================================================================
  START   testwikis       group0          group1          group2         
          1.41.0-wmf.26   1.41.0-wmf.26   1.41.0-wmf.25   1.41.0-wmf.25  
  [0]     [1]             [2]             [3]             [4]           
What station do you want the train to be at (0-4)?

Select the index corresponding to group 1 ([3]) and press enter. Now wait for scap to finish the deployment.

1-1 Verify production has indeed switched English Wiktionary Verify that the English Wiktionary (and other group1 wikis) have switched to the new version (Installed software, Product: MediaWiki, Version: VERSION)
1-2 Monitor production logs logstash etc. Monitor irc and logstash and/or logspam-watch for problems, see #Places to Watch for Breakage
1-3 Update roadmap page mw:MediaWiki 1.43/Roadmap Change the Deployed to group (if you're using VisualEditor) or the 3rd parameter of the WMFReleaseTableRow template (if you're using the wikitext editor) to 1 (deployed to group1)
{{WMFReleaseTableHead}}
{{WMFReleaseTableRow|12|2018-07-10|1}}
...
{{WMFReleaseTableFooter}}

Thursday: group{0,1} to all deploy

Step host command example
2-0 Create and auto-merge/deploy the group2 patch deploy1002
USERNAME@deploy1002:/srv/mediawiki-staging/$ scap train
                                                                         
                                          ____                                           
                                          |DD|_____T_                                    
                                          |_ |wmf.26|<                                   
                                            @-@-@-oo                                     
=========================================================================
  START   testwikis       group0          group1          group2         
          1.41.0-wmf.26   1.41.0-wmf.26   1.41.0-wmf.26   1.41.0-wmf.25  
  [0]     [1]             [2]             [3]             [4]           
What station do you want the train to be at (0-4)?

Select the index corresponding to group 2 ([4]) and press enter. Now wait for scap to finish the deployment.

2-1 Verify production has indeed switched English Wikipedia Verify that the English Wikipedia (and other group2 wikis) have switched to the new version (Installed software, Product: MediaWiki, Version: VERSION)
2-2 Monitor production logs logstash etc. Monitor irc and logstash and/or logspam-watch for problems, see #Places to Watch for Breakage
2-3 Update roadmap page mw:MediaWiki 1.43/Roadmap Change the Deployed to group (if you're using VisualEditor) or the 3rd parameter of the WMFReleaseTableRow template (if you're using the wikitext editor) to 2 (deployed to all)
{{WMFReleaseTableHead}}
{{WMFReleaseTableRow|12|2018-07-10|2}}
...
{{WMFReleaseTableFooter}}

Breakage

There will be times when this process does not go smoothly. There are guidelines for what to do when that happens.

In general, if there is an unexplained error that occurs within 1 hour of a train deployment — always roll back the train. Rolling back the train to eliminate it as the cause of unexplained breakage can be especially important if there are many ongoing possible causes for issues as this helps to eliminate one of those causes as the source of problems.

Rollback

To rollback a wikiversion change, it should be pretty quick. Go ahead and rollback production before you send patches up to gerrit since waiting on Jenkins may take a while:

USERNAME@deploy1002:/srv/mediawiki-staging$ git revert $(git log -1 --format=%H -- wikiversions.json)
USERNAME@deploy1002:/srv/mediawiki-staging$ scap sync-wikiversions 'Revert "group[0|1] wikis to [VERSION]"'

# Now that you've synced the revert, push patches up to gerrit, you have to run git commit --amend to get the changeid
# Ideally, you should also add the train blocker task id to the Bug: field for this commit
USERNAME@deploy1002:/srv/mediawiki-staging$ git commit --amend --no-edit
# [VERSION] below is the new version, e.g.: 1.43.0-wmf.6
USERNAME@deploy1002:/srv/mediawiki-staging$ git push origin HEAD:refs/for/master%topic=[VERSION],l=Code-Review+2

Example:

USERNAME@deploy1002:/srv/mediawiki-staging$ git push origin HEAD:refs/for/master%topic=1.34.0-wmf.0,l=Code-Review+2

Alternatively, if the rollback doesn't need to happen immediately and you can afford a few minutes, you can simply run the scap train command again to go back to a previous stage (however when deploying to group0 this would also make test servers go back!):

USERNAME@deploy1002:/srv/mediawiki-staging$ scap train
  • Wait for the patch to merge and the fetch back down to the deployment server
  • #Update roadmap.

Troubleshoot Kubernetes deployment

To get events for the service mw-api-ext on eqiad:

kube-env mw-api-ext eqiad
kubectl get events

See Kubernetes/Troubleshooting#Troubleshooting_a_deployment.


Places to Watch for Breakage

Train deployers should check for breakage as they are rolling out the train as they are effectively the first line of defense for train deploys.

Given limited resources, it is not possible to monitor every dashboard during the train. There are a limited set of signals that are actively monitored. And a much larger set of signals which may be monitored.

See MediaWiki_Engineering/Guides/Monitor_production_errors for a detailed breakdown of the log triage process.

Places we monitor

These are the places Release Engineering actively monitor during the train.

Other places to look

These links are not actively monitored by Release Engineering, but may be useful for troubleshooting and investigation of problems with the train.

If the train is blocked

  • A task will be assigned to you, for example T191059 (1.32.0-wmf.13 deployment blockers) (you can see that week's task at https://train-blockers.toolforge.org)
  • Any open subtasks block the train from moving forward. This means no further deployments until the blockers are resolved.

Checklist

If there are blocking tasks, please do the following:

  • Make sure all tasks blocking train are set to UBN! priority in phabricator
  • Comment on the task asking for an ETA or if this can be solved by reverting a recent commit.
  • Send e-mail to:
    • [email protected]
    • [email protected]
    • Ping private #engineering-all Slack channel
    • Subject: [Train] {version} status update
    • Body
      The {version} version of MediaWiki is blocked[0].
      
      The new version is deployed to {group(s){0,1,2}}[1], but can proceed no
      further until these issues are resolved:
      
      * {Phab task name} - {phab task link}
      
      Once these issues are resolved train can resume. If these issues are
      resolved on a Friday the train will resume Monday.
      
      Thank you for your help resolving these issues!
      
      -- Your humble train toiler
      
      [0]. <{link to phab task for train}>
      [1]. <https://versions.toolforge.org/>
      
  • Add relevant people (see Developers/Maintainers) to the blocking task
  • Ping relevant people in IRC
  • Once train is unblocked be sure to thank the folks who helped unblock it

Troubleshooting automated jobs

Troubleshooting pre-sync failure
What you're seeing Likely problem How to fix it
You received an email that indicates the automated branch cut job has failed. The job has failed. Follow the link in the email to the failed build. Inspect the console and continue below to troubleshoot.
The failed build console includes the message <url> was rejected by a test failure The branch-cut change for mediawiki/core has failed in CI. Follow the link to the change in Gerrit. Remove any existing +2 vote and re-vote +2 to trigger gate-and-submit. If the change is merged, all is well (but you should report the flaky behavior). If it fails again, continue below to troubleshoot.
The branch-cut change has failed in CI again (above). This is a real test failure. Yell for help from developers in Slack (#engineering-all) and/or on IRC (#wikimedia-releng ?). After a fix has been merged into the mainline branch and backported to the version branch, click rebuild last in Jenkins to rerun the branch-cut job.
You received an email with subject line FAIL: train-presync The systemd timer that runs scap stage-train auto has failed. Continue below to troubleshoot.
The email contains .gitmodules does not exist. Did the train branch commit get merged?. The automated branch cut job has failed. Head to the top of this table and troubleshoot the branch cut failure. Once you've solved the issue, re-run scap stage-train --yes auto on the deployment server.
The email contains ERROR: git am: error: Failed to merge in the changes. Security patches have failed to apply cleanly. Ping the Phabricator task for the security patch and ask for a rebase. Once they've resolved the issue, re-run scap stage-train --yes auto on the deployment server. This command will checkout the code on the deployment server and deploy to test wikis.
The email contains ssh: connect to host <host> port 22: Connection timed out. ? ?
The email contains error: insufficient permission for adding an object to repository database .git/objects. ? ?
Something else. ??? Get help from your backup conductor and fellow RelEngineers to troubleshoot the failure. Once you have solved the issue, be sure to update this section with: what you saw, the root problem, how you fixed it.

Incident documentation

See also

Footnotes