Page MenuHomePhabricator

Migrate Termbox SSR from Node 16 to 18
Closed, ResolvedPublic

Description

We want to update the Wikidata Termbox server-side rendering service to Node 18. The Blubber file was updated in Migrate from node16 to node18 (building the image version 2024-01-22-163619-production), and the new version was updated in values-test.yaml with no problems observed on Test Wikidata, but an update in value.yaml caused problems and had to be reverted – Logstash showed some errors on the Wikibase side like

Wikibase\View\Termbox\Renderer\TermboxRemoteRenderer: Problem requesting from the remote server

and the SSR itself printed lots of errors like this to the kubectl logs:

{"name":"wikibase-termbox","hostname":"termbox-production-7dfff557f8-4p9m9","pid":17,"level":"ERROR","message":"connect ECONNREFUSED ::1:6500","request":{"headers":{"Accept":"application/json, text/plain, */*","User-Agent":"wikibase-termbox/0.1.0 (The Wikidata team) axios/^0.21.1","Host":"www.wikidata.org:6500"},"url":"index.php","params":{"title":"Special:EntityData","id":"Q100626766","revision":2032799759,"format":"json"}},"url":"/termbox?entity=Q100626766&revision=2032799759&language=en&editLink=%2Fwiki%2FSpecial%3ASetLabelDescriptionAliases%2FQ100626766&preferredLanguages=en","reqId":"b17d22b1-01ea-4879-82b3-96a15d5f7b6c","levelPath":"error/service","msg":"connect ECONNREFUSED ::1:6500","time":"2024-01-23T15:22:48.263Z","v":0}                        
{"name":"wikibase-termbox","hostname":"termbox-production-7dfff557f8-4p9m9","pid":17,"level":"ERROR","message":"connect ECONNREFUSED ::1:6500","request":{"headers":{"Accept":"application/json, text/plain, */*","User-Agent":"wikibase-termbox/0.1.0 (The Wikidata team) axios/^0.21.1","Host":"www.wikidata.org:6500"},"url":"index.php","params":{"title":"Special:EntityData","id":"Q100626766","revision":2032799759,"format":"json"}},"url":"/termbox?entity=Q100626766&revision=2032799759&language=en&editLink=%2Fwiki%2FSpecial%3ASetLabelDescriptionAliases%2FQ100626766&preferredLanguages=en","reqId":"b17d22b1-01ea-4879-82b3-96a15d5f7b6c","levelPath":"error/service","msg":"connect ECONNREFUSED ::1:6500","time":"2024-01-23T15:22:48.336Z","v":0}                        
{"name":"wikibase-termbox","hostname":"termbox-production-7dfff557f8-4p9m9","pid":17,"level":"ERROR","message":"connect ECONNREFUSED ::1:6500","request":{"headers":{"Accept":"application/json, text/plain, */*","User-Agent":"wikibase-termbox/0.1.0 (The Wikidata team) axios/^0.21.1","Host":"www.wikidata.org:6500"},"url":"index.php","params":{"title":"Special:EntityData","id":"Q123049437","revision":2003292132,"format":"json"}},"url":"/termbox?entity=Q123049437&revision=2003292132&language=en&editLink=%2Fwiki%2FSpecial%3ASetLabelDescriptionAliases%2FQ123049437&preferredLanguages=en","reqId":"2250f175-6433-403b-b975-4404a459734c","levelPath":"error/service","msg":"connect ECONNREFUSED ::1:6500","time":"2024-01-23T15:22:49.700Z","v":0}                        
{"name":"wikibase-termbox","hostname":"termbox-production-7dfff557f8-4p9m9","pid":17,"level":"ERROR","message":"connect ECONNREFUSED ::1:6500","request":{"headers":{"Accept":"application/json, text/plain, */*","User-Agent":"wikibase-termbox/0.1.0 (The Wikidata team) axios/^0.21.1","Host":"www.wikidata.org:6500"},"url":"index.php","params":{"title":"Special:EntityData","id":"Q78603151","revision":1581930543,"format":"json"}},"url":"/termbox?entity=Q78603151&revision=1581930543&language=en&editLink=%2Fwiki%2FSpecial%3ASetLabelDescriptionAliases%2FQ78603151&preferredLanguages=en","reqId":"bc474f84-1bf5-47ff-bb6b-41a71751dc40","levelPath":"error/service","msg":"connect ECONNREFUSED ::1:6500","time":"2024-01-23T15:22:52.468Z","v":0}                           
{"name":"wikibase-termbox","hostname":"termbox-production-7dfff557f8-4p9m9","pid":17,"level":"ERROR","message":"connect ECONNREFUSED ::1:6500","request":{"headers":{"Accept":"application/json, text/plain, */*","User-Agent":"wikibase-termbox/0.1.0 (The Wikidata team) axios/^0.21.1","Host":"www.wikidata.org:6500"},"url":"index.php","params":{"title":"Special:EntityData","id":"Q1","revision":103,"format":"json"}},"url":"/termbox?language=de&entity=Q1&revision=103&editLink=%2Fedit%2FQ1347&preferredLanguages=de%7Cen","reqId":"675d6a65-99e8-49ed-bde9-c7e9a357d201","levelPath":"error/service","msg":"connect ECONNREFUSED ::1:6500","time":"2024-01-23T15:22:52.827Z","v":0}                                                                                             
{"name":"wikibase-termbox","hostname":"termbox-production-7dfff557f8-4p9m9","pid":17,"level":"ERROR","message":"connect ECONNREFUSED ::1:6500","request":{"headers":{"Accept":"application/json, text/plain, */*","User-Agent":"wikibase-termbox/0.1.0 (The Wikidata team) axios/^0.21.1","Host":"www.wikidata.org:6500"},"url":"index.php","params":{"title":"Special:EntityData","id":"Q1","revision":103,"format":"json"}},"url":"/termbox?language=de&entity=Q1&revision=103&editLink=%2Fedit%2FQ1347&preferredLanguages=de%7Cen","reqId":"5ec7ea20-4229-495c-9a04-8492d68ac71e","levelPath":"error/service","msg":"connect ECONNREFUSED ::1:6500","time":"2024-01-23T15:22:54.438Z","v":0}                                                                                             
{"name":"wikibase-termbox","hostname":"termbox-production-7dfff557f8-4p9m9","pid":17,"level":"ERROR","message":"connect ECONNREFUSED ::1:6500","request":{"headers":{"Accept":"application/json, text/plain, */*","User-Agent":"wikibase-termbox/0.1.0 (The Wikidata team) axios/^0.21.1","Host":"www.wikidata.org:6500"},"url":"index.php","params":{"title":"Special:EntityData","id":"Q28194017","revision":1373118015,"format":"json"}},"url":"/termbox?entity=Q28194017&revision=1373118015&language=en&editLink=%2Fwiki%2FSpecial%3ASetLabelDescriptionAliases%2FQ28194017&preferredLanguages=en","reqId":"9ca27f83-3883-4d8b-a911-a78371917117","levelPath":"error/service","msg":"connect ECONNREFUSED ::1:6500","time":"2024-01-23T15:22:54.824Z","v":0}                           
{"name":"wikibase-termbox","hostname":"termbox-production-7dfff557f8-4p9m9","pid":17,"level":"ERROR","message":"connect ECONNREFUSED ::1:6500","request":{"headers":{"Accept":"application/json, text/plain, */*","User-Agent":"wikibase-termbox/0.1.0 (The Wikidata team) axios/^0.21.1","Host":"www.wikidata.org:6500"},"url":"index.php","params":{"title":"Special:EntityData","id":"Q86367866","revision":1517776825,"format":"json"}},"url":"/termbox?entity=Q86367866&revision=1517776825&language=en&editLink=%2Fwiki%2FSpecial%3ASetLabelDescriptionAliases%2FQ86367866&preferredLanguages=en","reqId":"3417ab41-e267-4548-9891-c3e12b24f52e","levelPath":"error/service","msg":"connect ECONNREFUSED ::1:6500","time":"2024-01-23T15:22:56.983Z","v":0}                           
{"name":"wikibase-termbox","hostname":"termbox-production-7dfff557f8-4p9m9","pid":17,"level":"ERROR","message":"connect ECONNREFUSED ::1:6500","request":{"headers":{"Accept":"application/json, text/plain, */*","User-Agent":"wikibase-termbox/0.1.0 (The Wikidata team) axios/^0.21.1","Host":"www.wikidata.org:6500"},"url":"index.php","params":{"title":"Special:EntityData","id":"Q33254393","revision":1774045136,"format":"json"}},"url":"/termbox?entity=Q33254393&revision=1774045136&language=en&editLink=%2Fwiki%2FSpecial%3ASetLabelDescriptionAliases%2FQ33254393&preferredLanguages=en","reqId":"801d4cd4-d858-4a68-9c8a-91e2cd61d74a","levelPath":"error/service","msg":"connect ECONNREFUSED ::1:6500","time":"2024-01-23T15:22:59.374Z","v":0}

(Full file in /home/lucaswerkmeister-wmde/lucas-termbox-logs-2024-01-23 on deploy2002.) The important error appears to be “connect ECONNREFUSED ::1:6500”.

We need to figure out what’s going wrong here and how to fix it.

Event Timeline

the new version was updated in values-test.yaml with no problems observed on Test Wikidata

Note: I phrased this very carefully – I assume, but am not at all sure, that Test Wikidata uses the service configured by values-test.yaml. There’s some sort of matrix between the two wikis (wikidatawiki, testwikidatawiki), three clusters (staging, eqiad, codfw), and three files (values.yaml, values-staging.yaml, values-test.yaml), and I don’t really understand how they match up.

But anyway, @Jdforrester-WMF might have found the fix already, pointing at cxserver: Force 127.0.0.1 instead of localhost.

Then it would also make sense that values-test.yaml didn’t cause an error, because that has WIKIBASE_REPO_HOSTNAME_ALIAS: mw-api-int-ro.discovery.wmnet, while values.yaml has WIKIBASE_REPO_HOSTNAME_ALIAS: localhost.

Then it would also make sense that values-test.yaml didn’t cause an error, because that has WIKIBASE_REPO_HOSTNAME_ALIAS: mw-api-int-ro.discovery.wmnet, while values.yaml has WIKIBASE_REPO_HOSTNAME_ALIAS: localhost.

Aha. Interesting divergence – is that intentional?

I don’t know – it’s probably related to T334064, but I didn’t really understand a lot of what was going on in that task, to be honest. There’s some difference between the two in production config too:

wmf-config/InitialiseSettings.php:'wmgWikibaseSSRTermboxServerUrl' => [
wmf-config/InitialiseSettings.php-   'wikidatawiki' => 'http://localhost:6008/termbox',
wmf-config/InitialiseSettings.php-   'testwikidatawiki' => 'http://termbox-test.staging.svc.eqiad.wmnet:3031/termbox',
wmf-config/InitialiseSettings.php-],

So maybe there’s a reason why (IIUC) values-test.yaml directly connects to mw-api-int-ro.discovery.wmnet:4446, while values.yaml talks to localhost:6500 which according to .fixtures.yaml is a proxy to mw-api-int.discovery.wmnet:4446.

The deeper reason behind most of this mess is the probably the uniqueness of the test release. There is no other environment where we have a test release currently and thus some of the assumptions made elsewhere to provide functionality don't apply to it. Service mesh support as well as the DNS records are such exceptions and the difference in configuration to reflect the above is a consequence.

My gut feeling, probably supported by some stuff this task, says that the high level end-result is wasted effort every time some actions need to be taken that (even tangentially) affect termbox. Either in T334064 or in this task, special consideration needed/needs to happen to accommodate for the test release. Some of these thoughts were also captured (or at least alluded to) in T226814 when the test release was introduced, albeit not so clearly stated (and the situation has changed considerably since 2019)

My high level suggestion would be to re-evaluate if the test helm release actually serves a useful purpose (I know it serves test.wikidata.org but it apparently gets updated very infrequently. All termbox releases have been at the same version for 10 months now, so can't we just have test.wikidata.org use the main one?). If not, let's just stop having it. If yes, we might need to kick the can down the road a bit more until we decide we need to support somehow this type of helm releases, cause we currently have no other use cases and thus no current plans to support such uses.

So maybe there’s a reason why (IIUC) values-test.yaml directly connects to mw-api-int-ro.discovery.wmnet:4446, while values.yaml talks to localhost:6500 which according to .fixtures.yaml is a proxy to mw-api-int.discovery.wmnet:4446.

fixtures are test/CI data, they aren't used somehow outside of that scope. For the same reason, they are often dummy data and might or might not reflect some actual situation (in this case they do reflect reality, but that's more happenstance than anything else).

My high level suggestion would be to re-evaluate if the test helm release actually serves a useful purpose (I know it serves test.wikidata.org but it apparently gets updated very infrequently. All termbox releases have been at the same version for 10 months now, so can't we just have test.wikidata.org use the main one?).

IMHO it’s useful to be able to test a new Termbox version on Test Wikidata before deploying it to Wikidata – but as we’ve seen in this task, the current setup doesn’t support that perfectly, because there are too many differences between in the test release.

Would it be possible to have just one helm release, but have Test Wikidata use the staging cluster while Wikidata uses the eqiad and codfw clusters?

Otherwise, I think it wouldn’t be the end of the world if we just lost this ability, and always had to deploy to Test Wikidata and Wikidata together; the impact is that mobile users without JavaScript lose access to terms until the deployer notices the problem and rolls back to the old version, which should be accceptable IMHO. (Though I’d want to check that with Product if we decide to go this way.)

My high level suggestion would be to re-evaluate if the test helm release actually serves a useful purpose (I know it serves test.wikidata.org but it apparently gets updated very infrequently. All termbox releases have been at the same version for 10 months now, so can't we just have test.wikidata.org use the main one?).

IMHO it’s useful to be able to test a new Termbox version on Test Wikidata before deploying it to Wikidata – but as we’ve seen in this task, the current setup doesn’t support that perfectly, because there are too many differences between in the test release.

Agreed on the last part. On the first part, it depends on what a failure of Termbox would mean for your end users and whether it indeed makes sense to have 1 more safety net (in addition to staging). It's a product decision as you say. If it would help them make that decision, the dashboard for the /termbox API endpoint, is at https://grafana-rw.wikimedia.org/d/wJRbI7FGk/termbox?orgId=1&var-dc=thanos&var-site=All&var-service=termbox&var-prometheus=k8s&var-container_name=All&from=now-6M&to=now&viewPanel=12&editPanel=12

A quick reading show over the last 12 months, shows a multimodal distribution. It's split in 2 main sections, 1 that's before Jun 2023 and after mid-November 2023 and the in-between (Northern Summer+Autumn let's call it). The traffic in the latter pattern apparently tripled and then subsided again. I have no idea if this is a seasonal effect or a result of some code changes. In any case, the amount of rps implies a small amount of concurrent users globally, so there's an argument to be made that it might be OK to not have a testing ground.

Would it be possible to have just one helm release, but have Test Wikidata use the staging cluster while Wikidata uses the eqiad and codfw clusters?

Meaning merging the functionality of test in the functionality of the staging release ? It certainly is possible, although that would mean overloading the functionality of the staging release. We do have an open big question of what the staging releases mean to deployers after all and whether they indeed find them useful. I am a bit ambivalent about that approach, but it certainly is possible and if product people deemed it is useful to have a test release, we can go down that path.

Otherwise, I think it wouldn’t be the end of the world if we just lost this ability, and always had to deploy to Test Wikidata and Wikidata together; the impact is that mobile users without JavaScript lose access to terms until the deployer notices the problem and rolls back to the old version, which should be accceptable IMHO. (Though I’d want to check that with Product if we decide to go this way.)

Thanks for this input, I appreciate it.

Would it be possible to have just one helm release, but have Test Wikidata use the staging cluster while Wikidata uses the eqiad and codfw clusters?

Meaning merging the functionality of test in the functionality of the staging release ? It certainly is possible, although that would mean overloading the functionality of the staging release. We do have an open big question of what the staging releases mean to deployers after all and whether they indeed find them useful. I am a bit ambivalent about that approach, but it certainly is possible and if product people deemed it is useful to have a test release, we can go down that path.

Yeah, that’s certainly an open question for me – I don’t know what the current functionality of the staging release is. FWIW, when I’ve deployed new Termbox versions in the past, I never tested the deployment “directly” (though I assume it would be possible – curl some internal URL?) – I only ever tested it through MediaWiki, by looking at new items on Wikidata or Test Wikidata and checking whether they had a server-side rendered termbox or not. But IIUC, this only allows testing two of the possible release+cluster combinations: Wikidata targets the production release on the eqiad/codfw clusters, I assume, while Test Wikidata targets some other combination (I don’t know which one).

(I also just noticed that helmfile.yaml lists three releases: production, staging, and test. I don’t think I was aware of the test one before, to be honest. Edit: Nonsense, that’s the one I updated in this Gerrit change. But there’s definitely something about the releases and clusters/environments that I didn’t realize, because I previously thought of it as 2×3 combinations, when it seems to be more 3×3. Maybe the thing I missed is that the name “staging” is used both for a release and for an environment/cluster?)

Would it be possible to have just one helm release, but have Test Wikidata use the staging cluster while Wikidata uses the eqiad and codfw clusters?

Meaning merging the functionality of test in the functionality of the staging release ? It certainly is possible, although that would mean overloading the functionality of the staging release. We do have an open big question of what the staging releases mean to deployers after all and whether they indeed find them useful. I am a bit ambivalent about that approach, but it certainly is possible and if product people deemed it is useful to have a test release, we can go down that path.

Yeah, that’s certainly an open question for me – I don’t know what the current functionality of the staging release is.

It was always envisioned as safety net. One could deploy there before deploying to production in order to catch errors that standard CI/CD failed to catch. That being said, I think that it's natural for all these kinds of environments to obtain more roles than the intended ones and I don't think we have made a very consistent effort to communicate what the vision was.

FWIW, when I’ve deployed new Termbox versions in the past, I never tested the deployment “directly” (though I assume it would be possible – curl some internal URL?) – I only ever tested it through MediaWiki, by looking at new items on Wikidata or Test Wikidata and checking whether they had a server-side rendered termbox or not. But IIUC, this only allows testing two of the possible release+cluster combinations: Wikidata targets the production release on the eqiad/codfw clusters, I assume, while Test Wikidata targets some other combination (I don’t know which one).

There is a loop here. wikidata.org uses the production release of each helmfile environment (eqiad/codfw to match our DCs). The production releases use wikidata.org in their own turn (that's the loop I was pointing out).

The staging release isn't being used by anything. It is using wikidata.org production as well (you can tell by the fact that the only thing it overrides from the main values.yaml file is the number of replicas).

test.wikidata.org uses the test release. The test release uses test.wikidata.org in turn (same loop as above).

As for a curl request example, here it is

deploy2002:~$ curl https://staging.svc.eqiad.wmnet:4004/_info
{"name":"wikibase-termbox","version":"0.1.0"}

Your test release is accessible as

deploy1002:~$ curl http://staging.svc.eqiad.wmnet:3031/_info
{"name":"wikibase-termbox","version":"0.1.0"}

Note the difference in ports and HTTPS vs HTTP. The test release, being the unique thing that it is, doesn't have TLS support as it doesn't use the service mesh.

(I also just noticed that helmfile.yaml lists three releases: production, staging, and test. I don’t think I was aware of the test one before, to be honest. Edit: Nonsense, that’s the one I updated in this Gerrit change. But there’s definitely something about the releases and clusters/environments that I didn’t realize, because I previously thought of it as 2×3 combinations, when it seems to be more 3×3. Maybe the thing I missed is that the name “staging” is used both for a release and for an environment/cluster?)

It's 4 releases in total. 1 production release per helmfile environment (or cluster/DC/data center[1]) and 2 releases, named staging and test that both reside in a kubernetes cluster named staging. The corresponding environment in helmfile is also named staging.

Some of the above can possibly be treated as implementation details. There is nothing forcing us to have the staging and test releases in a different cluster/environment, they could also reside in the eqiad/codfw ones (addressing them would a little bit different but not by much). We just chose to go that way for some historical reasons.

[1] You can use those terms kinda interchangeably for this specific discussion we have right now. That might not always be the case, but it sure is right now

Thanks a lot – I’ve added some of that information at https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service#Deployment where it will hopefully be helpful in future.

With your explanation, I think the test and staging releases are each somewhat useful (though I wouldn’t mind if you want to remove one of them either). Additionally, it sounds like it would be useful to make the test release in particular less special; I guess ideally, values-test.yaml would override the config.public.WIKIBASE_REPO (test.wikidata.org instead of www.wikidata.org) and the main_app.version (so we can bump this version before the production one), but almost nothing else? But to me that seems like a separate task. What do you think?

Meanwhile, we should still fix the localhost issue of the production release. My understanding is that changing localhost to 127.0.0.1 might work, but T355686 has been proposed as an alternative solution that might be more sustainable; do you have any preference which one we should go for?

(I was also wondering why the HEALTHCHECK_QUERY in values.yaml, which looks correct to me, didn’t prevent the broken deployment – but as far as I can tell, it’s not actually connected to any Kubernetes liveness/readiness/startup probes like I had assumed. It ends up in some OpenAPI spec x-amples (curl 'https://staging.svc.eqiad.wmnet:4004/?spec') and that’s apparently all.)

Thanks a lot – I’ve added some of that information at https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service#Deployment where it will hopefully be helpful in future.

With your explanation, I think the test and staging releases are each somewhat useful (though I wouldn’t mind if you want to remove one of them either). Additionally, it sounds like it would be useful to make the test release in particular less special; I guess ideally, values-test.yaml would override the config.public.WIKIBASE_REPO (test.wikidata.org instead of www.wikidata.org) and the main_app.version (so we can bump this version before the production one), but almost nothing else? But to me that seems like a separate task. What do you think?

Definitely different task. I am also not at all sure right now that the test release can easily be folded in like that, we 'll have to see if the service mesh is able to support >1 release being exposed like that.

Meanwhile, we should still fix the localhost issue of the production release. My understanding is that changing localhost to 127.0.0.1 might work, but T355686 has been proposed as an alternative solution that might be more sustainable; do you have any preference which one we should go for?

T355686 is the preferable approach here, solving the problem more generically by having envoy dual stack binding and avoiding having every single application hardcoding localhost to 127.0.0.1.

(I was also wondering why the HEALTHCHECK_QUERY in values.yaml, which looks correct to me, didn’t prevent the broken deployment – but as far as I can tell, it’s not actually connected to any Kubernetes liveness/readiness/startup probes like I had assumed. It ends up in some OpenAPI spec x-amples (curl 'https://staging.svc.eqiad.wmnet:4004/?spec') and that’s apparently all.)

That thing is used by https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/service-checker/+/refs/heads/master which runs in our monitoring infrastructure. It utilizes the x-amples stanza to construct and issue queries to the service as part of monitoring, effectively mimicking a simple "user", at least as far as the x-amples stanzas for every endpoint instruct it to.

Definitely different task. I am also not at all sure right now that the test release can easily be folded in like that, we 'll have to see if the service mesh is able to support >1 release being exposed like that.

Created T355955: [SW] [GENERAL] Simplify Termbox SSR test release.

Meanwhile, we should still fix the localhost issue of the production release. My understanding is that changing localhost to 127.0.0.1 might work, but T355686 has been proposed as an alternative solution that might be more sustainable; do you have any preference which one we should go for?

T355686 is the preferable approach here, solving the problem more generically by having envoy dual stack binding and avoiding having every single application hardcoding localhost to 127.0.0.1.

Alright, then let’s see how that task develops. I’ve set myself a calendar reminder to come back to this task in ~two weeks, because I don’t think we should have a known broken version tagged as latest indefinitely – if the general solution doesn’t happen soon, we should either hard-code 127.0.0.1 after all (we can always revert it later) or revert the Node 18 upgrade for now. (But that’s not meant to hurry or pressure T355686 at all, I just want to make sure we don’t forget about the Wikidata part :))

(I was also wondering why the HEALTHCHECK_QUERY in values.yaml, which looks correct to me, didn’t prevent the broken deployment – but as far as I can tell, it’s not actually connected to any Kubernetes liveness/readiness/startup probes like I had assumed. It ends up in some OpenAPI spec x-amples (curl 'https://staging.svc.eqiad.wmnet:4004/?spec') and that’s apparently all.)

That thing is used by https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/service-checker/+/refs/heads/master which runs in our monitoring infrastructure. It utilizes the x-amples stanza to construct and issue queries to the service as part of monitoring, effectively mimicking a simple "user", at least as far as the x-amples stanzas for every endpoint instruct it to.

I see, thanks. So it the broken termbox would’ve shown up in monitoring sooner or later even without me testing it, but it didn’t automatically hold back the new version.

Definitely different task. I am also not at all sure right now that the test release can easily be folded in like that, we 'll have to see if the service mesh is able to support >1 release being exposed like that.

Created T355955: [SW] [GENERAL] Simplify Termbox SSR test release.

Meanwhile, we should still fix the localhost issue of the production release. My understanding is that changing localhost to 127.0.0.1 might work, but T355686 has been proposed as an alternative solution that might be more sustainable; do you have any preference which one we should go for?

T355686 is the preferable approach here, solving the problem more generically by having envoy dual stack binding and avoiding having every single application hardcoding localhost to 127.0.0.1.

Alright, then let’s see how that task develops. I’ve set myself a calendar reminder to come back to this task in ~two weeks, because I don’t think we should have a known broken version tagged as latest indefinitely – if the general solution doesn’t happen soon, we should either hard-code 127.0.0.1 after all (we can always revert it later) or revert the Node 18 upgrade for now. (But that’s not meant to hurry or pressure T355686 at all, I just want to make sure we don’t forget about the Wikidata part :))

Cool, thanks for the patience.

(I was also wondering why the HEALTHCHECK_QUERY in values.yaml, which looks correct to me, didn’t prevent the broken deployment – but as far as I can tell, it’s not actually connected to any Kubernetes liveness/readiness/startup probes like I had assumed. It ends up in some OpenAPI spec x-amples (curl 'https://staging.svc.eqiad.wmnet:4004/?spec') and that’s apparently all.)

That thing is used by https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/service-checker/+/refs/heads/master which runs in our monitoring infrastructure. It utilizes the x-amples stanza to construct and issue queries to the service as part of monitoring, effectively mimicking a simple "user", at least as far as the x-amples stanzas for every endpoint instruct it to.

I see, thanks. So it the broken termbox would’ve shown up in monitoring sooner or later even without me testing it, but it didn’t automatically hold back the new version.

Yes. If you do want to call the tool manually, you can via something like

deploy1002:~$ service-checker-swagger -t 60 termbox.svc.eqiad.wmnet https://termbox.discovery.wmnet:4004
All endpoints are healthy

Mess with the arguments a bit and you can test out all 4 releases with this. Note that in our infra we only test against the 2 production releases

Change 999882 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] termbox: Bump module dependencies

https://gerrit.wikimedia.org/r/999882

Definitely different task. I am also not at all sure right now that the test release can easily be folded in like that, we 'll have to see if the service mesh is able to support >1 release being exposed like that.

Created T355955: [SW] [GENERAL] Simplify Termbox SSR test release.

Meanwhile, we should still fix the localhost issue of the production release. My understanding is that changing localhost to 127.0.0.1 might work, but T355686 has been proposed as an alternative solution that might be more sustainable; do you have any preference which one we should go for?

T355686 is the preferable approach here, solving the problem more generically by having envoy dual stack binding and avoiding having every single application hardcoding localhost to 127.0.0.1.

Alright, then let’s see how that task develops. I’ve set myself a calendar reminder to come back to this task in ~two weeks, because I don’t think we should have a known broken version tagged as latest indefinitely – if the general solution doesn’t happen soon, we should either hard-code 127.0.0.1 after all (we can always revert it later) or revert the Node 18 upgrade for now. (But that’s not meant to hurry or pressure T355686 at all, I just want to make sure we don’t forget about the Wikidata part :))

Cool, thanks for the patience.

Patches are up for review!

Patches are up for review!

Looks alright to me – I think if another SRE can review the general changes, we can try them out in Termbox and see if it works or not.

Change 999882 merged by jenkins-bot:

[operations/deployment-charts@master] termbox: Bump module dependencies

https://gerrit.wikimedia.org/r/999882

Patches have been deployed, simple curl tests as well as service-checker-swagger checks have passed. I double checked the diff, envoy is listening now on both IPv6 and IPv4.

I think you are unblocked on this and can proceed with the migration.

Change 1003400 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[operations/deployment-charts@master] Reapply "termbox: update to 2024-01-22-163619-production"

https://gerrit.wikimedia.org/r/1003400

Change 1003400 merged by jenkins-bot:

[operations/deployment-charts@master] Reapply "termbox: update to 2024-01-22-163619-production"

https://gerrit.wikimedia.org/r/1003400

Lucas_Werkmeister_WMDE claimed this task.

I deployed the update and it’s working as far as I can tell – I think we’re done here! Thanks a lot @akosiaris :)

Change 1003404 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] Termbox: Update submodule

https://gerrit.wikimedia.org/r/1003404

Change 1003404 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Termbox: Update submodule

https://gerrit.wikimedia.org/r/1003404