Page MenuHomePhabricator

hnowlan (Hugh Nowlan)
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Jan 6 2020, 12:19 PM (252 w, 6 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
HNowlan (WMF) [ Global Accounts ]

Recent Activity

Wed, Nov 6

hnowlan added a comment to T376438: Download to PDF: HTTP 500 error on some wikis for some users.

Looks like the same crashpad flood issue again. The service needs a restart, and I think we should implement the flags @TheDJ has mentioned.

Wed, Nov 6, 4:33 PM · serviceops, Patch-For-Review, Content-Transform-Team-WIP, Essential-Work, Electron-PDFs

Tue, Oct 29

hnowlan added a comment to T378082: Requesting access to 'deployment' for 'Joely Rooke WMDE'.

Just to note Joely has verified the SSH key in this ticket via slack

Tue, Oct 29, 9:52 AM · SRE, SRE-Access-Requests

Fri, Oct 25

hnowlan edited projects for T378038: create a place (whiteboard) where SRE advertises current site status / things for awareness, added: SRE-OnFire; removed SRE.
Fri, Oct 25, 3:55 PM · SRE-OnFire, Sustainability (Incident Followup)
hnowlan updated subscribers of T378182: Grant Access to ldap/nda for Deepesha Burse WMDE.

This access requires signing an NDA, adding @KFrancis as per access request documentation. Thanks!

Fri, Oct 25, 3:50 PM · SRE, LDAP-Access-Requests
hnowlan moved T378082: Requesting access to 'deployment' for 'Joely Rooke WMDE' from Awaiting User Input to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Fri, Oct 25, 3:48 PM · SRE, SRE-Access-Requests
hnowlan moved T378182: Grant Access to ldap/nda for Deepesha Burse WMDE from Backlog to NDA Pending on the LDAP-Access-Requests board.
Fri, Oct 25, 3:48 PM · SRE, LDAP-Access-Requests
hnowlan closed T378181: Grant Access to ldap/wmde for Deepesha Burse WMDE as Invalid.

closing as dupe, following up in T378181

Fri, Oct 25, 3:40 PM · SRE, LDAP-Access-Requests
hnowlan updated subscribers of T378082: Requesting access to 'deployment' for 'Joely Rooke WMDE'.

This request first requires signing an NDA with Legal - tagging @KFrancis as per the access request process. Thanks!

Fri, Oct 25, 3:37 PM · SRE, SRE-Access-Requests
hnowlan changed the status of T377773: Give Dumps 1.0 access to gmodena from Open to Stalled.
Fri, Oct 25, 3:36 PM · SRE, SRE-Access-Requests
hnowlan moved T378082: Requesting access to 'deployment' for 'Joely Rooke WMDE' from Untriaged to Awaiting User Input on the SRE-Access-Requests board.
Fri, Oct 25, 3:35 PM · SRE, SRE-Access-Requests

Wed, Oct 23

hnowlan added a comment to T300383: Requesting access to Analytics Private Data Users for Tanja Andic.

Key updated - please let me know if it works.

Wed, Oct 23, 5:12 PM · SRE, SRE-Access-Requests
hnowlan closed T363996: Sessionstore's discovery TLS cert will expire before end of May 2024 as Resolved.

sessionstore codfw and eqiad are running with an envoy tls terminator, and latencies etc look acceptable.

Wed, Oct 23, 4:35 PM · Patch-For-Review, serviceops, Data-Persistence
hnowlan closed T377792: Grant bd808 membership in the contint-roots and contint-docker groups as Resolved.

Merged!

Wed, Oct 23, 9:10 AM · SRE, SRE-Access-Requests, Continuous-Integration-Infrastructure
hnowlan added a comment to T373517: shellbox-video pods being restarted prematurely.

Running the client directly against a k8s worker IP also succeeds, which means that kube-proxy most likely isn't to blame here.

Wed, Oct 23, 9:09 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops

Tue, Oct 22

hnowlan added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

eqiad is currently using the mesh - codfw is not. We decided to leave this config in place for the evening to get certainty and allow for time constraints. eqiad is looking fine so far. If an emergency revert is needed, both 2adb4cf4c6aa6e534aa7a596e796f5f099abc60f and 622bec969ea59a4352abc1e6daa20313ae1fe4f3 will need to be reverted before applying in eqiad

Tue, Oct 22, 5:58 PM · Patch-For-Review, serviceops, Data-Persistence
hnowlan added a comment to T373517: shellbox-video pods being restarted prematurely.

When connecting the same client to a k8s pod IP, the encoding and download of the file complete successfully, so some point of the communication between is definitely at fault here. We can now say with reasonable confidence that Envoy and Apache are not at fault here. Isolating which part will be a bit of a challenge but it's a clearer task.

Tue, Oct 22, 4:32 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan added a comment to T373517: shellbox-video pods being restarted prematurely.

I've mocked up a horrible Frankenstein script that mimics the TimedMediaHandler behaviour - when directly calling shellbox-video.discovery.wmnet via it, we see the exact same behaviour. This means that at the very least we can rule out failures at the Jobqueue or RunSingleJob layer:

Tue, Oct 22, 3:25 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan added a comment to T373517: shellbox-video pods being restarted prematurely.

Are the http requests using chunked transfer encoding. or not ? (I'm assuming its all http 1.1 and not 2.0)

Tue, Oct 22, 11:52 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan created T377830: RunSingleJob.php's readonly backoff behaviour will never be triggered.
Tue, Oct 22, 11:36 AM · serviceops-radar, WMF-JobQueue
hnowlan added a comment to T377773: Give Dumps 1.0 access to gmodena.

Could you please specify which groups access is needed to? There are a few dumps groups but it appears that @gmodena should inherit access to all of them by virtue of being part of the old platform engineering group. The full list of groups is here.

Hi @hnowlan ,

I don't know how group assignment works, but I as far I understand I should be able to impersonate dumpsgen.

I can ssh onto clouddumps, but when I try (as suggested):

[gmodena@clouddumps1002] $ sudo -u dumpsgen whoami

I'm asked for password auth (that fails).

Tue, Oct 22, 10:53 AM · SRE, SRE-Access-Requests
hnowlan added a comment to T377773: Give Dumps 1.0 access to gmodena.

Could you please specify which groups access is needed to? There are a few dumps groups but it appears that @gmodena should inherit access to all of them by virtue of being part of the old platform engineering group. The full list of groups is here.

Tue, Oct 22, 9:45 AM · SRE, SRE-Access-Requests

Mon, Oct 21

hnowlan added a comment to T373517: shellbox-video pods being restarted prematurely.

Minor datapoint that hasn't been noted - when testing with a larger file that takes longer to convert, we're seeing the same behaviour. This adds credence to the idea that this issue is not caused by a timeout, and is most likely caused by some kind of issue with the handling of and reading of responses, most likely beyond shellbox.

Mon, Oct 21, 5:01 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan removed a project from T35245: SVG files: text (and tspan) elements misplaced when rasterizing to PNG thumbnails/previews (multi-valued x/y, dx/dy attributes): Upstream.

I've removed the Upstream tag as requested. T40010 may be of interest for similar threads of conversation, might be worth making this task a subtask of that one for now.

Mon, Oct 21, 9:56 AM · Thumbor, Wikimedia-SVG-rendering

Fri, Oct 18

hnowlan added a comment to T376438: Download to PDF: HTTP 500 error on some wikis for some users.

Chromium is leaking processes, leaving chromium_crashpads lying around after a failure most likey:

root@wikikube-worker2070:/home/hnowlan# ps uax| grep chrome_crashpad | wc -l
115357
Fri, Oct 18, 9:24 AM · serviceops, Patch-For-Review, Content-Transform-Team-WIP, Essential-Work, Electron-PDFs
hnowlan added a comment to T376438: Download to PDF: HTTP 500 error on some wikis for some users.

I suspect that the issue is that we don't close or somehow we end up in a sitation with stale browser instances. Given the level of traffic/support of the pdf service would it be enough to just restart the service ?

Fri, Oct 18, 9:20 AM · serviceops, Patch-For-Review, Content-Transform-Team-WIP, Essential-Work, Electron-PDFs

Thu, Oct 17

hnowlan added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

@hnowlan @Eevans we could do the following:

  1. Test staging to verify that everything is good.
  2. Depool eqiad from discovery, apply the change, check, repool and watch metrics.
  3. If everything looks good, we move to codfw else we rollback

There is the chance to impact users, but it will be limited and in a controlled environment. Plus we already tested the latency with echostore and the new setting worked nicely. What do you think?

Thu, Oct 17, 9:21 AM · Patch-For-Review, serviceops, Data-Persistence

Wed, Oct 16

hnowlan closed T371699: Build and add Mercurius to PHP base image as Resolved.

Mercurius is now built into the php8.1-fpm-multiversion-base image as of docker-registry.discovery.wmnet/php8.1-fpm-multiversion-base:8.1.30-2.

Wed, Oct 16, 4:56 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan closed T371699: Build and add Mercurius to PHP base image , a subtask of T355292: Port videoscaling to kubernetes, as Resolved.
Wed, Oct 16, 4:51 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan added a comment to T371699: Build and add Mercurius to PHP base image .

Debian packages are now in the apt repo

Wed, Oct 16, 2:48 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops

Tue, Oct 15

hnowlan added a comment to T376438: Download to PDF: HTTP 500 error on some wikis for some users.

This appears to be a rerun of T375521 - temporary fix last time was a roll restart, but there's clearly a deeper issue.

Tue, Oct 15, 9:34 AM · serviceops, Patch-For-Review, Content-Transform-Team-WIP, Essential-Work, Electron-PDFs

Mon, Oct 14

hnowlan added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

@hnowlan if echostore turns out to work as expected (it sounds so from the other task), we could keep the ball rolling and do session store too wdyt?

Mon, Oct 14, 2:05 PM · Patch-For-Review, serviceops, Data-Persistence
hnowlan placed T320398: Expand upon Kask/Sessionstore documentation up for grabs.
Mon, Oct 14, 1:50 PM · SRE-Sprint-Week-Sustainability-March2023, serviceops, Sustainability (Incident Followup)
hnowlan added a comment to T350143: Write AQS 1 deprecation announcement.

aqs1 is disabled in restbase and the puppet configuration has been removed. All that remains is to archive the codebase and deploy repos.

Mon, Oct 14, 12:32 PM · AQS2.0, Data Products
hnowlan closed T371761: Add bdrwiki to RESTBase as Resolved.
Mon, Oct 14, 12:22 PM · Essential-Work, MediaWiki-Engineering, Content-Transform-Team, RESTBase
hnowlan closed T371761: Add bdrwiki to RESTBase, a subtask of T371760: Post-creation work for bdrwiki, as Resolved.
Mon, Oct 14, 12:21 PM · Countervandalism-Network, Wiki-Setup
hnowlan placed T300914: cpjobqueue not achieving configured concurrency up for grabs.
Mon, Oct 14, 11:51 AM · Platform Team Workboards (Platform Engineering Reliability), WMF-JobQueue, Platform Engineering

Oct 9 2024

hnowlan added a project to T376828: Thumbor's use of the `expensive` poolcounter queue can break rendering formats : Structured Data Engineering.
Oct 9 2024, 5:19 PM · Structured-Data-Backlog, Structured Data Engineering, serviceops, Thumbor
hnowlan triaged T376828: Thumbor's use of the `expensive` poolcounter queue can break rendering formats as High priority.
Oct 9 2024, 5:18 PM · Structured-Data-Backlog, Structured Data Engineering, serviceops, Thumbor
hnowlan created T376828: Thumbor's use of the `expensive` poolcounter queue can break rendering formats .
Oct 9 2024, 5:17 PM · Structured-Data-Backlog, Structured Data Engineering, serviceops, Thumbor
hnowlan added a comment to T376766: echostore's TLS certificate expires on 2024-10-13.

The main reason sessionstore didn't roll ahead with using the mesh was concern around the extremely broad impact any issues might have incurred. The risk profile for echostore is a lot lower, so I think we can move ahead with testing the mesh. I can't quite remember what they were but I'm fairly sure there's a bug or two in in the chart logic, but nothing that isn't obvious and can't be ironed out :)

Oct 9 2024, 9:27 AM · serviceops

Oct 5 2024

hnowlan added a comment to T372470: Elevated 429 responses from Thumbor on codfw starting 2024-08-14 00:00 UTC.

Just to explain the issue - a while ago a rate-limiting feature that was known to be problematic was reenabled in an emergency due to a harmful surge in traffic. This was left enabled and caused this issue to recur. I've since disabled this feature and we'll be removing it to prevent it being erroneously triggered again. However, the fact that this required manual reporting and wasn't noticed on the SRE-side isn't really acceptable so next week I'll be working on adding per-format alerting so that if there is an increase in errors for a single format we'll catch these before they can have a wide impact which will be tracked in T376538.

Oct 5 2024, 5:43 PM · Patch-For-Review, serviceops, All-and-every-Wikisource, Thumbor
Don-vip awarded T376538: Per-format monitoring for Thumbor a Fox token.
Oct 5 2024, 5:38 PM · serviceops, Thumbor
hnowlan added a comment to T376534: HTTP 429 errors: PDF thumbnails on Commons not displayed.

Thanks for the report - this was caused by T372470. I'm seeing recoveries on thumbnailing those files, could you confirm?

Oct 5 2024, 5:23 PM · Thumbor
hnowlan created T376538: Per-format monitoring for Thumbor .
Oct 5 2024, 5:19 PM · serviceops, Thumbor
hnowlan reopened T372470: Elevated 429 responses from Thumbor on codfw starting 2024-08-14 00:00 UTC as "Open".

I'm seeing recoveries on most of the linked images, but reopening this until we're sure this is resolved.

Oct 5 2024, 5:13 PM · Patch-For-Review, serviceops, All-and-every-Wikisource, Thumbor
hnowlan closed T372470: Elevated 429 responses from Thumbor on codfw starting 2024-08-14 00:00 UTC as Resolved.

found T376509 while investigating 429 for https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Feedback_form_Odia_Wikipedia_outreach.pdf/page1-180px-Feedback_form_Odia_Wikipedia_outreach.pdf.jpg

https://commons.wikimedia.org/wiki/File:Feedback_form_Odia_Wikipedia_outreach.pdf

  • 463px (embedded by default in description page) seems fine but I guess that's some kind of cache hit?
  • 180px (embedded in Special:ListFiles) is 429
  • other sizes linked from description page are ok.
  • other sizes I pulled out of thin air also don't work.

ahhhhh, now I found T372470#10113572.

Oct 5 2024, 4:44 PM · Patch-For-Review, serviceops, All-and-every-Wikisource, Thumbor
hnowlan closed T376509: investigate ThumbnailRender volume 2024-09-20 til 2024-10-04 (thumbnail, thumbor) as Invalid.
Oct 5 2024, 4:35 PM · serviceops, Thumbor
hnowlan added a comment to T376509: investigate ThumbnailRender volume 2024-09-20 til 2024-10-04 (thumbnail, thumbor).

High ThumbnailRender volume is normal, this is a constant background process that is ongoing to generate thumbnails on newly uploaded files. The change in the graphs from eqiad to codfw is part of the datacentre switchover (T370962).

Oct 5 2024, 4:33 PM · serviceops, Thumbor

Oct 3 2024

hnowlan added a comment to T371699: Build and add Mercurius to PHP base image .

Mercurius images for bookworm and bullseye are now building via CI (with some modifications for bullseye): https://gitlab.wikimedia.org/hnowlan/mercurius/-/artifacts

Oct 3 2024, 12:00 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops

Sep 30 2024

hnowlan lowered the priority of T374436: Large file uploads broken via Special:Upload from Unbreak Now! to Medium.
Sep 30 2024, 10:27 AM · Unstewarded-production-error, MW-1.43-notes (1.43.0-wmf.26; 2024-10-08), Patch-For-Review, Regression, Wikimedia-production-error, serviceops, MediaWiki-Uploading

Sep 27 2024

hnowlan added a comment to T374436: Large file uploads broken via Special:Upload.

That looks more like a few thousand times a month on commons to me. Am I reading it wrong?

Sep 27 2024, 3:14 PM · Unstewarded-production-error, MW-1.43-notes (1.43.0-wmf.26; 2024-10-08), Patch-For-Review, Regression, Wikimedia-production-error, serviceops, MediaWiki-Uploading

Sep 25 2024

hnowlan closed T375069: wikifunctions error messages are too large for logstash as Declined.

This is fundamentally a bug in NormalizedException or MediaWiki-libs-RequestTimeout; we don't control PHP's exception stack trace length. I filed T374618: Trim exceptions (?in wikimedia/normalized-exception) before they get to syslog, so that they aren't jsonTruncated about this last week – should we mark this as a dupe of that? Dependent on that?

Sep 25 2024, 10:00 AM · Abstract Wikipedia team (25Q1 (Jul–Sep)), serviceops-radar, Wikifunctions, WikiLambda
hnowlan closed T375069: wikifunctions error messages are too large for logstash, a subtask of T374231: wikifunctions mediawiki instance can't sustain more than 5rps, as Declined.
Sep 25 2024, 9:59 AM · Abstract Wikipedia team (25Q1 (Jul–Sep)), serviceops-radar, Wikifunctions, WikiLambda

Sep 19 2024

hnowlan added a comment to T373517: shellbox-video pods being restarted prematurely.

Just to note, I've been testing by forcing a reencode of this video in VP9 format. This can also be tested by grabbing a job from kafka using kafkacat (kafkacat -b kafka-main1004.eqiad.wmnet:9092 -t eqiad.mediawiki.job.webVideoTranscode -o -200) and then POSTing the inner parts of the event via curl to a specific videoscaler to test logging changes etc:

time curl -H "Host: videoscaler.discovery.wmnet" -k -v -v -X POST -d '{"database":"testwiki","type":"webVideoTranscode","params": {"transcodeMode":"derivative" ,"transcodeKey":"240p.vp9.webm","prioritized":false,"manualOverride":true,"remux":false,"requestId":"A_REQ_ID","namespace":6,"title":"CC_1916_10_02_ThePawnshop.mpg"},"mediawiki_signature":"A_SIG"}' https://mw1437.eqiad.wmnet/rpc/RunSingleJob.php
Sep 19 2024, 4:50 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan added a comment to T374911: Some POST of thumbnails to Swift time out.

These all appear to be requests from jobrunner hosts, which leads me to assume they're from the ThumbnailRender job. Could it be an ordering issue where we're triggering thumbnail generation during upload or something? The images themselves all seem to be fine when requested directly.

Sep 19 2024, 1:55 PM · Unstewarded-production-error, MediaWiki-Uploading, Thumbor, Data-Persistence, SRE-swift-storage, Wikimedia-production-error

Sep 18 2024

hnowlan created P69295 (An Untitled Masterwork).
Sep 18 2024, 4:58 PM
hnowlan created T375069: wikifunctions error messages are too large for logstash.
Sep 18 2024, 10:54 AM · Abstract Wikipedia team (25Q1 (Jul–Sep)), serviceops-radar, Wikifunctions, WikiLambda

Sep 17 2024

hnowlan created P69220 (An Untitled Masterwork).
Sep 17 2024, 2:40 PM

Sep 16 2024

hnowlan added a comment to T374860: Retire mw_wikiversion_difference check.

I think that's fairly on the money, we can probably remove this now. We still have some bare metal deployments on debug (but I think scap is aware of this versioning during a deploy) and videoscaler hosts so we're not completely free of it. But I think at this point we stand to lose little from removing it.

Sep 16 2024, 4:03 PM · serviceops, SRE Observability (FY2024/2025-Q1), Observability-Alerting

Sep 13 2024

hnowlan added a comment to T373517: shellbox-video pods being restarted prematurely.

We have at least partially addressed the healthchecking issues by introducing a second readiness probe on the shellbox app container that checks for an ffmpeg process running, which appears to be working quite well.

Sep 13 2024, 4:17 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops

Sep 12 2024

hnowlan triaged T374436: Large file uploads broken via Special:Upload as Unbreak Now! priority.
Sep 12 2024, 2:27 PM · Unstewarded-production-error, MW-1.43-notes (1.43.0-wmf.26; 2024-10-08), Patch-For-Review, Regression, Wikimedia-production-error, serviceops, MediaWiki-Uploading

Sep 11 2024

hnowlan added a comment to T372849: Determine switchover changes for migration of video scaling to k8s.

At this point in time I'd say it's not out of the question that we could have mercurius up and running some jobs, but for the purposes of the switchover I think it makes sense to revert to using videoscalers for the short term. It's a much more well understood problem space and while I hope to have some jobs running via mercurius, I really doubt we'd be doing it for *all* jobs.

Sep 11 2024, 4:04 PM · Datacenter-Switchover, serviceops
hnowlan added a comment to T373517: shellbox-video pods being restarted prematurely.

From php-fpm's fpm-status we can even see this behaviour so our check isn't at fault:

root@mw1451:/home/hnowlan# for i in `seq 200`; do curl -s 10.67.165.241:9181/fpm-status| grep ^active; sleep 0.2; done | sort | uniq -c
     18 active processes:     1
    182 active processes:     2
Sep 11 2024, 11:31 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan added a comment to T373517: shellbox-video pods being restarted prematurely.

The healthcheck endpoint is not consistently returning a 503 when workers are busy - this could be some kind of a race condition. When all of the following were executed the pod was actively running an ffmpeg process:

Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1                          
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1                                                                             OKroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1        
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1                        
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1
OKroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1
Sep 11 2024, 10:59 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops

Sep 10 2024

hnowlan renamed T374436: Large file uploads broken via Special:Upload from Large file uploads broken on (at least) group0 to Large file uploads broken via Special:Upload.
Sep 10 2024, 11:42 AM · Unstewarded-production-error, MW-1.43-notes (1.43.0-wmf.26; 2024-10-08), Patch-For-Review, Regression, Wikimedia-production-error, serviceops, MediaWiki-Uploading
hnowlan created T374436: Large file uploads broken via Special:Upload.
Sep 10 2024, 11:39 AM · Unstewarded-production-error, MW-1.43-notes (1.43.0-wmf.26; 2024-10-08), Patch-For-Review, Regression, Wikimedia-production-error, serviceops, MediaWiki-Uploading

Sep 9 2024

hnowlan created T374350: Majority of thumbor containers on pods occasionally getting into a stuck state .
Sep 9 2024, 11:24 AM · serviceops, Thumbor
hnowlan added a comment to T345953: [L] 3d2png uses unsupported/unmaintained packages.

The "deploy" config in package.json is set to node=12.22.12 on target=debian:bullseye - should that then be updated to 18.19.0? (and is there a reason that that is not already the case? would changing it break something?)

The canvas version currently in use (2.11.2), is supposed to work on node>=6, and AFAICT, none of the other existing package versions actually require v18. That said, I'm all for upgrading to more recent versions, especially if we're already running it in prod. I'm not sure how to properly move forward with that, though - esp. since package.json seems to have conflicting information.

Sep 9 2024, 10:05 AM · Patch-For-Review, Structured-Data-Backlog (Current Work), Structured Data Engineering, Technical-Debt, Security, 3D

Sep 6 2024

hnowlan removed projects from T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets: netops, Infrastructure-Foundations.
Sep 6 2024, 4:15 PM · serviceops
hnowlan created T374258: Comm Error: backplane 0 when reimaging wikikube-worker2095.
Sep 6 2024, 3:42 PM · SRE, ops-codfw, serviceops, DC-Ops
hnowlan updated the task description for T373916: Relabel codfw kubernetes nodes.
Sep 6 2024, 12:53 PM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops

Sep 5 2024

hnowlan updated the task description for T373916: Relabel codfw kubernetes nodes.
Sep 5 2024, 10:47 AM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops

Sep 4 2024

hnowlan created P68649 (An Untitled Masterwork).
Sep 4 2024, 2:57 PM
hnowlan added a comment to T372470: Elevated 429 responses from Thumbor on codfw starting 2024-08-14 00:00 UTC.

Almost no new PDF file thumbnails can be generated from codfw: https://commons.wikimedia.org/wiki/Special:NewFiles?user=&mediatype%5B%5D=OFFICE&start=&end=&wpFormIdentifier=specialnewimages&limit=50&offset=
The situation is a lot better when switching to eqiad. But the thumbnails rendered in eqiad somehow aren't shared with codfw, so when switching back to codfw, the thumbnails are still not loading.

Sep 4 2024, 9:29 AM · Patch-For-Review, serviceops, All-and-every-Wikisource, Thumbor

Sep 3 2024

hnowlan created T373916: Relabel codfw kubernetes nodes.
Sep 3 2024, 5:54 PM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
akosiaris awarded T373819: Issues reimaging kubernetes workers due to user conflicts in systemd-timesyncd a Love token.
Sep 3 2024, 3:15 PM · serviceops, Infrastructure-Foundations
hnowlan added a comment to T373819: Issues reimaging kubernetes workers due to user conflicts in systemd-timesyncd .

I don't see anything obvious in the diff of those two packages.
The systems prior to yesterday seem to have installed systemd-timesyncd during d-i, whereas the new ones did not. There the user creation is the first thing logged in syslog (so right after d-i?).

Sep 3 2024, 9:56 AM · serviceops, Infrastructure-Foundations
hnowlan renamed T373819: Issues reimaging kubernetes workers due to user conflicts in systemd-timesyncd from Issues reimaging kubernetes workers due to package user issues in systemd-timesyncd to Issues reimaging kubernetes workers due to user conflicts in systemd-timesyncd .
Sep 3 2024, 9:40 AM · serviceops, Infrastructure-Foundations

Sep 2 2024

hnowlan updated the task description for T373591: Relabel codfw kubernetes nodes.
Sep 2 2024, 5:21 PM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
hnowlan updated the task description for T373819: Issues reimaging kubernetes workers due to user conflicts in systemd-timesyncd .
Sep 2 2024, 5:10 PM · serviceops, Infrastructure-Foundations
hnowlan updated the task description for T373819: Issues reimaging kubernetes workers due to user conflicts in systemd-timesyncd .
Sep 2 2024, 5:10 PM · serviceops, Infrastructure-Foundations
hnowlan removed projects from T373819: Issues reimaging kubernetes workers due to user conflicts in systemd-timesyncd : netops, SRE.
Sep 2 2024, 5:10 PM · serviceops, Infrastructure-Foundations
hnowlan created T373819: Issues reimaging kubernetes workers due to user conflicts in systemd-timesyncd .
Sep 2 2024, 5:07 PM · serviceops, Infrastructure-Foundations
hnowlan edited P68518 Masterwork From Distant Lands.
Sep 2 2024, 11:51 AM

Aug 30 2024

hnowlan updated the task description for T373591: Relabel codfw kubernetes nodes.
Aug 30 2024, 4:08 PM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
hnowlan updated the task description for T373591: Relabel codfw kubernetes nodes.
Aug 30 2024, 12:28 PM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
hnowlan reopened T356241: Move video transcoding to use Shellbox, a subtask of T355292: Port videoscaling to kubernetes, as In Progress.
Aug 30 2024, 11:00 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan reopened T356241: Move video transcoding to use Shellbox as "In Progress".
Aug 30 2024, 11:00 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops

Aug 29 2024

hnowlan added a comment to T345953: [L] 3d2png uses unsupported/unmaintained packages.

Currently in prod we run 3d2png against nodejs v18.19.0 (I believe this was required for some of the canvas components), so hopefully that can help relieve some of these issues!

Aug 29 2024, 3:46 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Structured Data Engineering, Technical-Debt, Security, 3D
hnowlan updated the task description for T373591: Relabel codfw kubernetes nodes.
Aug 29 2024, 2:03 PM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
hnowlan added a comment to T373546: Migrate off HLS mov/mp4 experiment to a flat mov back-compat with WebM and MPEG-DASH.

fwiw the videoscaler load issues in T373517 are down to how pod/worker timeouts are managed rather than overall capacity, don't worry about it in relation to your work! Currently commons is back on metal, so the majority of encodes are happening there.

Aug 29 2024, 2:03 PM · TimedMediaHandler-Transcode

Aug 28 2024

hnowlan triaged T373517: shellbox-video pods being restarted prematurely as High priority.
Aug 28 2024, 3:05 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan updated the task description for T373517: shellbox-video pods being restarted prematurely.
Aug 28 2024, 1:11 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan added a comment to T373517: shellbox-video pods being restarted prematurely.

As a stopgap measure we've already introduced retries to the videoscaling job

Aug 28 2024, 1:03 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan created T373517: shellbox-video pods being restarted prematurely.
Aug 28 2024, 12:54 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops

Aug 26 2024

hnowlan closed T356241: Move video transcoding to use Shellbox as Resolved.

TimedMediaHandler now uses shellbox by default.

Aug 26 2024, 2:19 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan closed T356241: Move video transcoding to use Shellbox, a subtask of T355292: Port videoscaling to kubernetes, as Resolved.
Aug 26 2024, 2:15 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops

Aug 23 2024

hnowlan changed the status of T372470: Elevated 429 responses from Thumbor on codfw starting 2024-08-14 00:00 UTC from Open to Stalled.

This appears to have dropped. Leaving open to get patches resolved at a later point

Aug 23 2024, 11:34 AM · Patch-For-Review, serviceops, All-and-every-Wikisource, Thumbor

Aug 16 2024

hnowlan added a comment to T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis.

Here's a dump of the pod details of each of those IP addresses, counted and sorted by number of appearances: https://phabricator.wikimedia.org/P67349

Aug 16 2024, 12:01 PM · User-notice-archive, MediaWiki-Platform-Team (Radar), Vuln-DoS, SecTeam-Processed, Security, Essential-Work, Content-Transform-Team-WIP, Wikimedia-Incident, DBA, Wikimedia-production-error
hnowlan added a comment to T372470: Elevated 429 responses from Thumbor on codfw starting 2024-08-14 00:00 UTC.

To better trace this issue, could I get a sample of failing URLs please?

Aug 16 2024, 11:18 AM · Patch-For-Review, serviceops, All-and-every-Wikisource, Thumbor