User Details
- User Since
- Jan 6 2020, 12:19 PM (252 w, 6 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- HNowlan (WMF) [ Global Accounts ]
Wed, Nov 6
Looks like the same crashpad flood issue again. The service needs a restart, and I think we should implement the flags @TheDJ has mentioned.
Tue, Oct 29
Just to note Joely has verified the SSH key in this ticket via slack
Fri, Oct 25
This access requires signing an NDA, adding @KFrancis as per access request documentation. Thanks!
closing as dupe, following up in T378181
This request first requires signing an NDA with Legal - tagging @KFrancis as per the access request process. Thanks!
Wed, Oct 23
Key updated - please let me know if it works.
sessionstore codfw and eqiad are running with an envoy tls terminator, and latencies etc look acceptable.
Merged!
Running the client directly against a k8s worker IP also succeeds, which means that kube-proxy most likely isn't to blame here.
Tue, Oct 22
eqiad is currently using the mesh - codfw is not. We decided to leave this config in place for the evening to get certainty and allow for time constraints. eqiad is looking fine so far. If an emergency revert is needed, both 2adb4cf4c6aa6e534aa7a596e796f5f099abc60f and 622bec969ea59a4352abc1e6daa20313ae1fe4f3 will need to be reverted before applying in eqiad
When connecting the same client to a k8s pod IP, the encoding and download of the file complete successfully, so some point of the communication between is definitely at fault here. We can now say with reasonable confidence that Envoy and Apache are not at fault here. Isolating which part will be a bit of a challenge but it's a clearer task.
I've mocked up a horrible Frankenstein script that mimics the TimedMediaHandler behaviour - when directly calling shellbox-video.discovery.wmnet via it, we see the exact same behaviour. This means that at the very least we can rule out failures at the Jobqueue or RunSingleJob layer:
Mon, Oct 21
Minor datapoint that hasn't been noted - when testing with a larger file that takes longer to convert, we're seeing the same behaviour. This adds credence to the idea that this issue is not caused by a timeout, and is most likely caused by some kind of issue with the handling of and reading of responses, most likely beyond shellbox.
I've removed the Upstream tag as requested. T40010 may be of interest for similar threads of conversation, might be worth making this task a subtask of that one for now.
Fri, Oct 18
Chromium is leaking processes, leaving chromium_crashpads lying around after a failure most likey:
root@wikikube-worker2070:/home/hnowlan# ps uax| grep chrome_crashpad | wc -l 115357
Thu, Oct 17
Wed, Oct 16
Mercurius is now built into the php8.1-fpm-multiversion-base image as of docker-registry.discovery.wmnet/php8.1-fpm-multiversion-base:8.1.30-2.
Debian packages are now in the apt repo
Tue, Oct 15
This appears to be a rerun of T375521 - temporary fix last time was a roll restart, but there's clearly a deeper issue.
Mon, Oct 14
aqs1 is disabled in restbase and the puppet configuration has been removed. All that remains is to archive the codebase and deploy repos.
Oct 9 2024
The main reason sessionstore didn't roll ahead with using the mesh was concern around the extremely broad impact any issues might have incurred. The risk profile for echostore is a lot lower, so I think we can move ahead with testing the mesh. I can't quite remember what they were but I'm fairly sure there's a bug or two in in the chart logic, but nothing that isn't obvious and can't be ironed out :)
Oct 5 2024
Just to explain the issue - a while ago a rate-limiting feature that was known to be problematic was reenabled in an emergency due to a harmful surge in traffic. This was left enabled and caused this issue to recur. I've since disabled this feature and we'll be removing it to prevent it being erroneously triggered again. However, the fact that this required manual reporting and wasn't noticed on the SRE-side isn't really acceptable so next week I'll be working on adding per-format alerting so that if there is an increase in errors for a single format we'll catch these before they can have a wide impact which will be tracked in T376538.
Thanks for the report - this was caused by T372470. I'm seeing recoveries on thumbnailing those files, could you confirm?
I'm seeing recoveries on most of the linked images, but reopening this until we're sure this is resolved.
High ThumbnailRender volume is normal, this is a constant background process that is ongoing to generate thumbnails on newly uploaded files. The change in the graphs from eqiad to codfw is part of the datacentre switchover (T370962).
Oct 3 2024
Mercurius images for bookworm and bullseye are now building via CI (with some modifications for bullseye): https://gitlab.wikimedia.org/hnowlan/mercurius/-/artifacts
Sep 30 2024
Sep 27 2024
Sep 25 2024
Sep 19 2024
Just to note, I've been testing by forcing a reencode of this video in VP9 format. This can also be tested by grabbing a job from kafka using kafkacat (kafkacat -b kafka-main1004.eqiad.wmnet:9092 -t eqiad.mediawiki.job.webVideoTranscode -o -200) and then POSTing the inner parts of the event via curl to a specific videoscaler to test logging changes etc:
time curl -H "Host: videoscaler.discovery.wmnet" -k -v -v -X POST -d '{"database":"testwiki","type":"webVideoTranscode","params": {"transcodeMode":"derivative" ,"transcodeKey":"240p.vp9.webm","prioritized":false,"manualOverride":true,"remux":false,"requestId":"A_REQ_ID","namespace":6,"title":"CC_1916_10_02_ThePawnshop.mpg"},"mediawiki_signature":"A_SIG"}' https://mw1437.eqiad.wmnet/rpc/RunSingleJob.php
These all appear to be requests from jobrunner hosts, which leads me to assume they're from the ThumbnailRender job. Could it be an ordering issue where we're triggering thumbnail generation during upload or something? The images themselves all seem to be fine when requested directly.
Sep 18 2024
Sep 17 2024
Sep 16 2024
I think that's fairly on the money, we can probably remove this now. We still have some bare metal deployments on debug (but I think scap is aware of this versioning during a deploy) and videoscaler hosts so we're not completely free of it. But I think at this point we stand to lose little from removing it.
Sep 13 2024
We have at least partially addressed the healthchecking issues by introducing a second readiness probe on the shellbox app container that checks for an ffmpeg process running, which appears to be working quite well.
Sep 12 2024
Sep 11 2024
At this point in time I'd say it's not out of the question that we could have mercurius up and running some jobs, but for the purposes of the switchover I think it makes sense to revert to using videoscalers for the short term. It's a much more well understood problem space and while I hope to have some jobs running via mercurius, I really doubt we'd be doing it for *all* jobs.
From php-fpm's fpm-status we can even see this behaviour so our check isn't at fault:
root@mw1451:/home/hnowlan# for i in `seq 200`; do curl -s 10.67.165.241:9181/fpm-status| grep ^active; sleep 0.2; done | sort | uniq -c 18 active processes: 1 182 active processes: 2
The healthcheck endpoint is not consistently returning a 503 when workers are busy - this could be some kind of a race condition. When all of the following were executed the pod was actively running an ffmpeg process:
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1 Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1 OKroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1 Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1 Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1 Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1 Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1 OKroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1 Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1 Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1
Sep 10 2024
Sep 9 2024
Sep 6 2024
Sep 5 2024
Sep 4 2024
Sep 3 2024
Sep 2 2024
Aug 30 2024
Aug 29 2024
Currently in prod we run 3d2png against nodejs v18.19.0 (I believe this was required for some of the canvas components), so hopefully that can help relieve some of these issues!
fwiw the videoscaler load issues in T373517 are down to how pod/worker timeouts are managed rather than overall capacity, don't worry about it in relation to your work! Currently commons is back on metal, so the majority of encodes are happening there.
Aug 28 2024
As a stopgap measure we've already introduced retries to the videoscaling job
Aug 26 2024
TimedMediaHandler now uses shellbox by default.
Aug 23 2024
This appears to have dropped. Leaving open to get patches resolved at a later point
Aug 16 2024
Here's a dump of the pod details of each of those IP addresses, counted and sorted by number of appearances: https://phabricator.wikimedia.org/P67349
To better trace this issue, could I get a sample of failing URLs please?