Page MenuHomePhabricator

hnowlan (Hugh Nowlan)
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Jan 6 2020, 12:19 PM (254 w, 1 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
HNowlan (WMF) [ Global Accounts ]

Recent Activity

Yesterday

hnowlan created T380299: Revisit use of the wmf-deployment Gerrit group for deployment-charts rights.
Tue, Nov 19, 5:31 PM · Kubernetes, serviceops
hnowlan added a comment to T374350: Thumbor workers hang indefinitely when conducting some tiff operations, leading to user-facing error.

Things done to address this issue so far:

  • Alerting added to detect (unlikely) recurrence (T379559)
  • Prevent failing pods from remaining in service, returning errors (T379561)
Tue, Nov 19, 1:23 PM · Structured-Data-Backlog, serviceops, Thumbor
hnowlan added a comment to T380257: Error 429, Too Many Requests for some image.

This appears to be a problem with the image: rsvg-convert returns rendering error: NoMemory (which is a bit misleading). My understanding of SVG internals is relatively limited but it seems like the line patternTransform="matrix(0.142 -0.0168 -0.0205 -0.1008 -91816.0078 -14072.0449) is a recurrence of T292439

Tue, Nov 19, 1:02 PM · Thumbor
hnowlan added a comment to T374350: Thumbor workers hang indefinitely when conducting some tiff operations, leading to user-facing error.

If it helps, I still face problems, last one three minutes ago with https://commons.wikimedia.org/wiki/File:Telline_(Donax_trunculus)_(Ifremer_00673-78543).jpg

Tue, Nov 19, 12:38 PM · Structured-Data-Backlog, serviceops, Thumbor

Mon, Nov 18

hnowlan closed T379559: Alert on high Thumbor per-pod error rate as Resolved.
Mon, Nov 18, 5:11 PM · serviceops, Thumbor
hnowlan closed T379559: Alert on high Thumbor per-pod error rate, a subtask of T374350: Thumbor workers hang indefinitely when conducting some tiff operations, leading to user-facing error, as Resolved.
Mon, Nov 18, 5:11 PM · Structured-Data-Backlog, serviceops, Thumbor
kamila awarded T356241: Move video transcoding to use Shellbox a Party Time token.
Mon, Nov 18, 3:24 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
CDanis awarded T356241: Move video transcoding to use Shellbox a Party Time token.
Mon, Nov 18, 2:53 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan closed T356241: Move video transcoding to use Shellbox as Resolved.

As of the 13th of November, all video transcoding has been moved to shellbox-video. The service seems quite stable. We'll reclaim the videoscaler hardware at a later point.

Mon, Nov 18, 2:51 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan closed T356241: Move video transcoding to use Shellbox, a subtask of T355292: Port videoscaling to kubernetes, as Resolved.
Mon, Nov 18, 2:46 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
kamila awarded T373517: shellbox-video pods being restarted prematurely a Party Time token.
Mon, Nov 18, 1:35 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan closed T373517: shellbox-video pods being restarted prematurely as Resolved.

We've migrated to shellbox-video and the pod failures are no longer an issue thanks to the use of both the process check and tcp keepalives.

Mon, Nov 18, 10:14 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan closed T373517: shellbox-video pods being restarted prematurely, a subtask of T356241: Move video transcoding to use Shellbox, as Resolved.
Mon, Nov 18, 10:14 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops

Thu, Nov 14

hnowlan added a subtask for T374350: Thumbor workers hang indefinitely when conducting some tiff operations, leading to user-facing error: T379901: Create tool to monitor and automatically delete misbehaving pods.
Thu, Nov 14, 10:53 AM · Structured-Data-Backlog, serviceops, Thumbor
hnowlan added a parent task for T379901: Create tool to monitor and automatically delete misbehaving pods: T374350: Thumbor workers hang indefinitely when conducting some tiff operations, leading to user-facing error.
Thu, Nov 14, 10:53 AM · serviceops, Kubernetes

Wed, Nov 13

hnowlan closed T379561: Thumbor haproxy readiness check isn't failing on unhealthy pods, a subtask of T374350: Thumbor workers hang indefinitely when conducting some tiff operations, leading to user-facing error, as Resolved.
Wed, Nov 13, 5:39 PM · Structured-Data-Backlog, serviceops, Thumbor
hnowlan closed T379561: Thumbor haproxy readiness check isn't failing on unhealthy pods as Resolved.
Wed, Nov 13, 5:39 PM · serviceops, Thumbor

Tue, Nov 12

hnowlan changed the status of T379561: Thumbor haproxy readiness check isn't failing on unhealthy pods from Open to In Progress.
Tue, Nov 12, 12:29 PM · serviceops, Thumbor
hnowlan changed the status of T379561: Thumbor haproxy readiness check isn't failing on unhealthy pods, a subtask of T374350: Thumbor workers hang indefinitely when conducting some tiff operations, leading to user-facing error, from Open to In Progress.
Tue, Nov 12, 12:29 PM · Structured-Data-Backlog, serviceops, Thumbor

Mon, Nov 11

Don-vip awarded T374350: Thumbor workers hang indefinitely when conducting some tiff operations, leading to user-facing error a Hungry Hippo token.
Mon, Nov 11, 6:31 PM · Structured-Data-Backlog, serviceops, Thumbor
hnowlan triaged T379569: Reconsider use of `timeout` in Thumbor as High priority.
Mon, Nov 11, 4:39 PM · Structured-Data-Backlog, serviceops, Thumbor
hnowlan claimed T379561: Thumbor haproxy readiness check isn't failing on unhealthy pods.
Mon, Nov 11, 4:04 PM · serviceops, Thumbor
hnowlan created T379569: Reconsider use of `timeout` in Thumbor.
Mon, Nov 11, 4:02 PM · Structured-Data-Backlog, serviceops, Thumbor
hnowlan created T379561: Thumbor haproxy readiness check isn't failing on unhealthy pods.
Mon, Nov 11, 2:45 PM · serviceops, Thumbor
hnowlan renamed T379559: Alert on high Thumbor per-pod error rate from Alert on high per-pod error rate to Alert on high Thumbor per-pod error rate.
Mon, Nov 11, 2:33 PM · serviceops, Thumbor
hnowlan created T379559: Alert on high Thumbor per-pod error rate.
Mon, Nov 11, 2:33 PM · serviceops, Thumbor
hnowlan raised the priority of T374350: Thumbor workers hang indefinitely when conducting some tiff operations, leading to user-facing error from High to Unbreak Now!.
Mon, Nov 11, 2:31 PM · Structured-Data-Backlog, serviceops, Thumbor
hnowlan renamed T374350: Thumbor workers hang indefinitely when conducting some tiff operations, leading to user-facing error from Majority of thumbor containers on pods occasionally getting into a stuck state to Thumbor workers hang indefinitely when conducting some tiff operations, leading to user-facing error.
Mon, Nov 11, 1:29 PM · Structured-Data-Backlog, serviceops, Thumbor
hnowlan added a comment to T374350: Thumbor workers hang indefinitely when conducting some tiff operations, leading to user-facing error.

tldr: we have an issue with tiff conversion that is causing workers to block indefinitely, revealing a multitude of issues.

Mon, Nov 11, 1:26 PM · Structured-Data-Backlog, serviceops, Thumbor
hnowlan added a comment to T374350: Thumbor workers hang indefinitely when conducting some tiff operations, leading to user-facing error.

If we see a recurrence of this in future, please isolate the pod rather than delete it so it can be debugged

Mon, Nov 11, 10:23 AM · Structured-Data-Backlog, serviceops, Thumbor
hnowlan added a comment to T379426: "Error: 500, Internal Server Error" during thumbnail generation.

This is a recurrence of T374350

Mon, Nov 11, 10:23 AM · Wikimedia-Incident, SRE, Thumbor

Wed, Nov 6

hnowlan added a comment to T376438: Download to PDF: HTTP 500 error on some wikis for some users.

Looks like the same crashpad flood issue again. The service needs a restart, and I think we should implement the flags @TheDJ has mentioned.

Wed, Nov 6, 4:33 PM · serviceops, Content-Transform-Team-WIP, Essential-Work, Electron-PDFs

Tue, Oct 29

hnowlan added a comment to T378082: Requesting access to 'deployment' for 'Joely Rooke WMDE'.

Just to note Joely has verified the SSH key in this ticket via slack

Tue, Oct 29, 9:52 AM · SRE, SRE-Access-Requests

Fri, Oct 25

hnowlan edited projects for T378038: create a place (whiteboard) where SRE advertises current site status / things for awareness, added: SRE-OnFire; removed SRE.
Fri, Oct 25, 3:55 PM · SRE-OnFire, Sustainability (Incident Followup)
hnowlan updated subscribers of T378182: Grant Access to ldap/nda for Deepesha Burse WMDE.

This access requires signing an NDA, adding @KFrancis as per access request documentation. Thanks!

Fri, Oct 25, 3:50 PM · SRE, LDAP-Access-Requests
hnowlan moved T378082: Requesting access to 'deployment' for 'Joely Rooke WMDE' from Awaiting User Input to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Fri, Oct 25, 3:48 PM · SRE, SRE-Access-Requests
hnowlan moved T378182: Grant Access to ldap/nda for Deepesha Burse WMDE from Backlog to NDA Pending on the LDAP-Access-Requests board.
Fri, Oct 25, 3:48 PM · SRE, LDAP-Access-Requests
hnowlan closed T378181: Grant Access to ldap/wmde for Deepesha Burse WMDE as Invalid.

closing as dupe, following up in T378181

Fri, Oct 25, 3:40 PM · SRE, LDAP-Access-Requests
hnowlan updated subscribers of T378082: Requesting access to 'deployment' for 'Joely Rooke WMDE'.

This request first requires signing an NDA with Legal - tagging @KFrancis as per the access request process. Thanks!

Fri, Oct 25, 3:37 PM · SRE, SRE-Access-Requests
hnowlan changed the status of T377773: Give Dumps 1.0 access to gmodena from Open to Stalled.
Fri, Oct 25, 3:36 PM · SRE, SRE-Access-Requests
hnowlan moved T378082: Requesting access to 'deployment' for 'Joely Rooke WMDE' from Untriaged to Awaiting User Input on the SRE-Access-Requests board.
Fri, Oct 25, 3:35 PM · SRE, SRE-Access-Requests

Wed, Oct 23

hnowlan added a comment to T300383: Requesting access to Analytics Private Data Users for Tanja Andic.

Key updated - please let me know if it works.

Wed, Oct 23, 5:12 PM · SRE, SRE-Access-Requests
hnowlan closed T363996: Sessionstore's discovery TLS cert will expire before end of May 2024 as Resolved.

sessionstore codfw and eqiad are running with an envoy tls terminator, and latencies etc look acceptable.

Wed, Oct 23, 4:35 PM · Patch-For-Review, serviceops, Data-Persistence
hnowlan closed T377792: Grant bd808 membership in the contint-roots and contint-docker groups as Resolved.

Merged!

Wed, Oct 23, 9:10 AM · SRE, SRE-Access-Requests, Continuous-Integration-Infrastructure
hnowlan added a comment to T373517: shellbox-video pods being restarted prematurely.

Running the client directly against a k8s worker IP also succeeds, which means that kube-proxy most likely isn't to blame here.

Wed, Oct 23, 9:09 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops

Tue, Oct 22

hnowlan added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

eqiad is currently using the mesh - codfw is not. We decided to leave this config in place for the evening to get certainty and allow for time constraints. eqiad is looking fine so far. If an emergency revert is needed, both 2adb4cf4c6aa6e534aa7a596e796f5f099abc60f and 622bec969ea59a4352abc1e6daa20313ae1fe4f3 will need to be reverted before applying in eqiad

Tue, Oct 22, 5:58 PM · Patch-For-Review, serviceops, Data-Persistence
hnowlan added a comment to T373517: shellbox-video pods being restarted prematurely.

When connecting the same client to a k8s pod IP, the encoding and download of the file complete successfully, so some point of the communication between is definitely at fault here. We can now say with reasonable confidence that Envoy and Apache are not at fault here. Isolating which part will be a bit of a challenge but it's a clearer task.

Tue, Oct 22, 4:32 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan added a comment to T373517: shellbox-video pods being restarted prematurely.

I've mocked up a horrible Frankenstein script that mimics the TimedMediaHandler behaviour - when directly calling shellbox-video.discovery.wmnet via it, we see the exact same behaviour. This means that at the very least we can rule out failures at the Jobqueue or RunSingleJob layer:

Tue, Oct 22, 3:25 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan added a comment to T373517: shellbox-video pods being restarted prematurely.

Are the http requests using chunked transfer encoding. or not ? (I'm assuming its all http 1.1 and not 2.0)

Tue, Oct 22, 11:52 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan created T377830: RunSingleJob.php's readonly backoff behaviour will never be triggered.
Tue, Oct 22, 11:36 AM · serviceops-radar, WMF-JobQueue
hnowlan added a comment to T377773: Give Dumps 1.0 access to gmodena.

Could you please specify which groups access is needed to? There are a few dumps groups but it appears that @gmodena should inherit access to all of them by virtue of being part of the old platform engineering group. The full list of groups is here.

Hi @hnowlan ,

I don't know how group assignment works, but I as far I understand I should be able to impersonate dumpsgen.

I can ssh onto clouddumps, but when I try (as suggested):

[gmodena@clouddumps1002] $ sudo -u dumpsgen whoami

I'm asked for password auth (that fails).

Tue, Oct 22, 10:53 AM · SRE, SRE-Access-Requests
hnowlan added a comment to T377773: Give Dumps 1.0 access to gmodena.

Could you please specify which groups access is needed to? There are a few dumps groups but it appears that @gmodena should inherit access to all of them by virtue of being part of the old platform engineering group. The full list of groups is here.

Tue, Oct 22, 9:45 AM · SRE, SRE-Access-Requests

Mon, Oct 21

hnowlan added a comment to T373517: shellbox-video pods being restarted prematurely.

Minor datapoint that hasn't been noted - when testing with a larger file that takes longer to convert, we're seeing the same behaviour. This adds credence to the idea that this issue is not caused by a timeout, and is most likely caused by some kind of issue with the handling of and reading of responses, most likely beyond shellbox.

Mon, Oct 21, 5:01 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan removed a project from T35245: SVG files: text (and tspan) elements misplaced when rasterizing to PNG thumbnails/previews (multi-valued x/y, dx/dy attributes): Upstream.

I've removed the Upstream tag as requested. T40010 may be of interest for similar threads of conversation, might be worth making this task a subtask of that one for now.

Mon, Oct 21, 9:56 AM · Thumbor, Wikimedia-SVG-rendering

Oct 18 2024

hnowlan added a comment to T376438: Download to PDF: HTTP 500 error on some wikis for some users.

Chromium is leaking processes, leaving chromium_crashpads lying around after a failure most likey:

root@wikikube-worker2070:/home/hnowlan# ps uax| grep chrome_crashpad | wc -l
115357
Oct 18 2024, 9:24 AM · serviceops, Content-Transform-Team-WIP, Essential-Work, Electron-PDFs
hnowlan added a comment to T376438: Download to PDF: HTTP 500 error on some wikis for some users.

I suspect that the issue is that we don't close or somehow we end up in a sitation with stale browser instances. Given the level of traffic/support of the pdf service would it be enough to just restart the service ?

Oct 18 2024, 9:20 AM · serviceops, Content-Transform-Team-WIP, Essential-Work, Electron-PDFs

Oct 17 2024

hnowlan added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

@hnowlan @Eevans we could do the following:

  1. Test staging to verify that everything is good.
  2. Depool eqiad from discovery, apply the change, check, repool and watch metrics.
  3. If everything looks good, we move to codfw else we rollback

There is the chance to impact users, but it will be limited and in a controlled environment. Plus we already tested the latency with echostore and the new setting worked nicely. What do you think?

Oct 17 2024, 9:21 AM · Patch-For-Review, serviceops, Data-Persistence

Oct 16 2024

hnowlan closed T371699: Build and add Mercurius to PHP base image as Resolved.

Mercurius is now built into the php8.1-fpm-multiversion-base image as of docker-registry.discovery.wmnet/php8.1-fpm-multiversion-base:8.1.30-2.

Oct 16 2024, 4:56 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan closed T371699: Build and add Mercurius to PHP base image , a subtask of T355292: Port videoscaling to kubernetes, as Resolved.
Oct 16 2024, 4:51 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan added a comment to T371699: Build and add Mercurius to PHP base image .

Debian packages are now in the apt repo

Oct 16 2024, 2:48 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops

Oct 15 2024

hnowlan added a comment to T376438: Download to PDF: HTTP 500 error on some wikis for some users.

This appears to be a rerun of T375521 - temporary fix last time was a roll restart, but there's clearly a deeper issue.

Oct 15 2024, 9:34 AM · serviceops, Content-Transform-Team-WIP, Essential-Work, Electron-PDFs

Oct 14 2024

hnowlan added a comment to T363996: Sessionstore's discovery TLS cert will expire before end of May 2024.

@hnowlan if echostore turns out to work as expected (it sounds so from the other task), we could keep the ball rolling and do session store too wdyt?

Oct 14 2024, 2:05 PM · Patch-For-Review, serviceops, Data-Persistence
hnowlan placed T320398: Expand upon Kask/Sessionstore documentation up for grabs.
Oct 14 2024, 1:50 PM · SRE-Sprint-Week-Sustainability-March2023, serviceops, Sustainability (Incident Followup)
hnowlan added a comment to T350143: Write AQS 1 deprecation announcement.

aqs1 is disabled in restbase and the puppet configuration has been removed. All that remains is to archive the codebase and deploy repos.

Oct 14 2024, 12:32 PM · AQS2.0, Data Products
hnowlan closed T371761: Add bdrwiki to RESTBase as Resolved.
Oct 14 2024, 12:22 PM · Essential-Work, MediaWiki-Engineering, Content-Transform-Team, RESTBase
hnowlan closed T371761: Add bdrwiki to RESTBase, a subtask of T371760: Post-creation work for bdrwiki, as Resolved.
Oct 14 2024, 12:21 PM · Countervandalism-Network, Wiki-Setup
hnowlan placed T300914: cpjobqueue not achieving configured concurrency up for grabs.
Oct 14 2024, 11:51 AM · Platform Team Workboards (Platform Engineering Reliability), WMF-JobQueue, Platform Engineering

Oct 9 2024

hnowlan added a project to T376828: Thumbor's use of the `expensive` poolcounter queue can break rendering formats : Structured Data Engineering.
Oct 9 2024, 5:19 PM · Structured-Data-Backlog, Structured Data Engineering, serviceops, Thumbor
hnowlan triaged T376828: Thumbor's use of the `expensive` poolcounter queue can break rendering formats as High priority.
Oct 9 2024, 5:18 PM · Structured-Data-Backlog, Structured Data Engineering, serviceops, Thumbor
hnowlan created T376828: Thumbor's use of the `expensive` poolcounter queue can break rendering formats .
Oct 9 2024, 5:17 PM · Structured-Data-Backlog, Structured Data Engineering, serviceops, Thumbor
hnowlan added a comment to T376766: echostore's TLS certificate expires on 2024-10-13.

The main reason sessionstore didn't roll ahead with using the mesh was concern around the extremely broad impact any issues might have incurred. The risk profile for echostore is a lot lower, so I think we can move ahead with testing the mesh. I can't quite remember what they were but I'm fairly sure there's a bug or two in in the chart logic, but nothing that isn't obvious and can't be ironed out :)

Oct 9 2024, 9:27 AM · serviceops

Oct 5 2024

hnowlan added a comment to T372470: Elevated 429 responses from Thumbor on codfw starting 2024-08-14 00:00 UTC.

Just to explain the issue - a while ago a rate-limiting feature that was known to be problematic was reenabled in an emergency due to a harmful surge in traffic. This was left enabled and caused this issue to recur. I've since disabled this feature and we'll be removing it to prevent it being erroneously triggered again. However, the fact that this required manual reporting and wasn't noticed on the SRE-side isn't really acceptable so next week I'll be working on adding per-format alerting so that if there is an increase in errors for a single format we'll catch these before they can have a wide impact which will be tracked in T376538.

Oct 5 2024, 5:43 PM · Patch-For-Review, serviceops, All-and-every-Wikisource, Thumbor
Don-vip awarded T376538: Per-format monitoring for Thumbor a Fox token.
Oct 5 2024, 5:38 PM · serviceops, Thumbor
hnowlan added a comment to T376534: HTTP 429 errors: PDF thumbnails on Commons not displayed.

Thanks for the report - this was caused by T372470. I'm seeing recoveries on thumbnailing those files, could you confirm?

Oct 5 2024, 5:23 PM · Thumbor
hnowlan created T376538: Per-format monitoring for Thumbor .
Oct 5 2024, 5:19 PM · serviceops, Thumbor
hnowlan reopened T372470: Elevated 429 responses from Thumbor on codfw starting 2024-08-14 00:00 UTC as "Open".

I'm seeing recoveries on most of the linked images, but reopening this until we're sure this is resolved.

Oct 5 2024, 5:13 PM · Patch-For-Review, serviceops, All-and-every-Wikisource, Thumbor
hnowlan closed T372470: Elevated 429 responses from Thumbor on codfw starting 2024-08-14 00:00 UTC as Resolved.

found T376509 while investigating 429 for https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Feedback_form_Odia_Wikipedia_outreach.pdf/page1-180px-Feedback_form_Odia_Wikipedia_outreach.pdf.jpg

https://commons.wikimedia.org/wiki/File:Feedback_form_Odia_Wikipedia_outreach.pdf

  • 463px (embedded by default in description page) seems fine but I guess that's some kind of cache hit?
  • 180px (embedded in Special:ListFiles) is 429
  • other sizes linked from description page are ok.
  • other sizes I pulled out of thin air also don't work.

ahhhhh, now I found T372470#10113572.

Oct 5 2024, 4:44 PM · Patch-For-Review, serviceops, All-and-every-Wikisource, Thumbor
hnowlan closed T376509: investigate ThumbnailRender volume 2024-09-20 til 2024-10-04 (thumbnail, thumbor) as Invalid.
Oct 5 2024, 4:35 PM · serviceops, Thumbor
hnowlan added a comment to T376509: investigate ThumbnailRender volume 2024-09-20 til 2024-10-04 (thumbnail, thumbor).

High ThumbnailRender volume is normal, this is a constant background process that is ongoing to generate thumbnails on newly uploaded files. The change in the graphs from eqiad to codfw is part of the datacentre switchover (T370962).

Oct 5 2024, 4:33 PM · serviceops, Thumbor

Oct 3 2024

hnowlan added a comment to T371699: Build and add Mercurius to PHP base image .

Mercurius images for bookworm and bullseye are now building via CI (with some modifications for bullseye): https://gitlab.wikimedia.org/hnowlan/mercurius/-/artifacts

Oct 3 2024, 12:00 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops

Sep 30 2024

hnowlan lowered the priority of T374436: Large file uploads broken via Special:Upload from Unbreak Now! to Medium.
Sep 30 2024, 10:27 AM · Unstewarded-production-error, MW-1.43-notes (1.43.0-wmf.26; 2024-10-08), Patch-For-Review, Regression, Wikimedia-production-error, serviceops, MediaWiki-Uploading

Sep 27 2024

hnowlan added a comment to T374436: Large file uploads broken via Special:Upload.

That looks more like a few thousand times a month on commons to me. Am I reading it wrong?

Sep 27 2024, 3:14 PM · Unstewarded-production-error, MW-1.43-notes (1.43.0-wmf.26; 2024-10-08), Patch-For-Review, Regression, Wikimedia-production-error, serviceops, MediaWiki-Uploading

Sep 25 2024

hnowlan closed T375069: wikifunctions error messages are too large for logstash as Declined.

This is fundamentally a bug in NormalizedException or MediaWiki-libs-RequestTimeout; we don't control PHP's exception stack trace length. I filed T374618: Trim exceptions (?in wikimedia/normalized-exception) before they get to syslog, so that they aren't jsonTruncated about this last week – should we mark this as a dupe of that? Dependent on that?

Sep 25 2024, 10:00 AM · Abstract Wikipedia team (25Q1 (Jul–Sep)), serviceops-radar, Wikifunctions, WikiLambda
hnowlan closed T375069: wikifunctions error messages are too large for logstash, a subtask of T374231: wikifunctions mediawiki instance can't sustain more than 5rps, as Declined.
Sep 25 2024, 9:59 AM · Abstract Wikipedia team (25Q1 (Jul–Sep)), serviceops-radar, Wikifunctions, WikiLambda

Sep 19 2024

hnowlan added a comment to T373517: shellbox-video pods being restarted prematurely.

Just to note, I've been testing by forcing a reencode of this video in VP9 format. This can also be tested by grabbing a job from kafka using kafkacat (kafkacat -b kafka-main1004.eqiad.wmnet:9092 -t eqiad.mediawiki.job.webVideoTranscode -o -200) and then POSTing the inner parts of the event via curl to a specific videoscaler to test logging changes etc:

time curl -H "Host: videoscaler.discovery.wmnet" -k -v -v -X POST -d '{"database":"testwiki","type":"webVideoTranscode","params": {"transcodeMode":"derivative" ,"transcodeKey":"240p.vp9.webm","prioritized":false,"manualOverride":true,"remux":false,"requestId":"A_REQ_ID","namespace":6,"title":"CC_1916_10_02_ThePawnshop.mpg"},"mediawiki_signature":"A_SIG"}' https://mw1437.eqiad.wmnet/rpc/RunSingleJob.php
Sep 19 2024, 4:50 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan added a comment to T374911: Some POST of thumbnails to Swift time out.

These all appear to be requests from jobrunner hosts, which leads me to assume they're from the ThumbnailRender job. Could it be an ordering issue where we're triggering thumbnail generation during upload or something? The images themselves all seem to be fine when requested directly.

Sep 19 2024, 1:55 PM · Unstewarded-production-error, MediaWiki-Uploading, Thumbor, Data-Persistence, SRE-swift-storage, Wikimedia-production-error

Sep 18 2024

hnowlan created P69295 (An Untitled Masterwork).
Sep 18 2024, 4:58 PM
hnowlan created T375069: wikifunctions error messages are too large for logstash.
Sep 18 2024, 10:54 AM · Abstract Wikipedia team (25Q1 (Jul–Sep)), serviceops-radar, Wikifunctions, WikiLambda

Sep 17 2024

hnowlan created P69220 (An Untitled Masterwork).
Sep 17 2024, 2:40 PM

Sep 16 2024

hnowlan added a comment to T374860: Retire mw_wikiversion_difference check.

I think that's fairly on the money, we can probably remove this now. We still have some bare metal deployments on debug (but I think scap is aware of this versioning during a deploy) and videoscaler hosts so we're not completely free of it. But I think at this point we stand to lose little from removing it.

Sep 16 2024, 4:03 PM · serviceops, SRE Observability (FY2024/2025-Q1), Observability-Alerting

Sep 13 2024

hnowlan added a comment to T373517: shellbox-video pods being restarted prematurely.

We have at least partially addressed the healthchecking issues by introducing a second readiness probe on the shellbox app container that checks for an ffmpeg process running, which appears to be working quite well.

Sep 13 2024, 4:17 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops

Sep 12 2024

hnowlan triaged T374436: Large file uploads broken via Special:Upload as Unbreak Now! priority.
Sep 12 2024, 2:27 PM · Unstewarded-production-error, MW-1.43-notes (1.43.0-wmf.26; 2024-10-08), Patch-For-Review, Regression, Wikimedia-production-error, serviceops, MediaWiki-Uploading

Sep 11 2024

hnowlan added a comment to T372849: Determine switchover changes for migration of video scaling to k8s.

At this point in time I'd say it's not out of the question that we could have mercurius up and running some jobs, but for the purposes of the switchover I think it makes sense to revert to using videoscalers for the short term. It's a much more well understood problem space and while I hope to have some jobs running via mercurius, I really doubt we'd be doing it for *all* jobs.

Sep 11 2024, 4:04 PM · Datacenter-Switchover, serviceops
hnowlan added a comment to T373517: shellbox-video pods being restarted prematurely.

From php-fpm's fpm-status we can even see this behaviour so our check isn't at fault:

root@mw1451:/home/hnowlan# for i in `seq 200`; do curl -s 10.67.165.241:9181/fpm-status| grep ^active; sleep 0.2; done | sort | uniq -c
     18 active processes:     1
    182 active processes:     2
Sep 11 2024, 11:31 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
hnowlan added a comment to T373517: shellbox-video pods being restarted prematurely.

The healthcheck endpoint is not consistently returning a 503 when workers are busy - this could be some kind of a race condition. When all of the following were executed the pod was actively running an ffmpeg process:

Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1                          
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1                                                                             OKroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1        
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1                        
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1
OKroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1
Sep 11 2024, 10:59 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops

Sep 10 2024

hnowlan renamed T374436: Large file uploads broken via Special:Upload from Large file uploads broken on (at least) group0 to Large file uploads broken via Special:Upload.
Sep 10 2024, 11:42 AM · Unstewarded-production-error, MW-1.43-notes (1.43.0-wmf.26; 2024-10-08), Patch-For-Review, Regression, Wikimedia-production-error, serviceops, MediaWiki-Uploading
hnowlan created T374436: Large file uploads broken via Special:Upload.
Sep 10 2024, 11:39 AM · Unstewarded-production-error, MW-1.43-notes (1.43.0-wmf.26; 2024-10-08), Patch-For-Review, Regression, Wikimedia-production-error, serviceops, MediaWiki-Uploading

Sep 9 2024

hnowlan created T374350: Thumbor workers hang indefinitely when conducting some tiff operations, leading to user-facing error.
Sep 9 2024, 11:24 AM · Structured-Data-Backlog, serviceops, Thumbor
hnowlan added a comment to T345953: [L] 3d2png uses unsupported/unmaintained packages.

The "deploy" config in package.json is set to node=12.22.12 on target=debian:bullseye - should that then be updated to 18.19.0? (and is there a reason that that is not already the case? would changing it break something?)

The canvas version currently in use (2.11.2), is supposed to work on node>=6, and AFAICT, none of the other existing package versions actually require v18. That said, I'm all for upgrading to more recent versions, especially if we're already running it in prod. I'm not sure how to properly move forward with that, though - esp. since package.json seems to have conflicting information.

Sep 9 2024, 10:05 AM · Patch-For-Review, Structured-Data-Backlog (Current Work), Structured Data Engineering, Technical-Debt, Security, 3D

Sep 6 2024

hnowlan removed projects from T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets: netops, Infrastructure-Foundations.
Sep 6 2024, 4:15 PM · serviceops