Coder (specifically tailnet) is using a lot of memory with 2000 and 30% receiving web terminal traffic (potentially OOM killed because of it) #11797

mafredri · 2024-01-24T14:24:22Z

With 2000 workspaces and sending web terminal traffic to 30% of them (600), we see that coder memory usage shoots up.

The OOM kill did not happen, however, before we also started a concurrent workspace app proxying test. But I suppose web term and workspace apps function on much the same principle.

From the heap, we can observe that a lot of this memory is allocated by WireGuard:

IIRC, single tailnet was supposed to help with this. So I'm wondering if this is a regression?

// ping @coadler @spikecurtis

coadler · 2024-01-24T20:26:14Z

Was this run on main? #11789 has a bugfix that definitely has a negative performance impact on single tailnet, but that should be resolved by #11810.

mafredri · 2024-01-24T22:55:11Z

@coadler nah, this was v2.7.1, I can try running against main tomorrow to see if it performs any better. 👍🏻

spikecurtis · 2024-01-25T10:44:18Z

I don't see how #11789 would affect Wireguard, since Wireguard is just pushing IP (layer 3) packets and HTTP connection caching is TCP (layer 4). Also, the web terminal uses a single websocket, so caching the connection won't help.

mafredri · 2024-01-25T17:18:05Z

I gave https://ghcr.io/coder/coder-preview:main-2.7.1-devel-3d76e1b55 a try, and with 100 web terminal connections it seemed that memory usage had been improved, but after another try with 500 web terminal connections, we ran out of memory again (16 GB limit) with two coder instances. In other words, we went from 2GB used to >16 GB + OOM.

I still believe there's been a regression since we initially tried out the single_tailnet experiment. There we could handle 500 web terminals with ease and only 4 GB of RAM.

spikecurtis · 2024-01-26T05:19:31Z

Perhaps we aren't correctly detecting the agents as not-legacy.

I propose we don't chase our tails just yet because I have a PR to remove wsconncache entirely, which moots the legacy vs not legacy question.

mafredri added api Area: HTTP API networking Area: networking labels Jan 24, 2024

cdr-bot bot added the bug label Jan 24, 2024

mafredri mentioned this issue Jan 24, 2024

Prometheus metrics aggregator pegging coderd CPU and potentially causing OOM kill #11775

Closed

github-actions bot added the stale This issue is like stale bread. label Jul 25, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coder (specifically tailnet) is using a lot of memory with 2000 and 30% receiving web terminal traffic (potentially OOM killed because of it) #11797

Coder (specifically tailnet) is using a lot of memory with 2000 and 30% receiving web terminal traffic (potentially OOM killed because of it) #11797

mafredri commented Jan 24, 2024

coadler commented Jan 24, 2024 •

edited

Loading

mafredri commented Jan 24, 2024

spikecurtis commented Jan 25, 2024

mafredri commented Jan 25, 2024

spikecurtis commented Jan 26, 2024

Coder (specifically tailnet) is using a lot of memory with 2000 and 30% receiving web terminal traffic (potentially OOM killed because of it) #11797

Coder (specifically tailnet) is using a lot of memory with 2000 and 30% receiving web terminal traffic (potentially OOM killed because of it) #11797

Comments

mafredri commented Jan 24, 2024

coadler commented Jan 24, 2024 • edited Loading

mafredri commented Jan 24, 2024

spikecurtis commented Jan 25, 2024

mafredri commented Jan 25, 2024

spikecurtis commented Jan 26, 2024

coadler commented Jan 24, 2024 •

edited

Loading