Skip to content

Coder (specifically tailnet) is using a lot of memory with 2000 and 30% receiving web terminal traffic (potentially OOM killed because of it) #11797

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mafredri opened this issue Jan 24, 2024 · 5 comments
Labels
api Area: HTTP API networking Area: networking stale This issue is like stale bread.

Comments

@mafredri
Copy link
Member

With 2000 workspaces and sending web terminal traffic to 30% of them (600), we see that coder memory usage shoots up.

The OOM kill did not happen, however, before we also started a concurrent workspace app proxying test. But I suppose web term and workspace apps function on much the same principle.

image

From the heap, we can observe that a lot of this memory is allocated by WireGuard:

image image

IIRC, single tailnet was supposed to help with this. So I'm wondering if this is a regression?

// ping @coadler @spikecurtis

@coadler
Copy link
Contributor

coadler commented Jan 24, 2024

Was this run on main? #11789 has a bugfix that definitely has a negative performance impact on single tailnet, but that should be resolved by #11810.

@mafredri
Copy link
Member Author

@coadler nah, this was v2.7.1, I can try running against main tomorrow to see if it performs any better. 👍🏻

@spikecurtis
Copy link
Contributor

I don't see how #11789 would affect Wireguard, since Wireguard is just pushing IP (layer 3) packets and HTTP connection caching is TCP (layer 4). Also, the web terminal uses a single websocket, so caching the connection won't help.

@mafredri
Copy link
Member Author

I gave https://ghcr.io/coder/coder-preview:main-2.7.1-devel-3d76e1b55 a try, and with 100 web terminal connections it seemed that memory usage had been improved, but after another try with 500 web terminal connections, we ran out of memory again (16 GB limit) with two coder instances. In other words, we went from 2GB used to >16 GB + OOM.

I still believe there's been a regression since we initially tried out the single_tailnet experiment. There we could handle 500 web terminals with ease and only 4 GB of RAM.

@spikecurtis
Copy link
Contributor

Perhaps we aren't correctly detecting the agents as not-legacy.

I propose we don't chase our tails just yet because I have a PR to remove wsconncache entirely, which moots the legacy vs not legacy question.

@github-actions github-actions bot added the stale This issue is like stale bread. label Jul 25, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Area: HTTP API networking Area: networking stale This issue is like stale bread.
Projects
None yet
Development

No branches or pull requests

3 participants