-
Notifications
You must be signed in to change notification settings - Fork 896
Potential Memory Leak in 2.3 #10550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
thank you for submitting this issue @notwedtm - can you confirm the version is |
We have a fuzzy matcher in the Argo app:
Argo shows that we've been on |
cc @mafredri |
@notwedtm could you share what settings you are using for Coder? Of interest is especially whether or not you're using external/embedded DB, have enabled experimental coder settings, are using external provisioners, coder replicas, etc. Would you also be able to share some pprof dumps from when you notice high memory usage? (This requires enabling pprof for Essentially, we'd be interested in seeing the files produced by (see
(The value of |
We are running with an external DB (RDS Postgres). We are running with the kubernetes helm chart with the following config (sensitive values snipped):
I'll work on getting a profiles running today and report back! |
@matifali Would it be possible to get a public key so we can encrypt the profile data before we send it to you? |
cc: @mafredri |
You can email to (my first name) @coder.com |
One thing we see before the pod fails is it will say |
@nnn342 thanks for the pprof dumps. We looked through them but nothing immediately stood out. Do you happen to know what the memory consumption was at the time when you captured the dumps? If it was a recently started instance, memory usage may not have accumulated (yet). Would you say that the memory usage accumulates over time or that it's a sudden increase, leading to OOM? In the latter case, it may be hard to capture via pprof dumps. For instance, do you have any idea what transpired over the weekend, Oct 28 - 29? It almost seems like a lack-of-use lead to a lot of restarts, suggesting there's something more than memory overconsumption component to this.
Can you try disabling the users, one-by-one as you encounter those that "crash the system" (if we assume for a moment that this is the cause), or does the list of "culprits" also contain active users that need to access Coder? Unfortunately, other than that, I can't immediately think of any tips to stop this behavior without understanding why it may be happening. Downgrading to v2.2.1 is probably your best bet for now. Another option would be to help us narrow this down by trying earlier v2.3.x versions. Knowing which version introduced it would be helpful, but this is not a great way to treat a production system. So the viability largely depends on if you can reproduce this outside prod.
This sounds very peculiar. Out of curiosity, how are you determining that this user doesn't interact with Coder? @Emyrk wrt this error message/behavior, do you know if there were any auth/RBAC changes that went in between v2.2 and v2.3.3 that might be responsible, cause memory leaks? Any caching changes that might be accessing the wrong entry, etc? |
@mafredri not any big changes to the authz packages, but how we interact with it always changing. Auth changes were definitely added, I would need to walk them to see if there was any mistakes like this. I'll try and find some time. The only thing about investigating that route, is that the So any member hitting the dashboard I would suspect generates this logs. What can help a little bit is adding the log fields with that message. Specifically |
This happens to a few teammates but 1 teammate in particular has it happen to their username the most often.
|
That route is not available to members, so that failure is expected. We should fix the FE to not request this route if it is currently doing so. |
We believe we have identified the cause and implemented a fix in #10685. The behavior of the identified bug matches the experiences described in this issue. The PR will be merged tomorrow and we will make a patch release soon thereafter. PS. The auth issue described here is most likely not related to the memory leak leading to OOM kill. If that continues to be a problem after the patch, I would recommend opening a separate issue for it. |
Uh oh!
There was an error while loading. Please reload this page.
After upgrading the coder helm chart from v2.2 to v2.3 on 10/20, we started seeing significant memory (tens of gigabytes), CPU (multiple cores), and constant container restarts.
CPU Usage:

Memory Usage:

Container Restarts:

The text was updated successfully, but these errors were encountered: