Potential Memory Leak in 2.3 #10550

notwedtm · 2023-11-06T15:48:31Z

After upgrading the coder helm chart from v2.2 to v2.3 on 10/20, we started seeing significant memory (tens of gigabytes), CPU (multiple cores), and constant container restarts.

CPU Usage:

Memory Usage:

Container Restarts:

ericpaulsen · 2023-11-06T15:54:25Z

thank you for submitting this issue @notwedtm - can you confirm the version is 2.3.2?

notwedtm · 2023-11-06T16:02:58Z

We have a fuzzy matcher in the Argo app:

repoURL: https://helm.coder.com/v2
chart: coder
targetRevision: ~> 2.3.2

Argo shows that we've been on 2.3.3 for about 7 days now.

ammario · 2023-11-06T16:22:35Z

cc @mafredri

mafredri · 2023-11-06T17:05:39Z

@notwedtm could you share what settings you are using for Coder? Of interest is especially whether or not you're using external/embedded DB, have enabled experimental coder settings, are using external provisioners, coder replicas, etc.

Would you also be able to share some pprof dumps from when you notice high memory usage? (This requires enabling pprof for coder server via --pprof-enable or CODER_PPROF_ENABLE=true.)

Essentially, we'd be interested in seeing the files produced by (see /tmp/pprof after running):

mkdir /tmp/pprof
for p in allocs heap goroutine; do wget -O /tmp/pprof/$p.gz http://localhost:6060/debug/pprof/$p; done

(The value of http://localhost:6060 will depend on whether or not you've changed the pprof address and/or are running it on the same container as coder.)

notwedtm · 2023-11-06T17:16:11Z

We are running with an external DB (RDS Postgres). We are running with the kubernetes helm chart with the following config (sensitive values snipped):

coder:
  labels:
    tags.datadoghq.com/env: ops
    tags.datadoghq.com/service: coder
  podLabels:
    tags.datadoghq.com/env: ops
    tags.datadoghq.com/service: coder
    tags.datadoghq.com/version: 2.3.3
  podAnnotations:
    ad.datadoghq.com/coder.logs: '[{"source": "coder", "service": "coder"}]'
  tls:
    secretNames: 
      - coder-tls
  service:
    externalTrafficPolicy: Local
    sessionAffinity: None
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
      service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
      external-dns.alpha.kubernetes.io/hostname: "<< SNIP >>"
  env:
    - name: CODER_PG_CONNECTION_URL
      valueFrom:
        secretKeyRef:
          name: coder-db-credentials
          key: url
    - name: CODER_ACCESS_URL
      value: "<< SNIP >>"
    - name: CODER_REDIRECT_TO_ACCESS_URL
      value: "true"
    - name: CODER_WILDCARD_ACCESS_URL
      value: "<< SNIP >>"
    - name: CODER_OIDC_ISSUER_URL
      value: "<< SNIP >>"
    - name: CODER_OIDC_EMAIL_DOMAIN
      value: "<< SNIP >>"
    - name: CODER_OIDC_SIGN_IN_TEXT
      value: "<< SNIP >>"
    - name: CODER_DISABLE_PASSWORD_AUTH
      value: "true"
    - name: CODER_OIDC_GROUP_AUTO_CREATE
      value: "true"
    - name: CODER_OIDC_CLIENT_ID
      valueFrom:
        secretKeyRef:
          name: coder-auth
          key: CLIENT_ID
    - name: CODER_OIDC_CLIENT_SECRET
      valueFrom:
        secretKeyRef:
          name: coder-auth
          key: CLIENT_SECRET
    - name: CODER_GITAUTH_0_ID
      value: "gitlab"
    - name: CODER_GITAUTH_0_TYPE
      value: "gitlab"
    - name: CODER_GITAUTH_0_CLIENT_ID
      valueFrom:
        secretKeyRef:
          name: coder-gitlab-credentials
          key: GITLAB_CLIENT_ID
    - name: CODER_GITAUTH_0_CLIENT_SECRET
      valueFrom:
        secretKeyRef:
          name: coder-gitlab-credentials
          key: GITLAB_CLIENT_SECRET
  ingress:
    host: "<<SNIP>>"
    tls:
      enable: true
      secretName: coder-tls
      wildcardSecretName: coder-tls
  serviceAccount:
    annotations:
      eks.amazonaws.com/role-arn: << IRSA ROLE >>
  volumes:
    - name: terraformrc
      secret:
        secretName: terraformrc
  volumeMounts:
    - name: terraformrc
      mountPath: /home/coder/.terraformrc
      subPath: .terraformrc
      readOnly: true

I'll work on getting a profiles running today and report back!

nnn342 · 2023-11-07T16:29:28Z

@matifali Would it be possible to get a public key so we can encrypt the profile data before we send it to you?

ericpaulsen · 2023-11-08T03:06:42Z

cc: @mafredri

spikecurtis · 2023-11-08T05:41:01Z

You can email to (my first name) @coder.com

nnn342 · 2023-11-09T13:18:22Z

One thing we see before the pod fails is it will say coderd: requester is not authorized to access the object and it will give a username that is has not interacted with Coder in any way.
It is usually a specific user but it can different users. Almost like a weighted average.
Is there anything we can do to mitigate the OOM pod restarts in the short term? We upped the memory but that only helped short term.
Use a VM? Downgrade to v2.2? Turn off any services or force ssh only use?

mafredri · 2023-11-09T18:44:18Z

@nnn342 thanks for the pprof dumps. We looked through them but nothing immediately stood out. Do you happen to know what the memory consumption was at the time when you captured the dumps? If it was a recently started instance, memory usage may not have accumulated (yet).

Would you say that the memory usage accumulates over time or that it's a sudden increase, leading to OOM? In the latter case, it may be hard to capture via pprof dumps. For instance, do you have any idea what transpired over the weekend, Oct 28 - 29? It almost seems like a lack-of-use lead to a lot of restarts, suggesting there's something more than memory overconsumption component to this.

Is there anything we can do to mitigate the OOM pod restarts in the short term? We upped the memory but that only helped short term.

Can you try disabling the users, one-by-one as you encounter those that "crash the system" (if we assume for a moment that this is the cause), or does the list of "culprits" also contain active users that need to access Coder?

Unfortunately, other than that, I can't immediately think of any tips to stop this behavior without understanding why it may be happening. Downgrading to v2.2.1 is probably your best bet for now. Another option would be to help us narrow this down by trying earlier v2.3.x versions. Knowing which version introduced it would be helpful, but this is not a great way to treat a production system. So the viability largely depends on if you can reproduce this outside prod.

One thing we see before the pod fails is it will say coderd: requester is not authorized to access the object and it will give a username that is has not interacted with Coder in any way.

This sounds very peculiar. Out of curiosity, how are you determining that this user doesn't interact with Coder?

@Emyrk wrt this error message/behavior, do you know if there were any auth/RBAC changes that went in between v2.2 and v2.3.3 that might be responsible, cause memory leaks? Any caching changes that might be accessing the wrong entry, etc?

Emyrk · 2023-11-09T19:58:07Z

@mafredri not any big changes to the authz packages, but how we interact with it always changing. Auth changes were definitely added, I would need to walk them to see if there was any mistakes like this. I'll try and find some time.

The only thing about investigating that route, is that the requester is not authorized to access the object log is very common. Our dashboard always attempts to load /stats and /health for example but will 403/404 for regular users. This is not correct behavior, but the FE handles it gracefully so we have not fixed it yet.

So any member hitting the dashboard I would suspect generates this logs. What can help a little bit is adding the log fields with that message. Specifically roles, route, action, and object (omit the ID field). Those should not be private or confidential in any way, but would indicate which route is generating the log and where to look on our end.

nnn342 · 2023-11-10T15:07:36Z

This happens to a few teammates but 1 teammate in particular has it happen to their username the most often.

coderd: requester is not authorized to access the object internal_error= request_id=REMOVED roles="[member organization-member:REMOVED]” actor_id=REMOVED actor_name=commonuser scope=all route=/api/v2/deployment/config action=read object={"id":"","owner":"","org_owner":"","type":"deployment_config","acl_user_list":null,"acl_group_list":null}

Emyrk · 2023-11-13T03:39:13Z

That route is not available to members, so that failure is expected. We should fix the FE to not request this route if it is currently doing so.

Refs: #10550

mafredri · 2023-11-15T20:09:04Z

We believe we have identified the cause and implemented a fix in #10685. The behavior of the identified bug matches the experiences described in this issue. The PR will be merged tomorrow and we will make a patch release soon thereafter.

PS. The auth issue described here is most likely not related to the memory leak leading to OOM kill. If that continues to be a problem after the patch, I would recommend opening a separate issue for it.

Fixes #10550

iml-miles · 2023-11-21T12:25:25Z

Just a quick note here, after pushing 2.4.0 out live at 6PM last night, we have seen no memory spikes!

…10685) Fixes coder#10550

cdr-bot bot added the bug label Nov 6, 2023

matifali added the waiting-for-info The issue creator is asked to provide more information. label Nov 7, 2023

matifali removed the waiting-for-info The issue creator is asked to provide more information. label Nov 9, 2023

spikecurtis added the s1 Bugs that break core workflows. Only humans may set this. label Nov 13, 2023

mafredri self-assigned this Nov 13, 2023

mafredri mentioned this issue Nov 14, 2023

fix(coderd): fix memory leak in watchWorkspaceAgentMetadata #10685

Merged

mafredri added a commit that referenced this issue Nov 14, 2023

fix(coderd): fix memory leak in watchWorkspaceAgentMetadata

e6ad02f

Refs: #10550

mafredri closed this as completed in #10685 Nov 16, 2023

mafredri added a commit that referenced this issue Nov 16, 2023

fix(coderd): fix memory leak in watchWorkspaceAgentMetadata (#10685)

198b56c

Fixes #10550

ericpaulsen pushed a commit to lbi22/coder that referenced this issue Nov 27, 2023

fix(coderd): fix memory leak in watchWorkspaceAgentMetadata (coder#…

e943e09

…10685) Fixes coder#10550

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Potential Memory Leak in 2.3 #10550

Potential Memory Leak in 2.3 #10550

notwedtm commented Nov 6, 2023 •

edited

Loading

ericpaulsen commented Nov 6, 2023

Uh oh!

notwedtm commented Nov 6, 2023 •

edited

Loading

Uh oh!

ammario commented Nov 6, 2023

Uh oh!

mafredri commented Nov 6, 2023

Uh oh!

notwedtm commented Nov 6, 2023

Uh oh!

nnn342 commented Nov 7, 2023

Uh oh!

ericpaulsen commented Nov 8, 2023

Uh oh!

spikecurtis commented Nov 8, 2023

Uh oh!

nnn342 commented Nov 9, 2023

Uh oh!

mafredri commented Nov 9, 2023

Uh oh!

Emyrk commented Nov 9, 2023

Uh oh!

nnn342 commented Nov 10, 2023

Uh oh!

Emyrk commented Nov 13, 2023

Uh oh!

mafredri commented Nov 15, 2023

Uh oh!

iml-miles commented Nov 21, 2023

Uh oh!

Potential Memory Leak in 2.3 #10550

Potential Memory Leak in 2.3 #10550

Comments

notwedtm commented Nov 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ericpaulsen commented Nov 6, 2023

Uh oh!

notwedtm commented Nov 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ammario commented Nov 6, 2023

Uh oh!

mafredri commented Nov 6, 2023

Uh oh!

notwedtm commented Nov 6, 2023

Uh oh!

nnn342 commented Nov 7, 2023

Uh oh!

ericpaulsen commented Nov 8, 2023

Uh oh!

spikecurtis commented Nov 8, 2023

Uh oh!

nnn342 commented Nov 9, 2023

Uh oh!

mafredri commented Nov 9, 2023

Uh oh!

Emyrk commented Nov 9, 2023

Uh oh!

nnn342 commented Nov 10, 2023

Uh oh!

Emyrk commented Nov 13, 2023

Uh oh!

mafredri commented Nov 15, 2023

Uh oh!

iml-miles commented Nov 21, 2023

Uh oh!

notwedtm commented Nov 6, 2023 •

edited

Loading

notwedtm commented Nov 6, 2023 •

edited

Loading