Skip to content

Potential Memory Leak in 2.3 #10550

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
notwedtm opened this issue Nov 6, 2023 · 15 comments · Fixed by #10685
Closed

Potential Memory Leak in 2.3 #10550

notwedtm opened this issue Nov 6, 2023 · 15 comments · Fixed by #10685
Assignees
Labels
s1 Bugs that break core workflows. Only humans may set this.

Comments

@notwedtm
Copy link

notwedtm commented Nov 6, 2023

After upgrading the coder helm chart from v2.2 to v2.3 on 10/20, we started seeing significant memory (tens of gigabytes), CPU (multiple cores), and constant container restarts.

CPU Usage:
image

Memory Usage:
image

Container Restarts:
image

@cdr-bot cdr-bot bot added the bug label Nov 6, 2023
@ericpaulsen
Copy link
Member

thank you for submitting this issue @notwedtm - can you confirm the version is 2.3.2?

@notwedtm
Copy link
Author

notwedtm commented Nov 6, 2023

We have a fuzzy matcher in the Argo app:

repoURL: https://helm.coder.com/v2
chart: coder
targetRevision: ~> 2.3.2

Argo shows that we've been on 2.3.3 for about 7 days now.

@ammario
Copy link
Member

ammario commented Nov 6, 2023

cc @mafredri

@mafredri
Copy link
Member

mafredri commented Nov 6, 2023

@notwedtm could you share what settings you are using for Coder? Of interest is especially whether or not you're using external/embedded DB, have enabled experimental coder settings, are using external provisioners, coder replicas, etc.

Would you also be able to share some pprof dumps from when you notice high memory usage? (This requires enabling pprof for coder server via --pprof-enable or CODER_PPROF_ENABLE=true.)

Essentially, we'd be interested in seeing the files produced by (see /tmp/pprof after running):

mkdir /tmp/pprof
for p in allocs heap goroutine; do wget -O /tmp/pprof/$p.gz http://localhost:6060/debug/pprof/$p; done

(The value of http://localhost:6060 will depend on whether or not you've changed the pprof address and/or are running it on the same container as coder.)

@notwedtm
Copy link
Author

notwedtm commented Nov 6, 2023

We are running with an external DB (RDS Postgres). We are running with the kubernetes helm chart with the following config (sensitive values snipped):

coder:
  labels:
    tags.datadoghq.com/env: ops
    tags.datadoghq.com/service: coder
  podLabels:
    tags.datadoghq.com/env: ops
    tags.datadoghq.com/service: coder
    tags.datadoghq.com/version: 2.3.3
  podAnnotations:
    ad.datadoghq.com/coder.logs: '[{"source": "coder", "service": "coder"}]'
  tls:
    secretNames: 
      - coder-tls
  service:
    externalTrafficPolicy: Local
    sessionAffinity: None
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
      service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
      external-dns.alpha.kubernetes.io/hostname: "<< SNIP >>"
  env:
    - name: CODER_PG_CONNECTION_URL
      valueFrom:
        secretKeyRef:
          name: coder-db-credentials
          key: url
    - name: CODER_ACCESS_URL
      value: "<< SNIP >>"
    - name: CODER_REDIRECT_TO_ACCESS_URL
      value: "true"
    - name: CODER_WILDCARD_ACCESS_URL
      value: "<< SNIP >>"
    - name: CODER_OIDC_ISSUER_URL
      value: "<< SNIP >>"
    - name: CODER_OIDC_EMAIL_DOMAIN
      value: "<< SNIP >>"
    - name: CODER_OIDC_SIGN_IN_TEXT
      value: "<< SNIP >>"
    - name: CODER_DISABLE_PASSWORD_AUTH
      value: "true"
    - name: CODER_OIDC_GROUP_AUTO_CREATE
      value: "true"
    - name: CODER_OIDC_CLIENT_ID
      valueFrom:
        secretKeyRef:
          name: coder-auth
          key: CLIENT_ID
    - name: CODER_OIDC_CLIENT_SECRET
      valueFrom:
        secretKeyRef:
          name: coder-auth
          key: CLIENT_SECRET
    - name: CODER_GITAUTH_0_ID
      value: "gitlab"
    - name: CODER_GITAUTH_0_TYPE
      value: "gitlab"
    - name: CODER_GITAUTH_0_CLIENT_ID
      valueFrom:
        secretKeyRef:
          name: coder-gitlab-credentials
          key: GITLAB_CLIENT_ID
    - name: CODER_GITAUTH_0_CLIENT_SECRET
      valueFrom:
        secretKeyRef:
          name: coder-gitlab-credentials
          key: GITLAB_CLIENT_SECRET
  ingress:
    host: "<<SNIP>>"
    tls:
      enable: true
      secretName: coder-tls
      wildcardSecretName: coder-tls
  serviceAccount:
    annotations:
      eks.amazonaws.com/role-arn: << IRSA ROLE >>
  volumes:
    - name: terraformrc
      secret:
        secretName: terraformrc
  volumeMounts:
    - name: terraformrc
      mountPath: /home/coder/.terraformrc
      subPath: .terraformrc
      readOnly: true

I'll work on getting a profiles running today and report back!

@matifali matifali added the waiting-for-info The issue creator is asked to provide more information. label Nov 7, 2023
@nnn342
Copy link

nnn342 commented Nov 7, 2023

@matifali Would it be possible to get a public key so we can encrypt the profile data before we send it to you?

@ericpaulsen
Copy link
Member

cc: @mafredri

@spikecurtis
Copy link
Contributor

-----BEGIN PGP PUBLIC KEY BLOCK-----

mQINBGVLHoQBEADRHesBYNKCO1D2RmmRCEp/qRWVHSTZPOLln55+xseh6p29Btgf
e+zRej1kl9Jf9cxo4/ItmPpMB8DzvqFnt6eKUusCiL6Gcjn99snEMxL1KsWGxXXc
YBJogVX07TxxnKG2laFG0xsEe7Qk4HtucirTwT3eCIAGQQ6Vx06LgMFo8Zs/+VUf
kdBZWnEnxOC3m5ABd2uShvuROFqLl/OprhTGBRlYfA+rEa/Ov1j9Jx5XKmm+gzA7
33CSGV/Hi+XdHSCoOGtep6b6f4jG4ytiWG2CXbKM/wCPnwGxqZRJn1w+MVVeF9HE
4OSWwMivPvZAjDf5SaQfMkEGRY6HMr2xI5jdYOh+By1ZFqWg5HuCbCvu6QW+cboj
DhVMrv+i0Tlt4nH0J9QHxkudKZNNGKjADmzfKCHfkWZQeed/Xt4lEXUsEg6bre5v
+nEa8XWPXMY7a05QYN3ZXUtH/7H+AZ6aTaoYBnwKq/uxuyiDPVZBX7f0mx5OzREJ
s9m1YF1GaPU77Oq/5gbccAOlFvdWHGFuqZKJuZoYOaZF/r5veWf0vpQuSfp7V4kS
+gvoDR1uHtGtkYCFy3FVK/eyCFWXu5CDkEa4BFVvGJyEHRSs5J+/SCLw+/RfTv6C
E2p6hYppYqTMJMOcE51m87Rm7xoK7Bf7D39lq7VXFcm5x5aJnivKyMIkVwARAQAB
tClTcGlrZSBDdXJ0aXMgKENvZGVib29rKSA8c3Bpa2VAY29kZXIuY29tPokCVAQT
AQgAPhYhBBEQrpQFSuKBpxYYAMeZ2pG/TwZFBQJlSx6EAhsDBQkNKwtiBQsJCAcC
BhUKCQgLAgQWAgMBAh4BAheAAAoJEMeZ2pG/TwZFxl0QAL+hTEDnRYl4YcfJLnNC
uAfvTW6N2TNyYI2V6tZzCxfZemYvj6UgLSRSmb1FbBJnQe56eLn0rAr4+8XrCyL2
Dlay/1KFFA4OJTSizc3M9XQg6SK7k5/LGdIgjHzRSFhr1nkp3PD3mDM12LQmEsz8
HxZnadazaOsi0R7AY776HvyVw5WTHQK+ch8c/Pvghjp3cFwdH8OPE1l+7518bSwt
WIcbOdX/HkpqZqGtgPnxhpROTcZrWGfMGBbLGWjBQ1rl97vYrmrUUmaqOsvt1FRC
WEjDR7Uoxk186UVovosC+q0aHTnEJvPQB8P4dJA1HdsH3y1Pcpq9FDUaAR2Wl2s3
Z0qeymNDDcWk8SraVb8CDL2drQ/+nrZ1YQfpmxCkQtLzUVdWHvQNfNxT3UoQEe5N
geHqU8DeIIaOdaJLIxVmXw0yU0xrz2I6ytUh2rBoXHlHPm39Kbe1ffvTek5nBL64
tdowc3IqTr5GAf3iiuIjhFP9c3J8x+pHH4l+qpDmHeONq6of+NsspUKZAjca12qD
QByIpuT2C0WMYCu39Wf+ROt76KXSptyleygKRZzAuGK8XL6DbDUwYJKf8LbTh4Gx
lHWRuyQf0wXkFLGEWLMNoUJoPKdUiC3+5q5ue+46pYQ52zA7mUyxuPPVQ2WrBltB
HVCdtd0SVOVJDhEtnYwQPqOluQINBGVLHoQBEADjfGgmBD1IGgzVcX/Vp0QUjeze
qe7MDcJe7zO0UOOWwGNU99jpOU1S01Q6Sa1iNMJdokUxQdJ/wY54HuTVhzQtGHHf
cezEsx5XAMRrYQVXQV/uQDylDVbCmWXUGJo3Tn2441WwChbp7koRa23ij7fXoiPA
GRXhnanzL8gezOUJfmTRfcDUks3Iqznskzi0KBKvDt+Jllu8mEZgYnIkvuyPj72g
Vkz1hbAUatio2GIm7u8eXPKZ898DyZdN5mYFl/ZNP/LVb6dg4rvW0aECNi/Lly4L
D8W7lKaphPQy1JOfKVoO395zdHcygCQO2R/rpv3x8LjKXMRptXDMZiZp/NjTn3xs
j0yMi+KNW9xlKGJl9meLmbgAqVg/gH0xfBE+pQMthWWiDXJV5Ah9DRNxWrsA5YVS
/SxtQSMZK3/BS/2SAFJCpvbTab/ObtV92OflEOiaccMtITojYwP9S3V0o3qbgQ/K
cVfo0g0c7nBQhZZZ1eVk27Oj8/xiC/ixiF6bH6xl3cel0lm49gDCQx+UOrE1so4Z
0l8FC7OJ3tXE0bz2EQbh3b74FOmDGG2HTiGL2IRufzZRQce0AmDlP1ThlEnrclGt
jAcuodkJIn+QCc9pkfg0rKx98mLTewghwnwvw2sTBb0eEAaouOdES0oVEhigJ17Y
krxzD7bfoYcieA79AwARAQABiQI8BBgBCAAmFiEEERCulAVK4oGnFhgAx5nakb9P
BkUFAmVLHoQCGwwFCQ0rC2IACgkQx5nakb9PBkXrPBAAs2ctLPJpWMIRx1YEUnxq
AVhOvjOvNohGXGTcFoJFBNzVi9dQ3qiJ3RRJ4Xia6AYWkz+Dch3yzRT8mDjRCbPc
dc8Q5L9/cPtRA4vMCd3mWEApZYQRXNQ/krLaZujbh0MLGB2lbcSPRYxnzu2oLkxz
dFxTp225b4RLsljWoVK/6LpRbBfEyujPaMhaGaftSsH1S9a9og132+c82uB1jfko
SXAKEwuIltKk6X4ceHqHTWE45XIYqEsJDdI4TaoIrUlV6kOsUrBZxYIUb5nYUfd/
qB9+S7WCwfuLx0w+0kUjJqW7f2CsyJsfQ6Z1WtbeKzVcXwnH7ivNWv78CrrX0unN
twx2eZZi+Vn3zCi7zK4oMZe9Ek3/B7kbMS78uiW8FrAhHhIR0oq33pgs4zsn8P81
m1+tToXY4lEJjscuY8OFkaA12aKOQM1yCiFuYfL0GyZ7W/DngmS4v2Ay4JBTvcty
UHYu70QLlLev1x4hIpaU9kL3jAm/j3icnWY9RZxN+9e9tORBU10BzZilCYatmqWx
T44KPjKvQ03O3AUHOab2TZQI3kim4dp2PnN7b2wvvuPf54wHomR3iuX55Xb8ibze
8BGyuLlHisiISHeqvTrzyGG5c6b7NsAsmN18D9ARewEQ2fPDzo+dNJD1pGD5pIaP
GQLIQEzMohC6K2zMRuwZZ8s=
=nawJ
-----END PGP PUBLIC KEY BLOCK-----

You can email to (my first name) @coder.com

@nnn342
Copy link

nnn342 commented Nov 9, 2023

One thing we see before the pod fails is it will say coderd: requester is not authorized to access the object and it will give a username that is has not interacted with Coder in any way.
It is usually a specific user but it can different users. Almost like a weighted average.
Is there anything we can do to mitigate the OOM pod restarts in the short term? We upped the memory but that only helped short term.
Use a VM? Downgrade to v2.2? Turn off any services or force ssh only use?

@matifali matifali removed the waiting-for-info The issue creator is asked to provide more information. label Nov 9, 2023
@mafredri
Copy link
Member

mafredri commented Nov 9, 2023

@nnn342 thanks for the pprof dumps. We looked through them but nothing immediately stood out. Do you happen to know what the memory consumption was at the time when you captured the dumps? If it was a recently started instance, memory usage may not have accumulated (yet).

Would you say that the memory usage accumulates over time or that it's a sudden increase, leading to OOM? In the latter case, it may be hard to capture via pprof dumps. For instance, do you have any idea what transpired over the weekend, Oct 28 - 29? It almost seems like a lack-of-use lead to a lot of restarts, suggesting there's something more than memory overconsumption component to this.

Is there anything we can do to mitigate the OOM pod restarts in the short term? We upped the memory but that only helped short term.

Can you try disabling the users, one-by-one as you encounter those that "crash the system" (if we assume for a moment that this is the cause), or does the list of "culprits" also contain active users that need to access Coder?

Unfortunately, other than that, I can't immediately think of any tips to stop this behavior without understanding why it may be happening. Downgrading to v2.2.1 is probably your best bet for now. Another option would be to help us narrow this down by trying earlier v2.3.x versions. Knowing which version introduced it would be helpful, but this is not a great way to treat a production system. So the viability largely depends on if you can reproduce this outside prod.

One thing we see before the pod fails is it will say coderd: requester is not authorized to access the object and it will give a username that is has not interacted with Coder in any way.

This sounds very peculiar. Out of curiosity, how are you determining that this user doesn't interact with Coder?

@Emyrk wrt this error message/behavior, do you know if there were any auth/RBAC changes that went in between v2.2 and v2.3.3 that might be responsible, cause memory leaks? Any caching changes that might be accessing the wrong entry, etc?

@Emyrk
Copy link
Member

Emyrk commented Nov 9, 2023

@mafredri not any big changes to the authz packages, but how we interact with it always changing. Auth changes were definitely added, I would need to walk them to see if there was any mistakes like this. I'll try and find some time.

The only thing about investigating that route, is that the requester is not authorized to access the object log is very common. Our dashboard always attempts to load /stats and /health for example but will 403/404 for regular users. This is not correct behavior, but the FE handles it gracefully so we have not fixed it yet.

So any member hitting the dashboard I would suspect generates this logs. What can help a little bit is adding the log fields with that message. Specifically roles, route, action, and object (omit the ID field). Those should not be private or confidential in any way, but would indicate which route is generating the log and where to look on our end.

@nnn342
Copy link

nnn342 commented Nov 10, 2023

This happens to a few teammates but 1 teammate in particular has it happen to their username the most often.

coderd: requester is not authorized to access the object internal_error= request_id=REMOVED roles="[member organization-member:REMOVED]” actor_id=REMOVED actor_name=commonuser scope=all route=/api/v2/deployment/config action=read object={"id":"","owner":"","org_owner":"","type":"deployment_config","acl_user_list":null,"acl_group_list":null}

@Emyrk
Copy link
Member

Emyrk commented Nov 13, 2023

That route is not available to members, so that failure is expected. We should fix the FE to not request this route if it is currently doing so.

@spikecurtis spikecurtis added the s1 Bugs that break core workflows. Only humans may set this. label Nov 13, 2023
@mafredri mafredri self-assigned this Nov 13, 2023
@mafredri
Copy link
Member

We believe we have identified the cause and implemented a fix in #10685. The behavior of the identified bug matches the experiences described in this issue. The PR will be merged tomorrow and we will make a patch release soon thereafter.

PS. The auth issue described here is most likely not related to the memory leak leading to OOM kill. If that continues to be a problem after the patch, I would recommend opening a separate issue for it.

@iml-miles
Copy link

Just a quick note here, after pushing 2.4.0 out live at 6PM last night, we have seen no memory spikes!

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
s1 Bugs that break core workflows. Only humans may set this.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants