-
Notifications
You must be signed in to change notification settings - Fork 875
UI memory leak on a workspace detail page #15921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I managed to capture a heapprofile, but its too large to send here |
@raphaelfff What is your Coder version? |
2.18.1 |
this seems to be happening on one specific workspace only, other workspaces are fine |
Is there anything specific on that workspace? Could you share a template that can reproduce the workspace causing memory leak? |
other ws with same template work fine |
Almost sounds like something derived from the telemetry causing this ? (could it be the build timeline ? its a ws created under 2.17) |
I've been having this issue too, and it seems to happen more often (or perhaps faster?) on workspaces that have been running longer or have more logs in their history. It just happened to me (page froze, eventually "Aw, snap!" page) on a workspace that had been up for 180h, I restarted it, and now the page is fairly snappy. |
I also had one workspace (only using one actively) that crashed my chrome browser every 2 minutes. I restarted everything (coder server and workspace) and now it works again. I did not see this in firefox before, but needed to switch to chrome for a project. Coder v2.18.1 But even after that loading logs chrases the window because of "Out of Memory" |
That returned instantly for me, I confirmed that this workspace is currently experiencing the issue before testing. I'm not actively using it at the moment, so I'm happy to do whatever testing is needed.
|
I am not experiencing it anymore (after workspace update) and it was my only workspace, but here is my timings anyway: {
"provisioner_timings": [
{
"job_id": "dd7af91f-7908-498f-a1d0-42af56de1d06",
"started_at": "2025-01-02T17:25:22.799815Z",
"ended_at": "2025-01-02T17:25:28.292937Z",
"stage": "init",
"source": "terraform",
"action": "initializing terraform",
"resource": "state file"
},
{
"job_id": "dd7af91f-7908-498f-a1d0-42af56de1d06",
"started_at": "2025-01-02T17:25:28.700169Z",
"ended_at": "2025-01-02T17:25:28.702566Z",
"stage": "plan",
"source": "coder",
"action": "read",
"resource": "data.coder_provisioner.me"
},
{
"job_id": "dd7af91f-7908-498f-a1d0-42af56de1d06",
"started_at": "2025-01-02T17:25:28.700267Z",
"ended_at": "2025-01-02T17:25:28.702558Z",
"stage": "plan",
"source": "coder",
"action": "read",
"resource": "data.coder_workspace_owner.me"
},
{
"job_id": "dd7af91f-7908-498f-a1d0-42af56de1d06",
"started_at": "2025-01-02T17:25:28.70247Z",
"ended_at": "2025-01-02T17:25:28.704096Z",
"stage": "plan",
"source": "coder",
"action": "read",
"resource": "data.coder_workspace.me"
},
{
"job_id": "dd7af91f-7908-498f-a1d0-42af56de1d06",
"started_at": "2025-01-02T17:25:28.706564Z",
"ended_at": "2025-01-02T17:25:28.709574Z",
"stage": "plan",
"source": "coder",
"action": "state refresh",
"resource": "coder_agent.main"
},
{
"job_id": "dd7af91f-7908-498f-a1d0-42af56de1d06",
"started_at": "2025-01-02T17:25:28.706839Z",
"ended_at": "2025-01-02T17:25:28.726667Z",
"stage": "plan",
"source": "docker",
"action": "state refresh",
"resource": "docker_image.main"
},
{
"job_id": "dd7af91f-7908-498f-a1d0-42af56de1d06",
"started_at": "2025-01-02T17:25:28.717254Z",
"ended_at": "2025-01-02T17:25:28.718379Z",
"stage": "plan",
"source": "coder",
"action": "state refresh",
"resource": "coder_app.code-server"
},
{
"job_id": "dd7af91f-7908-498f-a1d0-42af56de1d06",
"started_at": "2025-01-02T17:25:28.738701Z",
"ended_at": "2025-01-02T17:25:28.879501Z",
"stage": "plan",
"source": "docker",
"action": "state refresh",
"resource": "docker_container.workspace[0]"
},
{
"job_id": "dd7af91f-7908-498f-a1d0-42af56de1d06",
"started_at": "2025-01-02T17:25:28.977971Z",
"ended_at": "2025-01-02T17:25:29.427195Z",
"stage": "graph",
"source": "terraform",
"action": "building terraform dependency graph",
"resource": "state file"
},
{
"job_id": "dd7af91f-7908-498f-a1d0-42af56de1d06",
"started_at": "2025-01-02T17:25:29.731249Z",
"ended_at": "2025-01-02T17:25:29.732097Z",
"stage": "apply",
"source": "coder",
"action": "delete",
"resource": "coder_app.code-server"
},
{
"job_id": "dd7af91f-7908-498f-a1d0-42af56de1d06",
"started_at": "2025-01-02T17:25:29.743888Z",
"ended_at": "2025-01-02T17:25:29.745745Z",
"stage": "apply",
"source": "coder",
"action": "create",
"resource": "coder_app.code-server"
},
{
"job_id": "dd7af91f-7908-498f-a1d0-42af56de1d06",
"started_at": "2025-01-02T17:25:29.746273Z",
"ended_at": "2025-01-02T17:25:31.354019Z",
"stage": "apply",
"source": "docker",
"action": "delete",
"resource": "docker_container.workspace[0]"
},
{
"job_id": "dd7af91f-7908-498f-a1d0-42af56de1d06",
"started_at": "2025-01-02T17:25:31.401457Z",
"ended_at": "2025-01-02T17:25:31.998322Z",
"stage": "apply",
"source": "docker",
"action": "create",
"resource": "docker_container.workspace[0]"
}
],
"agent_script_timings": [
{
"started_at": "2025-01-02T17:25:32.914668Z",
"ended_at": "2025-01-02T17:25:37.055461Z",
"exit_code": 0,
"stage": "start",
"status": "ok",
"display_name": "Startup Script",
"workspace_agent_id": "eb7fcbb7-4007-4b49-9e3d-019e645d3c03",
"workspace_agent_name": "main"
}
],
"agent_connection_timings": [
{
"started_at": "2025-01-02T17:25:32.612251Z",
"ended_at": "2025-01-02T17:25:32.858563Z",
"stage": "connect",
"workspace_agent_id": "eb7fcbb7-4007-4b49-9e3d-019e645d3c03",
"workspace_agent_name": "main"
}
]
} |
Thank you, @mcm and @archef2000—this is super helpful! I’ll use this data to mock the API endpoint and see what happens. |
@BrunoQuaresma I just joined the Coder Discord, feel free to reach out there if you need anything as well. My coder instance is very much non-production so there's not really anything sensitive in there, and I can run live tests etc. |
Thats what i get on the broken ws:
|
@raphaelfff, was your workspace running when you got the response? I’m assuming the issue occurs when the workspace is up and running, correct? I’m asking because, typically, when a workspace is running, it returns provisioner timings, but in the response you shared, that field is empty. |
Workspace is running, but it was created ages ago, before the timeline stuff was in place |
Don't know if helps, but again my timings after running it for some time and the code-server browser page crashing in unter a minute. {
"provisioner_timings": [
{
"job_id": "49fe19b4-19e9-4320-8e17-1a63164453da",
"started_at": "2025-01-05T12:47:14.819777Z",
"ended_at": "2025-01-05T12:47:18.654211Z",
"stage": "init",
"source": "terraform",
"action": "initializing terraform",
"resource": "state file"
},
{
"job_id": "49fe19b4-19e9-4320-8e17-1a63164453da",
"started_at": "2025-01-05T12:47:19.003863Z",
"ended_at": "2025-01-05T12:47:19.004863Z",
"stage": "plan",
"source": "coder",
"action": "read",
"resource": "data.coder_workspace.me"
},
{
"job_id": "49fe19b4-19e9-4320-8e17-1a63164453da",
"started_at": "2025-01-05T12:47:19.004077Z",
"ended_at": "2025-01-05T12:47:19.004819Z",
"stage": "plan",
"source": "coder",
"action": "read",
"resource": "data.coder_provisioner.me"
},
{
"job_id": "49fe19b4-19e9-4320-8e17-1a63164453da",
"started_at": "2025-01-05T12:47:19.004141Z",
"ended_at": "2025-01-05T12:47:19.004989Z",
"stage": "plan",
"source": "coder",
"action": "read",
"resource": "data.coder_workspace_owner.me"
},
{
"job_id": "49fe19b4-19e9-4320-8e17-1a63164453da",
"started_at": "2025-01-05T12:47:19.00933Z",
"ended_at": "2025-01-05T12:47:19.030038Z",
"stage": "plan",
"source": "docker",
"action": "state refresh",
"resource": "docker_image.main"
},
{
"job_id": "49fe19b4-19e9-4320-8e17-1a63164453da",
"started_at": "2025-01-05T12:47:19.01011Z",
"ended_at": "2025-01-05T12:47:19.013026Z",
"stage": "plan",
"source": "coder",
"action": "state refresh",
"resource": "coder_agent.main"
},
{
"job_id": "49fe19b4-19e9-4320-8e17-1a63164453da",
"started_at": "2025-01-05T12:47:19.019742Z",
"ended_at": "2025-01-05T12:47:19.020578Z",
"stage": "plan",
"source": "coder",
"action": "state refresh",
"resource": "coder_app.code-server"
},
{
"job_id": "49fe19b4-19e9-4320-8e17-1a63164453da",
"started_at": "2025-01-05T12:47:19.056489Z",
"ended_at": "2025-01-05T12:47:19.494507Z",
"stage": "graph",
"source": "terraform",
"action": "building terraform dependency graph",
"resource": "state file"
},
{
"job_id": "49fe19b4-19e9-4320-8e17-1a63164453da",
"started_at": "2025-01-05T12:47:19.815179Z",
"ended_at": "2025-01-05T12:47:20.238378Z",
"stage": "apply",
"source": "docker",
"action": "create",
"resource": "docker_container.workspace[0]"
}
],
"agent_script_timings": [],
"agent_connection_timings": [
{
"started_at": "2025-01-05T12:47:20.782132Z",
"ended_at": "2025-01-05T12:47:21.05562Z",
"stage": "connect",
"workspace_agent_id": "27941bd8-2f3b-4c0a-ad1d-46ea90cca242",
"workspace_agent_name": "main"
}
]
} |
@raphaelfff @mcm @archef2000 Would it be possible for one of you to add me as a user in your deployment so I can debug this issue directly? I’ve been trying to replicate it but haven’t had any success so far 😞. |
@BrunoQuaresma I m unable to add you to the deployment/workspace, but i'm happy to hop on a zoom call to debug, whats your email so I can schedule a call ? |
@BrunoQuaresma yeah absolutely, i am |
Hey @raphaelfff, sorry for the delay. I’m going to check with @mcm to see if I can directly access the deployment as a user and try to figure out the problem there. If I can’t resolve it, I’ll schedule a call with you for sure. Does that work for you? |
I m afraid this issue needs to be reoppened, i just upgraded to 2.18.5, and the issue still arises |
Thank you @raphaelfff. I will let @BrunoQuaresma look into this. |
cc: @DanielleMaywood |
I took some more snapshots, here are a couple findings:
Could it be that the timeline view becomes stupidly long ans tried to render all ticks for the duration (21 days in that case) ? |
@raphaelfff could you please share with me how you network request for this endpoint looks like please? 🙏 #15921 (comment) |
I already did: #15921 (comment) |
@raphaelfff I mean, after updating to the latest Coder version. |
|
What's happening is, for some reason, the agent connection timings are returning a very old date. So, when the component tries to calculate the range and ticks for the chart, it blows up because there are too many ticks. Since the component expects the range to be in milliseconds, not days, it generates millions of ticks, causing a memory explosion. @DanielleMaywood I think we talked about this, but I forgot—why are the agent connection timings returning a very old date compared to the dates returned by the agent script timings? |
The problem is also with the code server that is running on a workspace so when using that is also crashes after some time. |
I think I have an idea of what’s going on. This happens for workspaces with non-ephemeral resources like storage. So, if the storage is created today and a workspace build is triggered in one month, the timing returned will point one month back, creating a large time range that breaks the UI. @DanielleMaywood @dannykopping Should we just ignore these kinds of resources when returning the build timings? |
We are indeed using storage |
Do we have a reproduction of this behaviour?
I don't think this is the correct approach. If there's indeed a bug leading to incorrect timeouts, we must fix it. Additionally, we should either add 1) scaling and/or 2) maximum sizes for timing spans. There's no value in having the user scroll beyond one or two "pages" (i.e. widths) of the graph. We don't need precision here, we just need an indicative view of where the time was spent. For this reason I think scaling is probably the cleanest solution - but it has the obvious downside that short timings would become impossibly narrow; I think that's OK because you only really care about the lengthy span. |
Hum... I don't think incorrect timeouts are the issue here 🤔. What is causing the memory leak is the FE trying to render the chart for a long range as one week, when it is expected to be in seconds or minutes. Which leads me to a question about how we calculate build times. For me, does not make sense to include non ephemeral resources in the estimation since it is not related to the build specifically like a storage that will be created on the first build and reused for the subsequent ones.
I also think precision is not a huge thing here but more important, for me, is which type of resources we consider during the build time measurement. |
@BrunoQuaresma I'm in total agreement about sending back the correct measurements, which is why I asked if you have a reproduction so we can validate this. I can't see how the situation you described would occur. |
@dannykopping I will try to reproduce this on dev.coder.com and share the workspace with you 🙏 |
Experiencing this as well. |
Maybe some of these issues are going to be fixed by #17514 |
What about the vscode workspace crashing? |
Updates!
Please, reach out to us if you still facing this issue after this get released. |
I dont have much details, other than when opening the workspace details page, the page freezes up, and eventually crashes (the Oh snap! page)
That makes it pretty tricky to capture the issue...
The text was updated successfully, but these errors were encountered: