-
Notifications
You must be signed in to change notification settings - Fork 881
Add health-check for coder_apps
#2662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Kubernetes has an initial delay and a period that polls at an interval. It could be an interesting thing to add: I wonder if this should be enabled or disabled by default? I can see it both ways. |
Kubernetes is a good API to peek at, but I think we can simplify considerably. It has two checks,
Since we're just deciding whether to allow the app to be clickable on the dashboard, we should only have one kind of check. Also, I think we can drop the "initial delay." Presumably that is there to prevent Kubernetes from restarting a container that just takes a long time to start. For our use case, it doesn't matter if the app fails the first health probes --- we just continue to wait. |
I think there could still be value in "initial delay" so that we can accurately inform the user that there's a problem. They may be wondering why the app stays grayed-out forever? |
I think an |
However, I'm seeing now the unhealthy threshold in GCP is simply a number of checks so I could also see a few options working together to accurately check if an app starts in time, or fails. 🤷🏼 |
@Emyrk do you have context on the effort for this in V1 in terms of complexity? Sounds like this might need an RFC. |
Do you mean for V1 generic applications? @misskniss I haven't worked with this part of the code yet in V2, so I can't comment on v2 complexity. |
@misskniss this issue isn't relevant to v1. What context is missing from the issue that would require an RFC? Ben filled this out yesterday. |
I feel like we can confidently come to a solution with the information we have, so it's up to the engineering owner how to settle on a schema (RFC, comments in GitHub, etc) I think we mostly just need an owner and estimate to get things going. The idea of an RFC came up because it won't be a direct port from v1, but I've seen us implement features in different ways, such as done in #2989 and #2179 |
@kylecarbs V1 only came up because we were asking about who had experience with it previously is all. @Emyrk you were not in the session we were discussing it in but people thought you may have good insight on things we liked and did not like in V1, not that we expected you to take ownership necessarily. Though it is up for grabs if you want it. |
I wrote the V1 implementation so I could maybe help out here. I would love to hear feedback on what the pros/cons were of the first version...I never got much at the time (could be a good or a bad thing!). |
@sreya - I haven't heard any cons of the health checks in v1, it continues to work well for me. One thing I'm worried about (but lack the full context) is that v1 health checks may not work well for apps that may take up to 3 minutes to download and start. This will be common in Coder OSS. Based on that, we were discussing a few dashboard designs/changes to the schema to support longer wait times, as opposed to keeping a request open. Would love your thoughts here. |
Architecturally, I think we should go with the local agent actually performing the health checks, and then reporting changes in status up to Coder Server, which puts it in the DB. This matches Kubernetes architecture where the local kubelet does health checking against pods running on the node. I thought I heard it mentioned on the call that Coder Server should do it, but I think this isn't great because:
cc @kylecarbs |
Entirely agree with @spikecurtis! |
Adding an extra opinion on this. Would be nice to have this endpoint returning some status like “not found”, “initializing”, “ready” |
coder_apps
and items in coder_script
before allowing users to click them in the workspace UIcoder_apps
edited by @bpmct
Problem statement
Throughout Coder's documentation and examples, the startup_script is used to install web IDEs onto the workspace, such as code-server, Jetbrains Projector, JupyterLab, etc. From there, users connect via links in the dashboard. In the template, this is defined via the coder_app resource.
When the workspace starts, it takes 15-60s for the IDEs to install before a user can get to the page. When they click it before the app loads, there's a 404 page:
However, when you refresh 30 seconds later, it works!
Definition of done
From the dashboard, the app cannot be opened until the health check passes, or the app is eventually deemed unhealthy.
Prior art
Health checks are implemented for generic apps in Coder Classic with support for
exec
andhttp
based health checks, but with a hardcoded interval/timeout/unhealthy threshold. There is not a loading indicator in the Ui, but when an app is clicked, the tab loads forx
seconds until the health check passes/fails.Ideas
@bpmct: Unlike health checks in Coder Classic, I think health checks would benefit from a configurable
unhealthy threshold
since applications will often be installed during runtime, leading to longer-than-normal "wait times." GCP follows this patternAdd health check to
coder_app
resourceBefore the unhealthy threshold, a loading indicator could be present making it clear to users the app is still unhealthy/loading until the health check passes or times out. Perhaps the app is also unclickable
After the threshold is exceeded (e.g 3 mins), the app can have a red/error indicator if the health check never passes:
The text was updated successfully, but these errors were encountered: