-
Notifications
You must be signed in to change notification settings - Fork 883
add health check for out-of-date provisioner daemon #10676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@BrunoQuaresma pinged me about the healthcheck section schema. Here is the first draft: {
"provisionerd": [
{
"name": "provisioner-0",
"version": "0.23.4",
"provisioners": ["echo", "terraform"],
"tags": ["ml-dev", "qa"],
"created_at": "<timestamp>",
"last_seen_at": "<timestamp>"
},
{
"name": "provisioner-1",
...
}
]
} Let me know your thoughts! |
I think that's a reasonable place to start. |
I will see what I can do here:
|
@johnstcn Do you think we should rename EDIT: I'm happy to do this, just asking about your thoughts. |
@mtojek Yeah I think that makes sense. |
@spikecurtis raised questions around the vision, so let me elaborate more on MVP referring to healthcheck requirements
|
My interpretation of this is some kind of heartbeat. Is that correct? Are we just having Coderd do the heartbeats as long as there is a connection to the provisionerd, or does it need to be originated at the Provisionerd (e.g. to detect provisioners that are deadlocked even tho the connection is up)?
In a deployment where external provisioner daemons are scaled up and down you'll see a bunch of provisioner daemons that haven't reported in because they were stopped in a scale-down event. I think we need to be careful in the design to not be "alarmist" about these: just note that they were last connected within 24 hours and are now not connected. Don't call this a "warning." |
My original thought was to have provisionerdserver bump Having provisionerd ping a separate endpoint that bumps last_seen_at is a bit further removed than I'd like. |
Tying the |
I know that Cian wrote the mini-design doc (thanks!), but it would be nice to close the conversation.
I defer the final decision to the person responsible for implementation, but my recommendation is to use the best efforts. If a deadlock detection mechanism is hard to implement, I'm fine with scheduling it as a follow-up. Provisionerd heartbeat would be already a great improvement.
Do you know, @spikecurtis, if existing provisionerd could handle the scale-down signal somehow? Even responding to OS signal, and firing an HTTP request to coderd could be a nice starter.
This or and artificial, hidden ALIVE packet streamed similarly to job log entries. |
I'm going to say that surfacing provisioner lock-ups is outside of the scope of this specific issue ("add health check for out-of-date provisioner daemon"). We can look deeper into this as a follow-up enhancement. |
@stirby current status: this week I've been finishing off the final pieces of this issue. Please see issue description for task list / updates. |
Part of #10676 - Adds a health section for provisioner daemons (mostly cannibalized from the Workspace Proxy section) - Adds a corresponding storybook entry for provisioner daemons health section - Fixed an issue where dismissing the provisioner daemons warnings would result in a 500 error - Adds provisioner daemon error codes to docs
Closing based on completed to-do list in issue description. |
Extracted from #9558
if provisioner daemons are out of date with the server version, unexpected behavior can occur. the health check endpoint
/api/v2/debug/health
should display configured, provisioners, and their versions.TODOs:
provisioner_daemon.version
to database.gallant_galois0
).major.minor
version check of active provisioner daemons. feat(coderd): add provisioner_daemons to /debug/health endpoint #11393, fix(coderd/healthcheck): add daemon-specific warnings to healthcheck output #11490The text was updated successfully, but these errors were encountered: