Skip to content

add health check for out-of-date provisioner daemon #10676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
8 tasks done
Tracked by #8971
mtojek opened this issue Nov 14, 2023 · 13 comments
Closed
8 tasks done
Tracked by #8971

add health check for out-of-date provisioner daemon #10676

mtojek opened this issue Nov 14, 2023 · 13 comments
Assignees

Comments

@mtojek
Copy link
Member

mtojek commented Nov 14, 2023

Extracted from #9558

if provisioner daemons are out of date with the server version, unexpected behavior can occur. the health check endpoint /api/v2/debug/health should display configured, provisioners, and their versions.

TODOs:

@cdr-bot cdr-bot bot added the feature label Nov 14, 2023
@mtojek mtojek added the feature label Nov 14, 2023
@mtojek mtojek changed the title add health check for out-of-date provisioner add health check for out-of-date provisioner daemon Nov 14, 2023
@johnstcn johnstcn self-assigned this Nov 17, 2023
@mtojek
Copy link
Member Author

mtojek commented Nov 29, 2023

@BrunoQuaresma pinged me about the healthcheck section schema. Here is the first draft:

{
    "provisionerd": [
        {
            "name": "provisioner-0",
            "version": "0.23.4",
            "provisioners": ["echo", "terraform"],
            "tags": ["ml-dev", "qa"],
            "created_at": "<timestamp>",
            "last_seen_at": "<timestamp>"
        },
        {
            "name": "provisioner-1",
            ...
        }
    ]
}

Let me know your thoughts!

@johnstcn
Copy link
Member

@BrunoQuaresma pinged me about the healthcheck section schema. Here is the first draft:

{
    "provisionerd": [
        {
            "name": "provisioner-0",
            "version": "0.23.4",
            "provisioners": ["echo", "terraform"],
            "tags": ["ml-dev", "qa"],
            "created_at": "<timestamp>",
            "last_seen_at": "<timestamp>"
        },
        {
            "name": "provisioner-1",
            ...
        }
    ]
}

Let me know your thoughts!

I think that's a reasonable place to start.

@mtojek
Copy link
Member Author

mtojek commented Nov 30, 2023

I will see what I can do here:

Prune inactive provisioner daemons from database.

@mtojek
Copy link
Member Author

mtojek commented Dec 4, 2023

@johnstcn Do you think we should rename updated_at to last_seen_at as it is exactly what this field means?

EDIT:

I'm happy to do this, just asking about your thoughts.

@johnstcn
Copy link
Member

johnstcn commented Dec 5, 2023

@mtojek Yeah I think that makes sense.

@mtojek
Copy link
Member Author

mtojek commented Dec 11, 2023

@spikecurtis raised questions around the vision, so let me elaborate more on MVP referring to healthcheck requirements

  1. We'd like to include a basic section for provisioners in the healthcheck that signals if the provisioner didn't report status in time. The property last_seen_at (and the corresponding field in the database), will be refreshed every few minutes.
  2. Healthcheck section will evaluate if the provisioner is up or unresponsive. Let's say it does not report its state in 15 minutes. For now, we can consider only provisioners which checked in during the last 24h.
  3. Older entries would be swept with DeleteOld... job.

@spikecurtis
Copy link
Contributor

We'd like to include a basic section for provisioners in the healthcheck that signals if the provisioner didn't report status in time. The property last_seen_at (and the corresponding field in the database), will be refreshed every few minutes.

My interpretation of this is some kind of heartbeat. Is that correct? Are we just having Coderd do the heartbeats as long as there is a connection to the provisionerd, or does it need to be originated at the Provisionerd (e.g. to detect provisioners that are deadlocked even tho the connection is up)?

Healthcheck section will evaluate if the provisioner is up or unresponsive. Let's say it does not report its state in 15 minutes. For now, we can consider only provisioners which checked in during the last 24h.

In a deployment where external provisioner daemons are scaled up and down you'll see a bunch of provisioner daemons that haven't reported in because they were stopped in a scale-down event. I think we need to be careful in the design to not be "alarmist" about these: just note that they were last connected within 24 hours and are now not connected. Don't call this a "warning."

@johnstcn
Copy link
Member

My interpretation of this is some kind of heartbeat. Is that correct? Are we just having Coderd do the heartbeats as long as there is a connection to the provisionerd, or does it need to be originated at the Provisionerd (e.g. to detect provisioners that are deadlocked even tho the connection is up)?

My original thought was to have provisionerdserver bump last_seen_at whenever AcquireJob or AcquireJobWithCancel is called. In the latter case, we may need to spawn a separate goroutine to bump last_seen_at while the in-memory channel is waiting for a job.

Having provisionerd ping a separate endpoint that bumps last_seen_at is a bit further removed than I'd like.

@spikecurtis
Copy link
Contributor

spikecurtis commented Dec 11, 2023

My original thought was to have provisionerdserver bump last_seen_at whenever AcquireJob or AcquireJobWithCancel is called. In the latter case, we may need to spawn a separate goroutine to bump last_seen_at while the in-memory channel is waiting for a job.

Tying the last_seen_at bump explicitly to jobs is going to create annoying edge cases. What if a job takes 30 minutes to run? We should just create a goroutine that bumps it periodically as long as the provisioner is connected.

@mtojek
Copy link
Member Author

mtojek commented Dec 11, 2023

I know that Cian wrote the mini-design doc (thanks!), but it would be nice to close the conversation.

Are we just having Coderd do the heartbeats as long as there is a connection to the provisionerd, or does it need to be originated at the Provisionerd (e.g. to detect provisioners that are deadlocked even tho the connection is up)?

I defer the final decision to the person responsible for implementation, but my recommendation is to use the best efforts. If a deadlock detection mechanism is hard to implement, I'm fine with scheduling it as a follow-up. Provisionerd heartbeat would be already a great improvement.

because they were stopped in a scale-down event.

Do you know, @spikecurtis, if existing provisionerd could handle the scale-down signal somehow? Even responding to OS signal, and firing an HTTP request to coderd could be a nice starter.

What if a job takes 30 minutes to run? We should just create a goroutine that bumps it periodically as long as the provisioner is connected.

This or and artificial, hidden ALIVE packet streamed similarly to job log entries.

@johnstcn
Copy link
Member

I'm going to say that surfacing provisioner lock-ups is outside of the scope of this specific issue ("add health check for out-of-date provisioner daemon"). We can look deeper into this as a follow-up enhancement.

@johnstcn
Copy link
Member

johnstcn commented Jan 4, 2024

@stirby current status: this week I've been finishing off the final pieces of this issue. Please see issue description for task list / updates.

johnstcn added a commit that referenced this issue Jan 8, 2024
Part of #10676

- Adds a health section for provisioner daemons (mostly cannibalized from the Workspace Proxy section)
- Adds a corresponding storybook entry for provisioner daemons health section
- Fixed an issue where dismissing the provisioner daemons warnings would result in a 500 error
- Adds provisioner daemon error codes to docs
@johnstcn
Copy link
Member

johnstcn commented Jan 9, 2024

Closing based on completed to-do list in issue description.

@johnstcn johnstcn closed this as completed Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants