-
Notifications
You must be signed in to change notification settings - Fork 887
feat: add computed workspace and agent health fields to the api #8280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may not see the full picture, but if we want to KISS, why just not add 2 properties for every agent:
"health": {
"healthy": false,
"reason": "Foobar..."
}
I have a feeling that failing_sections
seems to be an overkill for what we're trying to achieve.
@@ -1262,6 +1262,24 @@ func convertWorkspaceAgent(derpMap *tailcfg.DERPMap, coordinator tailnet.Coordin | |||
workspaceAgent.ReadyAt = &dbAgent.ReadyAt.Time | |||
} | |||
|
|||
switch { | |||
case workspaceAgent.Status != codersdk.WorkspaceAgentConnected && workspaceAgent.LifecycleState == codersdk.WorkspaceAgentLifecycleOff: | |||
workspaceAgent.Health.Reason = "agent is not running" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be not in the loop, but what is the reason for returning the human-readable reason over API? Did you consider putting this mapping in the site/UI code?
Let's say that we want the site/UI to take an action on agent is taking too long to connect
, does it mean that we have to compare strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a valid concern. I suppose this method allows us to skip the introduction of a new enum and simply surface the reason in the UI. This is similar to how we do with API errors currently where we have a human readable reason in the HTTP response.
If we make this an enum, it needs to be kept up-to-date over UI, CLI, etc.
We've done that in this PR, except we also have the property on the workspace (which is currently only based on the sum of the agent healths).
I somewhat agree since currently we only rely on agent healths. However, out of a UI simplicity perspective, it's nice to be able to check one property on the workspace instead of diving into all the agents to figure out where the issue is. This field could easily (and probably should?) be expanded to include provisioning as well. To illustrate the use-case, let's say we add provisioning to health and if it fails we have The naming here comes mainly from trying to match the debug endpoint we have, as mentioned in #6461 (comment). |
Let me challenge this. Coder workspaces use at most 1 (maybe 2?) agent per workspace. It does not require complex UI processing on the frontend side. Also, I would consider deferring
I would not drive the API design with front-end shortcuts. What is important for a good API design is making endpoints concise and idempotent. IMHO |
I do see your point, however, I would rather equate this with restful APIs where e.g. the URL to the "next page of results" is part of the response. Likewise this That said, I'm fine with removing the sections feature. The tradeoff is that if we do add more reasons for unhealthiness, we'll need to find and update all the places that only took agents into consideration (for better or worse). |
Thanks for addressing the issue. I think that this form is still open for modifications unless we predict other health factors to consider? Anyway, as chatted with you over slack, I don't intend to slow down this feature delivery, so if more voices are for keeping the |
@mtojek thinking about this some more, how do you feel about the following format for the API response? {
// ...
"health": {
"healthy": false,
"failing_agents": ["dbd7447f-e4e2-4d9f-92a2-136f84df4b2f"]
},
"resources": [
{
// ...
"agents": [
{
"id": "dbd7447f-e4e2-4d9f-92a2-136f84df4b2f",
// ...
"health": {
"healthy": false,
"reason": "timeout|disconnected|start_error|..."
}
}
]
}
]
} We've eliminated the duplication but kept the response informative. This can't be extended in a generic way in the future, so if we want more knobs for health (like workspace build failures, etc.), we need to refactor the code that uses it. That should be fairly simple, though. From a client/UI perspective this is a bit harder to use than duplicating the information in the workspace health, but if we're fine with that, we can use this approach. Whether or not
|
I like this approach more. It is intuitive, and a machine reader won't be required to process the details of the agent structure. In theory, we don't expect our APIs to be human-readable, but I don't know if this requirement changes in the future.
Workspace build is a process of provisioning the workspace, so it can't expose any health metrics by its nature 🤔. I would rather stick to the workspace health. |
0b74ac8
to
c09d750
Compare
c09d750
to
9774488
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome 👍
802801e
to
36ec2ba
Compare
36ec2ba
to
fba12e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, we can use ENUM or human-readable messages. I don't have thoughts about this... I already worked with both and never saw a HUGE difference in development or maintainability in the long term so whatever is best for the BE.
Alright, let's go with this format for now 👍🏻. If we want an enum instead, we can introduce it later or add a second field ( |
This PR adds support for workspace and agent health fields.
See: #8280 (comment)
Previous iterations
Example output for a workspace:Update: We now only have a health field for agents to keep it simple:
To check on workspace health, one can do
const unhealthyAgents = workspace.latest_build.resources.map(r => r.agents).flat().filter(a => !a.health.healhty)
Refs #6461