feat: add computed workspace and agent health fields to the api #8280

mafredri · 2023-06-30T16:32:26Z

This PR adds support for workspace and agent health fields.

Previous iterations

Example output for a workspace:

{
    // ...
    "health": {
        "healthy": false,
        "failing_sections": [
            "agents.dbd7447f-e4e2-4d9f-92a2-136f84df4b2f"
        ],
        "agents": {
            "dbd7447f-e4e2-4d9f-92a2-136f84df4b2f": {
                "healthy": false,
                "reason": "agent startup script exited with an error"
            }
        }
    }
}

Update: We now only have a health field for agents to keep it simple:

{
	// ...
	"health": {
        "healthy": false,
        "reason": "agent startup script exited with an error"
    }
}

To check on workspace health, one can do const unhealthyAgents = workspace.latest_build.resources.map(r => r.agents).flat().filter(a => !a.health.healhty)

Refs #6461

mtojek

I may not see the full picture, but if we want to KISS, why just not add 2 properties for every agent:

"health": {
    "healthy": false,
    "reason": "Foobar..."
}

I have a feeling that failing_sections seems to be an overkill for what we're trying to achieve.

codersdk/workspaces.go

mtojek · 2023-07-03T09:30:12Z

coderd/workspaceagents.go

@@ -1262,6 +1262,24 @@ func convertWorkspaceAgent(derpMap *tailcfg.DERPMap, coordinator tailnet.Coordin
 		workspaceAgent.ReadyAt = &dbAgent.ReadyAt.Time
 	}

+	switch {
+	case workspaceAgent.Status != codersdk.WorkspaceAgentConnected && workspaceAgent.LifecycleState == codersdk.WorkspaceAgentLifecycleOff:
+		workspaceAgent.Health.Reason = "agent is not running"


I might be not in the loop, but what is the reason for returning the human-readable reason over API? Did you consider putting this mapping in the site/UI code?

Let's say that we want the site/UI to take an action on agent is taking too long to connect, does it mean that we have to compare strings?

This is a valid concern. I suppose this method allows us to skip the introduction of a new enum and simply surface the reason in the UI. This is similar to how we do with API errors currently where we have a human readable reason in the HTTP response.

If we make this an enum, it needs to be kept up-to-date over UI, CLI, etc.

codersdk/workspaces.go

mafredri · 2023-07-03T09:56:03Z

I may not see the full picture, but if we want to KISS, why just not add 2 properties for every agent:
"health": {
    "healthy": false,
    "reason": "Foobar..."
}

We've done that in this PR, except we also have the property on the workspace (which is currently only based on the sum of the agent healths).

I have a feeling that failing_sections seems to be an overkill for what we're trying to achieve.

I somewhat agree since currently we only rely on agent healths. However, out of a UI simplicity perspective, it's nice to be able to check one property on the workspace instead of diving into all the agents to figure out where the issue is. This field could easily (and probably should?) be expanded to include provisioning as well.

To illustrate the use-case, let's say we add provisioning to health and if it fails we have failing_sections = ['provision'] and { provision: { healthy: false, reason: "error from terraform" }. Now the UI can list workspaces and simply do if (!workspace.health.healthy) { for (let section of workspace.health.failing_sections) { show(workspace.health[section].reason) } }. This is simplified doesn't work as-is on the agents map (agents.uuid key) without extra processing, but hopefully illustrates the use-case.

The naming here comes mainly from trying to match the debug endpoint we have, as mentioned in #6461 (comment).

mtojek · 2023-07-03T10:15:09Z

it's nice to be able to check one property on the workspace instead of diving into all the agents to figure out where the issue is.

Let me challenge this.

Coder workspaces use at most 1 (maybe 2?) agent per workspace. It does not require complex UI processing on the frontend side. Also, I would consider deferring failing_sections for later until we have reasons to add the property. I would not build a super-flexible endpoint if we don't have plans to use the field. More fields = more maintenance.

This is simplified doesn't work as-is on the agents map (agents.uuid key) without extra processing, but hopefully illustrates the use-case.

I would not drive the API design with front-end shortcuts. What is important for a good API design is making endpoints concise and idempotent. IMHOfailing_sections is definitely redundant, and I would rather treat it as a field for future optimizations.

mafredri · 2023-07-03T10:23:12Z

I would not drive the API design with front-end shortcuts. What is important for a good API design is making endpoints concise and idempotent. IMHOfailing_sections is definitely redundant, and I would rather treat it as a field for future optimizations.

I do see your point, however, I would rather equate this with restful APIs where e.g. the URL to the "next page of results" is part of the response. Likewise this failing_sections field is pointing towards what caused the workspace to become unhealthy.

That said, I'm fine with removing the sections feature. The tradeoff is that if we do add more reasons for unhealthiness, we'll need to find and update all the places that only took agents into consideration (for better or worse).

mtojek · 2023-07-03T16:09:24Z

That said, I'm fine with removing the sections feature. The tradeoff is that if we do add more reasons for unhealthiness, we'll need to find and update all the places that only took agents into consideration (for better or worse).

Thanks for addressing the issue. I think that this form is still open for modifications unless we predict other health factors to consider?

Anyway, as chatted with you over slack, I don't intend to slow down this feature delivery, so if more voices are for keeping the failing_sections metadata, I will not block this PR. I just don't see this to be super-complicated now, so that we need an extra helper field to "calculate" workspace health.

mafredri · 2023-07-04T11:22:58Z

@mtojek thinking about this some more, how do you feel about the following format for the API response?

{
    // ...
    "health": {
        "healthy": false,
        "failing_agents": ["dbd7447f-e4e2-4d9f-92a2-136f84df4b2f"]
    },
    "resources": [
        {
            // ...
            "agents": [
                {
                    "id": "dbd7447f-e4e2-4d9f-92a2-136f84df4b2f",
                    // ...
                    "health": {
                        "healthy": false,
                        "reason": "timeout|disconnected|start_error|..."
                    }
                }
            ]
        }
    ]
}

We've eliminated the duplication but kept the response informative. This can't be extended in a generic way in the future, so if we want more knobs for health (like workspace build failures, etc.), we need to refactor the code that uses it. That should be fairly simple, though.

From a client/UI perspective this is a bit harder to use than duplicating the information in the workspace health, but if we're fine with that, we can use this approach.

Whether or not reason should be human readable or an enum (enum of the "bad" states for status|lifecycle_state), I'll leave up to @BrunoQuaresma and @bpmct.

~~PS. Maybe we should make health a property of workspace build (latest_build) instead of the workspace itself? Not sure... conceptually both work.~~

mtojek · 2023-07-04T11:32:04Z

I like this approach more. It is intuitive, and a machine reader won't be required to process the details of the agent structure. In theory, we don't expect our APIs to be human-readable, but I don't know if this requirement changes in the future.

PS. Maybe we should make health a property of workspace build (latest_build) instead of the workspace itself? Not sure... conceptually both work.

Workspace build is a process of provisioning the workspace, so it can't expose any health metrics by its nature 🤔. I would rather stick to the workspace health.

mtojek

Awesome 👍

coderd/workspaces_test.go

BrunoQuaresma

LGTM, we can use ENUM or human-readable messages. I don't have thoughts about this... I already worked with both and never saw a HUGE difference in development or maintainability in the long term so whatever is best for the BE.

mafredri · 2023-07-10T09:00:00Z

LGTM, we can use ENUM or human-readable messages. I don't have thoughts about this... I already worked with both and never saw a HUGE difference in development or maintainability in the long term so whatever is best for the BE.

Alright, let's go with this format for now 👍🏻. If we want an enum instead, we can introduce it later or add a second field (reason_code, etc.).

feat: add computed workspace and agent health fields to the api

60489f9

github-actions bot assigned mafredri Jun 30, 2023

mafredri mentioned this pull request Jun 30, 2023

"Running" on workspace list isn't helpful if the agent is disconnected #6461

Closed

update golden files

16976c0

mtojek reviewed Jul 3, 2023

View reviewed changes

remove health property from workspace

15cfc3f

mtojek mentioned this pull request Jul 3, 2023

feat: delay pending-in-queue banner #8309

Merged

mafredri force-pushed the mafredri/feat-add-workspace-and-agent-health branch 2 times, most recently from 0b74ac8 to c09d750 Compare July 4, 2023 12:38

implement proposal, add tests and examples

9774488

mafredri force-pushed the mafredri/feat-add-workspace-and-agent-health branch from c09d750 to 9774488 Compare July 4, 2023 12:47

Merge branch 'main' into mafredri/feat-add-workspace-and-agent-health

a62c6ae

mafredri marked this pull request as ready for review July 4, 2023 13:04

mafredri requested review from mtojek, BrunoQuaresma and bpmct July 4, 2023 13:06

mtojek approved these changes Jul 4, 2023

View reviewed changes

coderd/workspaces_test.go Show resolved Hide resolved

coderd/workspaces_test.go Outdated Show resolved Hide resolved

mafredri force-pushed the mafredri/feat-add-workspace-and-agent-health branch from 802801e to 36ec2ba Compare July 4, 2023 16:25

more tests

fba12e2

mafredri force-pushed the mafredri/feat-add-workspace-and-agent-health branch from 36ec2ba to fba12e2 Compare July 4, 2023 16:28

mtojek closed this in #8309 Jul 6, 2023

github-actions bot locked and limited conversation to collaborators Jul 6, 2023

mafredri reopened this Jul 6, 2023

BrunoQuaresma approved these changes Jul 6, 2023

View reviewed changes

Merge branch 'main' into mafredri/feat-add-workspace-and-agent-health

f51c061

mafredri merged commit b73f9d8 into main Jul 10, 2023

mafredri deleted the mafredri/feat-add-workspace-and-agent-health branch July 10, 2023 09:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add computed workspace and agent health fields to the api #8280

feat: add computed workspace and agent health fields to the api #8280

Uh oh!

mafredri commented Jun 30, 2023 •

edited

Loading

Uh oh!

mtojek left a comment

Uh oh!

Uh oh!

Uh oh!

mtojek Jul 3, 2023

Uh oh!

mafredri Jul 3, 2023

Uh oh!

Uh oh!

mafredri commented Jul 3, 2023

Uh oh!

mtojek commented Jul 3, 2023

Uh oh!

mafredri commented Jul 3, 2023 •

edited

Loading

Uh oh!

mtojek commented Jul 3, 2023

Uh oh!

mafredri commented Jul 4, 2023 •

edited

Loading

Uh oh!

mtojek commented Jul 4, 2023

Uh oh!

mtojek left a comment

Uh oh!

Uh oh!

Uh oh!

BrunoQuaresma left a comment

Uh oh!

mafredri commented Jul 10, 2023

Uh oh!

Uh oh!

feat: add computed workspace and agent health fields to the api #8280

feat: add computed workspace and agent health fields to the api #8280

Uh oh!

Conversation

mafredri commented Jun 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtojek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mtojek Jul 3, 2023

Choose a reason for hiding this comment

Uh oh!

mafredri Jul 3, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mafredri commented Jul 3, 2023

Uh oh!

mtojek commented Jul 3, 2023

Uh oh!

mafredri commented Jul 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtojek commented Jul 3, 2023

Uh oh!

mafredri commented Jul 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtojek commented Jul 4, 2023

Uh oh!

mtojek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

BrunoQuaresma left a comment

Choose a reason for hiding this comment

Uh oh!

mafredri commented Jul 10, 2023

Uh oh!

Uh oh!

mafredri commented Jun 30, 2023 •

edited

Loading

mafredri commented Jul 3, 2023 •

edited

Loading

mafredri commented Jul 4, 2023 •

edited

Loading