Skip to content

feat: add computed workspace and agent health fields to the api #8280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 10, 2023

Conversation

mafredri
Copy link
Member

@mafredri mafredri commented Jun 30, 2023

This PR adds support for workspace and agent health fields.

See: #8280 (comment)

Previous iterations Example output for a workspace:
{
    // ...
    "health": {
        "healthy": false,
        "failing_sections": [
            "agents.dbd7447f-e4e2-4d9f-92a2-136f84df4b2f"
        ],
        "agents": {
            "dbd7447f-e4e2-4d9f-92a2-136f84df4b2f": {
                "healthy": false,
                "reason": "agent startup script exited with an error"
            }
        }
    }
}

Update: We now only have a health field for agents to keep it simple:

{
	// ...
	"health": {
        "healthy": false,
        "reason": "agent startup script exited with an error"
    }
}

To check on workspace health, one can do const unhealthyAgents = workspace.latest_build.resources.map(r => r.agents).flat().filter(a => !a.health.healhty)

Refs #6461

Copy link
Member

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may not see the full picture, but if we want to KISS, why just not add 2 properties for every agent:

"health": {
    "healthy": false,
    "reason": "Foobar..."
}

I have a feeling that failing_sections seems to be an overkill for what we're trying to achieve.

@@ -1262,6 +1262,24 @@ func convertWorkspaceAgent(derpMap *tailcfg.DERPMap, coordinator tailnet.Coordin
workspaceAgent.ReadyAt = &dbAgent.ReadyAt.Time
}

switch {
case workspaceAgent.Status != codersdk.WorkspaceAgentConnected && workspaceAgent.LifecycleState == codersdk.WorkspaceAgentLifecycleOff:
workspaceAgent.Health.Reason = "agent is not running"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be not in the loop, but what is the reason for returning the human-readable reason over API? Did you consider putting this mapping in the site/UI code?

Let's say that we want the site/UI to take an action on agent is taking too long to connect, does it mean that we have to compare strings?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid concern. I suppose this method allows us to skip the introduction of a new enum and simply surface the reason in the UI. This is similar to how we do with API errors currently where we have a human readable reason in the HTTP response.

If we make this an enum, it needs to be kept up-to-date over UI, CLI, etc.

@mafredri
Copy link
Member Author

mafredri commented Jul 3, 2023

I may not see the full picture, but if we want to KISS, why just not add 2 properties for every agent:

"health": {
    "healthy": false,
    "reason": "Foobar..."
}

We've done that in this PR, except we also have the property on the workspace (which is currently only based on the sum of the agent healths).

I have a feeling that failing_sections seems to be an overkill for what we're trying to achieve.

I somewhat agree since currently we only rely on agent healths. However, out of a UI simplicity perspective, it's nice to be able to check one property on the workspace instead of diving into all the agents to figure out where the issue is. This field could easily (and probably should?) be expanded to include provisioning as well.

To illustrate the use-case, let's say we add provisioning to health and if it fails we have failing_sections = ['provision'] and { provision: { healthy: false, reason: "error from terraform" }. Now the UI can list workspaces and simply do if (!workspace.health.healthy) { for (let section of workspace.health.failing_sections) { show(workspace.health[section].reason) } }. This is simplified doesn't work as-is on the agents map (agents.uuid key) without extra processing, but hopefully illustrates the use-case.

The naming here comes mainly from trying to match the debug endpoint we have, as mentioned in #6461 (comment).

@mtojek
Copy link
Member

mtojek commented Jul 3, 2023

it's nice to be able to check one property on the workspace instead of diving into all the agents to figure out where the issue is.

Let me challenge this.

Coder workspaces use at most 1 (maybe 2?) agent per workspace. It does not require complex UI processing on the frontend side. Also, I would consider deferring failing_sections for later until we have reasons to add the property. I would not build a super-flexible endpoint if we don't have plans to use the field. More fields = more maintenance.

This is simplified doesn't work as-is on the agents map (agents.uuid key) without extra processing, but hopefully illustrates the use-case.

I would not drive the API design with front-end shortcuts. What is important for a good API design is making endpoints concise and idempotent. IMHOfailing_sections is definitely redundant, and I would rather treat it as a field for future optimizations.

@mafredri
Copy link
Member Author

mafredri commented Jul 3, 2023

I would not drive the API design with front-end shortcuts. What is important for a good API design is making endpoints concise and idempotent. IMHOfailing_sections is definitely redundant, and I would rather treat it as a field for future optimizations.

I do see your point, however, I would rather equate this with restful APIs where e.g. the URL to the "next page of results" is part of the response. Likewise this failing_sections field is pointing towards what caused the workspace to become unhealthy.

That said, I'm fine with removing the sections feature. The tradeoff is that if we do add more reasons for unhealthiness, we'll need to find and update all the places that only took agents into consideration (for better or worse).

@mtojek
Copy link
Member

mtojek commented Jul 3, 2023

That said, I'm fine with removing the sections feature. The tradeoff is that if we do add more reasons for unhealthiness, we'll need to find and update all the places that only took agents into consideration (for better or worse).

Thanks for addressing the issue. I think that this form is still open for modifications unless we predict other health factors to consider?

Anyway, as chatted with you over slack, I don't intend to slow down this feature delivery, so if more voices are for keeping the failing_sections metadata, I will not block this PR. I just don't see this to be super-complicated now, so that we need an extra helper field to "calculate" workspace health.

@mafredri
Copy link
Member Author

mafredri commented Jul 4, 2023

@mtojek thinking about this some more, how do you feel about the following format for the API response?

{
    // ...
    "health": {
        "healthy": false,
        "failing_agents": ["dbd7447f-e4e2-4d9f-92a2-136f84df4b2f"]
    },
    "resources": [
        {
            // ...
            "agents": [
                {
                    "id": "dbd7447f-e4e2-4d9f-92a2-136f84df4b2f",
                    // ...
                    "health": {
                        "healthy": false,
                        "reason": "timeout|disconnected|start_error|..."
                    }
                }
            ]
        }
    ]
}

We've eliminated the duplication but kept the response informative. This can't be extended in a generic way in the future, so if we want more knobs for health (like workspace build failures, etc.), we need to refactor the code that uses it. That should be fairly simple, though.

From a client/UI perspective this is a bit harder to use than duplicating the information in the workspace health, but if we're fine with that, we can use this approach.

Whether or not reason should be human readable or an enum (enum of the "bad" states for status|lifecycle_state), I'll leave up to @BrunoQuaresma and @bpmct.

PS. Maybe we should make health a property of workspace build (latest_build) instead of the workspace itself? Not sure... conceptually both work.

@mtojek
Copy link
Member

mtojek commented Jul 4, 2023

I like this approach more. It is intuitive, and a machine reader won't be required to process the details of the agent structure. In theory, we don't expect our APIs to be human-readable, but I don't know if this requirement changes in the future.

PS. Maybe we should make health a property of workspace build (latest_build) instead of the workspace itself? Not sure... conceptually both work.

Workspace build is a process of provisioning the workspace, so it can't expose any health metrics by its nature 🤔. I would rather stick to the workspace health.

@mafredri mafredri force-pushed the mafredri/feat-add-workspace-and-agent-health branch 2 times, most recently from 0b74ac8 to c09d750 Compare July 4, 2023 12:38
@mafredri mafredri force-pushed the mafredri/feat-add-workspace-and-agent-health branch from c09d750 to 9774488 Compare July 4, 2023 12:47
@mafredri mafredri marked this pull request as ready for review July 4, 2023 13:04
Copy link
Member

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome 👍

@mafredri mafredri force-pushed the mafredri/feat-add-workspace-and-agent-health branch from 802801e to 36ec2ba Compare July 4, 2023 16:25
@mafredri mafredri force-pushed the mafredri/feat-add-workspace-and-agent-health branch from 36ec2ba to fba12e2 Compare July 4, 2023 16:28
@mtojek mtojek closed this in #8309 Jul 6, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Jul 6, 2023
@mafredri mafredri reopened this Jul 6, 2023
Copy link
Collaborator

@BrunoQuaresma BrunoQuaresma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, we can use ENUM or human-readable messages. I don't have thoughts about this... I already worked with both and never saw a HUGE difference in development or maintainability in the long term so whatever is best for the BE.

@mafredri
Copy link
Member Author

LGTM, we can use ENUM or human-readable messages. I don't have thoughts about this... I already worked with both and never saw a HUGE difference in development or maintainability in the long term so whatever is best for the BE.

Alright, let's go with this format for now 👍🏻. If we want an enum instead, we can introduce it later or add a second field (reason_code, etc.).

@mafredri mafredri merged commit b73f9d8 into main Jul 10, 2023
@mafredri mafredri deleted the mafredri/feat-add-workspace-and-agent-health branch July 10, 2023 09:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants