Skip to content

"Running" on workspace list isn't helpful if the agent is disconnected #6461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bpmct opened this issue Mar 6, 2023 · 15 comments · Fixed by #8548
Closed

"Running" on workspace list isn't helpful if the agent is disconnected #6461

bpmct opened this issue Mar 6, 2023 · 15 comments · Fixed by #8548
Assignees
Labels
s2 Broken use cases or features (with a workaround). Only humans may set this.

Comments

@bpmct
Copy link
Member

bpmct commented Mar 6, 2023

Could we have a combined state which displays instead whether the agent is connected, if it's still running the startup script, whether it's reachable/unreachable, etc?

@mafredri
Copy link
Member

mafredri commented Mar 7, 2023

I'm thinking about how we should represent this in the UI/CLI. What makes most sense to me is to have a workspace health field since we still want to differentiate between a started/stopped state. The two columns could essentially be combined into one since health for a stopped workspace doesn't make much sense, but this should illustrate the general idea.

Before:
image

After:
image

This is a bit tricker to express in the CLI, since we don't have popups to give extra information.

Perhaps we can simply add a new column:

❯ coder list
WORKSPACE      TEMPLATE  STATUS   Health     LAST BUILT  OUTDATED  STARTS AT                         STOPS AFTER
mafredri/test  docker    Stopped  N/A        21d23h      true      -                                 8h
mafredri/work  docker    Started  Unhealthy  1d          false     9:30AM Sun-Sat (Europe/Helsinki)  14h

The user can then use show to find out why it's unhealthy:

❯ coder show work
┌──────────────────────────────────────────────────────────────────────────────────────────────┐
│ RESOURCE                    STATUS       LIFECYCLE       VERSION          ACCESS             │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ docker_container.workspace                                                                   │
│ └─ main (linux, amd64)      ⦿ connected  x startup_failed  v0.18.1+e3a4861   coder ssh work  │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ docker_image.workspace                                                                       │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ docker_image.base                                                                            │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ docker_image.go                                                                              │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ docker_image.template                                                                        │
├──────────────────────────────────────────────────────────────────────────────────────────────┤
│ docker_volume.home_volume                                                                    │
└──────────────────────────────────────────────────────────────────────────────────────────────┘

@bpmct
Copy link
Member Author

bpmct commented Mar 13, 2023

I'm definitely in favor of more of a single lifecycle state being displayed to users

@BrunoQuaresma
Copy link
Collaborator

We could do the following:

  • Show a green badge with the label "Healthy" when the workspace is ok and all the agents are ok too.
  • Show a red or orange bad with the label "Unhealthy" when the workspace is not ok or the agents are not running. When the user hovers over the badge we display a popover showing the agents and their statuses. We could also add a message like "Looks like the workspace is not running or its agents are failing. You may solve this by restarting the workspace."

@bpmct
Copy link
Member Author

bpmct commented Mar 14, 2023

When the user hovers over the badge we display a popover showing the agents and their statuses. We could also add a message like "Looks like the workspace is not running or its agents are failing. You may solve this by restarting the workspace."

I'm not a fan of hovering to see additional error details as every user may not intuitively hover. For example, the agent troubleshooting URL is hard to find. On the workspace list page, that is fair, but I think on the page itself we should promote restarting with a bug button.

@BrunoQuaresma
Copy link
Collaborator

@mafredri it looks like a FE task, is there anything you want to do in the BE? Or are you interested on take the FE as well?

@BrunoQuaresma
Copy link
Collaborator

BrunoQuaresma commented Jun 21, 2023

I think the task #6462 is related to it as well

@mafredri
Copy link
Member

it looks like a FE task, is there anything you want to do in the BE? Or are you interested on take the FE as well?

@BrunoQuaresma happy to defer FE to you. As for the BE, I think potential requirements may be easier to identify after we have an implementation of this. For example, do we want a single field on the agent that says health: helathy/unhealthy/etc to simplify the FE logic, etc?

@BrunoQuaresma
Copy link
Collaborator

@mafredri this would be great!

@matifali matifali added the s2 Broken use cases or features (with a workaround). Only humans may set this. label Jun 26, 2023
@mafredri
Copy link
Member

Alright @BrunoQuaresma and @bpmct, we can add a health field to the agent, and another one to the workspace (which is a summary of the agent health fields).

We could start off with a simple enum: healthy and unhealthy and then add more states as we need them.

The only question in my mind is, how can the UI help the user understand why a workspace is unhealthy based on the new health field? Should we add more values to the enum or would we just end up reimplementing the status/lifecycle state fields? Then again @BrunoQuaresma's proposal is good in that this doesn't matter and the simple state is enough.

To document my thought process, and possibly prompt some ideas from all of you. Here are the (currently possible) enumerated states of an agent (status:lifecycle state):

connecting:created
timeout:created
connected:created
connected:starting
(connected:start_timeout)
connected:start_error
connected:ready
connected:shutting_down
(connected:shutdown_timeout)
connected:shutdown_error
connected:off
disconnected:created
disconnected:starting
(disconnected:start_timeout)
disconnected:start_error
disconnected:ready
disconnected:shutting_down
(disconnected:shutdown_timeout)
disconnected:shutdown_error
disconnected:off

States that I would consider are healthy:

connecting:created
connected:created
connected:starting
connected:ready

There's one state representing a timeout or an unknown state, we don't know if it's healthy or unhealthy, nor do we know if it's ever going to be:

timeout:created

There are two states representing soft timeouts, these could be considered either healthy or unhealthy:

(connected:start_timeout)
(connected:shutdown_timeout)

The rest could all be considered unhealthy.

If we ever want to support restarting agents on request, even these states could be considered healthy:

connected:shutting_down
connected:off
disconnected:off

@bpmct
Copy link
Member Author

bpmct commented Jun 29, 2023

I don't think we should add more values to the enum. Displaying healthy/unhealthy will be easier for the user to understand IMO.

However, I think we could surface the "reason" as a different property in the API request, perhaps one that shows up when you hover over "unhealthy." I think https://dev.coder.com/api/v2/debug/health has a good design for this.

I assume we're not storing health information in the DB, right? Just computing it per-request based on workspace/agent state?

In unhealthy scenarios (e.g. agent disconnected), we should offer a button to restart the workspace.

Hope that helps!

@bpmct
Copy link
Member Author

bpmct commented Jun 29, 2023

Also, I'm more than happy to help with the user-facing tooltip/coder show copy for each of those states you mentioned. Let me know what you think.

@mafredri
Copy link
Member

I created a draft implementation of the health field, borrowing concepts from /debug/health, in #8280. Thoughts?

I still need to think some more about how this field will behave when e.g. the workspace is stopping. Stopping is technically healthy, but the workspace could be considered "unhealthy" from a usability perspective.

@mafredri
Copy link
Member

@BrunoQuaresma I created a draft for this, but could use some guidance for the popover/tooltip: #8387. Also, does this align with your vision?

@mtojek
Copy link
Member

mtojek commented Jul 17, 2023

Hey @mafredri and @BrunoQuaresma! I have just seen this issue, so I'm wondering if we can close this one. Otherwise, could you please document what is left?

@mafredri
Copy link
Member

mafredri commented Jul 17, 2023

@mtojek I'd say we still want to improve coder list to show health status.

With coder list, we can't do the same as the WebUI (add an exclamation mark to status), so a separate healthy/unhealthy column might be ideal.

There's also coder show [workspace], but it's whole UI needs to be changed (IMO). It's currently a "pretty table" which we've mostly moved away from, and I think it should show health and lifecycle state too. This could be a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
s2 Broken use cases or features (with a workaround). Only humans may set this.
Projects
None yet
6 participants