Skip to content

Drill-down view: workspace network latency & disconnects #6724

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mtojek opened this issue Mar 22, 2023 · 9 comments
Closed

Drill-down view: workspace network latency & disconnects #6724

mtojek opened this issue Mar 22, 2023 · 9 comments
Assignees
Milestone

Comments

@mtojek
Copy link
Member

mtojek commented Mar 22, 2023

Prometheus metrics around network disconnects (+ reason) and user latency in workspaces. Drill down to workspace, IDE, and connection type

  1. Add Prometheus metrics around network disconnects (incl. reason)
  2. Measure user latency, drill down to dimensions:
  • workspace
  • IDE
  • connection type
@mtojek mtojek added this to the Coder Operability milestone Mar 22, 2023
@mtojek mtojek self-assigned this Mar 29, 2023
@mtojek
Copy link
Member Author

mtojek commented Apr 3, 2023

I'm going with exposing agent metrics via Prometheus endpoint.

@bpmct bpmct changed the title Drill-down view: workspace network latency Drill-down view: workspace network latency & dosconnects Apr 3, 2023
@mtojek mtojek changed the title Drill-down view: workspace network latency & dosconnects Drill-down view: workspace network latency & disconnects Apr 4, 2023
@mtojek
Copy link
Member Author

mtojek commented Apr 17, 2023

Status update:

We have agent connection/latencies/session stats exposed via Prometheus. The next step would be collecting and exposing details related to agent timeouts.

@mtojek
Copy link
Member Author

mtojek commented Apr 24, 2023

Plan for this week:

Implement a metrics collector in coderd to aggregate metrics from agents. Let's collect magicsock metrics first.

  • coderd: adjust /report-stats API endpoint to receive metrics
  • agent: send magicsock metrics
  • coderd: save metrics to Prometheus registry

@mtojek
Copy link
Member Author

mtojek commented Apr 27, 2023

Hey @mafredri! Did you identify any code places where we can inject extra metrics/counters to debug agent timeouts easier? I'm wondering if there are any gaps that could be addressed here.

@mafredri
Copy link
Member

@mtojek not really. I still haven’t managed to isolate the problem and can’t really say what metric would help. Although, agent metrics is only one side of the coin, knowing what the client sees could help in such situations. Something I’m doing now is adding logging on the client side. But we probably shouldn’t be sending client metrics to the server, at least not normally.

@mtojek
Copy link
Member Author

mtojek commented Apr 28, 2023

There is a significant problem with logging on the agent side. Admins have to ask users to copy, or at least review their logs for issues. We definitely need something centralized, ideally a sink for logs, but I wouldn't mind some extra metrics, even vague ones like agent_vscode_network_timeout or agent_ssh_bad_peer_cerfiticate.

@mtojek
Copy link
Member Author

mtojek commented May 17, 2023

Battle plan:

@matifali
Copy link
Member

matifali commented May 18, 2023

Related #4680,

@mtojek
Copy link
Member Author

mtojek commented May 25, 2023

I'm going to detach #7581 from the plan and keep it as a separate issue, hence resolving it.

@mtojek mtojek closed this as completed May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants