Skip to content

feat: add connection statistics for workspace agents #6469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Mar 9, 2023
Merged

Conversation

kylecarbs
Copy link
Member

@kylecarbs kylecarbs commented Mar 7, 2023

image

This adds a bar at the bottom of the dashboard (only visible to admins) with periodically updating statistics on workspaces.

After merge, I'll add warnings for the DERPForcedWebsocket property to indicate reduced throughput.

@kylecarbs kylecarbs requested review from ammario and bpmct March 7, 2023 01:42
@kylecarbs kylecarbs self-assigned this Mar 7, 2023
@kylecarbs kylecarbs requested a review from mafredri March 7, 2023 16:09
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a preliminary look at this, I'll finish my review tomorrow.

agent/agent.go Outdated
select {
case a.connStatsChan <- stats:
// Only store the latest stat when it's successfully sent!
// Otherwise, it should be sent again on the next iteration.
a.latestStat.Store(stats)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the previous comment about Tailscale resetting counts on every report, I'd think this current implementation will lose stats?

I imagine a more safe way to update the stats would be something along the lines of:

a.statMu.Lock()
a.stats.RxBytes += ...

select {
case a.connStatsChan <- a.stats: // note: a copy assuming basic struct
	// sent
default:
	// dropped this report
}
a.statMu.Unlock()

This way we're always incrementing the numbers, even from dropped reports.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, good point!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we'd be better off blocking instead of dropping? It seems like that's fine from the Tailscale side, and then we don't really run any risks here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to block instead. Let me know your thoughts! The loop retries anyways.

coderd/coderd.go Outdated
@@ -405,15 +408,16 @@ func New(options *Options) *API {
r.Post("/csp/reports", api.logReportCSPViolations)

r.Get("/buildinfo", buildInfo)
r.Route("/deployment", func(r chi.Router) {
r.Use(apiKeyMiddleware)
r.Get("/config", api.deploymentConfig)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking change? /config/deployment => /deployment/config.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this endpoint is only relied on in the dashboard, I wouldn't consider this breaking, but if you think it is I'm fine to put ! with it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong preference, just being cautious. 😄

@ammario ammario removed their request for review March 7, 2023 17:41
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look very closely at frontend, but backend looks mostly good. I still have some concerns about the agent stats reporting (see comments) but that's either a lack of my understanding or something we should fix, approving nonetheless.

@@ -406,7 +406,7 @@ func newConfig() *codersdk.DeploymentConfig {
Usage: "How frequently agent stats are recorded",
Flag: "agent-stats-refresh-interval",
Hidden: true,
Default: 10 * time.Minute,
Default: 30 * time.Second,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty big change, I think it's OK but increases spamminess somewhat.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely is, but I did some napkin math and I think it should be alright.

Even if a user has hundreds of workspaces, a few hundred more writes/minute shouldn't be a big deal. I suppose it might spam the logs, which I'll check and resolve before merge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd hate for all of coderd to be spammed with stat logs in a large deployment ;p

@@ -1270,10 +1267,16 @@ func (a *agent) startReportingConnectionStats(ctx context.Context) {
// Convert from microseconds to milliseconds.
stats.ConnectionMedianLatencyMS /= 1000

lastStat := a.latestStat.Load()
if lastStat != nil && reflect.DeepEqual(lastStat, stats) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this still confuses me a bit. If Tailscale stats aren't cumulative, isn't the only way this matches lastStat if there was no chatter (tx/rx), the latency and sessions for SSH/Code/JetBrains stayed the same?

Since we're also doing network pings in the latency check, I think there is a non-zero chance for multiple reportStats to be running concurrently, essentially competing about Load()/Store() here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good points. I'll refactor this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at this again, it seems like this should be fine.

This will only match if there's no traffic, but that's arguably great because then we aren't spamming the database with nonsense. I don't want to do this in coderd, because we'd need to query for the last stat to ensure it's not the same.

reportStats is blocking, and so subsequent agent stat refreshes will wait before running again, so I don't think they'd compete.

Let me know if I'm overlooking something or didn't understand properly, I'm sick and my brain is stuffy right now ;p

@@ -0,0 +1 @@
ALTER TABLE workspace_agent_stats ALTER COLUMN connection_median_latency_ms TYPE bigint;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work with non-empty data? You could consider adding two fixtures in testdata/fixtures (000106_pre_workspace_agent_stats_connection_latency.up.sql and 000107_post_workspace_agent_stats_connection_latency.up.sql). In the former you add a row with bigint value and in the latter you add a row with float value. If tests pass then all is good. 👍🏻

type DeploymentStats struct {
// AggregatedFrom is the time in which stats are aggregated from.
// This might be back in time a specific duration or interval.
AggregatedFrom time.Time `json:"aggregated_since" format:"date-time"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to keep json/field out of sync?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, just a mistake on my end. Good catch!

BuildingWorkspaces int64 `json:"building_workspaces"`
RunningWorkspaces int64 `json:"running_workspaces"`
FailedWorkspaces int64 `json:"failed_workspaces"`
StoppedWorkspaces int64 `json:"stopped_workspaces"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could utilize nesting, e.g. workspaces.pending, session_count.vscode, etc. Matter of preference, so dealers choice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that's a bit easier to parse!

SessionCountReconnectingPTY int64 `json:"session_count_reconnecting_pty"`

WorkspaceRxBytes int64 `json:"workspace_rx_bytes"`
WorkspaceTxBytes int64 `json:"workspace_tx_bytes"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could be under workspaces.rx_bytes. Singular vs plural (workspace, workspaces) above is a bit confusing currently.

@kylecarbs kylecarbs merged commit 5304b4e into main Mar 9, 2023
@kylecarbs kylecarbs deleted the exportstats branch March 9, 2023 03:05
@github-actions github-actions bot locked and limited conversation to collaborators Mar 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants