Skip to content

fix(cli): port-forward: update workspace last_used_at #12659

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Mar 20, 2024

Conversation

johnstcn
Copy link
Member

@johnstcn johnstcn commented Mar 19, 2024

Fixes #12431

For the CLI to report workspace usage, we currently depend on the agent sending a stats report with a non-zero connection count. This is sourced from SetConnStatsCallback (tailnet/conn.go:621).
The problem with this approach is that the tailscale connection tracking only appears to track "connection established" events, and is not equivalent to running e.g. ss -plunt. This means:

  • Having an open port-forward on its own will not be detected as an 'active connection'
  • Having an open port-forward with a single long-lived connection (e.g. websocket conn) will only be counted as one activity datapoint, and subsequent stats collection intervals will ignore that active connection.

This PR updates the coder port-forward command to periodically inform coderd that the workspace is being used:

  • Adds workspaceusage.Tracker which periodically batch-updates workspace LastUsedAt
  • Adds coderd endpoint to signal workspace usage
  • Updates coder port-forward to periodically hit this endpoint
  • Modifies BatchUpdateWorkspacesLastUsedAt to avoid overwriting with stale data

In later PRs, we can:

  • Update coder ssh to also use this behaviour,
  • Update the workspace activity tracking package to also call ActivityBumpWorkspace so that last_used_at and deadline are handled in the same place,
  • Remove the existing behaviour of updating the workspace when stats are received from the agent.

Follow-ups:

  • We may need to ensure that multiple port-forwards to the same workspace only result in one coder process on a user's workstation updating the workspace at a time. My assumption here is that most users will have at most one or two port-forwards running at a time.

@johnstcn johnstcn self-assigned this Mar 19, 2024
@johnstcn johnstcn changed the title feat: tunnel: update workspace last_used_at feat(cli): port-forward: update workspace last_used_at Mar 19, 2024
@johnstcn johnstcn force-pushed the cj/cli-workspace-heartbeat branch 2 times, most recently from 17cb933 to 2412fb5 Compare March 19, 2024 13:20
@johnstcn johnstcn changed the title feat(cli): port-forward: update workspace last_used_at fix(cli): port-forward: update workspace last_used_at Mar 19, 2024
@johnstcn johnstcn force-pushed the cj/cli-workspace-heartbeat branch from 2412fb5 to c99327c Compare March 19, 2024 15:16
@johnstcn johnstcn marked this pull request as ready for review March 19, 2024 15:32
Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits & small {que,sugge}stions
Very clean! LGTM

johnstcn and others added 2 commits March 20, 2024 09:10
Co-authored-by: Danny Kopping <danny@coder.com>
Copy link
Member

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level review first

LastUsedAt: now,
IDs: ids,
}); err != nil {
wut.log.Error(ctx, "failed updating workspaces last_used_at", slog.F("count", count), slog.Error(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any retry mechanism if the batch fails?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a built-in assumption here that any failure of a single batch can be ignored as it gets retried frequently. I suppose consecutive failures should surface some form of error somewhere. I wonder if the database health page could show a summary of queries that have failed multiple consecutive executions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a built-in assumption here that any failure of a single batch can be ignored as it gets retried frequently. I suppose consecutive failures should surface some form of error somewhere

I might have missed it, but it would be cool to leave a comment stating this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome 👍

db, ps = dbtestutil.NewDB(t)
wuTick = make(chan time.Time)
wuFlush = make(chan int, 1)
wut = workspaceusage.New(db, workspaceusage.WithFlushChannel(wuFlush), workspaceusage.WithTickChannel(wuTick))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jiffy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wat

// the number of marked workspaces every time Tracker flushes.
// For testing only and will panic if used outside of tests.
func WithFlushChannel(c chan int) Option {
if flag.Lookup("test.v") == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice strictness check! Feels like this could be a testutil func, even if obvious.

// Calling Loop after Close() is an error.
select {
case <-wut.doneCh:
panic("developer error: Loop called after Close")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to do this instead of just returning, I'd prefer returning an error here, wouldn't want this to happen for a production instance after some changes to the code.

Feels like this mostly exists for tests (otherwise could be launched from New), so more reason not to add a possibility of panic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we return an error, then we have two options:

  1. Exit with the error, which brings us back to potentially impacting prod, but just in a nicer-looking way.
  2. Just log the error and do nothing, which means the error will likely pass unnoticed until folks wonder why their workspace usage isn't being tracked.

I don't think 2) is a valid option. We're not guaranteed to catch it in tests especially because of the possibility of ignoring errors in slogtest.

There's an alternative possibility 1a) where we check the error and recreate the tracker if non-nil. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot about option c) - have New start the loop and unexport the function.
Now it's un-possible!

Comment on lines 312 to 329
// Did last_used_at not update? Scratching your noggin? Here's why.
// Workspace usage tracking must be triggered manually in tests.
// The vast majority of existing tests do not depend on last_used_at
// and adding an extra background goroutine to all existing tests may
// lead to future flakes and goleak complaints.
// To do this, pass in your own WorkspaceUsageTracker like so:
//
// db, ps = dbtestutil.NewDB(t)
// wuTick = make(chan time.Time)
// wuFlush = make(chan int, 1)
// wut = workspaceusage.New(db, workspaceusage.WithFlushChannel(wuFlush), workspaceusage.WithTickChannel(wuTick))
// client = coderdtest.New(t, &coderdtest.Options{
// WorkspaceUsageTracker: wut,
// Database: db,
// Pubsub: ps,
// })
//
// See TestPortForward for how this works in practice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, love the personality in this comment 😄

@johnstcn johnstcn merged commit 92aa1eb into main Mar 20, 2024
@johnstcn johnstcn deleted the cj/cli-workspace-heartbeat branch March 20, 2024 16:44
@github-actions github-actions bot locked and limited conversation to collaborators Mar 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

coder tunnel does not update workspace last_used with long-lived connections
5 participants