Skip to content

fix(coderd): workspaceapps: update last_used_at when workspace app reports stats #11603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jan 16, 2024

Conversation

johnstcn
Copy link
Member

@johnstcn johnstcn commented Jan 12, 2024

Fixes #11509

  • Adds a new query BatchUpdateLastUsedAt
  • Adds calls to BatchUpdateLastUsedAt in app stats handler upon flush
  • Passes a stats flush chan to apptest setup scaffolding

This is not the 'correct' solution, but it puts a decent enough bandage on the problem until we figure out a more unified solution to this issue of updating the "Last Used At" field of workspaces based on certain events. I'll be opening a follow-up PR for this.

Now when the workspace apps stats collector flushes a batch of stats, we bump LastUsedAt for all affected workspaces.

Note: I'm just updating LastUsedAt to the same value for all workspaces. I don't know if there's a good way to insert a whole bunch of distinct values for a number of rows in a single transaction, and I want to keep this to a single query if possible.

I've verified that this works experimentally but this is the sort of thing that should be ossified in tests. I'm currently putting this into TestWorkspaceApps as it seems to be the place where all the testing of proxying flows happens.

Also note: the tests are sadly a bit racy; I've done what I can for the moment but I may need to spend some follow-up time refactoring.

@johnstcn johnstcn self-assigned this Jan 12, 2024
@johnstcn johnstcn force-pushed the cj/update-workspace-lastusedat-proxy branch 2 times, most recently from 2d44bf4 to 5b799e8 Compare January 15, 2024 17:16
@johnstcn johnstcn changed the title [WIP] fix(coderd): workspaceapps: update last_used_at when workspace app reports stats fix(coderd): workspaceapps: update last_used_at when workspace app reports stats Jan 15, 2024
@johnstcn johnstcn marked this pull request as ready for review January 15, 2024 17:20
@johnstcn johnstcn requested review from mafredri and mtojek January 15, 2024 17:20
Comment on lines +205 to +206
// TODO(cian): A blocked request should not count as workspace usage.
// assertWorkspaceLastUsedAtNotUpdated(t, appDetails.AppClient(t), appDetails)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review: leaving this here to address in a follow-up. As this is now tied to stats collection, we bump whenever there are app usage stats for a workspace.

Comment on lines +221 to +222
// TODO(cian): The initial redirect should not count as workspace usage.
// assertWorkspaceLastUsedAtNotUpdated(t, appDetails.AppClient(t), appDetails)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review: follow-up

Comment on lines +240 to +241
// TODO(cian): The initial redirect should not count as workspace usage.
// assertWorkspaceLastUsedAtNotUpdated(t, appDetails.AppClient(t), appDetails)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review: follow-up

Comment on lines +349 to +350
// TODO(cian): A blocked request should not count as workspace usage.
// assertWorkspaceLastUsedAtNotUpdated(t, appDetails.AppClient(t), appDetails)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review: follow-up

@@ -375,6 +396,7 @@ func Run(t *testing.T, appHostIsPrimary bool, factory DeploymentFactory) {
// TODO(@deansheather): This should be 400. There's a todo in the
// resolve request code to fix this.
require.Equal(t, http.StatusInternalServerError, resp.StatusCode)
assertWorkspaceLastUsedAtUpdated(t, appDetails)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review: I'm counting this as a successful and valid attempt to access a workspace.

Comment on lines 1568 to 1591
// Accessing an app should update the workspace's LastUsedAt.
// NOTE: Despite our efforts with the flush channel, this is inherently racy.
func assertWorkspaceLastUsedAtUpdated(t testing.TB, details *Details) {
t.Helper()

<-time.After(testutil.IntervalMedium) // Wait for Bicopy to finish.
details.FlushStats()
// Wait for stats to fully flush.
require.Eventually(t, func() bool {
ws, err := details.SDKClient.Workspace(context.Background(), details.Workspace.ID)
assert.NoError(t, err)
return ws.LastUsedAt.After(details.Workspace.LastUsedAt)
}, testutil.WaitShort, testutil.IntervalMedium, "workspace LastUsedAt not updated when it should have been")
}

// Except when it sometimes shouldn't (e.g. no access)
// NOTE: Despite our efforts with the flush channel, this is inherently racy.
func assertWorkspaceLastUsedAtNotUpdated(t testing.TB, details *Details) {
t.Helper()

<-time.After(testutil.IntervalMedium) // Wait for Bicopy to finish.
details.FlushStats()
<-time.After(testutil.IntervalMedium) // Wait for stats to fully flush.
ws, err := details.SDKClient.Workspace(context.Background(), details.Workspace.ID)
require.NoError(t, err)
require.Equal(t, ws.LastUsedAt, details.Workspace.LastUsedAt, "workspace LastUsedAt updated when it should not have been")
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review: this is where all the race gremlins live:

  • Bicopy does not appear to exit immediately so if we flush stats immediately we may not get a finished session.
  • Even though a completed flush should mean that LastUsedAt has been bumped, I've observed some wiggle room (especially when wsproxy is involved).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we try looping the flush until we either time out or have the stats we're looking for? Might make sense to have FlushStats take a context too, just to be sure.

Comment on lines 445 to 450
proxyFlushCh := make(chan chan<- struct{})
flushStats := func() {
proxyFlushDone := make(chan struct{})
proxyFlushCh <- proxyFlushDone
<-proxyFlushDone
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review: there should be no need to flush the stats on the primary here, as the proxy sends its stats directly to the primary endpoint which is wired up directly to the Reporter, which just inserts immediately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but without a flush, we can't ensure that Report() is being called (i.e. flushing the in-memory stats to coderd via post), right? So I think it may be needed here too, depending on what we're looking to achieve.

Copy link
Member Author

@johnstcn johnstcn Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my reading, any POSTS to /app-stats go straight to api.WorkspaceAppsStatsCollectorOptions.Reporter, and the wsproxy app stats reporter just hits that endpoint on flush. The stats reporter for the primary goes straight to the DB as well as far as I can tell.

Copy link
Member

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left few nit-picks, and I'm looking forward to follow-ups!

"nhooyr.io/websocket"

"cdr.dev/slog"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it expected or is the linter impaired?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be my linter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, it's how these are usually grouped (coder/coder, cdr.dev, etc grouped together, separate from stdlib and 3rd party).

@johnstcn johnstcn requested a review from spikecurtis January 16, 2024 09:12
// temporary map to avoid O(q.workspaces*arg.workspaceIds)
m := make(map[uuid.UUID]struct{})
for _, id := range arg.IDs {
m[id] = struct{}{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be before the mutex lock, but it's such a minor perf change not sure it's worth changing.

n++
}
if n == 0 {
return sql.ErrNoRows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for err no rows here, an update never returns this unless RETURNING * is used.

"nhooyr.io/websocket"

"cdr.dev/slog"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, it's how these are usually grouped (coder/coder, cdr.dev, etc grouped together, separate from stdlib and 3rd party).

}

if err := tx.InsertWorkspaceAppStats(ctx, batch); err != nil {
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could make a case for us trying to do the last used update even if this fails, wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inserting app stats is just doing a big insert into a single table (which IIRC is unlogged), so if we run into issues doing that my gut tells me that any further database queries might not be successful. Might not hurt to try, but I'm not sure how much the extra complexity would be worth it. Bear in mind that we will end up just trying to do this again in another 30 seconds!

uniqueIDs := slice.Unique(batch.WorkspaceID)
if err := tx.BatchUpdateWorkspaceLastUsedAt(ctx, database.BatchUpdateWorkspaceLastUsedAtParams{
IDs: uniqueIDs,
LastUsedAt: dbtime.Now(), // This isn't 100% accurate, but it's good enough.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it's close enough. We could ensure we use max time from the stats if we want slightly more accuracy (still off for some workspaces, though).

Comment on lines 445 to 450
proxyFlushCh := make(chan chan<- struct{})
flushStats := func() {
proxyFlushDone := make(chan struct{})
proxyFlushCh <- proxyFlushDone
<-proxyFlushDone
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but without a flush, we can't ensure that Report() is being called (i.e. flushing the in-memory stats to coderd via post), right? So I think it may be needed here too, depending on what we're looking to achieve.

@@ -442,6 +442,13 @@ func TestWorkspaceProxyWorkspaceApps(t *testing.T) {
"*",
}

proxyFlushCh := make(chan chan<- struct{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming these stats specific might keep things more clear.

Comment on lines 1568 to 1591
// Accessing an app should update the workspace's LastUsedAt.
// NOTE: Despite our efforts with the flush channel, this is inherently racy.
func assertWorkspaceLastUsedAtUpdated(t testing.TB, details *Details) {
t.Helper()

<-time.After(testutil.IntervalMedium) // Wait for Bicopy to finish.
details.FlushStats()
// Wait for stats to fully flush.
require.Eventually(t, func() bool {
ws, err := details.SDKClient.Workspace(context.Background(), details.Workspace.ID)
assert.NoError(t, err)
return ws.LastUsedAt.After(details.Workspace.LastUsedAt)
}, testutil.WaitShort, testutil.IntervalMedium, "workspace LastUsedAt not updated when it should have been")
}

// Except when it sometimes shouldn't (e.g. no access)
// NOTE: Despite our efforts with the flush channel, this is inherently racy.
func assertWorkspaceLastUsedAtNotUpdated(t testing.TB, details *Details) {
t.Helper()

<-time.After(testutil.IntervalMedium) // Wait for Bicopy to finish.
details.FlushStats()
<-time.After(testutil.IntervalMedium) // Wait for stats to fully flush.
ws, err := details.SDKClient.Workspace(context.Background(), details.Workspace.ID)
require.NoError(t, err)
require.Equal(t, ws.LastUsedAt, details.Workspace.LastUsedAt, "workspace LastUsedAt updated when it should not have been")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we try looping the flush until we either time out or have the stats we're looking for? Might make sense to have FlushStats take a context too, just to be sure.

t.Helper()

<-time.After(testutil.IntervalMedium) // Wait for Bicopy to finish.
details.FlushStats()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, looping flush seems better than waiting an arbitrary time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A loop-flush here would probably just execute after the first iteration.

@johnstcn
Copy link
Member Author

Filed #11643 for following up on unexpected stats inserts.

@johnstcn johnstcn merged commit d583aca into main Jan 16, 2024
@johnstcn johnstcn deleted the cj/update-workspace-lastusedat-proxy branch January 16, 2024 14:06
@github-actions github-actions bot locked and limited conversation to collaborators Jan 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: workspace.last_used_at not updated when accessing coder_app
3 participants