-
Notifications
You must be signed in to change notification settings - Fork 886
fix(coderd): workspaceapps: update last_used_at when workspace app reports stats #11603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2d44bf4
to
5b799e8
Compare
// TODO(cian): A blocked request should not count as workspace usage. | ||
// assertWorkspaceLastUsedAtNotUpdated(t, appDetails.AppClient(t), appDetails) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
review: leaving this here to address in a follow-up. As this is now tied to stats collection, we bump whenever there are app usage stats for a workspace.
// TODO(cian): The initial redirect should not count as workspace usage. | ||
// assertWorkspaceLastUsedAtNotUpdated(t, appDetails.AppClient(t), appDetails) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
review: follow-up
// TODO(cian): The initial redirect should not count as workspace usage. | ||
// assertWorkspaceLastUsedAtNotUpdated(t, appDetails.AppClient(t), appDetails) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
review: follow-up
// TODO(cian): A blocked request should not count as workspace usage. | ||
// assertWorkspaceLastUsedAtNotUpdated(t, appDetails.AppClient(t), appDetails) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
review: follow-up
@@ -375,6 +396,7 @@ func Run(t *testing.T, appHostIsPrimary bool, factory DeploymentFactory) { | |||
// TODO(@deansheather): This should be 400. There's a todo in the | |||
// resolve request code to fix this. | |||
require.Equal(t, http.StatusInternalServerError, resp.StatusCode) | |||
assertWorkspaceLastUsedAtUpdated(t, appDetails) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
review: I'm counting this as a successful and valid attempt to access a workspace.
// Accessing an app should update the workspace's LastUsedAt. | ||
// NOTE: Despite our efforts with the flush channel, this is inherently racy. | ||
func assertWorkspaceLastUsedAtUpdated(t testing.TB, details *Details) { | ||
t.Helper() | ||
|
||
<-time.After(testutil.IntervalMedium) // Wait for Bicopy to finish. | ||
details.FlushStats() | ||
// Wait for stats to fully flush. | ||
require.Eventually(t, func() bool { | ||
ws, err := details.SDKClient.Workspace(context.Background(), details.Workspace.ID) | ||
assert.NoError(t, err) | ||
return ws.LastUsedAt.After(details.Workspace.LastUsedAt) | ||
}, testutil.WaitShort, testutil.IntervalMedium, "workspace LastUsedAt not updated when it should have been") | ||
} | ||
|
||
// Except when it sometimes shouldn't (e.g. no access) | ||
// NOTE: Despite our efforts with the flush channel, this is inherently racy. | ||
func assertWorkspaceLastUsedAtNotUpdated(t testing.TB, details *Details) { | ||
t.Helper() | ||
|
||
<-time.After(testutil.IntervalMedium) // Wait for Bicopy to finish. | ||
details.FlushStats() | ||
<-time.After(testutil.IntervalMedium) // Wait for stats to fully flush. | ||
ws, err := details.SDKClient.Workspace(context.Background(), details.Workspace.ID) | ||
require.NoError(t, err) | ||
require.Equal(t, ws.LastUsedAt, details.Workspace.LastUsedAt, "workspace LastUsedAt updated when it should not have been") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
review: this is where all the race gremlins live:
- Bicopy does not appear to exit immediately so if we flush stats immediately we may not get a finished session.
- Even though a completed flush should mean that LastUsedAt has been bumped, I've observed some wiggle room (especially when wsproxy is involved).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we try looping the flush until we either time out or have the stats we're looking for? Might make sense to have FlushStats
take a context too, just to be sure.
enterprise/wsproxy/wsproxy_test.go
Outdated
proxyFlushCh := make(chan chan<- struct{}) | ||
flushStats := func() { | ||
proxyFlushDone := make(chan struct{}) | ||
proxyFlushCh <- proxyFlushDone | ||
<-proxyFlushDone | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
review: there should be no need to flush the stats on the primary here, as the proxy sends its stats directly to the primary endpoint which is wired up directly to the Reporter, which just inserts immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but without a flush, we can't ensure that Report()
is being called (i.e. flushing the in-memory stats to coderd via post), right? So I think it may be needed here too, depending on what we're looking to achieve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my reading, any POSTS to /app-stats
go straight to api.WorkspaceAppsStatsCollectorOptions.Reporter
, and the wsproxy app stats reporter just hits that endpoint on flush. The stats reporter for the primary goes straight to the DB as well as far as I can tell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left few nit-picks, and I'm looking forward to follow-ups!
"nhooyr.io/websocket" | ||
|
||
"cdr.dev/slog" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is it expected or is the linter impaired?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be my linter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, it's how these are usually grouped (coder/coder, cdr.dev, etc grouped together, separate from stdlib and 3rd party).
// temporary map to avoid O(q.workspaces*arg.workspaceIds) | ||
m := make(map[uuid.UUID]struct{}) | ||
for _, id := range arg.IDs { | ||
m[id] = struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be before the mutex lock, but it's such a minor perf change not sure it's worth changing.
coderd/database/dbmem/dbmem.go
Outdated
n++ | ||
} | ||
if n == 0 { | ||
return sql.ErrNoRows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for err no rows here, an update never returns this unless RETURNING *
is used.
"nhooyr.io/websocket" | ||
|
||
"cdr.dev/slog" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, it's how these are usually grouped (coder/coder, cdr.dev, etc grouped together, separate from stdlib and 3rd party).
} | ||
|
||
if err := tx.InsertWorkspaceAppStats(ctx, batch); err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One could make a case for us trying to do the last used update even if this fails, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inserting app stats is just doing a big insert into a single table (which IIRC is unlogged), so if we run into issues doing that my gut tells me that any further database queries might not be successful. Might not hurt to try, but I'm not sure how much the extra complexity would be worth it. Bear in mind that we will end up just trying to do this again in another 30 seconds!
uniqueIDs := slice.Unique(batch.WorkspaceID) | ||
if err := tx.BatchUpdateWorkspaceLastUsedAt(ctx, database.BatchUpdateWorkspaceLastUsedAtParams{ | ||
IDs: uniqueIDs, | ||
LastUsedAt: dbtime.Now(), // This isn't 100% accurate, but it's good enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, it's close enough. We could ensure we use max time from the stats if we want slightly more accuracy (still off for some workspaces, though).
enterprise/wsproxy/wsproxy_test.go
Outdated
proxyFlushCh := make(chan chan<- struct{}) | ||
flushStats := func() { | ||
proxyFlushDone := make(chan struct{}) | ||
proxyFlushCh <- proxyFlushDone | ||
<-proxyFlushDone | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but without a flush, we can't ensure that Report()
is being called (i.e. flushing the in-memory stats to coderd via post), right? So I think it may be needed here too, depending on what we're looking to achieve.
enterprise/wsproxy/wsproxy_test.go
Outdated
@@ -442,6 +442,13 @@ func TestWorkspaceProxyWorkspaceApps(t *testing.T) { | |||
"*", | |||
} | |||
|
|||
proxyFlushCh := make(chan chan<- struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming these stats specific might keep things more clear.
// Accessing an app should update the workspace's LastUsedAt. | ||
// NOTE: Despite our efforts with the flush channel, this is inherently racy. | ||
func assertWorkspaceLastUsedAtUpdated(t testing.TB, details *Details) { | ||
t.Helper() | ||
|
||
<-time.After(testutil.IntervalMedium) // Wait for Bicopy to finish. | ||
details.FlushStats() | ||
// Wait for stats to fully flush. | ||
require.Eventually(t, func() bool { | ||
ws, err := details.SDKClient.Workspace(context.Background(), details.Workspace.ID) | ||
assert.NoError(t, err) | ||
return ws.LastUsedAt.After(details.Workspace.LastUsedAt) | ||
}, testutil.WaitShort, testutil.IntervalMedium, "workspace LastUsedAt not updated when it should have been") | ||
} | ||
|
||
// Except when it sometimes shouldn't (e.g. no access) | ||
// NOTE: Despite our efforts with the flush channel, this is inherently racy. | ||
func assertWorkspaceLastUsedAtNotUpdated(t testing.TB, details *Details) { | ||
t.Helper() | ||
|
||
<-time.After(testutil.IntervalMedium) // Wait for Bicopy to finish. | ||
details.FlushStats() | ||
<-time.After(testutil.IntervalMedium) // Wait for stats to fully flush. | ||
ws, err := details.SDKClient.Workspace(context.Background(), details.Workspace.ID) | ||
require.NoError(t, err) | ||
require.Equal(t, ws.LastUsedAt, details.Workspace.LastUsedAt, "workspace LastUsedAt updated when it should not have been") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we try looping the flush until we either time out or have the stats we're looking for? Might make sense to have FlushStats
take a context too, just to be sure.
t.Helper() | ||
|
||
<-time.After(testutil.IntervalMedium) // Wait for Bicopy to finish. | ||
details.FlushStats() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, looping flush seems better than waiting an arbitrary time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A loop-flush here would probably just execute after the first iteration.
bad5194
to
8573ba7
Compare
Filed #11643 for following up on unexpected stats inserts. |
Fixes #11509
BatchUpdateLastUsedAt
BatchUpdateLastUsedAt
in app stats handler upon flushThis is not the 'correct' solution, but it puts a decent enough bandage on the problem until we figure out a more unified solution to this issue of updating the "Last Used At" field of workspaces based on certain events. I'll be opening a follow-up PR for this.
Now when the workspace apps stats collector flushes a batch of stats, we bump
LastUsedAt
for all affected workspaces.Note: I'm just updating
LastUsedAt
to the same value for all workspaces. I don't know if there's a good way to insert a whole bunch of distinct values for a number of rows in a single transaction, and I want to keep this to a single query if possible.I've verified that this works experimentally but this is the sort of thing that should be ossified in tests. I'm currently putting this into
TestWorkspaceApps
as it seems to be the place where all the testing of proxying flows happens.Also note: the tests are sadly a bit racy; I've done what I can for the moment but I may need to spend some follow-up time refactoring.