fix(coderd): workspaceapps: update last_used_at when workspace app reports stats #11603

johnstcn · 2024-01-12T20:48:25Z

Adds a new query BatchUpdateLastUsedAt
Adds calls to BatchUpdateLastUsedAt in app stats handler upon flush
Passes a stats flush chan to apptest setup scaffolding

This is not the 'correct' solution, but it puts a decent enough bandage on the problem until we figure out a more unified solution to this issue of updating the "Last Used At" field of workspaces based on certain events. I'll be opening a follow-up PR for this.

Now when the workspace apps stats collector flushes a batch of stats, we bump LastUsedAt for all affected workspaces.

Note: I'm just updating LastUsedAt to the same value for all workspaces. I don't know if there's a good way to insert a whole bunch of distinct values for a number of rows in a single transaction, and I want to keep this to a single query if possible.

I've verified that this works experimentally but this is the sort of thing that should be ossified in tests. I'm currently putting this into TestWorkspaceApps as it seems to be the place where all the testing of proxying flows happens.

Also note: the tests are sadly a bit racy; I've done what I can for the moment but I may need to spend some follow-up time refactoring.

johnstcn · 2024-01-15T17:22:52Z

coderd/workspaceapps/apptest/apptest.go

+			// TODO(cian): A blocked request should not count as workspace usage.
+			// assertWorkspaceLastUsedAtNotUpdated(t, appDetails.AppClient(t), appDetails)


review: leaving this here to address in a follow-up. As this is now tied to stats collection, we bump whenever there are app usage stats for a workspace.

johnstcn · 2024-01-15T17:23:00Z

coderd/workspaceapps/apptest/apptest.go

+			// TODO(cian): The initial redirect should not count as workspace usage.
+			// assertWorkspaceLastUsedAtNotUpdated(t, appDetails.AppClient(t), appDetails)


review: follow-up

johnstcn · 2024-01-15T17:23:08Z

coderd/workspaceapps/apptest/apptest.go

+			// TODO(cian): The initial redirect should not count as workspace usage.
+			// assertWorkspaceLastUsedAtNotUpdated(t, appDetails.AppClient(t), appDetails)


review: follow-up

johnstcn · 2024-01-15T17:23:17Z

coderd/workspaceapps/apptest/apptest.go

+			// TODO(cian): A blocked request should not count as workspace usage.
+			// assertWorkspaceLastUsedAtNotUpdated(t, appDetails.AppClient(t), appDetails)


review: follow-up

johnstcn · 2024-01-15T17:23:45Z

coderd/workspaceapps/apptest/apptest.go

@@ -375,6 +396,7 @@ func Run(t *testing.T, appHostIsPrimary bool, factory DeploymentFactory) {
 			// TODO(@deansheather): This should be 400. There's a todo in the
 			// resolve request code to fix this.
 			require.Equal(t, http.StatusInternalServerError, resp.StatusCode)
+			assertWorkspaceLastUsedAtUpdated(t, appDetails)


review: I'm counting this as a successful and valid attempt to access a workspace.

johnstcn · 2024-01-15T17:27:07Z

coderd/workspaceapps/apptest/apptest.go

+// Accessing an app should update the workspace's LastUsedAt.
+// NOTE: Despite our efforts with the flush channel, this is inherently racy.
+func assertWorkspaceLastUsedAtUpdated(t testing.TB, details *Details) {
+	t.Helper()
+
+	<-time.After(testutil.IntervalMedium) // Wait for Bicopy to finish.
+	details.FlushStats()
+	// Wait for stats to fully flush.
+	require.Eventually(t, func() bool {
+		ws, err := details.SDKClient.Workspace(context.Background(), details.Workspace.ID)
+		assert.NoError(t, err)
+		return ws.LastUsedAt.After(details.Workspace.LastUsedAt)
+	}, testutil.WaitShort, testutil.IntervalMedium, "workspace LastUsedAt not updated when it should have been")
+}
+
+// Except when it sometimes shouldn't (e.g. no access)
+// NOTE: Despite our efforts with the flush channel, this is inherently racy.
+func assertWorkspaceLastUsedAtNotUpdated(t testing.TB, details *Details) {
+	t.Helper()
+
+	<-time.After(testutil.IntervalMedium) // Wait for Bicopy to finish.
+	details.FlushStats()
+	<-time.After(testutil.IntervalMedium) // Wait for stats to fully flush.
+	ws, err := details.SDKClient.Workspace(context.Background(), details.Workspace.ID)
+	require.NoError(t, err)
+	require.Equal(t, ws.LastUsedAt, details.Workspace.LastUsedAt, "workspace LastUsedAt updated when it should not have been")
+}


review: this is where all the race gremlins live:

Bicopy does not appear to exit immediately so if we flush stats immediately we may not get a finished session.

Even though a completed flush should mean that LastUsedAt has been bumped, I've observed some wiggle room (especially when wsproxy is involved).

Why don't we try looping the flush until we either time out or have the stats we're looking for? Might make sense to have FlushStats take a context too, just to be sure.

johnstcn · 2024-01-15T17:28:26Z

enterprise/wsproxy/wsproxy_test.go

+		proxyFlushCh := make(chan chan<- struct{})
+		flushStats := func() {
+			proxyFlushDone := make(chan struct{})
+			proxyFlushCh <- proxyFlushDone
+			<-proxyFlushDone
+		}


review: there should be no need to flush the stats on the primary here, as the proxy sends its stats directly to the primary endpoint which is wired up directly to the Reporter, which just inserts immediately.

Correct me if I'm wrong, but without a flush, we can't ensure that Report() is being called (i.e. flushing the in-memory stats to coderd via post), right? So I think it may be needed here too, depending on what we're looking to achieve.

From my reading, any POSTS to /app-stats go straight to api.WorkspaceAppsStatsCollectorOptions.Reporter, and the wsproxy app stats reporter just hits that endpoint on flush. The stats reporter for the primary goes straight to the DB as well as far as I can tell.

mtojek

I left few nit-picks, and I'm looking forward to follow-ups!

mtojek · 2024-01-16T08:16:50Z

coderd/httpapi/websocket.go

 	"nhooyr.io/websocket"
+
+	"cdr.dev/slog"


nit: is it expected or is the linter impaired?

Might be my linter.

This is fine, it's how these are usually grouped (coder/coder, cdr.dev, etc grouped together, separate from stdlib and 3rd party).

coderd/workspaceapps/stats.go

mafredri · 2024-01-16T10:38:45Z

coderd/database/dbmem/dbmem.go

+	// temporary map to avoid O(q.workspaces*arg.workspaceIds)
+	m := make(map[uuid.UUID]struct{})
+	for _, id := range arg.IDs {
+		m[id] = struct{}{}


This could be before the mutex lock, but it's such a minor perf change not sure it's worth changing.

mafredri · 2024-01-16T10:43:11Z

coderd/database/dbmem/dbmem.go

+		n++
+	}
+	if n == 0 {
+		return sql.ErrNoRows


No need for err no rows here, an update never returns this unless RETURNING * is used.

mafredri · 2024-01-16T10:44:09Z

coderd/httpapi/websocket.go

 	"nhooyr.io/websocket"
+
+	"cdr.dev/slog"


This is fine, it's how these are usually grouped (coder/coder, cdr.dev, etc grouped together, separate from stdlib and 3rd party).

mafredri · 2024-01-16T10:47:52Z

coderd/workspaceapps/stats.go

+		}
+
+		if err := tx.InsertWorkspaceAppStats(ctx, batch); err != nil {
+			return err


One could make a case for us trying to do the last used update even if this fails, wdyt?

Inserting app stats is just doing a big insert into a single table (which IIRC is unlogged), so if we run into issues doing that my gut tells me that any further database queries might not be successful. Might not hurt to try, but I'm not sure how much the extra complexity would be worth it. Bear in mind that we will end up just trying to do this again in another 30 seconds!

mafredri · 2024-01-16T10:49:56Z

coderd/workspaceapps/stats.go

+		uniqueIDs := slice.Unique(batch.WorkspaceID)
+		if err := tx.BatchUpdateWorkspaceLastUsedAt(ctx, database.BatchUpdateWorkspaceLastUsedAtParams{
+			IDs:        uniqueIDs,
+			LastUsedAt: dbtime.Now(), // This isn't 100% accurate, but it's good enough.


IMO, it's close enough. We could ensure we use max time from the stats if we want slightly more accuracy (still off for some workspaces, though).

mafredri · 2024-01-16T11:04:30Z

enterprise/wsproxy/wsproxy_test.go

+		proxyFlushCh := make(chan chan<- struct{})
+		flushStats := func() {
+			proxyFlushDone := make(chan struct{})
+			proxyFlushCh <- proxyFlushDone
+			<-proxyFlushDone
+		}


Correct me if I'm wrong, but without a flush, we can't ensure that Report() is being called (i.e. flushing the in-memory stats to coderd via post), right? So I think it may be needed here too, depending on what we're looking to achieve.

mafredri · 2024-01-16T11:05:01Z

enterprise/wsproxy/wsproxy_test.go

@@ -442,6 +442,13 @@ func TestWorkspaceProxyWorkspaceApps(t *testing.T) {
 			"*",
 		}

+		proxyFlushCh := make(chan chan<- struct{})


Naming these stats specific might keep things more clear.

coderd/workspaceapps_test.go

mafredri · 2024-01-16T11:07:01Z

coderd/workspaceapps/apptest/apptest.go

+// Accessing an app should update the workspace's LastUsedAt.
+// NOTE: Despite our efforts with the flush channel, this is inherently racy.
+func assertWorkspaceLastUsedAtUpdated(t testing.TB, details *Details) {
+	t.Helper()
+
+	<-time.After(testutil.IntervalMedium) // Wait for Bicopy to finish.
+	details.FlushStats()
+	// Wait for stats to fully flush.
+	require.Eventually(t, func() bool {
+		ws, err := details.SDKClient.Workspace(context.Background(), details.Workspace.ID)
+		assert.NoError(t, err)
+		return ws.LastUsedAt.After(details.Workspace.LastUsedAt)
+	}, testutil.WaitShort, testutil.IntervalMedium, "workspace LastUsedAt not updated when it should have been")
+}
+
+// Except when it sometimes shouldn't (e.g. no access)
+// NOTE: Despite our efforts with the flush channel, this is inherently racy.
+func assertWorkspaceLastUsedAtNotUpdated(t testing.TB, details *Details) {
+	t.Helper()
+
+	<-time.After(testutil.IntervalMedium) // Wait for Bicopy to finish.
+	details.FlushStats()
+	<-time.After(testutil.IntervalMedium) // Wait for stats to fully flush.
+	ws, err := details.SDKClient.Workspace(context.Background(), details.Workspace.ID)
+	require.NoError(t, err)
+	require.Equal(t, ws.LastUsedAt, details.Workspace.LastUsedAt, "workspace LastUsedAt updated when it should not have been")
+}


Why don't we try looping the flush until we either time out or have the stats we're looking for? Might make sense to have FlushStats take a context too, just to be sure.

mafredri · 2024-01-16T11:08:32Z

coderd/workspaceapps/apptest/apptest.go

+	t.Helper()
+
+	<-time.After(testutil.IntervalMedium) // Wait for Bicopy to finish.
+	details.FlushStats()


Same here, looping flush seems better than waiting an arbitrary time?

A loop-flush here would probably just execute after the first iteration.

…ports stat

johnstcn · 2024-01-16T14:05:56Z

Filed #11643 for following up on unexpected stats inserts.

johnstcn self-assigned this Jan 12, 2024

johnstcn force-pushed the cj/update-workspace-lastusedat-proxy branch 2 times, most recently from 2d44bf4 to 5b799e8 Compare January 15, 2024 17:16

johnstcn changed the title ~~[WIP] fix(coderd): workspaceapps: update last_used_at when workspace app reports stats~~ fix(coderd): workspaceapps: update last_used_at when workspace app reports stats Jan 15, 2024

johnstcn marked this pull request as ready for review January 15, 2024 17:20

johnstcn requested review from mafredri and mtojek January 15, 2024 17:20

johnstcn commented Jan 15, 2024

View reviewed changes

mtojek approved these changes Jan 16, 2024

View reviewed changes

johnstcn requested a review from spikecurtis January 16, 2024 09:12

mafredri reviewed Jan 16, 2024

View reviewed changes

johnstcn added 11 commits January 16, 2024 13:48

fix(coderd): workspaceapps: update last_used_at when workspace app re…

39b846f

…ports stat

fix tests

d858bac

fix linter

07176ef

use the correct client

ef8d3ab

add flush to wsproxy tests

601136c

use unbuffered channels for flush

184af27

maybe fix race conditions?

1f9012e

hack: wait for connections to fully close

862b1ff

reword comments

f522ea6

improve comment

b714d69

address PR comments

8573ba7

johnstcn force-pushed the cj/update-workspace-lastusedat-proxy branch from bad5194 to 8573ba7 Compare January 16, 2024 13:48

johnstcn mentioned this pull request Jan 16, 2024

potential workspace apps stats bug: blocked access causes stats report #11643

Closed

johnstcn merged commit d583aca into main Jan 16, 2024

johnstcn deleted the cj/update-workspace-lastusedat-proxy branch January 16, 2024 14:06

github-actions bot locked and limited conversation to collaborators Jan 16, 2024

		// TODO(cian): A blocked request should not count as workspace usage.
		// assertWorkspaceLastUsedAtNotUpdated(t, appDetails.AppClient(t), appDetails)

		// TODO(cian): The initial redirect should not count as workspace usage.
		// assertWorkspaceLastUsedAtNotUpdated(t, appDetails.AppClient(t), appDetails)

fix(coderd): workspaceapps: update last_used_at when workspace app reports stats #11603

fix(coderd): workspaceapps: update last_used_at when workspace app reports stats #11603

Uh oh!

Conversation

johnstcn commented Jan 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnstcn Jan 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mtojek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnstcn commented Jan 16, 2024

Uh oh!

Uh oh!

johnstcn commented Jan 12, 2024 •

edited

Loading

johnstcn Jan 16, 2024 •

edited

Loading