feat(coderd): batch agent stats inserts #8875

johnstcn · 2023-08-03T08:23:32Z

This PR adds support for batching inserts to the workspace_agents_stats table.

…nsert

johnstcn

notes:

johnstcn · 2023-08-03T11:37:36Z

cli/server.go

@@ -813,6 +814,21 @@ func (r *RootCmd) Server(newAPI func(context.Context, *coderd.Options) (*coderd.
 				options.SwaggerEndpoint = cfg.Swagger.Enable.Value()
 			}

+			batchStatsTicker := time.NewTicker(29 * time.Second) // Hard-coding.


review: Would this be better set at a lower interval, or made configurable?

First of all, 29 sec looks magical :) You may need to comment on why is it exactly this.

I didn't look deeper into the review yet, but you may want to introduce 2 limits: timeout and buffer size are getting full.

Made the interval 1 second instead by default.

coderd/coderdtest/coderdtest.go

coderd/database/dbauthz/dbauthz.go

johnstcn · 2023-08-03T11:45:28Z

coderd/database/dbfake/dbfake.go

+		stat := statByAgent[agentID]
+		stat.AgentID = agentStat.AgentID
+		stat.TemplateID = agentStat.TemplateID
+		stat.UserID = agentStat.UserID
+		stat.WorkspaceID = agentStat.WorkspaceID


review: this was bugged before.

Good finding 👍 Will it be covered with tests now?

Yes, the tests added here were the impetus for this change!

johnstcn · 2023-08-03T11:48:18Z

coderd/database/queries/workspaceagentstats.sql

+	unnest(@workspace_id :: uuid[]) AS workspace_id,
+	unnest(@template_id :: uuid[]) AS template_id,
+	unnest(@agent_id :: uuid[]) AS agent_id,
+	jsonb_array_elements(@connections_by_proto :: jsonb) AS connections_by_proto,


review: connections_by_proto needs to be handled as jsonb[] and not jsonb[] as that ends up being passed to pq.Array which turns it into something like {{123,34,106,34,...125}}.

:)

jsonb[] and not jsonb[]

🙃

jsonb and not jsonb[]

coderd/coderd.go

mtojek · 2023-08-03T15:22:25Z

coderd/database/queries/workspaceagentstats.sql

+	unnest(@workspace_id :: uuid[]) AS workspace_id,
+	unnest(@template_id :: uuid[]) AS template_id,
+	unnest(@agent_id :: uuid[]) AS agent_id,
+	jsonb_array_elements(@connections_by_proto :: jsonb) AS connections_by_proto,


:)

jsonb[] and not jsonb[]

mtojek · 2023-08-03T15:28:25Z

coderd/database/dbfake/dbfake.go

+		stat := statByAgent[agentID]
+		stat.AgentID = agentStat.AgentID
+		stat.TemplateID = agentStat.TemplateID
+		stat.UserID = agentStat.UserID
+		stat.WorkspaceID = agentStat.WorkspaceID


Good finding 👍 Will it be covered with tests now?

coderd/database/dbauthz/dbauthz.go

mtojek · 2023-08-03T15:31:26Z

coderd/coderd.go

+	ctx, cancel := context.WithCancel(context.Background())
+
+	if options.StatsBatcher == nil {
+		panic("developer error: options.StatsBatcher is nil")


Is there a reason why New(options) can't define a default instance?

No, just keeping to the existing pattern in this file.

cli/server.go

mtojek · 2023-08-03T15:49:02Z

coderd/batchstats/batcher.go

+
+const (
+	// DefaultBatchSize is the default size of the batcher's buffer.
+	DefaultBatchSize = 1024


In a typical environment, how many items are in the buffer?

Assuming 1000 workspaces, the worst-case scenario is that they all attempt to push stats at once. That's extremely unlikely however.

coderd/batchstats/batcher.go

mtojek · 2023-08-03T15:51:49Z

coderd/batchstats/batcher.go

+	b.buf.SessionCountSSH = append(b.buf.SessionCountSSH, st.SessionCountSSH)
+	b.buf.ConnectionMedianLatencyMS = append(b.buf.ConnectionMedianLatencyMS, st.ConnectionMedianLatencyMS)
+
+	// If the buffer is full, signal the flusher to flush immediately.


nit: do you think that we should flush before the buffer is full? Let's say 80% of occupancy?

What would the advantage of this be? If you're trying to avoid blocking during flush, blocking will happen regardless of when we flush. We could make a temporary copy of the buffer while we flush to avoid that, but batch-inserting 1000 entries happens fairly quickly from what I can tell.

We had a conversation about this.

We came up with passing the existing buffer to the lever and resetting the buffer. So that way the buffer is new for the next Add and the previous data is pending the db query.

Psuedo code:

if len(b.buf.ID) == cap(b.buf.ID) { select{ case b.flushLever <- struct{}{}: default: b.log.Error(context.Background(), "this should never happen, dropping agent stats :(") } b.flushLever <- struct{}{} b.flushLever <- &b.buf b.buf = make(database.InsertWorkspaceAgentStatsParams, 0, b.batchSize) }

Some additional nits:

We should probably use the constant instead of cap for comparison since cap can grow. len(b.buf.ID) >= b.batchSize

We should use a select and throw an error log if we are backing up. We should drop the stats, as if we are backing up, there is nothing we can do except log it anyway.

It ended up being simpler to do an early flush. However, I thought about this some more and I think using a buffered channel would also resolve this issue. This is a possible later enhancement.

mtojek · 2023-08-03T15:57:09Z

coderd/batchstats/batcher.go

+		b.buf.ConnectionsByProto = payload
+	}
+
+	err = b.store.InsertWorkspaceAgentStats(ctx, b.buf)


What happens if one "stat" in the batch is faulty? Is the whole batch rejected?

Bonus points: maybe we need a test for this.

Yes, the whole batch will be rejected. That's a good idea for a test.

It would be nice to exclude the faulty item, and keep the rest of the batch, but I guess that it would be tricky to implement, or just fall back to iterative approach.

Emyrk

Not saying to solve this, but dang our server.go start function is growing huge. We should really look into a better way to manage it.

coderd/batchstats/batcher.go

Emyrk · 2023-08-03T20:08:07Z

coderd/batchstats/batcher.go

+	b.buf.SessionCountSSH = append(b.buf.SessionCountSSH, st.SessionCountSSH)
+	b.buf.ConnectionMedianLatencyMS = append(b.buf.ConnectionMedianLatencyMS, st.ConnectionMedianLatencyMS)
+
+	// If the buffer is full, signal the flusher to flush immediately.


We had a conversation about this.

We came up with passing the existing buffer to the lever and resetting the buffer. So that way the buffer is new for the next Add and the previous data is pending the db query.

Psuedo code:

if len(b.buf.ID) == cap(b.buf.ID) { select{ case b.flushLever <- struct{}{}: default: b.log.Error(context.Background(), "this should never happen, dropping agent stats :(") } b.flushLever <- struct{}{} b.flushLever <- &b.buf b.buf = make(database.InsertWorkspaceAgentStatsParams, 0, b.batchSize) }

Some additional nits:

We should probably use the constant instead of cap for comparison since cap can grow. len(b.buf.ID) >= b.batchSize

We should use a select and throw an error log if we are backing up. We should drop the stats, as if we are backing up, there is nothing we can do except log it anyway.

mtojek

Looks good 👍

Emyrk · 2023-08-04T13:48:01Z

coderd/batchstats/batcher.go

 		b.flushLever <- struct{}{}
+		b.flushForced.Store(true)


Does the flushForced just prevent a deadlock by adding to the flushLever channel if it already has something?

Can we just use a select statement to prevent pushing to the flushLever channel if it already has an item?

I tried the select but it will still end up doing one extra unwanted flush.

Well, in theory an extra flush is lesser evil than losing data.

…nsert

johnstcn added 16 commits August 2, 2023 10:34

RED: add tests for batchstats.Batcher

3dbc40d

RED: add initial batchstats.Batcher implementation and some TODOs

c302132

wip

554fcf0

actually insert stats

04d0172

GREEN (with postgres)

5ee8398

test with multiple workspaces

666261f

rm comments

e856a71

fix null IDs in FakeQuerier.GetWorkspaceAgentStats

270497f

first stab at batch query

1fa285c

remove unnecessary type

d15a0df

GREEN: got query working

f13300c

implement dbfake / dbauthz, adjust external API

0b1b0ac

plumb it all through

d5acbde

fixup! plumb it all through

6c8de19

reduce test log output

c93b861

Merge remote-tracking branch 'origin/main' into cj/batch-agent-stat-i…

6f7191d

…nsert

johnstcn self-assigned this Aug 3, 2023

johnstcn added 7 commits August 3, 2023 10:15

linter

7986008

fmt

4c1020e

fix bugged query

f0a0ae5

todone

3d842b4

fixup! todone

8b4dfdc

cli/server: wait for batcher to be done

721b99e

fix panic error message

8b8a4b1

johnstcn commented Aug 3, 2023

View reviewed changes

johnstcn marked this pull request as ready for review August 3, 2023 11:58

johnstcn requested review from mtojek and Emyrk August 3, 2023 13:55

mtojek reviewed Aug 3, 2023

View reviewed changes

Emyrk reviewed Aug 3, 2023

View reviewed changes

refactor

e82fb33

mtojek self-requested a review August 4, 2023 06:04

mtojek approved these changes Aug 4, 2023

View reviewed changes

flush buffer early to avoid growing buf

35c8d92

Emyrk reviewed Aug 4, 2023

View reviewed changes

Emyrk approved these changes Aug 4, 2023

View reviewed changes

Merge remote-tracking branch 'origin/main' into cj/batch-agent-stat-i…

7265b95

…nsert

johnstcn merged commit 9fb18f3 into main Aug 4, 2023

johnstcn deleted the cj/batch-agent-stat-insert branch August 4, 2023 16:00

github-actions bot locked and limited conversation to collaborators Aug 4, 2023

feat(coderd): batch agent stats inserts #8875

feat(coderd): batch agent stats inserts #8875

Uh oh!

Conversation

johnstcn commented Aug 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnstcn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Emyrk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mtojek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

johnstcn commented Aug 3, 2023 •

edited

Loading