fix(agent): send metadata in batches #10225

mafredri · 2023-10-11T18:26:26Z

Fixes #9782

I recommend reviewing with ignore whitespace.

mafredri · 2023-10-11T18:26:31Z

Current dependencies on/for this PR:

main
- PR feat(coderd): add support for sending batched agent metadata #10223
  - PR feat(codersdk/agentsdk): use new agent metadata batch endpoint #10224
    - PR fix(agent): send metadata in batches #10225 👈

This comment was auto-generated by Graphite.

agent/agent_test.go

johnstcn · 2023-10-12T10:45:04Z

agent/agent.go

+			updatedMetadata[mr.key] = mr.result
+			continue
+		case <-report:
+			if len(updatedMetadata) > 0 {


If I'm understanding this correctly - say you have two agent metadata definitions, one which updates every 1 second, the other which updates every 10 seconds.

In the previous request-per-metadata-key approach, this would cause (0.1 + 1) 1.1 requests per second, while in this new approach we would end up with a minimum of 1 requests per second, as the more frequently updated metadatum would cause a batch request.

I'm think we should update the documentation with this change to reflect the new behaviour. I think it would also make sense to recommend users to keep in mind the minimum metadata refresh interval when writing their templates; any metadatum with a refresh interval of 1 second will cause frequent metadata updates from what I understand here.

In the previous request-per-metadata-key approach, this would cause (0.1 + 1) 1.1 requests per second, while in this new approach we would end up with a minimum of 1 requests per second.

I'm not sure where 0.1 comes from in your example, but for simplicity sake, let's say we have 1s and 2s intervals for the metadata, and each take 0ns to execute. In the previous implementation we would approximately do:

14:00:00 POST meta1 14:00:00 POST meta2 14:00:01 POST meta1 14:00:02 POST meta1 14:00:02 POST meta2 14:00:03 POST meta1

In the new implementation, we would:

14:00:00 POST meta1, meta2 14:00:01 POST meta1 14:00:02 POST meta1, meta2 14:00:03 POST meta1

With 2s and 3s metadata, it would look like this:

Old:

14:00:00 POST meta1 14:00:00 POST meta2 14:00:02 POST meta1 14:00:03 POST meta2 14:00:04 POST meta1 14:00:06 POST meta1 14:00:06 POST meta2

New:

14:00:00 POST meta1, meta2 14:00:02 POST meta1 14:00:03 POST meta2 14:00:04 POST meta1 14:00:06 POST meta1, meta2

This is an approximation of ideal conditions, though. And perhaps we should separate collect and send triggers by, say, 500ms. This would increase the likelyhood of ideal batching.

With regards to RPS the new implementation reduces RPS and doesn't necessarily guarantee 1 RPS, it depends on interval and how long commands take to execute.

I'm think we should update the documentation with this change to reflect the new behaviour.

Are you referring to this https://coder.com/docs/v2/latest/templates/agent-metadata#db-write-load?

The write load will be about the same, but the writes will be more performant since they're batched. I suppose we could call one batch one write, even though we're updating multiple rows. I'll amend this part.

Wouldn't an agent metadatum with a refresh interval of 10 seconds cause 1 request roughly every 10 seconds (i.e. 0.1 RPS)?

Yes, this is the case in both implementations.

johnstcn

The change to the docs is the cherry on top here!

-You can expect (10 * 6 * 2) / 4 or 30 writes per second.
+You can expect at most (10 * 2 / 4) + (10 * 2 / 6) or ~8 writes per second.

mtojek

As you refactored this function a bit, I'm wondering if it is possible to prepare extra unit tests for healthy/unhealthy conditions. I'm thinking about singleflight vs. duplicate backpressure, coderd is unavailable, etc.

mtojek · 2023-10-12T13:21:15Z

agent/agent.go


-		if len(manifest.Metadata) > metadataLimit {


We don't need this condition anymore? just in case...

Previously there was a buffered channel, but that's no longer required so this artificial restriction got lifted. If we want to limit this, it should be in the coder tf provider, saying too many metadata items were added.

mtojek · 2023-10-12T13:30:34Z

agent/agent.go

+						reportSemaphore <- struct{}{}
+					}()
+
+					err := a.client.PostMetadata(ctx, agentsdk.PostMetadataRequest{Metadata: metadata})


This behavior doesn't change, right:

If the agent fails to send the metadata, it will be lost. There is no "retry" mechanism in place now?

Correct, we don't retry since trying to send stale data wouldn't make much sense, we instead wait until the next update is available and try to send it.

Ok! I guess we can debate about the lack of source data for insights, but that's a separate issue.

Ah, this is not used for insights, so it will not have an effect on that.

mtojek · 2023-10-12T13:31:43Z

agent/agent.go

+	// benefit from canceling the previous send and starting a new one.
+	var (
+		updatedMetadata = make(map[string]*codersdk.WorkspaceAgentMetadataResult)
+		reportTimeout   = 30 * time.Second


There are many magic timeouts hidden in this file. I'm wondering if we shouldn't move them to the top of file, and make them more dependent. For instance: instead of 5sec -> reportTimeout * 1/6

With 5 sec, are you referring to the case where md.Timeout is not set (or zero), and we fall back to 5 sec?

I wanted to change as little about the metadata collection as possible, but I can go over this and see what can be consolidated.

This 30 second timeout was randomly picked by me as a crutch for scenarios where the network is super slow or down. It should be enough for at least something to get through, whilst avoiding sending very stale data.

I see. I'm afraid that there are too many timeouts depending on each other in the entire implementation, and it might be hard to debug potential queueing problems.

mtojek · 2023-10-12T13:32:35Z

agent/agent.go

+		report          = make(chan struct{}, 1)
+		collect         = make(chan struct{}, 1)


Would it change something if we increase buffers here?

Yes. They are single buffer to be non-blocking and not cause back-pressure. If we increase channel sizes then we may have back-pressure if an operation is taking longer than expected. Say collect was 10, if one iteration of starting collection took 10s then this channel would fill and afterwards 10 collections would start immediately after each other, even if they now finish in 0.1s each, possibly leading to 10 collections (and load) within a second (vs once per second).

agent/agent.go

mtojek · 2023-10-12T13:43:02Z

agent/agent.go

-					// The last collected value isn't quite stale yet, so we skip it.
-					if collectedAt.Add(time.Duration(md.Interval) * intervalUnit).After(time.Now()) {
-						return
+					ctxTimeout := time.Duration(timeout) * time.Second


I might be lost in the flow here... why is this 1sec timeout needed?

This also didn't change in this PR.

The timeout here can be 1s or it can be higher. I suppose the logic is that if you want metadata to update every 1s, collection should run faster than 1s, but this is likely not ideal.

We could change this in the future or if it becomes a problem.

mtojek

I have better understanding of the code now. LGTM!

mafredri · 2023-10-13T14:32:52Z

Merge activity

Oct 13, 10:32 AM: Graphite rebased this pull request after merging its parent, because this pull request is set to merge when ready.
Oct 13, 10:48 AM: @mafredri merged this pull request with Graphite.

This was referenced Oct 11, 2023

feat(coderd): add support for sending batched agent metadata #10223

Merged

feat(codersdk/agentsdk): use new agent metadata batch endpoint #10224

Merged

github-actions bot assigned mafredri Oct 11, 2023

mafredri changed the title ~~feat(agent): send metadata in batches~~ fix(agent): send metadata in batches Oct 11, 2023

mafredri force-pushed the mafredri/feat-agentsdk-use-agent-metadata-batch-endpoint branch from eeb4adb to 69ebb05 Compare October 11, 2023 18:42

mafredri force-pushed the mafredri/feat-agent-send-metadata-in-batches branch from 82b8659 to b7958c3 Compare October 11, 2023 18:44

mafredri force-pushed the mafredri/feat-agentsdk-use-agent-metadata-batch-endpoint branch from 69ebb05 to a3395ac Compare October 12, 2023 08:30

mafredri force-pushed the mafredri/feat-agent-send-metadata-in-batches branch from b7958c3 to dd6935e Compare October 12, 2023 08:30

mafredri requested review from johnstcn, ammario and mtojek October 12, 2023 09:57

mafredri force-pushed the mafredri/feat-agentsdk-use-agent-metadata-batch-endpoint branch from a3395ac to 7e7f6c0 Compare October 12, 2023 10:26

mafredri force-pushed the mafredri/feat-agent-send-metadata-in-batches branch from dd6935e to 431637a Compare October 12, 2023 10:27

johnstcn reviewed Oct 12, 2023

View reviewed changes

johnstcn approved these changes Oct 12, 2023

View reviewed changes

mtojek reviewed Oct 12, 2023

View reviewed changes

mtojek approved these changes Oct 12, 2023

View reviewed changes

ammario removed their request for review October 12, 2023 15:32

mafredri force-pushed the mafredri/feat-agentsdk-use-agent-metadata-batch-endpoint branch from 7e7f6c0 to 161c535 Compare October 13, 2023 13:10

mafredri force-pushed the mafredri/feat-agent-send-metadata-in-batches branch from 4e3730e to 44586d1 Compare October 13, 2023 13:10

mafredri force-pushed the mafredri/feat-agentsdk-use-agent-metadata-batch-endpoint branch 2 times, most recently from 06d1c13 to e89259a Compare October 13, 2023 14:19

mafredri force-pushed the mafredri/feat-agent-send-metadata-in-batches branch from 44586d1 to 3dd45f4 Compare October 13, 2023 14:19

Base automatically changed from mafredri/feat-agentsdk-use-agent-metadata-batch-endpoint to main October 13, 2023 14:32

mafredri added 2 commits October 13, 2023 14:32

feat(agent): send metadata in batches

3a03964

update docs

4758603

mafredri force-pushed the mafredri/feat-agent-send-metadata-in-batches branch from 3dd45f4 to 4758603 Compare October 13, 2023 14:32

mafredri merged commit 76c65b1 into main Oct 13, 2023

mafredri deleted the mafredri/feat-agent-send-metadata-in-batches branch October 13, 2023 14:48

github-actions bot locked and limited conversation to collaborators Oct 13, 2023

		report = make(chan struct{}, 1)
		collect = make(chan struct{}, 1)

fix(agent): send metadata in batches #10225

fix(agent): send metadata in batches #10225

Uh oh!

Conversation

mafredri commented Oct 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mafredri commented Oct 11, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnstcn Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnstcn left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mtojek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mtojek left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mafredri commented Oct 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Uh oh!

mafredri commented Oct 11, 2023 •

edited

Loading

johnstcn Oct 12, 2023 •

edited

Loading

johnstcn left a comment •

edited

Loading

mtojek left a comment •

edited

Loading

mafredri commented Oct 13, 2023 •

edited

Loading