feat: integrate agentAPI with resources monitoring logic #16438

defelmnq · 2025-02-04T23:35:09Z

As part of the new resources monitoring logic - more specifically for OOM & OOD Notifications , we need to update the AgentAPI , and the agents logic.

This PR aims to do it, and more specifically :
We are updating the AgentAPI & TailnetAPI to version 24 to add two new methods in the AgentAPI :

One method to fetch the resources monitoring configuration
One method to push the datapoints for the resources monitoring.

Also, this PR adds a new logic on the agent side, with a routine running and ticking - fetching the resources usage each time , but also storing it in a FIFO like queue.

Finally, this PR fixes a problem we had with RBAC logic on the resources monitoring model, applying the same logic than we have for similar entities.

defelmnq · 2025-02-11T05:04:56Z

agent/proto/agent.proto

@@ -295,6 +295,42 @@ message Timing {
    Status status = 6;
 }

+message GetResourcesMonitoringConfigurationRequest {


This message is based on what has been described in the RFC - and has been mutually discussed and approved by Vincent and @DanielleMaywood

defelmnq · 2025-02-11T05:05:50Z

agent/resources_monitor.go

+	"github.com/coder/quartz"
+)
+
+func (a *agent) pushResourcesMonitoring(ctx context.Context, aAPI proto.DRPCAgentClient24) error {


self review : We have this intermediate function in order to have an extra layer, and simplicify for testing purpose.
As all arguments are injected as parameters to PushResourcesMonitoringWithConfig we can use the mocked versions.

defelmnq · 2025-02-11T05:07:17Z

agent/resources_monitor.go

+	configFetcher ResourcesMonitorConfigurationFetcher,
+	datapointsPusher ResourcesMonitorDatapointsPusher,
+) error {
+	config, err := configFetcher(ctx, &proto.GetResourcesMonitoringConfigurationRequest{})


self review : There's three situations on the which we do not want to run the go-routine :

If we can not fetch the configuration from Coderd.

If the resources_monitoring is not enabled.

If we can not instantiate the clistat client (which is the one used to fetch resources usage.)

defelmnq · 2025-02-11T05:08:36Z

coderd/agentapi/resources_monitoring.go

+	}, nil
+}
+
+func (a *ResourcesMonitoringAPI) PushResourcesMonitoringUsage(ctx context.Context, req *proto.PushResourcesMonitoringUsageRequest) (*proto.PushResourcesMonitoringUsageResponse, error) {


self review : This function is just here as a placeholder - we are fine keeping it this way for now as it will be replaced by @DanielleMaywood PR.

defelmnq · 2025-02-11T05:09:54Z

agent/resources_monitor.go

+
+	datapointsQueue := NewResourcesMonitorQueue(int(config.Config.NumDatapoints))
+
+	clk.TickerFunc(ctx, time.Duration(config.Config.TickInterval*int32(time.Second)), func() error {


self review : We prefer to log instead of returning an error in case of problem.

Using TickerFunc from quartz, returning an error would just shutdown the whole ticker and stop it. Returning nil stops the current tick, but the next one will still trigger.

defelmnq · 2025-02-11T05:12:04Z

@spikecurtis I'd love to have your review - specially on the whole AgentAPI / TailnetAPI part - I want to be sure that I increase the version as we should.

dannykopping

Generally looking good, but I'm concerned that this doesn't match the RFC. See my comment in agent/resources_monitor.go.

agent/resources_monitor.go

agent/resources_monitor_test.go

dannykopping · 2025-02-11T05:45:56Z

agent/resources_monitor_test.go

+func TestPushResourcesMonitoringWithConfig(t *testing.T) {
+	t.Parallel()
+
+	tests := []struct {


Looking at the code coverage, we're missing a few test-cases here:

disabled config pushes nothing

errors when fetching memory/volume info

errors when pushing to coderd

I improved a lot the testability of this block by creating a struct for the resources_fetcher part - so we can mock it.

Also added test cases for all the missing cases. Should be way better now.

coderd/agentapi/resources_monitoring.go

agent/resources_monitor_queue_test.go

agent/agent.go

agent/resources_monitor.go

dannykopping · 2025-02-11T07:23:48Z

agent/resources_monitor.go

+	ResourcesMonitorDatapointsPusher     func(ctx context.Context, params *proto.PushResourcesMonitoringUsageRequest) (*proto.PushResourcesMonitoringUsageResponse, error)
+)
+
+func PushResourcesMonitoringWithConfig(ctx context.Context,


This implementation does not seem to match the RFC design.

Once the buffer fills up, it'll send the full buffer to the control plane. However, it'll keep doing so every TickInterval thereafter since the buffer only ever replaces the oldest value.

The RFC specifies that the payload should only be delivered every 15s for the 20 datapoints; it's currently performing collection and submission in the same ticker with no distinction between them.

We also seem to not be adding the UNKNOWN datapoints when values cannot be retrieved.

Was this a conscious choice? Is the RFC out-of-date perhaps?

About the UNKNOWN status - this is handled directly on the coderd side.
It will be visible in the part currently being done by Danielle - from the configuration stored in the DB, and what we return on the Agent side, coderd (and the processing component) decide if the resource, for the given datapoint, is OK, NOK or UNKNOWN.

About the tick logic - that's maybe a misunderstanding in the RFC , but when I defined it to be delivered every 15s, it was based on the TickInterval - considering it was set to 15. having both the TickInterval and another value set independently would require some sync to avoid sending two times the datapoints without having a refresh or fetching datapoints too frequently and so not pushing data to coderd at each tick.

TL;DR - I think the best solution is to have the TickInterval handling both the refresh of resources and push to coderd so we are sure each push will always have the last data.

About the UNKNOWN status - this is handled directly on the coderd side.
It will be visible in the part currently being done by Danielle - from the configuration stored in the DB, and what we return on the Agent side, coderd (and the processing component) decide if the resource, for the given datapoint, is OK, NOK or UNKNOWN.

Can you point me to some specific code I can look at please? Will you be inferring that a datapoint was missed by looking at the time of each datapoint? I wonder why we opted for this approach rather than explicitly sending a noop datapoint to indicate a problem.

TL;DR - I think the best solution is to have the TickInterval handling both the refresh of resources and push to coderd so we are sure each push will always have the last data.

This doesn't really address what I'm asking, though.

We're going to be sending all n datapoints every TickInterval once the queue fills up since we never drain the queue but only shifting off the oldest element.

spikecurtis

Proto version changes look overall fine, but I want to review again re: RBAC

agent/proto/agent.proto

agent/proto/agent_drpc_old.go

tailnet/proto/version.go

coderd/rbac/object_gen.go

spikecurtis · 2025-02-11T08:10:27Z

agent/resources_monitor.go

+
+type (
+	ResourcesMonitorConfigurationFetcher func(ctx context.Context, params *proto.GetResourcesMonitoringConfigurationRequest) (*proto.GetResourcesMonitoringConfigurationResponse, error)
+	ResourcesMonitorDatapointsPusher     func(ctx context.Context, params *proto.PushResourcesMonitoringUsageRequest) (*proto.PushResourcesMonitoringUsageResponse, error)


This should be an interface. With a bare function type, IDE tooling has a hard time finding implementations of this function. If you make it an interface, you can find implementations and see what this is.

I removed the func parts and ResourcesMonitorConfigurationFetcher to have an interface - should be way better this way. 👀

spikecurtis · 2025-02-11T08:12:16Z

agent/resources_monitor.go

+	datapointsPusher ResourcesMonitorDatapointsPusher,
+) error {
+	config, err := configFetcher(ctx, &proto.GetResourcesMonitoringConfigurationRequest{})
+	if err != nil {


Instead of passing a "fetcher", you should just pass the config, and do the fetching before you call this function. It will make testing easier and you can avoid defining ResourcesMonitorConfigurationFetcher which is a real mouthful.

✅ changed too.

spikecurtis · 2025-02-11T08:19:18Z

agent/resources_monitor.go

+		return xerrors.Errorf("failed to create resources fetcher: %w", err)
+	}
+
+	datapointsQueue := NewResourcesMonitorQueue(int(config.Config.NumDatapoints))


I think this would be a lot more readable with a ResourceMonitor struct that includes things like the resources fetcher and queue as members, instead of this nested function style. It makes it hard to keep track of the scope of things, that is, which things are per tick and which last between ticks.

You set up the long lived struct, then start the ticker calling a method on the struct. That way it's very clear what is part of the tick and what outlives it. It also means you don't have to pass the fetcher to helpers like fetchResourceMonitoredVolume.

I changed a lot the way things were organised based on the comments - so now :

Resources monitor logic is in its own package

It works with a struct and methods to make it an entity

members are used to store the queue & config

🙏

spikecurtis · 2025-02-11T08:20:29Z

agent/resources_monitor.go

@@ -0,0 +1,132 @@
+package agent


I think Go style would be to move this into it's own package so you don't have to include some variant of "resource" and "monitor" in the name of everything to keep things clear.

I moved it to its own package ✅ should indeed simplify a lot the naming.

DanielleMaywood · 2025-02-12T11:55:45Z

coderd/agentapi/resources_monitoring.go

+}
+
+func (a *ResourcesMonitoringAPI) GetResourcesMonitoringConfiguration(ctx context.Context, _ *proto.GetResourcesMonitoringConfigurationRequest) (*proto.GetResourcesMonitoringConfigurationResponse, error) {
+	agent, err := a.AgentFn(ctx)


nit:

I think we could avoid using the AgentFn pattern here. We only need the Agent's ID, but this function makes a call to the database, using the Agent's ID, to fetch the agent. We could instead have the ResourcesMonitoringAPI have an AgentID field and just pass in the AgentID from Options.

A bit surprised there's a so simple option , but makes sense and changed - Thanks ! ✅

defelmnq · 2025-02-13T01:38:04Z

@dannykopping @spikecurtis thanks for the first review. I changed and iterated on most of the things excepted RBAC logic as it will be a whole topic and I dont want to block the rest.

Can I ask you for another review (without RBAC) - ensuring we are all good for the whole logic ?

Globally :

I moved the ResourcesMonitor part to its own package.
Created a fetcher interface to improve testability (and added a few cases)
Changed the proto part to have optional fields , and remove the global enabled in favor of enabled for each resource.
Some cleanup here and there based on comments

Thanks !

spikecurtis · 2025-02-13T05:21:19Z

I'm happy to give another review, but I consider the RBAC to be a blocker to merging.

agent/proto/resourcesmonitor/fetcher.go

agent/proto/resourcesmonitor/fetcher_test.go

agent/proto/resourcesmonitor/fetcher.go

defelmnq · 2025-02-13T22:31:15Z

Thanks @spikecurtis for the other review. I think we reached something way better here :

I fixed your last comments
re-inserted the RBAC logic. Took a bit of time understanding the way everything works but now tested and am happy with the result.

Provisionerd can add entries
Whoever can read the associated workspace can also fetch the resources_monitoring associated.

dannykopping

I still have concerns about how this data collection is implemented.

GetResourcesMonitoringConfiguration specifies that a collection will occur every 10 seconds. Datapoints will be added to the queue, and once the queue fills up it'll send the contents of the queue to coderd. Once the queue is full it never empties, so it'll just shift the oldest element off to accommodate a new one.

The CollectionIntervalSeconds defines both the collection AND send interval, which means that once the queue is full it'll send the full contents every CollectionIntervalSeconds, but...

CollectionIntervalSeconds is 10s by default, which is too slow for datapoint collection; the point of the system we discussed in the RFC was to collect often (maybe even once per second), and to buffer those datapoints to send every so often so as to not overwhelm coderd. This implementation means we're sending the same datapoint NumDatapoints-1 times to coderd. I don't see the point in this.

Regarding the processor: with this design it would take up to 200s / 3.3m (20 datapoints collected 10s apart) to go from an unhealthy state to a healthy state; that will not feel responsive, and we can do much better!

In another comment I described my reservations about this design concerning failed collections; from the answer it seems like coderd will be inferring those failed collections by noticing gaps in the data; how we'll get this precise I don't know (since we're not collecting timestamps with each datapoint) - and it seems unnecessary since we already know at the source when this has occurred. Currently we're just skipping over these failed datapoints in the agent with a log, and leaving coderd to guess as to what happened; this doesn't seem like a resilient design to me.

@defelmnq pointed out that an empty datapoint would be sent if the collection failed (I missed this on line 73). coderd will also look at the monitoring config and determine if the datapoint is blank but shouldn't be, and determine that as UNKNOWN. I think it could've been more explicit from the agent's perspective, but this approach will work fine.

dannykopping · 2025-02-14T05:09:24Z

agent/proto/resourcesmonitor/resources_monitor_test.go

+		{
+			// If one of the resources fails to be fetched, the datapoints still should be pushed with the other resources.
+			name: "ErrorFetchingVolume",
+			config: &proto.GetResourcesMonitoringConfigurationResponse{


Nit: since these all have the same config, consider creating a const to improve readability.
Alternatively (I'd prefer this) you should change these configs up so that you avoid accidentally overfitting for this specific config.

dannykopping · 2025-02-14T05:10:42Z

agent/proto/resourcesmonitor/resources_monitor_test.go

+			// We have one call per tick, once reached the ${config.NumDatapoints}.
+			expectedCalls := tt.numTicks - int(tt.config.Config.NumDatapoints) + 1
+			require.Equal(t, expectedCalls, counterCalls)
+			cancel()


Nit: I don't think you need this here since you have a defer cancel() already which will be invoked after this.

dannykopping · 2025-02-14T05:14:06Z

agent/proto/resourcesmonitor/resources_monitor_test.go

+				require.Len(t, req.Datapoints, 20)
+				require.Nil(t, req.Datapoints[0].Memory)
+				require.NotNil(t, req.Datapoints[0].Volumes)
+				require.Equal(t, &proto.PushResourcesMonitoringUsageRequest_Datapoint_VolumeUsage{
+					Volume: "/",
+					Total:  100000,
+					Used:   50000,
+				}, req.Datapoints[0].Volumes[0])


Nit: this feels awkward; the assertions should be happening in the test body, not the test case definitions.
Don't we always want to check that the returned values match expectations?

spikecurtis

RBAC stuff looks OK now.

dannykopping

OK, @defelmnq and I had discussed the implementation and I'm happy to unblock this as it stands.
What I thought we were building and what I agreed to in the RFC were different, and for that I take responsibility. Apologies for the inertia this caused on this review!

One optimisation we agreed to do now is to start shipping the payloads even before the buffer fills up, so we can alert early if errors are present straight away.

For later:

My argument is that we should collect faster and send at the ~same rate, to achieve a higher resolution measurement and consequently have more reactive alerts / status updates.

To achieve this, we would need to send at least half (but probably the full) previous NumDatapoints datapoints to ensure we can satisfy the 3 rules (which may only be true across current + previous payloads):

4 datapoints in a row above threshold = NOK
any 10 datapoints in a NumDatapoints payload above threshold = NOK
all NumDatapoints within threshold = OK

work on new agent version

9d42cad

defelmnq self-assigned this Feb 4, 2025

improve function for resources monitoring

7b2d19e

defelmnq changed the title ~~feat: connect agentAPI with resources monitoring logic~~ feat: integrate agentAPI with resources monitoring logic Feb 10, 2025

defelmnq added 11 commits February 10, 2025 04:06

add missing files

0124d60

work on resources monitor tests

3661e8c

apply fmt and lint

a5a788e

work on dbauthz tests

91d1515

work on dbauthz

0bc7632

work on rbac

3085041

continue to iterate

120a37b

continue to iterate

dd8ed40

continue to iterate

0a8941b

work on tests

f3388b4

improve testing

523f6fd

defelmnq marked this pull request as ready for review February 11, 2025 05:04

defelmnq requested review from spikecurtis and johnstcn as code owners February 11, 2025 05:04

defelmnq commented Feb 11, 2025

View reviewed changes

improve error messages

06adbf7

defelmnq requested a review from dannykopping February 11, 2025 05:11

defelmnq requested a review from DanielleMaywood February 11, 2025 05:14

dannykopping reviewed Feb 11, 2025

View reviewed changes

spikecurtis requested changes Feb 11, 2025

View reviewed changes

DanielleMaywood mentioned this pull request Feb 11, 2025

chore: implement oom/ood processing component #16436

Merged

DanielleMaywood reviewed Feb 12, 2025

View reviewed changes

johnstcn mentioned this pull request Feb 12, 2025

feat: add agentapi endpoint to report connections for audit #16507

Merged

defelmnq added 3 commits February 12, 2025 22:20

change proto payload for get resources monitoring config

c95b05a

change proto payload for get resources monitoring config

c79b6cb

rework fetcher and tests

b28d4fa

defelmnq added 4 commits February 13, 2025 01:41

fix tests

7701624

fix tests

5fad903

fix tests

b611ae5

fix logic

3c65b8a

spikecurtis requested changes Feb 13, 2025

View reviewed changes

defelmnq added 7 commits February 13, 2025 07:54

improve testing fetcher and rename struct

63c5869

lint

2d3eeb5

work on dbauthz

e17aafc

improve dbauthz for fetching

c5a4201

change dbauthz permissions

262a672

finalise tests

dbca96e

fix comments from github

3145eab

defelmnq requested review from spikecurtis, DanielleMaywood and dannykopping February 13, 2025 22:31

dannykopping requested changes Feb 14, 2025

View reviewed changes

spikecurtis approved these changes Feb 14, 2025

View reviewed changes

dannykopping approved these changes Feb 14, 2025

View reviewed changes

add collectedAt

3bec324

defelmnq merged commit bc609d0 into main Feb 14, 2025
30 checks passed

defelmnq deleted the agent_res_mon branch February 14, 2025 09:28

github-actions bot locked and limited conversation to collaborators Feb 14, 2025


		datapointsQueue := NewResourcesMonitorQueue(int(config.Config.NumDatapoints))

		clk.TickerFunc(ctx, time.Duration(config.Config.TickInterval*int32(time.Second)), func() error {

feat: integrate agentAPI with resources monitoring logic #16438

feat: integrate agentAPI with resources monitoring logic #16438

Uh oh!

Conversation

defelmnq commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

defelmnq commented Feb 11, 2025

Uh oh!

dannykopping left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

spikecurtis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

defelmnq commented Feb 13, 2025

Uh oh!

spikecurtis commented Feb 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

defelmnq commented Feb 13, 2025

Uh oh!

dannykopping left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

defelmnq commented Feb 4, 2025 •

edited

Loading

dannykopping left a comment •

edited

Loading