-
Notifications
You must be signed in to change notification settings - Fork 874
feat: integrate agentAPI with resources monitoring logic #16438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -295,6 +295,42 @@ message Timing { | |||
Status status = 6; | |||
} | |||
|
|||
message GetResourcesMonitoringConfigurationRequest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message is based on what has been described in the RFC - and has been mutually discussed and approved by Vincent and @DanielleMaywood
agent/resources_monitor.go
Outdated
"github.com/coder/quartz" | ||
) | ||
|
||
func (a *agent) pushResourcesMonitoring(ctx context.Context, aAPI proto.DRPCAgentClient24) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self review : We have this intermediate function in order to have an extra layer, and simplicify for testing purpose.
As all arguments are injected as parameters to PushResourcesMonitoringWithConfig
we can use the mocked versions.
agent/resources_monitor.go
Outdated
configFetcher ResourcesMonitorConfigurationFetcher, | ||
datapointsPusher ResourcesMonitorDatapointsPusher, | ||
) error { | ||
config, err := configFetcher(ctx, &proto.GetResourcesMonitoringConfigurationRequest{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self review : There's three situations on the which we do not want to run the go-routine :
- If we can not fetch the configuration from Coderd.
- If the resources_monitoring is not enabled.
- If we can not instantiate the clistat client (which is the one used to fetch resources usage.)
}, nil | ||
} | ||
|
||
func (a *ResourcesMonitoringAPI) PushResourcesMonitoringUsage(ctx context.Context, req *proto.PushResourcesMonitoringUsageRequest) (*proto.PushResourcesMonitoringUsageResponse, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self review : This function is just here as a placeholder - we are fine keeping it this way for now as it will be replaced by @DanielleMaywood PR.
agent/resources_monitor.go
Outdated
|
||
datapointsQueue := NewResourcesMonitorQueue(int(config.Config.NumDatapoints)) | ||
|
||
clk.TickerFunc(ctx, time.Duration(config.Config.TickInterval*int32(time.Second)), func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self review : We prefer to log instead of returning an error in case of problem.
Using TickerFunc
from quartz, returning an error would just shutdown the whole ticker and stop it. Returning nil stops the current tick, but the next one will still trigger.
@spikecurtis I'd love to have your review - specially on the whole AgentAPI / TailnetAPI part - I want to be sure that I increase the version as we should. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looking good, but I'm concerned that this doesn't match the RFC. See my comment in agent/resources_monitor.go
.
agent/resources_monitor_test.go
Outdated
func TestPushResourcesMonitoringWithConfig(t *testing.T) { | ||
t.Parallel() | ||
|
||
tests := []struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code coverage, we're missing a few test-cases here:
- disabled config pushes nothing
- errors when fetching memory/volume info
- errors when pushing to coderd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I improved a lot the testability of this block by creating a struct for the resources_fetcher part - so we can mock it.
Also added test cases for all the missing cases. Should be way better now.
agent/resources_monitor.go
Outdated
ResourcesMonitorDatapointsPusher func(ctx context.Context, params *proto.PushResourcesMonitoringUsageRequest) (*proto.PushResourcesMonitoringUsageResponse, error) | ||
) | ||
|
||
func PushResourcesMonitoringWithConfig(ctx context.Context, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation does not seem to match the RFC design.
Once the buffer fills up, it'll send the full buffer to the control plane. However, it'll keep doing so every TickInterval
thereafter since the buffer only ever replaces the oldest value.
The RFC specifies that the payload should only be delivered every 15s for the 20 datapoints; it's currently performing collection and submission in the same ticker with no distinction between them.
We also seem to not be adding the UNKNOWN
datapoints when values cannot be retrieved.
Was this a conscious choice? Is the RFC out-of-date perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the UNKNOWN
status - this is handled directly on the coderd side.
It will be visible in the part currently being done by Danielle - from the configuration stored in the DB, and what we return on the Agent side, coderd (and the processing component) decide if the resource, for the given datapoint, is OK
, NOK
or UNKNOWN
.
About the tick logic - that's maybe a misunderstanding in the RFC , but when I defined it to be delivered every 15s, it was based on the TickInterval
- considering it was set to 15. having both the TickInterval
and another value set independently would require some sync to avoid sending two times the datapoints without having a refresh or fetching datapoints too frequently and so not pushing data to coderd at each tick.
TL;DR - I think the best solution is to have the TickInterval
handling both the refresh of resources and push to coderd so we are sure each push will always have the last data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the UNKNOWN status - this is handled directly on the coderd side.
It will be visible in the part currently being done by Danielle - from the configuration stored in the DB, and what we return on the Agent side, coderd (and the processing component) decide if the resource, for the given datapoint, is OK, NOK or UNKNOWN.
Can you point me to some specific code I can look at please? Will you be inferring that a datapoint was missed by looking at the time of each datapoint? I wonder why we opted for this approach rather than explicitly sending a noop datapoint to indicate a problem.
TL;DR - I think the best solution is to have the TickInterval handling both the refresh of resources and push to coderd so we are sure each push will always have the last data.
This doesn't really address what I'm asking, though.
We're going to be sending all n datapoints every TickInterval
once the queue fills up since we never drain the queue but only shifting off the oldest element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proto version changes look overall fine, but I want to review again re: RBAC
agent/resources_monitor.go
Outdated
|
||
type ( | ||
ResourcesMonitorConfigurationFetcher func(ctx context.Context, params *proto.GetResourcesMonitoringConfigurationRequest) (*proto.GetResourcesMonitoringConfigurationResponse, error) | ||
ResourcesMonitorDatapointsPusher func(ctx context.Context, params *proto.PushResourcesMonitoringUsageRequest) (*proto.PushResourcesMonitoringUsageResponse, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an interface. With a bare function type, IDE tooling has a hard time finding implementations of this function. If you make it an interface, you can find implementations and see what this is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the func
parts and ResourcesMonitorConfigurationFetcher
to have an interface - should be way better this way. 👀
agent/resources_monitor.go
Outdated
datapointsPusher ResourcesMonitorDatapointsPusher, | ||
) error { | ||
config, err := configFetcher(ctx, &proto.GetResourcesMonitoringConfigurationRequest{}) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of passing a "fetcher", you should just pass the config, and do the fetching before you call this function. It will make testing easier and you can avoid defining ResourcesMonitorConfigurationFetcher
which is a real mouthful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ changed too.
agent/resources_monitor.go
Outdated
return xerrors.Errorf("failed to create resources fetcher: %w", err) | ||
} | ||
|
||
datapointsQueue := NewResourcesMonitorQueue(int(config.Config.NumDatapoints)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be a lot more readable with a ResourceMonitor
struct that includes things like the resources fetcher and queue as members, instead of this nested function style. It makes it hard to keep track of the scope of things, that is, which things are per tick and which last between ticks.
You set up the long lived struct, then start the ticker calling a method on the struct. That way it's very clear what is part of the tick and what outlives it. It also means you don't have to pass the fetcher
to helpers like fetchResourceMonitoredVolume
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed a lot the way things were organised based on the comments - so now :
- Resources monitor logic is in its own package
- It works with a struct and methods to make it an entity
- members are used to store the queue & config
🙏
agent/resources_monitor.go
Outdated
@@ -0,0 +1,132 @@ | |||
package agent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Go style would be to move this into it's own package so you don't have to include some variant of "resource" and "monitor" in the name of everything to keep things clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it to its own package ✅ should indeed simplify a lot the naming.
} | ||
|
||
func (a *ResourcesMonitoringAPI) GetResourcesMonitoringConfiguration(ctx context.Context, _ *proto.GetResourcesMonitoringConfigurationRequest) (*proto.GetResourcesMonitoringConfigurationResponse, error) { | ||
agent, err := a.AgentFn(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
I think we could avoid using the AgentFn
pattern here. We only need the Agent's ID, but this function makes a call to the database, using the Agent's ID, to fetch the agent. We could instead have the ResourcesMonitoringAPI
have an AgentID
field and just pass in the AgentID
from Options
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit surprised there's a so simple option , but makes sense and changed - Thanks ! ✅
@dannykopping @spikecurtis thanks for the first review. I changed and iterated on most of the things excepted RBAC logic as it will be a whole topic and I dont want to block the rest. Can I ask you for another review (without RBAC) - ensuring we are all good for the whole logic ? Globally :
Thanks ! |
I'm happy to give another review, but I consider the RBAC to be a blocker to merging. |
Thanks @spikecurtis for the other review. I think we reached something way better here :
Provisionerd can add entries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have concerns about how this data collection is implemented.
GetResourcesMonitoringConfiguration
specifies that a collection will occur every 10 seconds. Datapoints will be added to the queue, and once the queue fills up it'll send the contents of the queue to coderd
. Once the queue is full it never empties, so it'll just shift the oldest element off to accommodate a new one.
The CollectionIntervalSeconds
defines both the collection AND send interval, which means that once the queue is full it'll send the full contents every CollectionIntervalSeconds
, but...
CollectionIntervalSeconds
is 10s by default, which is too slow for datapoint collection; the point of the system we discussed in the RFC was to collect often (maybe even once per second), and to buffer those datapoints to send every so often so as to not overwhelm coderd
. This implementation means we're sending the same datapoint NumDatapoints-1
times to coderd
. I don't see the point in this.
Regarding the processor: with this design it would take up to 200s / 3.3m (20 datapoints collected 10s apart) to go from an unhealthy state to a healthy state; that will not feel responsive, and we can do much better!
In another comment I described my reservations about this design concerning failed collections; from the answer it seems like coderd
will be inferring those failed collections by noticing gaps in the data; how we'll get this precise I don't know (since we're not collecting timestamps with each datapoint) - and it seems unnecessary since we already know at the source when this has occurred. Currently we're just skipping over these failed datapoints in the agent with a log, and leaving coderd
to guess as to what happened; this doesn't seem like a resilient design to me.
@defelmnq pointed out that an empty datapoint would be sent if the collection failed (I missed this on line 73). coderd
will also look at the monitoring config and determine if the datapoint is blank but shouldn't be, and determine that as UNKNOWN
. I think it could've been more explicit from the agent's perspective, but this approach will work fine.
{ | ||
// If one of the resources fails to be fetched, the datapoints still should be pushed with the other resources. | ||
name: "ErrorFetchingVolume", | ||
config: &proto.GetResourcesMonitoringConfigurationResponse{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: since these all have the same config, consider creating a const to improve readability.
Alternatively (I'd prefer this) you should change these configs up so that you avoid accidentally overfitting for this specific config.
// We have one call per tick, once reached the ${config.NumDatapoints}. | ||
expectedCalls := tt.numTicks - int(tt.config.Config.NumDatapoints) + 1 | ||
require.Equal(t, expectedCalls, counterCalls) | ||
cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I don't think you need this here since you have a defer cancel()
already which will be invoked after this.
require.Len(t, req.Datapoints, 20) | ||
require.Nil(t, req.Datapoints[0].Memory) | ||
require.NotNil(t, req.Datapoints[0].Volumes) | ||
require.Equal(t, &proto.PushResourcesMonitoringUsageRequest_Datapoint_VolumeUsage{ | ||
Volume: "/", | ||
Total: 100000, | ||
Used: 50000, | ||
}, req.Datapoints[0].Volumes[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this feels awkward; the assertions should be happening in the test body, not the test case definitions.
Don't we always want to check that the returned values match expectations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RBAC stuff looks OK now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, @defelmnq and I had discussed the implementation and I'm happy to unblock this as it stands.
What I thought we were building and what I agreed to in the RFC were different, and for that I take responsibility. Apologies for the inertia this caused on this review!
One optimisation we agreed to do now is to start shipping the payloads even before the buffer fills up, so we can alert early if errors are present straight away.
For later:
My argument is that we should collect faster and send at the ~same rate, to achieve a higher resolution measurement and consequently have more reactive alerts / status updates.
To achieve this, we would need to send at least half (but probably the full) previous NumDatapoints
datapoints to ensure we can satisfy the 3 rules (which may only be true across current + previous payloads):
- 4 datapoints in a row above threshold =
NOK
- any 10 datapoints in a
NumDatapoints
payload above threshold =NOK
- all
NumDatapoints
within threshold =OK
As part of the new resources monitoring logic - more specifically for OOM & OOD Notifications , we need to update the AgentAPI , and the agents logic.
This PR aims to do it, and more specifically :
We are updating the AgentAPI & TailnetAPI to version 24 to add two new methods in the AgentAPI :
Also, this PR adds a new logic on the agent side, with a routine running and ticking - fetching the resources usage each time , but also storing it in a FIFO like queue.
Finally, this PR fixes a problem we had with RBAC logic on the resources monitoring model, applying the same logic than we have for similar entities.