-
Notifications
You must be signed in to change notification settings - Fork 874
feat: implement reconciliation loop #17261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement reconciliation loop #17261
Conversation
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
appeasing linter Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
This reverts commit 97cc4ff.
} | ||
|
||
if !preset.UsingActiveVersion && len(ps.Running) == 0 && len(ps.InProgress) == 0 { | ||
logger.Debug(ctx, "skipping reconciliation for preset; inactive, no running prebuilds, and no in-progress operations", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This short-circuit is not necessary: ReconcilePreset should accurately calculate the actions.
I don't think it's a good idea to have this short circuit because you're splitting the business logic of reconciliation into two different places that do the same thing today. If we want to modify the logic, it will be easy to make the change only in one place and have inconsistent or undesired behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll remove it
cc @dannykopping
Btw: there are a bit similar check here: https://github.com/coder/coder/blob/yevhenii/510-reconciliation-loop-v2/enterprise/coderd/prebuilds/reconcile.go#L320-L329
Should we keep it or remove as well?
|
||
eg.Go(func() error { | ||
// Pass outer context. | ||
err = c.ReconcilePreset(ctx, *ps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems strange to me that we take the snapshot under the ReconciliationLock transaction, but then don't use that transaction for the individual reconciliation of presets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reconciliation of presets are independent. If reconciliation for one preset will fail - we don't want to rollback changes for another presets.
Few more thoughts:
We need ReconciliationLock
only to prevent other coderd replicas to execute reconciliation iterations
in parallel because we run Reconciliation Loop
on all replicas.
So I guess we don't really need transaction - we just need distributed lock. So we can use pg_advisory_lock
instead of pg_advisory_xact_lock
?
But I see that we don't use pg_advisory_lock
anywhere.
fyi: SnapshotState
also creates transaction because it maybe called outside ReconciliationLock
. But this tx will be ignored within reconciliation loop - because we don't support nested txs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @evgeniy-scherbina for addressing my comments. LGTM 👍
- I saw somebody raised an idea of internal tests, so that could address the problem. I’m cool with that.
- A separate reconciliation loop for presets is understandable, but there is a risk of increasing code complexity in favor of a very little gain. Time difference would be a matter of seconds, right? Non-blocking this PR, but wanted to confirm if I’m not missing anything.
- Thanks for the explanation! I admit I took into consideration slightly different assumptions, more focused on ensuring the final state. Since the current logic reduces build failures, I will not challenge it further.
// reconciliation loop will bring these resources back into their desired numbers in an EVENTUALLY-consistent way. | ||
// | ||
// For example: we could decide to provision 1 new instance in this reconciliation. | ||
// While that workspace is being provisioned, another template version is created which means this same preset will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should clarify what "this same preset" means a little more precisely, since the preset ID will strictly be different for new template versions.
fields := []any{ | ||
slog.F("action_type", actions.ActionType), | ||
slog.F("create_count", actions.Create), | ||
slog.F("delete_count", len(actions.DeleteIDs)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a pity that we don't log the current state of the preset here anymore. This will be helpful for us when debugging reconciliation problems. What is the reason why you removed these fields?
https://github.com/coder/coder/blob/dk/prebuilds/enterprise/coderd/prebuilds/reconcile.go#L307-L314
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll put it back. Initially I removed it because I splitted Actions & State structures. Basically I only need 3 fields to make a reconciliation decision:
- backoff
- create
- deleteIDs
I don't need 12 fields including outdated
, extraneous
, failed
, etc... Because it maybe a bit confusing for API consumer if I see structure like this:
fe60b56#diff-e435a5b0fb81846382c9401a9058a87a473f9f39eca0e884d1dd981c4d2885a1R29-R41
Imagine you're API consumer and you see:
outdated
,extraneous
,failed
,deleteIDs
What decision to make? Should you delete outdated
or failed
or deleteIDs
?
Obviously you can always read the source and understand how it works, but it means it's bad API if you need to do it.
It was my thoughts during refactoring, but maybe I'm not right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dannykopping I put it back.
@@ -0,0 +1,53 @@ | |||
package prebuilds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need at least some of the API in the AGPL code because we need the noop implementation "running" until a premium license is used. Other than that, I think most of this can shift to enterprise.
I believe I had it this way because AGPL cannot import enterprise but the reverse is allowed. The interfaces reference the types which are in this package, and if we moved those to enterprise then it would violate the license.
This is how the implementation will be swapped out at runtime based on the license:
https://github.com/coder/coder/blob/dk/prebuilds/enterprise/coderd/coderd.go#L1165-L1188
Stop(ctx context.Context, cause error) | ||
} | ||
|
||
type Reconciler interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this refactor in a follow-up? Let's try keep the momentum on this PR.
Co-authored-by: Spike Curtis <spike@coder.com>
Co-authored-by: Spike Curtis <spike@coder.com>
Co-authored-by: Danny Kopping <danny@coder.com>
Closes coder/internal#510
Refactoring Summary
1)
CalculateActions
FunctionIssues Before Refactoring:
ReconciliationActions
struct was partially initialized early, then mutated in multiple places, making the flow error-prone.Original source:
coder/coderd/prebuilds/state.go
Lines 13 to 167 in fe60b56
Improvements After Refactoring:
Refactored function:
coder/coderd/prebuilds/state.go
Lines 67 to 84 in eeb0407
2)
ReconciliationActions
StructIssues Before Refactoring:
Improvements After Refactoring:
ReconciliationActions
— defines the intended reconciliation action.ReconciliationState
— captures runtime state and metadata, primarily for logging and diagnostics.Original struct:
coder/coderd/prebuilds/reconcile.go
Lines 29 to 41 in fe60b56