Skip to content

feat: implement reconciliation loop #17261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 158 commits into from
Apr 17, 2025

Conversation

evgeniy-scherbina
Copy link
Contributor

@evgeniy-scherbina evgeniy-scherbina commented Apr 4, 2025

Closes coder/internal#510

Refactoring Summary

1) CalculateActions Function

Issues Before Refactoring:

  • Large function (~150 lines), making it difficult to read and maintain.
  • The control flow is hard to follow due to complex conditional logic.
  • The ReconciliationActions struct was partially initialized early, then mutated in multiple places, making the flow error-prone.

Original source:

func (p PresetState) CalculateActions(clock quartz.Clock, backoffInterval time.Duration) (*ReconciliationActions, error) {
// TODO: align workspace states with how we represent them on the FE and the CLI
// right now there's some slight differences which can lead to additional prebuilds being created
// TODO: add mechanism to prevent prebuilds being reconciled from being claimable by users; i.e. if a prebuild is
// about to be deleted, it should not be deleted if it has been claimed - beware of TOCTOU races!
var (
actual int32 // Running prebuilds for active version.
desired int32 // Active template version's desired instances as defined in preset.
eligible int32 // Prebuilds which can be claimed.
outdated int32 // Prebuilds which no longer match the active template version.
extraneous int32 // Extra running prebuilds for active version (somehow).
starting, stopping, deleting int32 // Prebuilds currently being provisioned up or down.
)
if p.Preset.UsingActiveVersion {
actual = int32(len(p.Running))
desired = p.Preset.DesiredInstances.Int32
}
for _, prebuild := range p.Running {
if p.Preset.UsingActiveVersion {
if prebuild.Ready {
eligible++
}
extraneous = int32(math.Max(float64(actual-p.Preset.DesiredInstances.Int32), 0))
}
if prebuild.TemplateVersionID == p.Preset.TemplateVersionID && !p.Preset.UsingActiveVersion {
outdated++
}
}
// In-progress builds are common across all presets belonging to a given template.
// In other words: these values will be identical across all presets belonging to this template.
for _, progress := range p.InProgress {
num := progress.Count
switch progress.Transition {
case database.WorkspaceTransitionStart:
starting += num
case database.WorkspaceTransitionStop:
stopping += num
case database.WorkspaceTransitionDelete:
deleting += num
}
}
var (
toCreate = int(math.Max(0, float64(
desired-(actual+starting)), // The number of prebuilds currently being stopped (should be 0)
))
toDelete = int(math.Max(0, float64(
outdated- // The number of prebuilds running above the desired count for active version
deleting), // The number of prebuilds currently being deleted
))
actions = &ReconciliationActions{
Actual: actual,
Desired: desired,
Eligible: eligible,
Outdated: outdated,
Extraneous: extraneous,
Starting: starting,
Stopping: stopping,
Deleting: deleting,
}
)
// If the template has become deleted or deprecated since the last reconciliation, we need to ensure we
// scale those prebuilds down to zero.
if p.Preset.Deleted || p.Preset.Deprecated {
toCreate = 0
toDelete = int(actual + outdated)
actions.Desired = 0
}
// We backoff when the last build failed, to give the operator some time to investigate the issue and to not provision
// a tonne of prebuilds (_n_ on each reconciliation iteration).
if p.Backoff != nil && p.Backoff.NumFailed > 0 {
actions.Failed = p.Backoff.NumFailed
backoffUntil := p.Backoff.LastBuildAt.Add(time.Duration(p.Backoff.NumFailed) * backoffInterval)
if clock.Now().Before(backoffUntil) {
actions.Create = 0
actions.DeleteIDs = nil
actions.BackoffUntil = backoffUntil
// Return early here; we should not perform any reconciliation actions if we're in a backoff period.
return actions, nil
}
}
// It's possible that an operator could stop/start prebuilds which interfere with the reconciliation loop, so
// we check if there are somehow more prebuilds than we expect, and then pick random victims to be deleted.
if extraneous > 0 {
// Sort running IDs by creation time so we always delete the oldest prebuilds.
// In general, we want fresher prebuilds (imagine a mono-repo is cloned; newer is better).
slices.SortFunc(p.Running, func(a, b database.GetRunningPrebuiltWorkspacesRow) int {
if a.CreatedAt.Before(b.CreatedAt) {
return -1
}
if a.CreatedAt.After(b.CreatedAt) {
return 1
}
return 0
})
for i := 0; i < int(extraneous); i++ {
if i >= len(p.Running) {
// This should never happen.
// TODO: move up
// c.logger.Warn(ctx, "unexpected reconciliation state; extraneous count exceeds running prebuilds count!",
// slog.F("running_count", len(p.Running)),
// slog.F("extraneous", extraneous))
continue
}
actions.DeleteIDs = append(actions.DeleteIDs, p.Running[i].ID)
}
// TODO: move up
// c.logger.Warn(ctx, "found extra prebuilds running, picking random victim(s)",
// slog.F("template_id", p.Preset.TemplateID.String()), slog.F("desired", desired), slog.F("actual", actual), slog.F("extra", extraneous),
// slog.F("victims", victims))
// Prevent the rest of the reconciliation from completing
return actions, nil
}
actions.Create = int32(toCreate)
// if toDelete > 0 && len(p.Running) != toDelete {
// TODO: move up
// c.logger.Warn(ctx, "mismatch between running prebuilds and expected deletion count!",
// slog.F("template_id", s.preset.TemplateID.String()), slog.F("running", len(p.Running)), slog.F("to_delete", toDelete))
// }
// TODO: implement lookup to not perform same action on workspace multiple times in $period
// i.e. a workspace cannot be deleted for some reason, which continually makes it eligible for deletion
for i := 0; i < toDelete; i++ {
if i >= len(p.Running) {
// TODO: move up
// Above warning will have already addressed this.
continue
}
actions.DeleteIDs = append(actions.DeleteIDs, p.Running[i].ID)
}
return actions, nil
}

Improvements After Refactoring:

  • Simplified and broken down into smaller, focused helper methods.
  • The flow of the function is now more linear and easier to understand.
  • Struct initialization is cleaner, avoiding partial and incremental mutations.

Refactored function:

func (p PresetSnapshot) CalculateActions(clock quartz.Clock, backoffInterval time.Duration) (*ReconciliationActions, error) {
// TODO: align workspace states with how we represent them on the FE and the CLI
// right now there's some slight differences which can lead to additional prebuilds being created
// TODO: add mechanism to prevent prebuilds being reconciled from being claimable by users; i.e. if a prebuild is
// about to be deleted, it should not be deleted if it has been claimed - beware of TOCTOU races!
actions, needsBackoff := p.needsBackoffPeriod(clock, backoffInterval)
if needsBackoff {
return actions, nil
}
if !p.isActive() {
return p.handleInactiveTemplateVersion()
}
return p.handleActiveTemplateVersion()
}


2) ReconciliationActions Struct

Issues Before Refactoring:

  • The struct mixed both actionable decisions and diagnostic state, which blurred its purpose.
  • It was unclear which fields were necessary for reconciliation logic, and which were purely for logging/observability.

Improvements After Refactoring:

  • Split into two clear, purpose-specific structs:
    • ReconciliationActions — defines the intended reconciliation action.
    • ReconciliationState — captures runtime state and metadata, primarily for logging and diagnostics.

Original struct:

// ReconciliationActions represents the set of actions which must be taken to achieve the desired state for prebuilds.
type ReconciliationActions struct {
Actual int32 // Running prebuilds for active version.
Desired int32 // Active template version's desired instances as defined in preset.
Eligible int32 // Prebuilds which can be claimed.
Outdated int32 // Prebuilds which no longer match the active template version.
Extraneous int32 // Extra running prebuilds for active version (somehow).
Starting, Stopping, Deleting int32 // Prebuilds currently being provisioned up or down.
Failed int32 // Number of prebuilds which have failed in the past CODER_WORKSPACE_PREBUILDS_RECONCILIATION_BACKOFF_LOOKBACK_PERIOD.
Create int32 // The number of prebuilds required to be created to reconcile required state.
DeleteIDs []uuid.UUID // IDs of running prebuilds required to be deleted to reconcile required state.
BackoffUntil time.Time // The time to wait until before trying to provision a new prebuild.
}

SasSwart and others added 30 commits March 17, 2025 12:11
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
appeasing linter

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
}

if !preset.UsingActiveVersion && len(ps.Running) == 0 && len(ps.InProgress) == 0 {
logger.Debug(ctx, "skipping reconciliation for preset; inactive, no running prebuilds, and no in-progress operations",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This short-circuit is not necessary: ReconcilePreset should accurately calculate the actions.

I don't think it's a good idea to have this short circuit because you're splitting the business logic of reconciliation into two different places that do the same thing today. If we want to modify the logic, it will be easy to make the change only in one place and have inconsistent or undesired behavior.

Copy link
Contributor Author

@evgeniy-scherbina evgeniy-scherbina Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll remove it
cc @dannykopping

Btw: there are a bit similar check here: https://github.com/coder/coder/blob/yevhenii/510-reconciliation-loop-v2/enterprise/coderd/prebuilds/reconcile.go#L320-L329

Should we keep it or remove as well?


eg.Go(func() error {
// Pass outer context.
err = c.ReconcilePreset(ctx, *ps)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems strange to me that we take the snapshot under the ReconciliationLock transaction, but then don't use that transaction for the individual reconciliation of presets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconciliation of presets are independent. If reconciliation for one preset will fail - we don't want to rollback changes for another presets.


Few more thoughts:

We need ReconciliationLock only to prevent other coderd replicas to execute reconciliation iterations in parallel because we run Reconciliation Loop on all replicas.

So I guess we don't really need transaction - we just need distributed lock. So we can use pg_advisory_lock instead of pg_advisory_xact_lock?
But I see that we don't use pg_advisory_lock anywhere.


fyi: SnapshotState also creates transaction because it maybe called outside ReconciliationLock. But this tx will be ignored within reconciliation loop - because we don't support nested txs.

Copy link
Member

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @evgeniy-scherbina for addressing my comments. LGTM 👍

  1. I saw somebody raised an idea of internal tests, so that could address the problem. I’m cool with that.
  2. A separate reconciliation loop for presets is understandable, but there is a risk of increasing code complexity in favor of a very little gain. Time difference would be a matter of seconds, right? Non-blocking this PR, but wanted to confirm if I’m not missing anything.
  3. Thanks for the explanation! I admit I took into consideration slightly different assumptions, more focused on ensuring the final state. Since the current logic reduces build failures, I will not challenge it further.

// reconciliation loop will bring these resources back into their desired numbers in an EVENTUALLY-consistent way.
//
// For example: we could decide to provision 1 new instance in this reconciliation.
// While that workspace is being provisioned, another template version is created which means this same preset will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should clarify what "this same preset" means a little more precisely, since the preset ID will strictly be different for new template versions.

fields := []any{
slog.F("action_type", actions.ActionType),
slog.F("create_count", actions.Create),
slog.F("delete_count", len(actions.DeleteIDs)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a pity that we don't log the current state of the preset here anymore. This will be helpful for us when debugging reconciliation problems. What is the reason why you removed these fields?

https://github.com/coder/coder/blob/dk/prebuilds/enterprise/coderd/prebuilds/reconcile.go#L307-L314

Copy link
Contributor Author

@evgeniy-scherbina evgeniy-scherbina Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll put it back. Initially I removed it because I splitted Actions & State structures. Basically I only need 3 fields to make a reconciliation decision:

  • backoff
  • create
  • deleteIDs

I don't need 12 fields including outdated, extraneous, failed, etc... Because it maybe a bit confusing for API consumer if I see structure like this:
fe60b56#diff-e435a5b0fb81846382c9401a9058a87a473f9f39eca0e884d1dd981c4d2885a1R29-R41

Imagine you're API consumer and you see:

  • outdated, extraneous, failed, deleteIDs

What decision to make? Should you delete outdated or failed or deleteIDs?
Obviously you can always read the source and understand how it works, but it means it's bad API if you need to do it.


It was my thoughts during refactoring, but maybe I'm not right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dannykopping I put it back.

@@ -0,0 +1,53 @@
package prebuilds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need at least some of the API in the AGPL code because we need the noop implementation "running" until a premium license is used. Other than that, I think most of this can shift to enterprise.

I believe I had it this way because AGPL cannot import enterprise but the reverse is allowed. The interfaces reference the types which are in this package, and if we moved those to enterprise then it would violate the license.

This is how the implementation will be swapped out at runtime based on the license:
https://github.com/coder/coder/blob/dk/prebuilds/enterprise/coderd/coderd.go#L1165-L1188

Stop(ctx context.Context, cause error)
}

type Reconciler interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this refactor in a follow-up? Let's try keep the momentum on this PR.

@evgeniy-scherbina evgeniy-scherbina merged commit 27bc60d into main Apr 17, 2025
37 checks passed
@evgeniy-scherbina evgeniy-scherbina deleted the yevhenii/510-reconciliation-loop-v2 branch April 17, 2025 13:29
@github-actions github-actions bot locked and limited conversation to collaborators Apr 17, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prebuild Reconciliation Loop
7 participants