-
Notifications
You must be signed in to change notification settings - Fork 875
feat: implement reconciliation loop #17261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
300e80f
b788237
8122595
bfb7c28
6639167
769ae1d
e7e9c27
3936047
8bdcafb
412d198
51773ec
23773c2
bc3ff44
7ff747e
baa3076
ed14fb3
3cc74fb
fc32154
d7b4ec4
e8b53f7
9df6554
ccc309e
ee1f16a
d040ddd
83a6722
cd70710
97cc4ff
4d59039
205d6af
e489e1b
1b29686
20470e4
7b9c8ce
e189a0b
692c0e5
f747db0
3166a42
aa6b490
bc4e7d2
f167b92
8fd34ab
7a8ec49
a64d661
c787cd2
bd38603
097f9c3
4cfdd6f
4a34d52
8d9cd45
f870d7e
18ad931
4667171
a26c094
6ed4121
2312f41
8da7f47
5150a5c
bff34ea
ef462b6
dc45165
9c8a352
eb80919
0b2bbee
3a97bf6
2eeb884
73f99e8
c942753
9badf7c
6763ba2
e5117d7
dfec884
5065ad6
e354956
4d3aab6
97b3886
fe60b56
eeb0407
e807d02
d78675e
7f60a5d
8f5c9f9
42582e1
14d924c
82e016c
c03ea52
3693d45
beb814f
a5418ac
f4f9b17
d11fd58
31d3bf6
5007a83
a79fe4c
c0246f4
ed608cb
70223e4
07808c2
a2ceeb6
73fb414
7e9c65f
2ca8030
ce83f92
3bc4d8a
2986574
aa22a8a
f41b19e
61a88e4
9ac7a2c
d0c2094
474fc06
9fac5d7
868d0b6
5f204f2
40b3e5f
5e3adbc
108720f
9d8f6b1
b994eec
eff754e
4b052be
bc5297c
0989636
2b57ac4
41d7e07
ab5fa8f
f485bc0
074f768
a47627a
eebb298
9a672dd
c2f4561
62fb3f4
742d0d3
f3e24b1
08aed24
951c8b5
9c1e82f
46e240c
6dc1f68
23964aa
0a4d053
98d203d
145b9ff
8b91668
cccdab2
a2e5643
1771c84
1fc551d
d99c5cb
b98ccd4
936cc38
32da1ab
5a403c0
908e6eb
77e4472
853af80
172177f
a825bf0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
package prebuilds | ||
|
||
import ( | ||
"context" | ||
|
||
"github.com/google/uuid" | ||
|
||
"github.com/coder/coder/v2/coderd/database" | ||
) | ||
|
||
type ReconciliationOrchestrator interface { | ||
evgeniy-scherbina marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Reconciler | ||
|
||
RunLoop(ctx context.Context) | ||
Stop(ctx context.Context, cause error) | ||
} | ||
|
||
type Reconciler interface { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The current API feels too low-level. Ideally, we should expose higher-level methods such as:
Note:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's been a couple weeks but I believe we only expose these for tests. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe make it an internal interface and internal test? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added a detailed comments, explaining how this API can be used, and why it may sense to expose all these API. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @johnstcn I also wanted to do it, but:
So removing them means:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is no need for this interface to be exported for the metrics collector to work correctly, since it lives in the same package as the Tests can be inside the same package if they need to access unexported functions. Also, you don't need to export an interface to call exported functions on, say, the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we do this refactor in a follow-up? Let's try keep the momentum on this PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't agree that "momentum" should be a reason to check in code that doesn't meet our standards. What is to be gained by fixing it later rather than now? IMO it just makes it more likely that we don't fix the issue at all. Also, this interface isn't actually used anywhere within this PR, so fixing it within the scope of this PR should be trivial. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @spikecurtis I reduce number of methods in Now it looks like this: type Reconciler interface {
// ReconcileAll orchestrates the reconciliation of all prebuilds across all templates.
// It takes a global snapshot of the system state and then reconciles each preset
// in parallel, creating or deleting prebuilds as needed to reach their desired states.
ReconcileAll(ctx context.Context) error
} In future PRs maybe we'll need additional methods, but we can deal with it when we'll be working on those PRs. |
||
// SnapshotState MUST be called inside a repeatable-read tx. | ||
SnapshotState(ctx context.Context, store database.Store) (*ReconciliationState, error) | ||
// DetermineActions MUST be called inside a repeatable-read tx. | ||
DetermineActions(ctx context.Context, state PresetState) (*ReconciliationActions, error) | ||
// Reconcile MUST be called inside a repeatable-read tx. | ||
Reconcile(ctx context.Context, state PresetState, actions ReconciliationActions) error | ||
} | ||
|
||
type Claimer interface { | ||
Claim(ctx context.Context, store database.Store, userID uuid.UUID, name string, presetID uuid.UUID) (*uuid.UUID, error) | ||
Initiator() uuid.UUID | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
package prebuilds | ||
evgeniy-scherbina marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
import ( | ||
"context" | ||
|
||
"github.com/coder/coder/v2/coderd/database" | ||
) | ||
|
||
type NoopReconciler struct{} | ||
|
||
func NewNoopReconciler() *NoopReconciler { | ||
return &NoopReconciler{} | ||
} | ||
|
||
func (NoopReconciler) RunLoop(context.Context) {} | ||
func (NoopReconciler) Stop(context.Context, error) {} | ||
evgeniy-scherbina marked this conversation as resolved.
Show resolved
Hide resolved
|
||
func (NoopReconciler) SnapshotState(context.Context, database.Store) (*ReconciliationState, error) { | ||
return &ReconciliationState{}, nil | ||
} | ||
|
||
func (NoopReconciler) DetermineActions(context.Context, PresetState) (*ReconciliationActions, error) { | ||
return &ReconciliationActions{}, nil | ||
} | ||
|
||
func (NoopReconciler) Reconcile(context.Context, PresetState, ReconciliationActions) error { | ||
return nil | ||
} | ||
|
||
var _ ReconciliationOrchestrator = NoopReconciler{} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
package prebuilds | ||
evgeniy-scherbina marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
import ( | ||
"time" | ||
|
||
"github.com/google/uuid" | ||
"golang.org/x/xerrors" | ||
|
||
"github.com/coder/coder/v2/coderd/database" | ||
"github.com/coder/coder/v2/coderd/util/slice" | ||
) | ||
|
||
// ReconciliationState represents a full point-in-time snapshot of state relating to prebuilds across all templates. | ||
type ReconciliationState struct { | ||
Presets []database.GetTemplatePresetsWithPrebuildsRow | ||
RunningPrebuilds []database.GetRunningPrebuiltWorkspacesRow | ||
PrebuildsInProgress []database.CountInProgressPrebuildsRow | ||
Backoffs []database.GetPresetsBackoffRow | ||
} | ||
|
||
// PresetState is a subset of ReconciliationState but specifically for a single preset. | ||
type PresetState struct { | ||
Preset database.GetTemplatePresetsWithPrebuildsRow | ||
Running []database.GetRunningPrebuiltWorkspacesRow | ||
InProgress []database.CountInProgressPrebuildsRow | ||
Backoff *database.GetPresetsBackoffRow | ||
} | ||
|
||
// ReconciliationActions represents the set of actions which must be taken to achieve the desired state for prebuilds. | ||
type ReconciliationActions struct { | ||
Actual int32 // Running prebuilds for active version. | ||
Desired int32 // Active template version's desired instances as defined in preset. | ||
Eligible int32 // Prebuilds which can be claimed. | ||
Outdated int32 // Prebuilds which no longer match the active template version. | ||
Extraneous int32 // Extra running prebuilds for active version (somehow). | ||
Starting, Stopping, Deleting int32 // Prebuilds currently being provisioned up or down. | ||
Failed int32 // Number of prebuilds which have failed in the past CODER_WORKSPACE_PREBUILDS_RECONCILIATION_BACKOFF_LOOKBACK_PERIOD. | ||
Create int32 // The number of prebuilds required to be created to reconcile required state. | ||
DeleteIDs []uuid.UUID // IDs of running prebuilds required to be deleted to reconcile required state. | ||
BackoffUntil time.Time // The time to wait until before trying to provision a new prebuild. | ||
} | ||
|
||
func NewReconciliationState(presets []database.GetTemplatePresetsWithPrebuildsRow, runningPrebuilds []database.GetRunningPrebuiltWorkspacesRow, | ||
prebuildsInProgress []database.CountInProgressPrebuildsRow, backoffs []database.GetPresetsBackoffRow, | ||
) ReconciliationState { | ||
return ReconciliationState{Presets: presets, RunningPrebuilds: runningPrebuilds, PrebuildsInProgress: prebuildsInProgress, Backoffs: backoffs} | ||
} | ||
|
||
func (s ReconciliationState) FilterByPreset(presetID uuid.UUID) (*PresetState, error) { | ||
preset, found := slice.Find(s.Presets, func(preset database.GetTemplatePresetsWithPrebuildsRow) bool { | ||
return preset.ID == presetID | ||
}) | ||
if !found { | ||
return nil, xerrors.Errorf("no preset found with ID %q", presetID) | ||
} | ||
|
||
running := slice.Filter(s.RunningPrebuilds, func(prebuild database.GetRunningPrebuiltWorkspacesRow) bool { | ||
if !prebuild.CurrentPresetID.Valid { | ||
return false | ||
} | ||
return prebuild.CurrentPresetID.UUID == preset.ID && | ||
prebuild.TemplateVersionID == preset.TemplateVersionID // Not strictly necessary since presets are 1:1 with template versions, but no harm in being extra safe. | ||
evgeniy-scherbina marked this conversation as resolved.
Show resolved
Hide resolved
|
||
}) | ||
|
||
// These aren't preset-specific, but they need to inhibit all presets of this template from operating since they could | ||
evgeniy-scherbina marked this conversation as resolved.
Show resolved
Hide resolved
|
||
// be in-progress builds which might impact another preset. For example, if a template goes from no defined prebuilds to defined prebuilds | ||
// and back, or a template is updated from one version to another. | ||
// We group by the template so that all prebuilds being provisioned for a prebuild are inhibited if any prebuild for | ||
// any preset in that template are in progress, to prevent clobbering. | ||
inProgress := slice.Filter(s.PrebuildsInProgress, func(prebuild database.CountInProgressPrebuildsRow) bool { | ||
return prebuild.TemplateID == preset.TemplateID | ||
}) | ||
|
||
var backoff *database.GetPresetsBackoffRow | ||
backoffs := slice.Filter(s.Backoffs, func(row database.GetPresetsBackoffRow) bool { | ||
return row.PresetID == preset.ID | ||
}) | ||
if len(backoffs) == 1 { | ||
evgeniy-scherbina marked this conversation as resolved.
Show resolved
Hide resolved
|
||
backoff = &backoffs[0] | ||
} | ||
|
||
return &PresetState{ | ||
Preset: preset, | ||
Running: running, | ||
InProgress: inProgress, | ||
Backoff: backoff, | ||
}, nil | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,167 @@ | ||
package prebuilds | ||
|
||
import ( | ||
"math" | ||
"slices" | ||
"time" | ||
|
||
"github.com/coder/quartz" | ||
|
||
"github.com/coder/coder/v2/coderd/database" | ||
) | ||
|
||
func (p PresetState) CalculateActions(clock quartz.Clock, backoffInterval time.Duration) (*ReconciliationActions, error) { | ||
// TODO: align workspace states with how we represent them on the FE and the CLI | ||
// right now there's some slight differences which can lead to additional prebuilds being created | ||
|
||
// TODO: add mechanism to prevent prebuilds being reconciled from being claimable by users; i.e. if a prebuild is | ||
evgeniy-scherbina marked this conversation as resolved.
Show resolved
Hide resolved
|
||
// about to be deleted, it should not be deleted if it has been claimed - beware of TOCTOU races! | ||
|
||
var ( | ||
actual int32 // Running prebuilds for active version. | ||
desired int32 // Active template version's desired instances as defined in preset. | ||
eligible int32 // Prebuilds which can be claimed. | ||
outdated int32 // Prebuilds which no longer match the active template version. | ||
extraneous int32 // Extra running prebuilds for active version (somehow). | ||
starting, stopping, deleting int32 // Prebuilds currently being provisioned up or down. | ||
) | ||
|
||
if p.Preset.UsingActiveVersion { | ||
actual = int32(len(p.Running)) | ||
desired = p.Preset.DesiredInstances.Int32 | ||
} | ||
|
||
for _, prebuild := range p.Running { | ||
if p.Preset.UsingActiveVersion { | ||
if prebuild.Ready { | ||
eligible++ | ||
} | ||
|
||
extraneous = int32(math.Max(float64(actual-p.Preset.DesiredInstances.Int32), 0)) | ||
} | ||
|
||
if prebuild.TemplateVersionID == p.Preset.TemplateVersionID && !p.Preset.UsingActiveVersion { | ||
outdated++ | ||
} | ||
} | ||
|
||
// In-progress builds are common across all presets belonging to a given template. | ||
// In other words: these values will be identical across all presets belonging to this template. | ||
for _, progress := range p.InProgress { | ||
num := progress.Count | ||
switch progress.Transition { | ||
case database.WorkspaceTransitionStart: | ||
starting += num | ||
case database.WorkspaceTransitionStop: | ||
stopping += num | ||
case database.WorkspaceTransitionDelete: | ||
deleting += num | ||
} | ||
} | ||
|
||
var ( | ||
toCreate = int(math.Max(0, float64( | ||
desired-(actual+starting)), // The number of prebuilds currently being stopped (should be 0) | ||
)) | ||
toDelete = int(math.Max(0, float64( | ||
outdated- // The number of prebuilds running above the desired count for active version | ||
deleting), // The number of prebuilds currently being deleted | ||
)) | ||
|
||
actions = &ReconciliationActions{ | ||
Actual: actual, | ||
Desired: desired, | ||
Eligible: eligible, | ||
Outdated: outdated, | ||
Extraneous: extraneous, | ||
Starting: starting, | ||
Stopping: stopping, | ||
Deleting: deleting, | ||
} | ||
) | ||
|
||
// If the template has become deleted or deprecated since the last reconciliation, we need to ensure we | ||
// scale those prebuilds down to zero. | ||
if p.Preset.Deleted || p.Preset.Deprecated { | ||
toCreate = 0 | ||
toDelete = int(actual + outdated) | ||
actions.Desired = 0 | ||
} | ||
|
||
// We backoff when the last build failed, to give the operator some time to investigate the issue and to not provision | ||
// a tonne of prebuilds (_n_ on each reconciliation iteration). | ||
if p.Backoff != nil && p.Backoff.NumFailed > 0 { | ||
actions.Failed = p.Backoff.NumFailed | ||
|
||
backoffUntil := p.Backoff.LastBuildAt.Add(time.Duration(p.Backoff.NumFailed) * backoffInterval) | ||
|
||
if clock.Now().Before(backoffUntil) { | ||
actions.Create = 0 | ||
actions.DeleteIDs = nil | ||
actions.BackoffUntil = backoffUntil | ||
|
||
// Return early here; we should not perform any reconciliation actions if we're in a backoff period. | ||
return actions, nil | ||
} | ||
} | ||
|
||
// It's possible that an operator could stop/start prebuilds which interfere with the reconciliation loop, so | ||
// we check if there are somehow more prebuilds than we expect, and then pick random victims to be deleted. | ||
if extraneous > 0 { | ||
// Sort running IDs by creation time so we always delete the oldest prebuilds. | ||
// In general, we want fresher prebuilds (imagine a mono-repo is cloned; newer is better). | ||
slices.SortFunc(p.Running, func(a, b database.GetRunningPrebuiltWorkspacesRow) int { | ||
if a.CreatedAt.Before(b.CreatedAt) { | ||
return -1 | ||
} | ||
if a.CreatedAt.After(b.CreatedAt) { | ||
return 1 | ||
} | ||
|
||
return 0 | ||
}) | ||
|
||
for i := 0; i < int(extraneous); i++ { | ||
if i >= len(p.Running) { | ||
// This should never happen. | ||
// TODO: move up | ||
// c.logger.Warn(ctx, "unexpected reconciliation state; extraneous count exceeds running prebuilds count!", | ||
// slog.F("running_count", len(p.Running)), | ||
// slog.F("extraneous", extraneous)) | ||
continue | ||
} | ||
|
||
actions.DeleteIDs = append(actions.DeleteIDs, p.Running[i].ID) | ||
} | ||
|
||
// TODO: move up | ||
// c.logger.Warn(ctx, "found extra prebuilds running, picking random victim(s)", | ||
// slog.F("template_id", p.Preset.TemplateID.String()), slog.F("desired", desired), slog.F("actual", actual), slog.F("extra", extraneous), | ||
// slog.F("victims", victims)) | ||
|
||
// Prevent the rest of the reconciliation from completing | ||
return actions, nil | ||
} | ||
|
||
actions.Create = int32(toCreate) | ||
|
||
// if toDelete > 0 && len(p.Running) != toDelete { | ||
// TODO: move up | ||
// c.logger.Warn(ctx, "mismatch between running prebuilds and expected deletion count!", | ||
// slog.F("template_id", s.preset.TemplateID.String()), slog.F("running", len(p.Running)), slog.F("to_delete", toDelete)) | ||
// } | ||
|
||
// TODO: implement lookup to not perform same action on workspace multiple times in $period | ||
// i.e. a workspace cannot be deleted for some reason, which continually makes it eligible for deletion | ||
for i := 0; i < toDelete; i++ { | ||
if i >= len(p.Running) { | ||
// TODO: move up | ||
// Above warning will have already addressed this. | ||
continue | ||
} | ||
|
||
actions.DeleteIDs = append(actions.DeleteIDs, p.Running[i].ID) | ||
} | ||
|
||
return actions, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't catch it, but why is
prebuilds
located incoderd
? Are there any components or interfaces used in non-commercial license mode? What I can see here are mainly interfaces and snapshots.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the idea is we have all interfaces and related types defined in
coderd
. Consequently we have to define methods in the same package. IIRC in goalng you can't define methods outside the package.@dannykopping should know more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but since
prebuilds
will be enterprise only, could we just park the entire implementation there?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need at least some of the API in the AGPL code because we need the noop implementation "running" until a premium license is used. Other than that, I think most of this can shift to enterprise.
I believe I had it this way because AGPL cannot import enterprise but the reverse is allowed. The interfaces reference the types which are in this package, and if we moved those to enterprise then it would violate the license.
This is how the implementation will be swapped out at runtime based on the license:
https://github.com/coder/coder/blob/dk/prebuilds/enterprise/coderd/coderd.go#L1165-L1188