Skip to content

feat: track resource replacements when claiming a prebuilt workspace #17571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 38 commits into from
May 14, 2025

Conversation

dannykopping
Copy link
Contributor

@dannykopping dannykopping commented Apr 25, 2025

Closes coder/internal#369

We can't know whether a replacement (i.e. drift of terraform state leading to a resource needing to be deleted/recreated) will take place apriori; we can only detect it at plan time, because the provider decides whether a resource must be replaced and it cannot be inferred through static analysis of the template.

This is likely to be the most common gotcha with using prebuilds, since it requires a slight template modification to use prebuilds effectively, so let's head this off before it's an issue for customers.

Drift details will now be logged in the workspace build logs:

image

Plus a notification will be sent to template admins when this situation arises:

image

A new metric - coderd_prebuilt_workspaces_resource_replacements_total - will also increment each time a workspace encounters replacements.

We only track that a resource replacement occurred, not how many. Just one is enough to ruin a prebuild, but we can't know apriori which replacement would cause this.
For example, say we have 2 replacements: a docker_container and a null_resource; we don't know which one might
cause an issue (or indeed if either would), so we just track the replacement.

@github-actions github-actions bot added the stale This issue is like stale bread. label May 6, 2025
@dannykopping dannykopping force-pushed the dk/logreplacements branch 3 times, most recently from 4859410 to 634082b Compare May 7, 2025 21:06
@github-actions github-actions bot removed the stale This issue is like stale bread. label May 8, 2025
@dannykopping dannykopping force-pushed the dk/logreplacements branch from 0322146 to 13168f4 Compare May 8, 2025 12:38
@@ -75,6 +75,7 @@ message CompletedJob {
repeated provisioner.Resource resources = 2;
repeated provisioner.Timing timings = 3;
repeated provisioner.Module modules = 4;
repeated provisioner.ResourceReplacement resourceReplacements = 5;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is backwards-compatible since repeated fields are effectively optional if nil length.

// nolint:gocritic // Necessary to query all the required data.
ctx = dbauthz.AsSystemRestricted(ctx)
// Since this may be called in a fire-and-forget fashion, we need to give up at some point.
trackCtx, trackCancel := context.WithTimeout(ctx, time.Minute)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a best-effort attempt to warn operators of this situation; it's ok if it times out, we'll get a log to trace this with.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this partially covers my earlier comment. I still think it'd be good to take shutdowns into consideration for these and define what that behavior should be (rather than undefined). Right now these routines will be rudely interrupted during shutdown rather than exiting cleanly. Likewise these can be left running even if a CompleteJob is interrupted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now these routines will be rudely interrupted during shutdown rather than exiting cleanly.

Good call! I've now passed in the provisionerdserver's lifecycle context.

Likewise these can be left running even if a CompleteJob is interrupted.

I believe that's already the case?

@@ -258,7 +258,7 @@ func getStateFilePath(workdir string) string {
}

// revive:disable-next-line:flag-parameter
func (e *executor) plan(ctx, killCtx context.Context, env, vars []string, logr logSink, destroy bool) (*proto.PlanComplete, error) {
func (e *executor) plan(ctx, killCtx context.Context, env, vars []string, logr logSink, metadata *proto.Metadata) (*proto.PlanComplete, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the meat of the feature; everything else is just plumbing between system and user eyeball.

level := proto.LogLevel_INFO

// Terraform indicates that a resource will be deleted and recreated by showing the change along with this substring.
if bytes.Contains(line, []byte("# forces replacement")) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit flimsy; open to other ideas.
In any case, this is just sugar. The fact that the plan, with all its drift details, are shown will be sufficient. Highlighting the lines is just a courtesy to the user.

@@ -13,43 +16,65 @@ import (
"github.com/coder/coder/v2/coderd/prebuilds"
)

const (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Muddies the purpose of the PR a bit, but it was a worthwhile driveby refactoring given that we're adding a new metric (MetricResourceReplacementsCount) and we need to check for its value in a test.

@dannykopping dannykopping changed the title WIP! feat: log resource replacements when claiming a prebuilt workspace feat: track resource replacements when claiming a prebuilt workspace May 8, 2025
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
…replacement(s)

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
also a test that was broken from an earlier fix

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
@dannykopping dannykopping force-pushed the dk/logreplacements branch from 13168f4 to 70f9a53 Compare May 8, 2025 12:48
@dannykopping dannykopping marked this pull request as ready for review May 8, 2025 12:59
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No major issues uncovered, my main concern is the goroutine management in CompleteJob, other than that mostly minor nits and suggestions.

dannykopping added a commit that referenced this pull request May 12, 2025
…17757)

Used in combination with
coder/terraform-provider-coder#396

This is required by both #17475 and
#17571

Operators may need to conditionalize their templates to perform certain
operations once a prebuilt workspace has been claimed. This value will
**only** be set once a claim takes place and a subsequent `terraform
apply` occurs. Any `terraform apply` runs thereafter will be
indistinguishable from a normal run on a workspace.

---------

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments but my main concern is around using the atomic.Pointer for the notifier, I don't really see a reason for it in which case I think the code should be simplified.

The other remaining concern is around goroutine/shutdown which could lead to test flakes (leak detection).

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
… API

avoid all this pointer juggling

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
…nyway

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
@dannykopping dannykopping requested a review from spikecurtis May 14, 2025 09:39
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Copy link
Contributor

@spikecurtis spikecurtis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My review limited to the protobuf / version changes.

@dannykopping dannykopping enabled auto-merge (squash) May 14, 2025 09:51
@dannykopping dannykopping disabled auto-merge May 14, 2025 09:53
@dannykopping dannykopping merged commit 6e96778 into main May 14, 2025
34 checks passed
@dannykopping dannykopping deleted the dk/logreplacements branch May 14, 2025 12:52
@github-actions github-actions bot locked and limited conversation to collaborators May 14, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inform template admins that resources will be replaced
4 participants