-
Notifications
You must be signed in to change notification settings - Fork 914
feat: track resource replacements when claiming a prebuilt workspace #17571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
coderd/database/migrations/000318_resource_replacements_notification.up.sql
Outdated
Show resolved
Hide resolved
4859410
to
634082b
Compare
0322146
to
13168f4
Compare
@@ -75,6 +75,7 @@ message CompletedJob { | |||
repeated provisioner.Resource resources = 2; | |||
repeated provisioner.Timing timings = 3; | |||
repeated provisioner.Module modules = 4; | |||
repeated provisioner.ResourceReplacement resourceReplacements = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is backwards-compatible since repeated
fields are effectively optional
if nil length.
// nolint:gocritic // Necessary to query all the required data. | ||
ctx = dbauthz.AsSystemRestricted(ctx) | ||
// Since this may be called in a fire-and-forget fashion, we need to give up at some point. | ||
trackCtx, trackCancel := context.WithTimeout(ctx, time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a best-effort attempt to warn operators of this situation; it's ok if it times out, we'll get a log to trace this with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this partially covers my earlier comment. I still think it'd be good to take shutdowns into consideration for these and define what that behavior should be (rather than undefined). Right now these routines will be rudely interrupted during shutdown rather than exiting cleanly. Likewise these can be left running even if a CompleteJob is interrupted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now these routines will be rudely interrupted during shutdown rather than exiting cleanly.
Good call! I've now passed in the provisionerdserver
's lifecycle context.
Likewise these can be left running even if a CompleteJob is interrupted.
I believe that's already the case?
@@ -258,7 +258,7 @@ func getStateFilePath(workdir string) string { | |||
} | |||
|
|||
// revive:disable-next-line:flag-parameter | |||
func (e *executor) plan(ctx, killCtx context.Context, env, vars []string, logr logSink, destroy bool) (*proto.PlanComplete, error) { | |||
func (e *executor) plan(ctx, killCtx context.Context, env, vars []string, logr logSink, metadata *proto.Metadata) (*proto.PlanComplete, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the meat of the feature; everything else is just plumbing between system and user eyeball.
level := proto.LogLevel_INFO | ||
|
||
// Terraform indicates that a resource will be deleted and recreated by showing the change along with this substring. | ||
if bytes.Contains(line, []byte("# forces replacement")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit flimsy; open to other ideas.
In any case, this is just sugar. The fact that the plan, with all its drift details, are shown will be sufficient. Highlighting the lines is just a courtesy to the user.
coderd/database/migrations/000320_resource_replacements_notification.up.sql
Outdated
Show resolved
Hide resolved
@@ -13,43 +16,65 @@ import ( | |||
"github.com/coder/coder/v2/coderd/prebuilds" | |||
) | |||
|
|||
const ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Muddies the purpose of the PR a bit, but it was a worthwhile driveby refactoring given that we're adding a new metric (MetricResourceReplacementsCount
) and we need to check for its value in a test.
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
…replacement(s) Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
also a test that was broken from an earlier fix Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
13168f4
to
70f9a53
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No major issues uncovered, my main concern is the goroutine management in CompleteJob, other than that mostly minor nits and suggestions.
coderd/database/migrations/000320_resource_replacements_notification.up.sql
Outdated
Show resolved
Hide resolved
coderd/database/migrations/000320_resource_replacements_notification.up.sql
Outdated
Show resolved
Hide resolved
…17757) Used in combination with coder/terraform-provider-coder#396 This is required by both #17475 and #17571 Operators may need to conditionalize their templates to perform certain operations once a prebuilt workspace has been claimed. This value will **only** be set once a claim takes place and a subsequent `terraform apply` occurs. Any `terraform apply` runs thereafter will be indistinguishable from a normal run on a workspace. --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments but my main concern is around using the atomic.Pointer
for the notifier, I don't really see a reason for it in which case I think the code should be simplified.
The other remaining concern is around goroutine/shutdown which could lead to test flakes (leak detection).
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
… API avoid all this pointer juggling Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
…nyway Signed-off-by: Danny Kopping <dannykopping@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My review limited to the protobuf / version changes.
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Closes coder/internal#369
We can't know whether a replacement (i.e. drift of terraform state leading to a resource needing to be deleted/recreated) will take place apriori; we can only detect it at
plan
time, because the provider decides whether a resource must be replaced and it cannot be inferred through static analysis of the template.This is likely to be the most common gotcha with using prebuilds, since it requires a slight template modification to use prebuilds effectively, so let's head this off before it's an issue for customers.
Drift details will now be logged in the workspace build logs:
Plus a notification will be sent to template admins when this situation arises:
A new metric -
coderd_prebuilt_workspaces_resource_replacements_total
- will also increment each time a workspace encounters replacements.We only track that a resource replacement occurred, not how many. Just one is enough to ruin a prebuild, but we can't know apriori which replacement would cause this.
For example, say we have 2 replacements: a
docker_container
and anull_resource
; we don't know which one mightcause an issue (or indeed if either would), so we just track the replacement.