Skip to content

fix: make handleManifest always signal dependents #13141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 6, 2024

Conversation

spikecurtis
Copy link
Contributor

@spikecurtis spikecurtis commented May 3, 2024

Fixes #13139

Using a bare channel to signal dependent goroutines means that we can only signal success, not failure, which leads to deadlock if we fail in a way that doesn't cause the whole apiConnRoutineManager to tear down routines.

Instead, we use a new object called a checkpoint that signals success or failure, so that dependent routines get unblocked if the routine they depend on fails.

Copy link
Contributor Author

spikecurtis commented May 3, 2024

Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the checkpoint implementation, just some minor stuff but otherwise LGTM (no need to re-review unless you want.)


// complete the checkpoint. Pass nil to indicate the checkpoint was ok.
func (c *checkpoint) complete(err error) {
c.err = err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using sync once here to simplify usage an allowing multiple calls to complete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That introduces the possibility of multiple calls racing success vs failure. Better to ensure that there is only one call to complete().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what way would it be racy? Sync once is atomic AFAIK and first come-first served, the other remains blocked waiting for the first to complete.

Copy link
Member

@mafredri mafredri May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps another way to put it: I think it would be preferable for this to be safe to use incorrectly and document how it should be used, vs incorrect use (or carelessness) resulting in a runtime panic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's racy in the sense that if you have two callers, one saying "success" and one saying "failure", then they can race each other to determine which gets reported to things waiting on the checkpoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand your concern but I have a hard time understanding how it's relevant here. If someone introduces that race right now the program will panic instead. So it needs to be avoided in either case, i.e. a non-issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be preferable for this to be safe to use incorrectly and document how it should be used, vs incorrect use (or carelessness) resulting in a runtime panic.

I think that kind of defensive programming is a disservice in this case. Silently absorbing incorrect/careless use in a non-deterministic way is a source of frustrating and hard to diagnose bugs.

There is a genuine case to be made for a central service like Coderd needing to avoid panicking because of programming bugs, but here in the Agent, I think it's preferable to panic, print a stack trace, and exit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can definitely see the benefit of that approach as well, but I do feel we should have a high tolerance towards introducing sharp edges that can result in decreased developer flow. If we feel this is important enough to panic, wouldn’t another approach be to detect a second call and log it as an error + stack trace?

We don’t know how well the logs from a panic will be persisted either, the user may attempt to rebuild their workspace and ultimately we caused an inconvenience and are never the wiser.

@spikecurtis spikecurtis force-pushed the spike/13139-handleOK-promise branch 2 times, most recently from 92f0f66 to 094ce97 Compare May 6, 2024 05:19
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the critical log handling, LGTM!

@spikecurtis spikecurtis force-pushed the spike/13139-superfluous-ctx branch from 2a73bb4 to 49cee21 Compare May 6, 2024 10:22
@spikecurtis spikecurtis force-pushed the spike/13139-handleOK-promise branch from 094ce97 to 4e4c469 Compare May 6, 2024 10:22
Base automatically changed from spike/13139-superfluous-ctx to main May 6, 2024 10:33
@spikecurtis spikecurtis force-pushed the spike/13139-handleOK-promise branch from 4e4c469 to 4ad81b0 Compare May 6, 2024 10:33
Copy link
Contributor Author

spikecurtis commented May 6, 2024

Merge activity

  • May 6, 6:34 AM EDT: Graphite rebased this pull request as part of a merge.
  • May 6, 6:47 AM EDT: @spikecurtis merged this pull request with Graphite.

@spikecurtis spikecurtis merged commit d51c691 into main May 6, 2024
25 checks passed
@spikecurtis spikecurtis deleted the spike/13139-handleOK-promise branch May 6, 2024 10:47
@github-actions github-actions bot locked and limited conversation to collaborators May 6, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Agent RPC socket failure during handleManifest can lock up the agent
2 participants