Skip to content

chore: refactor agent connection updates #11301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 2, 2024

Conversation

spikecurtis
Copy link
Contributor

Refactors the code that handles monitoring an agent websocket with pings and updating the connection times in the DB.

Consolidates v1 and v2 agent APIs under the same code for this.

One substantive change (not just a refactor) is that I've made it so that we actually disconnect if the agent fails to respond to our pings, rather than the old behavior where we would update the database, but not actually tear down the websocket.

Copy link
Contributor Author

Current dependencies on/for this PR:

This stack of pull requests is managed by Graphite.

@spikecurtis spikecurtis requested a review from mafredri December 21, 2023 07:36
AnyTimes().
Return(database.WorkspaceBuild{ID: build.ID}, nil)

go uut.mind(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relying on context to shut down this method its racy in that we may trigger goleak sporadically. Might be best to synchronize this to the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goleak waits for goroutines to complete up to a timeout. I don't believe we are in danger of triggering it unless the routine deadlocks, in which case goleak will have found a legit bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default for only 2 seconds I believe, and I don’t think it’s configurable. Plenty of time for a slow env like Windows runner to trigger the edge case, I’d say.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even on a slow Windows runner, 2 seconds is absolutely ages. I've never seen goleak give a false positive on a goroutine that was unblocked and just waiting to be scheduled by the runtime and to finish its work. Have you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen such failures in the past, which is why I raised the concern. In my experience it can and has resulted in rare flakes. I'd like to concretely say which scenarios, but truth be told I don't remember.

I also can't say whether or not it's the case for the current Windows runners, the only thing I can clearly say is that for the old GH Actions Windows runner 2 seconds was definitely not a long time and such expectations often resulted in a flake. 😄 We run so many things in parallel it's unclear how much time is needed, the more things exit leaving other things behind, the closer we get to filling up that 2 second gap, I suppose.

@spikecurtis spikecurtis force-pushed the spike/refactor-agent-update branch from e1f4d89 to 1f31780 Compare January 2, 2024 07:12
@spikecurtis spikecurtis requested a review from mafredri January 2, 2024 07:43
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing, but other than that this looks good 👍🏻.

PS. Now that I know about the uut pattern, I kinda like it. ☺️

return
}
lastPing.Store(ptr.Ref(time.Now()))
func (api *API) startAgentWebsocketMonitor(ctx context.Context,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we now have a close method, I don't think we need to take a context here anymore, wdyt? For start we could simply create the context/cancel func based on api context here and pass it along.

PS. Thanks for the naming change, this now feels very obvious what it does when reading the code!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the API context is used when sending the final updated that the agent has been disconnected.

The context we accept here, and want for start(), is tied to the connection.

@spikecurtis spikecurtis force-pushed the spike/refactor-agent-update branch from 1f31780 to 61f5d26 Compare January 2, 2024 11:54
@spikecurtis spikecurtis merged commit c9b7d61 into main Jan 2, 2024
@spikecurtis spikecurtis deleted the spike/refactor-agent-update branch January 2, 2024 12:04
Copy link
Contributor Author

Merge activity

@github-actions github-actions bot locked and limited conversation to collaborators Jan 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants