Skip to content

fix: Avoid deadlock in AgentReportStats Close during agent Close #5415

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 14, 2022

Conversation

mafredri
Copy link
Member

Since AgentReportStats takes a stats function which was doing mutex
locking on agent shutdown, it was possible for there to be a deadlock
depending on how the AgentReportsStats Close function is implemented.

This mostly seems to happen on Windows test runners as it's pretty hard
to hit this edge case. The bug currently only exists in the test
implementation of AgentReportStats, however, this was refactored to be
more robust in case of future changes.

Since AgentReportStats takes a stats function which was doing mutex
locking on agent shutdown, it was possible for there to be a deadlock
depending on how the AgentReportsStats Close function is implemented.

This mostly seems to happen on Windows test runners as it's pretty hard
to hit this edge case. The bug currently only exists in the test
implementation of AgentReportStats, however, this was refactored to be
more robust in case of future changes.
_ = cl.Close()
}); err != nil {
a.logger.Error(ctx, "report stats goroutine", slog.Error(err))
_ = cl.Close()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Is stats reporter startup failure fatal? I.e. do we return an error or not?

@mafredri mafredri self-assigned this Dec 14, 2022
@mafredri mafredri marked this pull request as ready for review December 14, 2022 15:53
@mafredri mafredri requested review from kylecarbs and mtojek December 14, 2022 15:53
@@ -561,28 +580,6 @@ func (a *agent) init(ctx context.Context) {
}

go a.runLoop(ctx)
cl, err := a.client.AgentReportStats(ctx, a.logger, func() *codersdk.AgentStats {
stats := map[netlogtype.Connection]netlogtype.Counts{}
a.closeMutex.Lock()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: The deadlock happens between

case c.statsChan <- stats():
and
<-doneCh
.

Copy link
Member

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part that is confusing for me is that you're modifying the agent initialization phase, and you have to assume that the agent or its network is already dead. Maybe we should split it into a few elements?

@mafredri mafredri merged commit 012a9e7 into main Dec 14, 2022
@mafredri mafredri deleted the mafredri/fix-report-stats-mutex-deadlock branch December 14, 2022 16:45
@github-actions github-actions bot locked and limited conversation to collaborators Dec 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants