-
Notifications
You must be signed in to change notification settings - Fork 887
fix: Avoid deadlock in AgentReportStats Close during agent Close #5415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Since AgentReportStats takes a stats function which was doing mutex locking on agent shutdown, it was possible for there to be a deadlock depending on how the AgentReportsStats Close function is implemented. This mostly seems to happen on Windows test runners as it's pretty hard to hit this edge case. The bug currently only exists in the test implementation of AgentReportStats, however, this was refactored to be more robust in case of future changes.
_ = cl.Close() | ||
}); err != nil { | ||
a.logger.Error(ctx, "report stats goroutine", slog.Error(err)) | ||
_ = cl.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review: Is stats reporter startup failure fatal? I.e. do we return an error or not?
@@ -561,28 +580,6 @@ func (a *agent) init(ctx context.Context) { | |||
} | |||
|
|||
go a.runLoop(ctx) | |||
cl, err := a.client.AgentReportStats(ctx, a.logger, func() *codersdk.AgentStats { | |||
stats := map[netlogtype.Connection]netlogtype.Counts{} | |||
a.closeMutex.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The part that is confusing for me is that you're modifying the agent initialization phase, and you have to assume that the agent or its network is already dead. Maybe we should split it into a few elements?
Since AgentReportStats takes a stats function which was doing mutex
locking on agent shutdown, it was possible for there to be a deadlock
depending on how the AgentReportsStats Close function is implemented.
This mostly seems to happen on Windows test runners as it's pretty hard
to hit this edge case. The bug currently only exists in the test
implementation of AgentReportStats, however, this was refactored to be
more robust in case of future changes.