Skip to content
Merged
Prev Previous commit
Next Next commit
Fix another race
  • Loading branch information
ammario committed Jun 8, 2023
commit 23a6833c775c5f5e57d45c071132fd30f9c6f4c4
15 changes: 10 additions & 5 deletions cli/ssh.go
Original file line number Diff line number Diff line change
Expand Up @@ -117,10 +117,7 @@ func (r *RootCmd) ssh() *clibase.Cmd {
// the logger is created but before any goroutines or wind-down
// defers (e.g. context cancels) are declared.
var wg sync.WaitGroup
defer func() {
cancel()
wg.Wait()
}()
defer wg.Wait()

workspace, workspaceAgent, err := getWorkspaceAndAgent(ctx, inv, client, codersdk.Me, inv.Args[0])
if err != nil {
Expand Down Expand Up @@ -192,7 +189,15 @@ func (r *RootCmd) ssh() *clibase.Cmd {
wg.Add(1)
go func() {
defer wg.Done()
watchAndClose(ctx, rawSSH.Close, logger, client, workspace)
watchAndClose(ctx, func() error {
rawSSH.Close()
// If we don't close Stdin, the io.Copy below may
// block indefinitely on Stdin Read.
if rc, ok := inv.Stdin.(io.Closer); ok {
rc.Close()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This smells funny in that it shouldn't be the job of the cli command to close stdin.

Also, in the case where rawSSH goes down unexpectedly, but the workspace stays up, I think that we could still be blocked on reading from stdin, creating a deadlock with the waitgroup.

It should be the job of the clibase invoker to close stdin when the command is done -- but here that can create a deadlock with the waitgroup, since the main command won't return until the waitgroup is done.

Instead of considering the waitgroup to be all the command goroutines except the main one, why don't we make the waitgroup all the command goroutines including the main one? Then we can have a separate goroutine that just waits for the group, and closes the log file.

e.g.

var wg sync.WaitGroup
wg.Add(1)
defer wg.Done()
if logDirPath != "" {
    logfile, err := os.OpenFile(...)
    if err != nil {...}
    go func() {
        wg.Wait()
        logfile.Close()
    }
}

wg.Add(1)
go func() {
    defer wg.Done()
    ...
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree my solution is janky. The challenge with closing the logfile in a separate goroutine is it creates a leak — which was the main thing I was trying to fix when opening this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing in a separate goroutine is not a leak (unless the waitgroup fails to complete). It just waits for the other things to finish, closes the file, then exits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, it's not just a question of jankiness. As written, it's still bugged in the case where the rawSSH closes for some external reason not related to the workspace being stopped. The command will deadlock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing in a separate goroutine is not a leak (unless the waitgroup fails to complete). It just waits for the other things to finish, closes the file, then exits.

Maybe leak isn't the best word. It can show up in our leak detector and cause test flakiness, even though in production there is no leak.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written, it's still bugged in the case where the rawSSH closes for some external reason not related to the workspace being stopped. The command will deadlock.

I don't understand this. If rawSSH closes for some external reason, and io.Copy is blocked on write, the write call would immediately fail?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. If rawSSH closes for some external reason, and io.Copy is blocked on write, the write call would immediately fail?

I'm worried about io.Copy blocking on Read() - the same thing you're trying to prevent here, but with rawSSH just closing for a different reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe leak isn't the best word. It can show up in our leak detector and cause test flakiness, even though in production there is no leak.

Have you tried it and seen goleak firing? If there are goroutines still running, goleak will sleep and retry up to some maximum timeout, exactly so that things like what I'm proposing have time to resolve themselves before goleak complains about them.

I can't imagine this would be the case for waiting for a waitgroup and closing a file, but if for some reason the goroutine takes longer to resolve than goleak's default timeout, we can adjust the timeout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know about the retry. I'll implement your proposal.

return nil
}, logger, client, workspace)
}()

wg.Add(1)
Expand Down