Skip to content
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 73 additions & 41 deletions cli/ssh.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ import (
"net/url"
"os"
"os/exec"
"path"
"path/filepath"
"strings"
"sync"
"time"

"github.com/gen2brain/beeep"
Expand Down Expand Up @@ -50,8 +50,7 @@ func (r *RootCmd) ssh() *clibase.Cmd {
identityAgent string
wsPollInterval time.Duration
noWait bool
logDir string
logToFile bool
logDirPath string
)
client := new(codersdk.Client)
cmd := &clibase.Cmd{
Expand All @@ -74,24 +73,34 @@ func (r *RootCmd) ssh() *clibase.Cmd {
logger.Error(ctx, "command exit", slog.Error(retErr))
}
}()
if logToFile {
// we need a way to ensure different ssh invocations don't clobber
// each other's logs. Date-time strings will likely have collisions
// in unit tests and/or scripts unless we extend precision out to
// sub-millisecond, which seems unwieldy. A simple 5-character random
// string will do it, since the operating system already tracks
// dates and times for file IO.
qual, err := cryptorand.String(5)
if logDirPath != "" {
nonce, err := cryptorand.StringCharset(cryptorand.Lower, 5)
if err != nil {
return xerrors.Errorf("generate random qualifier: %w", err)
return xerrors.Errorf("generate nonce: %w", err)
}
logPth := path.Join(logDir, fmt.Sprintf("coder-ssh-%s.log", qual))
logFile, err := os.Create(logPth)
logFilePath := filepath.Join(
logDirPath,
fmt.Sprintf(
"coder-ssh-%s-%s.log",
// The time portion makes it easier to find the right
// log file.
time.Now().Format("20060102-150405"),
// The nonce prevents collisions, as SSH invocations
// frequently happen in parallel.
nonce,
),
)
logFile, err := os.OpenFile(
logFilePath,
os.O_CREATE|os.O_APPEND|os.O_WRONLY|os.O_EXCL,
0o600,
)
if err != nil {
return xerrors.Errorf("error opening %s for logging: %w", logPth, err)
return xerrors.Errorf("error opening %s for logging: %w", logDirPath, err)
}
logger = slog.Make(sloghuman.Sink(logFile))
defer logFile.Close()

logger = slog.Make(sloghuman.Sink(logFile))
if r.verbose {
logger = logger.Leveled(slog.LevelDebug)
}
Expand All @@ -100,6 +109,16 @@ func (r *RootCmd) ssh() *clibase.Cmd {
client.Logger = logger
}

// This WaitGroup solves for a race condition where we were logging
// while closing the log file in in a defer. It probably solves
// others too.
//
// Its position in this function is important. It must be after
// the logger is created but before any goroutines or wind-down
// defers (e.g. context cancels) are declared.
var wg sync.WaitGroup
defer wg.Wait()

workspace, workspaceAgent, err := getWorkspaceAndAgent(ctx, inv, client, codersdk.Me, inv.Args[0])
if err != nil {
return err
Expand Down Expand Up @@ -166,9 +185,24 @@ func (r *RootCmd) ssh() *clibase.Cmd {
return xerrors.Errorf("connect SSH: %w", err)
}
defer rawSSH.Close()
go watchAndClose(ctx, rawSSH.Close, logger, client, workspace)

wg.Add(1)
go func() {
defer wg.Done()
watchAndClose(ctx, func() error {
rawSSH.Close()
// If we don't close Stdin, the io.Copy below may
// block indefinitely on Stdin Read.
if rc, ok := inv.Stdin.(io.Closer); ok {
rc.Close()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This smells funny in that it shouldn't be the job of the cli command to close stdin.

Also, in the case where rawSSH goes down unexpectedly, but the workspace stays up, I think that we could still be blocked on reading from stdin, creating a deadlock with the waitgroup.

It should be the job of the clibase invoker to close stdin when the command is done -- but here that can create a deadlock with the waitgroup, since the main command won't return until the waitgroup is done.

Instead of considering the waitgroup to be all the command goroutines except the main one, why don't we make the waitgroup all the command goroutines including the main one? Then we can have a separate goroutine that just waits for the group, and closes the log file.

e.g.

var wg sync.WaitGroup
wg.Add(1)
defer wg.Done()
if logDirPath != "" {
    logfile, err := os.OpenFile(...)
    if err != nil {...}
    go func() {
        wg.Wait()
        logfile.Close()
    }
}

wg.Add(1)
go func() {
    defer wg.Done()
    ...
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree my solution is janky. The challenge with closing the logfile in a separate goroutine is it creates a leak — which was the main thing I was trying to fix when opening this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing in a separate goroutine is not a leak (unless the waitgroup fails to complete). It just waits for the other things to finish, closes the file, then exits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, it's not just a question of jankiness. As written, it's still bugged in the case where the rawSSH closes for some external reason not related to the workspace being stopped. The command will deadlock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing in a separate goroutine is not a leak (unless the waitgroup fails to complete). It just waits for the other things to finish, closes the file, then exits.

Maybe leak isn't the best word. It can show up in our leak detector and cause test flakiness, even though in production there is no leak.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written, it's still bugged in the case where the rawSSH closes for some external reason not related to the workspace being stopped. The command will deadlock.

I don't understand this. If rawSSH closes for some external reason, and io.Copy is blocked on write, the write call would immediately fail?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. If rawSSH closes for some external reason, and io.Copy is blocked on write, the write call would immediately fail?

I'm worried about io.Copy blocking on Read() - the same thing you're trying to prevent here, but with rawSSH just closing for a different reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe leak isn't the best word. It can show up in our leak detector and cause test flakiness, even though in production there is no leak.

Have you tried it and seen goleak firing? If there are goroutines still running, goleak will sleep and retry up to some maximum timeout, exactly so that things like what I'm proposing have time to resolve themselves before goleak complains about them.

I can't imagine this would be the case for waiting for a waitgroup and closing a file, but if for some reason the goroutine takes longer to resolve than goleak's default timeout, we can adjust the timeout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know about the retry. I'll implement your proposal.

return nil
}, logger, client, workspace)
}()

wg.Add(1)
go func() {
defer wg.Done()
// Ensure stdout copy closes incase stdin is closed
// unexpectedly. Typically we wouldn't worry about
// this since OpenSSH should kill the proxy command.
Expand Down Expand Up @@ -201,19 +235,24 @@ func (r *RootCmd) ssh() *clibase.Cmd {
return xerrors.Errorf("ssh session: %w", err)
}
defer sshSession.Close()
go watchAndClose(
ctx,
func() error {
err := sshSession.Close()
logger.Debug(ctx, "session close", slog.Error(err))
err = sshClient.Close()
logger.Debug(ctx, "client close", slog.Error(err))
return nil
},
logger,
client,
workspace,
)

wg.Add(1)
go func() {
defer wg.Done()
watchAndClose(
ctx,
func() error {
err := sshSession.Close()
logger.Debug(ctx, "session close", slog.Error(err))
err = sshClient.Close()
logger.Debug(ctx, "client close", slog.Error(err))
return nil
},
logger,
client,
workspace,
)
}()

if identityAgent == "" {
identityAgent = os.Getenv("SSH_AUTH_SOCK")
Expand Down Expand Up @@ -354,18 +393,11 @@ func (r *RootCmd) ssh() *clibase.Cmd {
Value: clibase.BoolOf(&noWait),
},
{
Flag: "log-dir",
Default: os.TempDir(),
Description: "Specify the location for the log files.",
Env: "CODER_SSH_LOG_DIR",
Value: clibase.StringOf(&logDir),
},
{
Flag: "log-to-file",
Flag: "log-dir",
Description: "Specify the directory containing SSH diagnostic log files.",
Env: "CODER_SSH_LOG_DIR",
FlagShorthand: "l",
Env: "CODER_SSH_LOG_TO_FILE",
Description: "Enable diagnostic logging to file.",
Value: clibase.BoolOf(&logToFile),
Value: clibase.StringOf(&logDirPath),
},
}
return cmd
Expand Down
26 changes: 7 additions & 19 deletions cli/ssh_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -261,7 +261,7 @@ func TestSSH(t *testing.T) {
client, workspace, agentToken := setupWorkspaceForAgent(t, nil)
_, _ = tGoContext(t, func(ctx context.Context) {
// Run this async so the SSH command has to wait for
// the build and agent to connect!
// the build and agent to connect.
agentClient := agentsdk.New(client.URL)
agentClient.SetSessionToken(agentToken)
agentCloser := agent.New(agent.Options{
Expand Down Expand Up @@ -411,20 +411,14 @@ func TestSSH(t *testing.T) {
t.Run("FileLogging", func(t *testing.T) {
t.Parallel()

dir := t.TempDir()
logDir := t.TempDir()

client, workspace, agentToken := setupWorkspaceForAgent(t, nil)
inv, root := clitest.New(t, "ssh", workspace.Name, "-l", "--log-dir", dir)
inv, root := clitest.New(t, "ssh", "-l", logDir, workspace.Name)
clitest.SetupConfig(t, client, root)
pty := ptytest.New(t).Attach(inv)
w := clitest.StartWithWaiter(t, inv)

ctx, cancel := context.WithTimeout(context.Background(), testutil.WaitLong)
defer cancel()

cmdDone := tGo(t, func() {
err := inv.WithContext(ctx).Run()
assert.NoError(t, err)
})
pty.ExpectMatch("Waiting")

agentClient := agentsdk.New(client.URL)
Expand All @@ -439,17 +433,11 @@ func TestSSH(t *testing.T) {

// Shells on Mac, Windows, and Linux all exit shells with the "exit" command.
pty.WriteLine("exit")
<-cmdDone
w.RequireSuccess()

entries, err := os.ReadDir(dir)
ents, err := os.ReadDir(logDir)
require.NoError(t, err)
for _, e := range entries {
t.Logf("logdir entry: %s", e.Name())
if strings.HasPrefix(e.Name(), "coder-ssh") {
return
}
}
t.Fatal("failed to find ssh logfile")
require.Len(t, ents, 1, "expected one file in logdir %s", logDir)
})
}

Expand Down
7 changes: 2 additions & 5 deletions cli/testdata/coder_ssh_--help.golden
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,8 @@ Start a shell into a workspace
Specifies which identity agent to use (overrides $SSH_AUTH_SOCK),
forward agent must also be enabled.

--log-dir string, $CODER_SSH_LOG_DIR (default: /tmp)
Specify the location for the log files.

-l, --log-to-file bool, $CODER_SSH_LOG_TO_FILE
Enable diagnostic logging to file.
-l, --log-dir string, $CODER_SSH_LOG_DIR
Specify the directory containing SSH diagnostic log files.

--no-wait bool, $CODER_SSH_NO_WAIT
Specifies whether to wait for a workspace to become ready before
Expand Down
14 changes: 2 additions & 12 deletions docs/cli/ssh.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,24 +39,14 @@ Specifies whether to forward the GPG agent. Unsupported on Windows workspaces, b

Specifies which identity agent to use (overrides $SSH_AUTH_SOCK), forward agent must also be enabled.

### --log-dir
### -l, --log-dir

| | |
| ----------- | ------------------------------- |
| Type | <code>string</code> |
| Environment | <code>$CODER_SSH_LOG_DIR</code> |
| Default | <code>/tmp</code> |

Specify the location for the log files.

### -l, --log-to-file

| | |
| ----------- | ----------------------------------- |
| Type | <code>bool</code> |
| Environment | <code>$CODER_SSH_LOG_TO_FILE</code> |

Enable diagnostic logging to file.
Specify the directory containing SSH diagnostic log files.

### --no-wait

Expand Down