refactor: PTY & SSH #7100

spikecurtis · 2023-04-12T10:01:22Z

Our agentssh package handles running commands in a pseudoterminal, and we've had several problems with not returning command output.

The implementation prior to this PR seems to behave how we want it to, (at least on Linux), but it

uses needlessly complex concurrency and synchronization
incorrectly explains why it works
doesn't fix the problem on Windows

This leaves us in a precarious situation where if someone tries to fix or enhance this function, they'll misunderstand how and why the current code works.

This PR attempts to address these problems.

This was the subject of several fixes: #6777 #6833 and #6862

The initial bug was that the SSH handler runs io.Copy() in a goroutine to copy the command output to the SSH session, but we didn't wait for this copy to finish before closing down the session, causing a race that sometimes lost output.

However, simply waiting for the io.Copy() to finish would result in a deadlock because of another, much more subtle bug in the way pseudoterminals are handled by the OS.

Linux/macOS

The current code makes an unnecessary call to the low-level syscall dup to duplicate a file descriptor. I'll try to explain what was happening and why it was bad:

On POSIX systems, the way pseudo-TTYs work is that open a pair of files, which I'll refer to as the pty and the tty to match the terminology in our code (Linux itself calls them ptm and pts). The tty acts just like a "real" teletype device for any program interacting with it --- and when we launch commands, we pass the tty as all three of the stdio file descriptors (stdin, stdout, stderr) to the child process.

After forking/exec'ing to start the child process this is what things looked like:

go process
PTY:
  - pty  ------------------------------> pty <--- linux kernel ----> tty <---- child process
  - tty  ------------------------------------------------------------^

Notice that our go process kept the tty file open. What happens is when you read from the pty, and there is no data available to read, it blocks until either

there is data to read
all file descriptors to the tty are closed

We kept an open file descriptor to the tty so even after the child process exited, we never hit condition 2, until we called Close() on our PTY object, which then closed both pty and tty at the same time.

The current code gets around this by calling dup on the pty file descriptor, so we end up with two references to the PTY.

go process
SSH Handler:
  - stdout ------------------------------+
PTY:                                     V
  - pty  ------------------------------> pty <--- linux kernel ----> tty <---- child process
  - tty  ------------------------------------------------------------^

The explanation given for this is that it allows us to close stdin and stdout separately, but that's a misunderstanding. The child process interacts only with the tty for both stdin and stdout --- duplicating the pty doesn't change this.

However, this does allow us to break the deadlock, because calling Close() on the PTY closes tty, but since we have the duplicated file, we can still read the output as long as the child process is still alive or there is unread buffered data in the kernel. But, this is needlessly wasteful of file descriptors (which are a limited kernel resource), and essentially uses a kernel primitive to accomplish the task of closing tty without closing pty.

The essential features of this PR on Linux/macOS are to

Close the tty immediately after creating the child process - this allows us to safely read all the output from the child process without fear of deadlocking.
Fix the SSH handler to simply call io.Copy() to copy output, then grab the exit code and return --- no extra file descriptors, wait groups, or synchronization.

Windows

Instead of a single read-write file descriptor, Windows uses a pair of pipes (which are unidirectional). On Windows, though, the pseudoconsole is hosted by a separate process, called the Console Host (conhost.exe), rather than being handled entirely in-kernel like on Linux/macOS. The lifecycle of the pseudoterminal is explicit via a Handle returned on creation. So on Windows, we create 5 handles:

Input Read Pipe
Input Write Pipe
Output Read Pipe
Output Write Pipe
Console

In the current implementation, we keep all 5 handles open until the PTY is closed, at which point we close all 5. This then causes a race for a reader trying to read all the data before the pipe gets closed.

Docs for PseudoConsoles explain that once you pass the Input Read Pipe and Output Write Pipe to the pseudoconsole host, you should close them.

On Windows we also have to close the Pseudoconsole itself to unblock our output reads, since the Pseudoconsole is a separate process from the child application, and it has an explicit lifetime. So, in this PR, we also add code that closes the Pseudoconsole as soon as the child application exits (for PTYs attached to commands). Crucially, we leave the Output Read Pipe open, so that we can still read any data buffered in the OS.

Other features

split the PTY interface in 2, since there are still use cases in the code where we create a PTY but don't start a child process. So, we have one interface when we are interacting with a child, and a different interface when we hold both ends of the PTY -- this is done for safety so that you can't attempt to read from the tty after we've closed it.
add a unit test to check that we clean up orphaned processes when the SSH client disconnects
add a unit test to check for truncated output from commands started in a PTY

Signed-off-by: Spike Curtis <spike@coder.com>

mafredri

Thanks for digging into this! I was aware of the method by which it worked but seems did a poor job explaining it... not to mention being brave enough to venture this deep into PTYs. 😄

I did a little testing locally and couldn't immediately find any issues with this approach, so I'm hoping there won't be any edge-cases with processes closing their inputs/outputs while still running.

Had a few remarks but all-in-all, looks good, good job!

pty/pty_windows.go

mafredri · 2023-04-19T15:03:56Z

pty/pty_windows.go

+	// may have already closed the PseudoConsole when the command exited, so that
+	// output reads can get to EOF.  In that case, we don't need to close it
+	// again here.
+	if p.console != windows.InvalidHandle {


Assigned both here and in waitInternal, probably needs protection.

Protected by the closeMutex

Ah right, the p.pw threw me off. I noticed Resize is not protected though!

pty/ptytest/ptytest.go

mafredri · 2023-04-19T15:11:36Z

pty/ptytest/ptytest.go

 func (p *PTY) Close() error {
 	p.t.Helper()
+	pErr := p.PTY.Close()
+	eErr := p.outExpecter.close("close")


A lot of the ceremony in close is due to closing pty, which is no longer done there. We should probably simplify it and make it less spammy. (The theory was that we might be hanging in some of the cleanup, resulting in weird flakes, but that has not been the case.)

The outExpecter.close() doesn't close the pty itself, but it does monitor the cleanup of the various goroutines we created to log the output and run expectations against it. I've left that part intact as my change is meant to leave ptytest functionally equivalent to what it was before.

pty/start_test.go

mafredri · 2023-04-19T15:27:52Z

pty/start_test.go

+		b := make([]byte, 1)
+		go func() {
+			_, err := r.Read(b)
+			readErrs <- err


Suggested change

readErrs <- err

select {

case readErrs <- err:

default:

}

I don't think this is needed.

The whole purpose of this little goroutine is to allow me to call Read() but still be able to timeout in this function if the context expires. So I read in a goroutine, then select whether the read completed before the context expires. The chan error is buffered, so even if the context expires, the goroutine will be able to finish once Read() returns. Note that an expired context calls return so there can only be one read goroutine started at a time.

mafredri · 2023-04-19T15:31:43Z

pty/pty_other.go

 		Writer: p.tty,
 	}
 }

+func (p *otherPty) OutputReader() io.Reader {
+	return &ptmReader{p.pty}


It's unfortunate we can't pass the *os.File here since io.Copy can sometimes do some slightly more performant operations with it. (That's more of a nice-to-have though, so this is fine.)

I don't think that applies for copying from an *os.File. io.Copy() looks for WriteTo() on the source, which *os.File does not implement.

It does apply when you are copying to an *os.File, since it implements ReadFrom() --- InputWriter() still returns the *os.File directly, so we can take advantage of that optimization.

mafredri · 2023-04-19T16:22:14Z

agent/agentssh/agentssh.go

-type readNopCloser struct{ io.Reader }
+// ptySession is the interface to the ssh.Session that startPTYSession uses
+// we use an interface here so that we can fake it in tests.
+type ptySession interface {


Is there a reason you chose a stripped down interface vs using ssh.Session directly? I see the test but we also have a fake context there with all the methods so I'm not seeing the benefit per-se. I feel this creates a bit of needless indirection. Not critical to change now, I'll see if I make some changes down the line as I do some refactoring for the session handling anyway.

ssh.Session is an even bigger interface!

I couldn't see any way to directly instantiate any of the concrete implementations in like gliderlabs.

My first stab at the test I wanted to do created the whole server like the other ssh tests do, but I found the even after calling Close() on the network connection, the session context in the handler wasn't closed quickly. So, this construction narrows down the big ssh.Session interface into just what we need, and allows me to more precisely control when the context expires in tests.

It's a bit annoying that I have to mock out the context methods, but unfortunately, ssh.Session is defined to have a method Context() that returns ssh.Context rather than context.Context.

Unfortunately the go type-checker isn't smart enough to understand that Context() ssh.Context must also satisfy Context() context.Context since context.Context is a strict subset of ssh.Context. Alas.

agent/agent_test.go

go.mod

pty/pty_other.go

deansheather · 2023-04-19T21:09:16Z

pty/pty_windows.go

+				// if we already have an error from the command, prefer that error
+				// but if the command succeeded and closing the PseudoConsole fails
+				// then record that error so that we have a chance to see it
+				p.cmdErr = err


It'd be nice to log the error whether it's kept or discarded

It would be nice, but I don't have a logger, and I'd have to change all the init signatures to get one, so I decided on this.

Another alternative I considered is panicking---something pretty messed up at the OS level is happening if we get an error here. We're pretty much guaranteed to crash the agent if we panic, so it's likely we'd get told about it.

pty/start_test.go

Signed-off-by: Spike Curtis <spike@coder.com>

pty/start_windows.go

mafredri

The windows Resize method still needs protection on reading p.console and I'm not sold on the outExpecter naming (as commented), but won't block the PR for that.

Once the the resize is fixed and the test pass, LGTM.

Signed-off-by: Spike Curtis <spike@coder.com>

kylecarbs

Feel free to defer my reviews to Dean and Mathias.

I'm extremely happy you dug into this!

pty/pty.go

Signed-off-by: Spike Curtis <spike@coder.com>

spikecurtis added 2 commits April 11, 2023 15:00

Add ssh tests for longoutput, orphan

0075d7d

Signed-off-by: Spike Curtis <spike@coder.com>

PTY/SSH tests & improvements

a491d4f

Signed-off-by: Spike Curtis <spike@coder.com>

github-actions bot assigned spikecurtis Apr 12, 2023

spikecurtis added 2 commits April 12, 2023 15:14

Fix some tests

2cf357a

Signed-off-by: Spike Curtis <spike@coder.com>

Fix linting

28c0646

Signed-off-by: Spike Curtis <spike@coder.com>

spikecurtis changed the title ~~PTY & SSH fixes~~ refactor: PTY & SSH Apr 12, 2023

spikecurtis added 11 commits April 12, 2023 12:08

fmt

872e357

Signed-off-by: Spike Curtis <spike@coder.com>

Fix windows test

3f21e30

Signed-off-by: Spike Curtis <spike@coder.com>

Windows copy test

e83ff6e

Signed-off-by: Spike Curtis <spike@coder.com>

WIP Windows pty handling

b610579

Signed-off-by: Spike Curtis <spike@coder.com>

Fix truncation tests

90bfe94

Signed-off-by: Spike Curtis <spike@coder.com>

Appease linter/fmt

d6e131c

Signed-off-by: Spike Curtis <spike@coder.com>

Fix typo

2c9c6ef

Signed-off-by: Spike Curtis <spike@coder.com>

Rework truncation test to not assume OS buffers

e39e885

Signed-off-by: Spike Curtis <spike@coder.com>

Disable orphan test on Windows --- uses sh

8ec3d1f

Signed-off-by: Spike Curtis <spike@coder.com>

agent_test running SSH in pty use ptytest.Start

df424e6

Signed-off-by: Spike Curtis <spike@coder.com>

More detail about closing pseudoconsole on windows

50e3fec

Signed-off-by: Spike Curtis <spike@coder.com>

spikecurtis marked this pull request as ready for review April 19, 2023 07:13

spikecurtis requested review from kylecarbs, deansheather and mafredri April 19, 2023 07:13

mafredri reviewed Apr 19, 2023

View reviewed changes

deansheather approved these changes Apr 19, 2023

View reviewed changes

spikecurtis added 2 commits April 20, 2023 11:07

Code review fixes

c09083e

Signed-off-by: Spike Curtis <spike@coder.com>

Rearrange ptytest method order

439107d

Signed-off-by: Spike Curtis <spike@coder.com>

spikecurtis requested a review from mafredri April 20, 2023 11:15

mafredri reviewed Apr 20, 2023

View reviewed changes

pty/start_windows.go Outdated Show resolved Hide resolved

mafredri approved these changes Apr 20, 2023

View reviewed changes

spikecurtis added 2 commits April 21, 2023 05:21

Protect pty.Resize on windows from races

c6a3229

Signed-off-by: Spike Curtis <spike@coder.com>

Fix windows bugs

aa94546

Signed-off-by: Spike Curtis <spike@coder.com>

kylecarbs reviewed Apr 21, 2023

View reviewed changes

pty/pty.go Show resolved Hide resolved

spikecurtis added 3 commits April 24, 2023 06:06

PTY doesn't extend PTYCmd

0f07cb9

Signed-off-by: Spike Curtis <spike@coder.com>

merge branch 'main' of https://github.com/coder/coder into spike/pty-ssh

eaabb3a

Fix windows types

50060d8

Signed-off-by: Spike Curtis <spike@coder.com>

spikecurtis merged commit daee91c into main Apr 24, 2023

spikecurtis deleted the spike/pty-ssh branch April 24, 2023 10:53

github-actions bot locked and limited conversation to collaborators Apr 24, 2023

-			readErrs <- err
+			select {
+			case readErrs <- err:
+			default:
+			}

refactor: PTY & SSH #7100

refactor: PTY & SSH #7100

Uh oh!

Conversation

spikecurtis commented Apr 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linux/macOS

Windows

Other features

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

kylecarbs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

spikecurtis commented Apr 12, 2023 •

edited

Loading