Skip to content

refactor: PTY & SSH #7100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Apr 24, 2023
Merged

refactor: PTY & SSH #7100

merged 22 commits into from
Apr 24, 2023

Conversation

spikecurtis
Copy link
Contributor

@spikecurtis spikecurtis commented Apr 12, 2023

Our agentssh package handles running commands in a pseudoterminal, and we've had several problems with not returning command output.

The implementation prior to this PR seems to behave how we want it to, (at least on Linux), but it

  1. uses needlessly complex concurrency and synchronization
  2. incorrectly explains why it works
  3. doesn't fix the problem on Windows

This leaves us in a precarious situation where if someone tries to fix or enhance this function, they'll misunderstand how and why the current code works.

This PR attempts to address these problems.

This was the subject of several fixes: #6777 #6833 and #6862

The initial bug was that the SSH handler runs io.Copy() in a goroutine to copy the command output to the SSH session, but we didn't wait for this copy to finish before closing down the session, causing a race that sometimes lost output.

However, simply waiting for the io.Copy() to finish would result in a deadlock because of another, much more subtle bug in the way pseudoterminals are handled by the OS.

Linux/macOS

The current code makes an unnecessary call to the low-level syscall dup to duplicate a file descriptor. I'll try to explain what was happening and why it was bad:

On POSIX systems, the way pseudo-TTYs work is that open a pair of files, which I'll refer to as the pty and the tty to match the terminology in our code (Linux itself calls them ptm and pts). The tty acts just like a "real" teletype device for any program interacting with it --- and when we launch commands, we pass the tty as all three of the stdio file descriptors (stdin, stdout, stderr) to the child process.

After forking/exec'ing to start the child process this is what things looked like:

go process
PTY:
  - pty  ------------------------------> pty <--- linux kernel ----> tty <---- child process
  - tty  ------------------------------------------------------------^

Notice that our go process kept the tty file open. What happens is when you read from the pty, and there is no data available to read, it blocks until either

  1. there is data to read
  2. all file descriptors to the tty are closed

We kept an open file descriptor to the tty so even after the child process exited, we never hit condition 2, until we called Close() on our PTY object, which then closed both pty and tty at the same time.

The current code gets around this by calling dup on the pty file descriptor, so we end up with two references to the PTY.

go process
SSH Handler:
  - stdout ------------------------------+
PTY:                                     V
  - pty  ------------------------------> pty <--- linux kernel ----> tty <---- child process
  - tty  ------------------------------------------------------------^

The explanation given for this is that it allows us to close stdin and stdout separately, but that's a misunderstanding. The child process interacts only with the tty for both stdin and stdout --- duplicating the pty doesn't change this.

However, this does allow us to break the deadlock, because calling Close() on the PTY closes tty, but since we have the duplicated file, we can still read the output as long as the child process is still alive or there is unread buffered data in the kernel. But, this is needlessly wasteful of file descriptors (which are a limited kernel resource), and essentially uses a kernel primitive to accomplish the task of closing tty without closing pty.

The essential features of this PR on Linux/macOS are to

  1. Close the tty immediately after creating the child process - this allows us to safely read all the output from the child process without fear of deadlocking.
  2. Fix the SSH handler to simply call io.Copy() to copy output, then grab the exit code and return --- no extra file descriptors, wait groups, or synchronization.

Windows

Instead of a single read-write file descriptor, Windows uses a pair of pipes (which are unidirectional). On Windows, though, the pseudoconsole is hosted by a separate process, called the Console Host (conhost.exe), rather than being handled entirely in-kernel like on Linux/macOS. The lifecycle of the pseudoterminal is explicit via a Handle returned on creation. So on Windows, we create 5 handles:

  1. Input Read Pipe
  2. Input Write Pipe
  3. Output Read Pipe
  4. Output Write Pipe
  5. Console

In the current implementation, we keep all 5 handles open until the PTY is closed, at which point we close all 5. This then causes a race for a reader trying to read all the data before the pipe gets closed.

Docs for PseudoConsoles explain that once you pass the Input Read Pipe and Output Write Pipe to the pseudoconsole host, you should close them.

On Windows we also have to close the Pseudoconsole itself to unblock our output reads, since the Pseudoconsole is a separate process from the child application, and it has an explicit lifetime. So, in this PR, we also add code that closes the Pseudoconsole as soon as the child application exits (for PTYs attached to commands). Crucially, we leave the Output Read Pipe open, so that we can still read any data buffered in the OS.

Other features

  • split the PTY interface in 2, since there are still use cases in the code where we create a PTY but don't start a child process. So, we have one interface when we are interacting with a child, and a different interface when we hold both ends of the PTY -- this is done for safety so that you can't attempt to read from the tty after we've closed it.
  • add a unit test to check that we clean up orphaned processes when the SSH client disconnects
  • add a unit test to check for truncated output from commands started in a PTY

Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
@spikecurtis spikecurtis changed the title PTY & SSH fixes refactor: PTY & SSH Apr 12, 2023
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
@spikecurtis spikecurtis marked this pull request as ready for review April 19, 2023 07:13
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this! I was aware of the method by which it worked but seems did a poor job explaining it... not to mention being brave enough to venture this deep into PTYs. 😄

I did a little testing locally and couldn't immediately find any issues with this approach, so I'm hoping there won't be any edge-cases with processes closing their inputs/outputs while still running.

Had a few remarks but all-in-all, looks good, good job!

// may have already closed the PseudoConsole when the command exited, so that
// output reads can get to EOF. In that case, we don't need to close it
// again here.
if p.console != windows.InvalidHandle {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assigned both here and in waitInternal, probably needs protection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Protected by the closeMutex

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, the p.pw threw me off. I noticed Resize is not protected though!

func (p *PTY) Close() error {
p.t.Helper()
pErr := p.PTY.Close()
eErr := p.outExpecter.close("close")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the ceremony in close is due to closing pty, which is no longer done there. We should probably simplify it and make it less spammy. (The theory was that we might be hanging in some of the cleanup, resulting in weird flakes, but that has not been the case.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outExpecter.close() doesn't close the pty itself, but it does monitor the cleanup of the various goroutines we created to log the output and run expectations against it. I've left that part intact as my change is meant to leave ptytest functionally equivalent to what it was before.

b := make([]byte, 1)
go func() {
_, err := r.Read(b)
readErrs <- err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
readErrs <- err
select {
case readErrs <- err:
default:
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed.

The whole purpose of this little goroutine is to allow me to call Read() but still be able to timeout in this function if the context expires. So I read in a goroutine, then select whether the read completed before the context expires. The chan error is buffered, so even if the context expires, the goroutine will be able to finish once Read() returns. Note that an expired context calls return so there can only be one read goroutine started at a time.

Writer: p.tty,
}
}

func (p *otherPty) OutputReader() io.Reader {
return &ptmReader{p.pty}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunate we can't pass the *os.File here since io.Copy can sometimes do some slightly more performant operations with it. (That's more of a nice-to-have though, so this is fine.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that applies for copying from an *os.File. io.Copy() looks for WriteTo() on the source, which *os.File does not implement.

It does apply when you are copying to an *os.File, since it implements ReadFrom() --- InputWriter() still returns the *os.File directly, so we can take advantage of that optimization.

type readNopCloser struct{ io.Reader }
// ptySession is the interface to the ssh.Session that startPTYSession uses
// we use an interface here so that we can fake it in tests.
type ptySession interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you chose a stripped down interface vs using ssh.Session directly? I see the test but we also have a fake context there with all the methods so I'm not seeing the benefit per-se. I feel this creates a bit of needless indirection. Not critical to change now, I'll see if I make some changes down the line as I do some refactoring for the session handling anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ssh.Session is an even bigger interface!

I couldn't see any way to directly instantiate any of the concrete implementations in like gliderlabs.

My first stab at the test I wanted to do created the whole server like the other ssh tests do, but I found the even after calling Close() on the network connection, the session context in the handler wasn't closed quickly. So, this construction narrows down the big ssh.Session interface into just what we need, and allows me to more precisely control when the context expires in tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit annoying that I have to mock out the context methods, but unfortunately, ssh.Session is defined to have a method Context() that returns ssh.Context rather than context.Context.

Unfortunately the go type-checker isn't smart enough to understand that Context() ssh.Context must also satisfy Context() context.Context since context.Context is a strict subset of ssh.Context. Alas.

// if we already have an error from the command, prefer that error
// but if the command succeeded and closing the PseudoConsole fails
// then record that error so that we have a chance to see it
p.cmdErr = err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to log the error whether it's kept or discarded

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice, but I don't have a logger, and I'd have to change all the init signatures to get one, so I decided on this.

Another alternative I considered is panicking---something pretty messed up at the OS level is happening if we get an error here. We're pretty much guaranteed to crash the agent if we panic, so it's likely we'd get told about it.

Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
@spikecurtis spikecurtis requested a review from mafredri April 20, 2023 11:15
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The windows Resize method still needs protection on reading p.console and I'm not sold on the outExpecter naming (as commented), but won't block the PR for that.

Once the the resize is fixed and the test pass, LGTM.

Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Copy link
Member

@kylecarbs kylecarbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to defer my reviews to Dean and Mathias.

I'm extremely happy you dug into this!

Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
@spikecurtis spikecurtis merged commit daee91c into main Apr 24, 2023
@spikecurtis spikecurtis deleted the spike/pty-ssh branch April 24, 2023 10:53
@github-actions github-actions bot locked and limited conversation to collaborators Apr 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants