-
Notifications
You must be signed in to change notification settings - Fork 887
refactor: PTY & SSH #7100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: PTY & SSH #7100
Conversation
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into this! I was aware of the method by which it worked but seems did a poor job explaining it... not to mention being brave enough to venture this deep into PTYs. 😄
I did a little testing locally and couldn't immediately find any issues with this approach, so I'm hoping there won't be any edge-cases with processes closing their inputs/outputs while still running.
Had a few remarks but all-in-all, looks good, good job!
// may have already closed the PseudoConsole when the command exited, so that | ||
// output reads can get to EOF. In that case, we don't need to close it | ||
// again here. | ||
if p.console != windows.InvalidHandle { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assigned both here and in waitInternal, probably needs protection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Protected by the closeMutex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, the p.pw
threw me off. I noticed Resize
is not protected though!
pty/ptytest/ptytest.go
Outdated
func (p *PTY) Close() error { | ||
p.t.Helper() | ||
pErr := p.PTY.Close() | ||
eErr := p.outExpecter.close("close") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of the ceremony in close
is due to closing pty
, which is no longer done there. We should probably simplify it and make it less spammy. (The theory was that we might be hanging in some of the cleanup, resulting in weird flakes, but that has not been the case.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The outExpecter.close()
doesn't close the pty itself, but it does monitor the cleanup of the various goroutines we created to log the output and run expectations against it. I've left that part intact as my change is meant to leave ptytest
functionally equivalent to what it was before.
b := make([]byte, 1) | ||
go func() { | ||
_, err := r.Read(b) | ||
readErrs <- err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readErrs <- err | |
select { | |
case readErrs <- err: | |
default: | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed.
The whole purpose of this little goroutine is to allow me to call Read()
but still be able to timeout in this function if the context expires. So I read in a goroutine, then select
whether the read completed before the context expires. The chan error
is buffered, so even if the context expires, the goroutine will be able to finish once Read()
returns. Note that an expired context calls return
so there can only be one read goroutine started at a time.
Writer: p.tty, | ||
} | ||
} | ||
|
||
func (p *otherPty) OutputReader() io.Reader { | ||
return &ptmReader{p.pty} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unfortunate we can't pass the *os.File
here since io.Copy
can sometimes do some slightly more performant operations with it. (That's more of a nice-to-have though, so this is fine.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that applies for copying from an *os.File
. io.Copy()
looks for WriteTo()
on the source, which *os.File
does not implement.
It does apply when you are copying to an *os.File
, since it implements ReadFrom()
--- InputWriter()
still returns the *os.File
directly, so we can take advantage of that optimization.
type readNopCloser struct{ io.Reader } | ||
// ptySession is the interface to the ssh.Session that startPTYSession uses | ||
// we use an interface here so that we can fake it in tests. | ||
type ptySession interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you chose a stripped down interface vs using ssh.Session
directly? I see the test but we also have a fake context there with all the methods so I'm not seeing the benefit per-se. I feel this creates a bit of needless indirection. Not critical to change now, I'll see if I make some changes down the line as I do some refactoring for the session handling anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ssh.Session
is an even bigger interface
!
I couldn't see any way to directly instantiate any of the concrete implementations in like gliderlabs
.
My first stab at the test I wanted to do created the whole server like the other ssh tests do, but I found the even after calling Close()
on the network connection, the session context in the handler wasn't closed quickly. So, this construction narrows down the big ssh.Session
interface into just what we need, and allows me to more precisely control when the context expires in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit annoying that I have to mock out the context methods, but unfortunately, ssh.Session
is defined to have a method Context()
that returns ssh.Context
rather than context.Context
.
Unfortunately the go type-checker isn't smart enough to understand that Context() ssh.Context
must also satisfy Context() context.Context
since context.Context
is a strict subset of ssh.Context
. Alas.
// if we already have an error from the command, prefer that error | ||
// but if the command succeeded and closing the PseudoConsole fails | ||
// then record that error so that we have a chance to see it | ||
p.cmdErr = err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to log the error whether it's kept or discarded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice, but I don't have a logger, and I'd have to change all the init signatures to get one, so I decided on this.
Another alternative I considered is panicking---something pretty messed up at the OS level is happening if we get an error here. We're pretty much guaranteed to crash the agent if we panic, so it's likely we'd get told about it.
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The windows Resize method still needs protection on reading p.console
and I'm not sold on the outExpecter
naming (as commented), but won't block the PR for that.
Once the the resize is fixed and the test pass, LGTM.
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to defer my reviews to Dean and Mathias.
I'm extremely happy you dug into this!
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Our agentssh package handles running commands in a pseudoterminal, and we've had several problems with not returning command output.
The implementation prior to this PR seems to behave how we want it to, (at least on Linux), but it
This leaves us in a precarious situation where if someone tries to fix or enhance this function, they'll misunderstand how and why the current code works.
This PR attempts to address these problems.
This was the subject of several fixes: #6777 #6833 and #6862
The initial bug was that the SSH handler runs
io.Copy()
in a goroutine to copy the command output to the SSH session, but we didn't wait for this copy to finish before closing down the session, causing a race that sometimes lost output.However, simply waiting for the
io.Copy()
to finish would result in a deadlock because of another, much more subtle bug in the way pseudoterminals are handled by the OS.Linux/macOS
The current code makes an unnecessary call to the low-level syscall
dup
to duplicate a file descriptor. I'll try to explain what was happening and why it was bad:On POSIX systems, the way pseudo-TTYs work is that open a pair of files, which I'll refer to as the
pty
and thetty
to match the terminology in our code (Linux itself calls themptm
andpts
). Thetty
acts just like a "real" teletype device for any program interacting with it --- and when we launch commands, we pass thetty
as all three of the stdio file descriptors (stdin, stdout, stderr) to the child process.After forking/exec'ing to start the child process this is what things looked like:
Notice that our go process kept the
tty
file open. What happens is when you read from thepty
, and there is no data available to read, it blocks until eithertty
are closedWe kept an open file descriptor to the
tty
so even after the child process exited, we never hit condition 2, until we calledClose()
on ourPTY
object, which then closed bothpty
andtty
at the same time.The current code gets around this by calling
dup
on thepty
file descriptor, so we end up with two references to the PTY.The explanation given for this is that it allows us to close
stdin
andstdout
separately, but that's a misunderstanding. The child process interacts only with thetty
for bothstdin
andstdout
--- duplicating thepty
doesn't change this.However, this does allow us to break the deadlock, because calling
Close()
on thePTY
closestty
, but since we have the duplicated file, we can still read the output as long as the child process is still alive or there is unread buffered data in the kernel. But, this is needlessly wasteful of file descriptors (which are a limited kernel resource), and essentially uses a kernel primitive to accomplish the task of closingtty
without closingpty
.The essential features of this PR on Linux/macOS are to
tty
immediately after creating the child process - this allows us to safely read all the output from the child process without fear of deadlocking.io.Copy()
to copy output, then grab the exit code and return --- no extra file descriptors, wait groups, or synchronization.Windows
Instead of a single read-write file descriptor, Windows uses a pair of pipes (which are unidirectional). On Windows, though, the pseudoconsole is hosted by a separate process, called the Console Host (conhost.exe), rather than being handled entirely in-kernel like on Linux/macOS. The lifecycle of the pseudoterminal is explicit via a Handle returned on creation. So on Windows, we create 5 handles:
In the current implementation, we keep all 5 handles open until the PTY is closed, at which point we close all 5. This then causes a race for a reader trying to read all the data before the pipe gets closed.
Docs for PseudoConsoles explain that once you pass the Input Read Pipe and Output Write Pipe to the pseudoconsole host, you should close them.
On Windows we also have to close the Pseudoconsole itself to unblock our output reads, since the Pseudoconsole is a separate process from the child application, and it has an explicit lifetime. So, in this PR, we also add code that closes the Pseudoconsole as soon as the child application exits (for PTYs attached to commands). Crucially, we leave the Output Read Pipe open, so that we can still read any data buffered in the OS.
Other features
PTY
interface in 2, since there are still use cases in the code where we create a PTY but don't start a child process. So, we have one interface when we are interacting with a child, and a different interface when we hold both ends of the PTY -- this is done for safety so that you can't attempt to read from thetty
after we've closed it.