-
Notifications
You must be signed in to change notification settings - Fork 888
fix: give SSH stdio sessions a chance to close before closing netstack #10815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Current dependencies on/for this PR: This stack of pull requests is managed by Graphite. |
Hmm, is my test or my expectation faulty? I tried this PR through the following way: while :; do {sleep 10 && pkill -HUP -f /home/coder/src/coder/coder/coder} &; ssh coder.w; sleep 0.1; done While following the agent log, the WireGuard peers keep accumulating:
However, sending the HUP to SSH seems to appropriately tear the connection down and there's no accumulation: while :; do {sleep 10 && pkill -HUP ssh} &; ssh coder.w; sleep 0.1; done The difference between the two in agent logs seem to be:
So it seems there are still more situations where we could handle this gracefully? (Given that we handle HUP/INT, I'd expect the shutdown procedure to be proper when it targets the proxy command too.) |
That's #7960 and is unrelated to SSH session termination. It's true that sending SIGHUP to OpenSSH terminates in a slightly different way: OpenSSH sends an explicit |
I don't feel like the reaction is the same. As observed, one accumulates WireGuard peers whereas the other does not. So there's a difference in how the tailscale connection is torn down. Why can't we exit (wrt tailscale connection) as cleanly by sending HUP to coder as we do when SSH sends the disconnect message? Seems like it should be doable. 🤔 |
That's surprising. Can you send me the full logs that show Wireguard peers being accumulated in one case but not the other? |
@spikecurtis alright, I'll see if I can reproduce tomorrow (logs from earlier are already gone). |
8110808
to
e1a3c1e
Compare
Merge activity
|
@spikecurtis I tried to reproduce the behavior today but I was unable to, so not sure what/why I observed what I did yesterday. I'm fairly sure the running agent I was testing against was 7060069, and the client was built from your branch. I can't spot any significant commits that would've changed behavior between tries, so let's forget about this for now. |
It could just have been a timing issue. Inactive peers eventually time out and get reaped if there are new peers coming in, making it appear, just by looking at the total number that maybe they were not accumulating in the second case. |
Man, graceful shutdown is hard. Even after my changes, we were still hitting a graceful shutdown race: https://github.com/coder/coder/runs/18886842123
The problem was that while we attempt a graceful shutdown at the SSH layer by closing the session for writing, we were not giving it a chance to complete before continuing to tear down the stack of closers, including one that closes the netstack, and thus drop the TCP connection before it closes.