Skip to content

fix: give SSH stdio sessions a chance to close before closing netstack #10815

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 22, 2023

Conversation

spikecurtis
Copy link
Contributor

Man, graceful shutdown is hard. Even after my changes, we were still hitting a graceful shutdown race: https://github.com/coder/coder/runs/18886842123

The problem was that while we attempt a graceful shutdown at the SSH layer by closing the session for writing, we were not giving it a chance to complete before continuing to tear down the stack of closers, including one that closes the netstack, and thus drop the TCP connection before it closes.

Copy link
Contributor Author

Current dependencies on/for this PR:

This stack of pull requests is managed by Graphite.

@mafredri
Copy link
Member

Hmm, is my test or my expectation faulty? I tried this PR through the following way:

while :; do {sleep 10 && pkill -HUP -f /home/coder/src/coder/coder/coder} &; ssh coder.w; sleep 0.1; done

While following the agent log, the WireGuard peers keep accumulating:

2023-11-21 13:05:53.606 [debu]  net.tailnet.net.wgengine: wgengine: Reconfig: configuring userspace WireGuard config (with 15/15 peers)

However, sending the HUP to SSH seems to appropriately tear the connection down and there's no accumulation:

while :; do {sleep 10 && pkill -HUP ssh} &; ssh coder.w; sleep 0.1; done

The difference between the two in agent logs seem to be:

2023-11-21 13:04:12.685 [info]  ssh-server: ssh connection complete  remote_addr=[fd7a:115c:a1e0:4088:981c:3dc5:c7d2:eb60]:48513  local_addr=[fd7a:115c:a1e0:49d6:b259:b7ac:b1b2:48f4]:1  error=EOF
2023-11-21 13:04:12.687 [debu]  ssh-server: copy output done  remote_addr=[fd7a:115c:a1e0:4088:981c:3dc5:c7d2:eb60]:48513  local_addr=[fd7a:115c:a1e0:49d6:b259:b7ac:b1b2:48f4]:1  id=856ccabe-edda-4a8c-a04c-be1ec941e7a4  bytes=734  error=<nil>
2023-11-21 13:04:12.688 [info]  ssh-server: ssh session returned  remote_addr=[fd7a:115c:a1e0:4088:981c:3dc5:c7d2:eb60]:48513  local_addr=[fd7a:115c:a1e0:49d6:b259:b7ac:b1b2:48f4]:1  id=856ccabe-edda-4a8c-a04c-be1ec941e7a4  error="signal: killed"
2023-11-21 13:10:57.054 [info]  ssh-server: ssh connection complete  remote_addr=[fd7a:115c:a1e0:4e3c:beec:a26f:c572:feb3]:43841  local_addr=[fd7a:115c:a1e0:49d6:b259:b7ac:b1b2:48f4]:1  error="ssh: disconnect, reason 11: disconnected by user"
2023-11-21 13:10:57.056 [debu]  ssh-server: copy output done  remote_addr=[fd7a:115c:a1e0:4e3c:beec:a26f:c572:feb3]:43841  local_addr=[fd7a:115c:a1e0:49d6:b259:b7ac:b1b2:48f4]:1  id=0cb4cc40-9bbe-4a91-b5d8-dd7e4ba5ebd6  bytes=1534  error=<nil>
2023-11-21 13:10:57.056 [info]  ssh-server: ssh session returned  remote_addr=[fd7a:115c:a1e0:4e3c:beec:a26f:c572:feb3]:43841  local_addr=[fd7a:115c:a1e0:49d6:b259:b7ac:b1b2:48f4]:1  id=0cb4cc40-9bbe-4a91-b5d8-dd7e4ba5ebd6  error="signal: killed"

So it seems there are still more situations where we could handle this gracefully? (Given that we handle HUP/INT, I'd expect the shutdown procedure to be proper when it targets the proxy command too.)

@spikecurtis
Copy link
Contributor Author

While following the agent log, the WireGuard peers keep accumulating:

That's #7960 and is unrelated to SSH session termination.


It's true that sending SIGHUP to OpenSSH terminates in a slightly different way: OpenSSH sends an explicit disconnect message over the SSH connection, and sending SIGHUP to coder ssh --stdio results in the SSH connection getting closed (EOF), but the agentssh reacts the same way.

@mafredri
Copy link
Member

It's true that sending SIGHUP to OpenSSH terminates in a slightly different way: OpenSSH sends an explicit disconnect message over the SSH connection, and sending SIGHUP to coder ssh --stdio results in the SSH connection getting closed (EOF), but the agentssh reacts the same way.

I don't feel like the reaction is the same. As observed, one accumulates WireGuard peers whereas the other does not. So there's a difference in how the tailscale connection is torn down. Why can't we exit (wrt tailscale connection) as cleanly by sending HUP to coder as we do when SSH sends the disconnect message? Seems like it should be doable. 🤔

@spikecurtis
Copy link
Contributor Author

I don't feel like the reaction is the same. As observed, one accumulates WireGuard peers whereas the other does not. So there's a difference in how the tailscale connection is torn down.

That's surprising. Can you send me the full logs that show Wireguard peers being accumulated in one case but not the other?

@mafredri
Copy link
Member

@spikecurtis alright, I'll see if I can reproduce tomorrow (logs from earlier are already gone).

@spikecurtis spikecurtis force-pushed the spike/ssh-stdio-graceful-shutdown branch from 8110808 to e1a3c1e Compare November 22, 2023 09:03
@spikecurtis spikecurtis merged commit f20cc66 into main Nov 22, 2023
@spikecurtis spikecurtis deleted the spike/ssh-stdio-graceful-shutdown branch November 22, 2023 09:11
Copy link
Contributor Author

Merge activity

@github-actions github-actions bot locked and limited conversation to collaborators Nov 22, 2023
@mafredri
Copy link
Member

@spikecurtis I tried to reproduce the behavior today but I was unable to, so not sure what/why I observed what I did yesterday.

I'm fairly sure the running agent I was testing against was 7060069, and the client was built from your branch. I can't spot any significant commits that would've changed behavior between tries, so let's forget about this for now.

Copy link
Contributor Author

It could just have been a timing issue. Inactive peers eventually time out and get reaped if there are new peers coming in, making it appear, just by looking at the total number that maybe they were not accumulating in the second case.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants