fix: give SSH stdio sessions a chance to close before closing netstack #10815

spikecurtis · 2023-11-21T12:19:55Z

Man, graceful shutdown is hard. Even after my changes, we were still hitting a graceful shutdown race: https://github.com/coder/coder/runs/18886842123

The problem was that while we attempt a graceful shutdown at the SSH layer by closing the session for writing, we were not giving it a chance to complete before continuing to tear down the stack of closers, including one that closes the netstack, and thus drop the TCP connection before it closes.

spikecurtis · 2023-11-21T12:20:06Z

Current dependencies on/for this PR:

main
- PR fix: give SSH stdio sessions a chance to close before closing netstack #10815 👈

This stack of pull requests is managed by Graphite.

mafredri · 2023-11-21T13:16:22Z

Hmm, is my test or my expectation faulty? I tried this PR through the following way:

while :; do {sleep 10 && pkill -HUP -f /home/coder/src/coder/coder/coder} &; ssh coder.w; sleep 0.1; done

While following the agent log, the WireGuard peers keep accumulating:

2023-11-21 13:05:53.606 [debu]  net.tailnet.net.wgengine: wgengine: Reconfig: configuring userspace WireGuard config (with 15/15 peers)

However, sending the HUP to SSH seems to appropriately tear the connection down and there's no accumulation:

while :; do {sleep 10 && pkill -HUP ssh} &; ssh coder.w; sleep 0.1; done

The difference between the two in agent logs seem to be:

2023-11-21 13:04:12.685 [info]  ssh-server: ssh connection complete  remote_addr=[fd7a:115c:a1e0:4088:981c:3dc5:c7d2:eb60]:48513  local_addr=[fd7a:115c:a1e0:49d6:b259:b7ac:b1b2:48f4]:1  error=EOF
2023-11-21 13:04:12.687 [debu]  ssh-server: copy output done  remote_addr=[fd7a:115c:a1e0:4088:981c:3dc5:c7d2:eb60]:48513  local_addr=[fd7a:115c:a1e0:49d6:b259:b7ac:b1b2:48f4]:1  id=856ccabe-edda-4a8c-a04c-be1ec941e7a4  bytes=734  error=<nil>
2023-11-21 13:04:12.688 [info]  ssh-server: ssh session returned  remote_addr=[fd7a:115c:a1e0:4088:981c:3dc5:c7d2:eb60]:48513  local_addr=[fd7a:115c:a1e0:49d6:b259:b7ac:b1b2:48f4]:1  id=856ccabe-edda-4a8c-a04c-be1ec941e7a4  error="signal: killed"

2023-11-21 13:10:57.054 [info]  ssh-server: ssh connection complete  remote_addr=[fd7a:115c:a1e0:4e3c:beec:a26f:c572:feb3]:43841  local_addr=[fd7a:115c:a1e0:49d6:b259:b7ac:b1b2:48f4]:1  error="ssh: disconnect, reason 11: disconnected by user"
2023-11-21 13:10:57.056 [debu]  ssh-server: copy output done  remote_addr=[fd7a:115c:a1e0:4e3c:beec:a26f:c572:feb3]:43841  local_addr=[fd7a:115c:a1e0:49d6:b259:b7ac:b1b2:48f4]:1  id=0cb4cc40-9bbe-4a91-b5d8-dd7e4ba5ebd6  bytes=1534  error=<nil>
2023-11-21 13:10:57.056 [info]  ssh-server: ssh session returned  remote_addr=[fd7a:115c:a1e0:4e3c:beec:a26f:c572:feb3]:43841  local_addr=[fd7a:115c:a1e0:49d6:b259:b7ac:b1b2:48f4]:1  id=0cb4cc40-9bbe-4a91-b5d8-dd7e4ba5ebd6  error="signal: killed"

So it seems there are still more situations where we could handle this gracefully? (Given that we handle HUP/INT, I'd expect the shutdown procedure to be proper when it targets the proxy command too.)

cli/ssh.go

spikecurtis · 2023-11-21T16:36:18Z

While following the agent log, the WireGuard peers keep accumulating:

That's #7960 and is unrelated to SSH session termination.

It's true that sending SIGHUP to OpenSSH terminates in a slightly different way: OpenSSH sends an explicit disconnect message over the SSH connection, and sending SIGHUP to coder ssh --stdio results in the SSH connection getting closed (EOF), but the agentssh reacts the same way.

mafredri · 2023-11-21T17:04:38Z

It's true that sending SIGHUP to OpenSSH terminates in a slightly different way: OpenSSH sends an explicit disconnect message over the SSH connection, and sending SIGHUP to coder ssh --stdio results in the SSH connection getting closed (EOF), but the agentssh reacts the same way.

I don't feel like the reaction is the same. As observed, one accumulates WireGuard peers whereas the other does not. So there's a difference in how the tailscale connection is torn down. Why can't we exit (wrt tailscale connection) as cleanly by sending HUP to coder as we do when SSH sends the disconnect message? Seems like it should be doable. 🤔

spikecurtis · 2023-11-21T17:57:14Z

I don't feel like the reaction is the same. As observed, one accumulates WireGuard peers whereas the other does not. So there's a difference in how the tailscale connection is torn down.

That's surprising. Can you send me the full logs that show Wireguard peers being accumulated in one case but not the other?

mafredri · 2023-11-21T19:21:19Z

@spikecurtis alright, I'll see if I can reproduce tomorrow (logs from earlier are already gone).

spikecurtis · 2023-11-22T09:11:23Z

Merge activity

Nov 22, 4:11 AM: @spikecurtis merged this pull request with Graphite.

mafredri · 2023-11-22T11:12:30Z

@spikecurtis I tried to reproduce the behavior today but I was unable to, so not sure what/why I observed what I did yesterday.

I'm fairly sure the running agent I was testing against was 7060069, and the client was built from your branch. I can't spot any significant commits that would've changed behavior between tries, so let's forget about this for now.

spikecurtis · 2023-11-22T11:15:28Z

It could just have been a timing issue. Inactive peers eventually time out and get reaped if there are new peers coming in, making it appear, just by looking at the total number that maybe they were not accumulating in the second case.

github-actions bot assigned spikecurtis Nov 21, 2023

spikecurtis requested review from johnstcn and mafredri November 21, 2023 12:20

mafredri approved these changes Nov 21, 2023

View reviewed changes

cli/ssh.go Show resolved Hide resolved

johnstcn approved these changes Nov 21, 2023

View reviewed changes

cli/ssh.go Show resolved Hide resolved

fix: give SSH stdio sessions a chance to close before closing netstack

e1a3c1e

spikecurtis force-pushed the spike/ssh-stdio-graceful-shutdown branch from 8110808 to e1a3c1e Compare November 22, 2023 09:03

spikecurtis merged commit f20cc66 into main Nov 22, 2023

spikecurtis deleted the spike/ssh-stdio-graceful-shutdown branch November 22, 2023 09:11

github-actions bot locked and limited conversation to collaborators Nov 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: give SSH stdio sessions a chance to close before closing netstack #10815

fix: give SSH stdio sessions a chance to close before closing netstack #10815

Uh oh!

spikecurtis commented Nov 21, 2023

Uh oh!

spikecurtis commented Nov 21, 2023

Uh oh!

mafredri commented Nov 21, 2023

Uh oh!

Uh oh!

Uh oh!

spikecurtis commented Nov 21, 2023

Uh oh!

mafredri commented Nov 21, 2023

Uh oh!

spikecurtis commented Nov 21, 2023

Uh oh!

mafredri commented Nov 21, 2023

Uh oh!

spikecurtis commented Nov 22, 2023

Uh oh!

mafredri commented Nov 22, 2023

Uh oh!

spikecurtis commented Nov 22, 2023

Uh oh!

Uh oh!

fix: give SSH stdio sessions a chance to close before closing netstack #10815

fix: give SSH stdio sessions a chance to close before closing netstack #10815

Uh oh!

Conversation

spikecurtis commented Nov 21, 2023

Uh oh!

spikecurtis commented Nov 21, 2023

Uh oh!

mafredri commented Nov 21, 2023

Uh oh!

Uh oh!

Uh oh!

spikecurtis commented Nov 21, 2023

Uh oh!

mafredri commented Nov 21, 2023

Uh oh!

spikecurtis commented Nov 21, 2023

Uh oh!

mafredri commented Nov 21, 2023

Uh oh!

spikecurtis commented Nov 22, 2023

Merge activity

Uh oh!

mafredri commented Nov 22, 2023

Uh oh!

spikecurtis commented Nov 22, 2023

Uh oh!

Uh oh!