Skip to content

coder ssh hangs if underlying network is down #14712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
spikecurtis opened this issue Sep 18, 2024 · 2 comments · Fixed by coder/tailscale#61 or #14746
Closed

coder ssh hangs if underlying network is down #14712

spikecurtis opened this issue Sep 18, 2024 · 2 comments · Fixed by coder/tailscale#61 or #14746
Assignees
Labels
customer-reported Bugs reported by enterprise customers. Only humans may set this. networking Area: networking s2 Broken use cases or features (with a workaround). Only humans may set this.

Comments

@spikecurtis
Copy link
Contributor

spikecurtis commented Sep 18, 2024

coder ssh hangs if the underlying network is down, rather than timing out. Here we are not talking about an idle connection, but one in which the user types something to send to the workspace, but the network is down.

We should give the underlying tailnet, say, 15 seconds (3x Wireguard handshake retries) to send the data and get a response, but if we go longer than this, it's safe to conclude the networking is dead and we should exit with an error.

stack trace of coder ssh waiting to send

Reproduction steps

Connect to a workspace with coder ssh
Disable WiFi and/or unplug Ethernet
Type some stuff at the frozen prompt

@spikecurtis spikecurtis added s2 Broken use cases or features (with a workaround). Only humans may set this. bug networking Area: networking customer-reported Bugs reported by enterprise customers. Only humans may set this. labels Sep 18, 2024
@spikecurtis
Copy link
Contributor Author

In the attached stack trace, the system looks idle from the SSH standpoint --- it is waiting to read from the local stdin and from the SSH Channel stdout. This is likely because the keys typed into the hung terminal did get successfully written into the TCP session send buffer. So, the system is waiting for sent data to be ACK'd.

There is a socket option, TCPUserTimeoutOption which is the amount of time to wait for sent data to be ACK'd before tearing down the connection, but gVisor should only be retrying a max of 15 times by default anyway before timing out.

@spikecurtis spikecurtis self-assigned this Sep 19, 2024
@spikecurtis
Copy link
Contributor Author

gVisor should only be retrying a max of 15 times by default anyway before timing out.

Did some more testing. gVisor does eventually time out, but the retry timeout increases exponentially to a cap of 120 seconds, so it took about 15 minutes before it actually timed out in my testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer-reported Bugs reported by enterprise customers. Only humans may set this. networking Area: networking s2 Broken use cases or features (with a workaround). Only humans may set this.
Projects
None yet
1 participant