Skip to content

fix: use TSMP for pings and checking reachability #11306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 2, 2024
Merged

Conversation

spikecurtis
Copy link
Contributor

@spikecurtis spikecurtis commented Dec 21, 2023

We're seeing some flaky tests related to agent connectivity - https://github.com/coder/coder/actions/runs/7286675441/job/19856270998

I'm pretty sure what happened in this one is that the client opened a connection while the wgengine was in the process of reconfiguring the wireguard device, so the fact that the peer became "active" as a result of traffic being sent was not noticed.

The test calls AwaitReachable() but this only tests the disco layer, so it doesn't wait for wireguard to come up.

I think we should be using TSMP for pinging and reachability, since this operates at the IP layer, and therefore requires that wireguard comes up before being successful.

This should also help with the problems we have seen where a TCP connection starts before wireguard is up and the initial round trip has to wait for the 5 second wireguard handshake retry.

fixes: #11294

Copy link
Contributor Author

Current dependencies on/for this PR:

This stack of pull requests is managed by Graphite.

@spikecurtis spikecurtis force-pushed the spike/tsmp-ping branch 2 times, most recently from 5933e85 to 0576b65 Compare January 2, 2024 05:49
Signed-off-by: Spike Curtis <spike@coder.com>
@@ -186,7 +186,7 @@ func TestAgent_Stats_Magic(t *testing.T) {
// If it isn't, it's set to -1.
s.ConnectionMedianLatencyMS >= 0
}, testutil.WaitLong, testutil.IntervalFast,
"never saw stats: %+v", s,
"never saw stats",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives misleading information since the pointer is passed on the initial call to Eventually. It is not updated since <-stats returns a new pointer, rather than updating the data at the old pointer. Changed in favor of logging the relevant info as we get it.

s, ok := <-stats
t.Logf("got stats after disconnect %t, %d",
ok, s.SessionCountJetBrains)
return ok &&
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TSMP ping counts as a "connection" in the agent stats so it never goes to zero. I'm not sure why we were testing it going to zero in the first place @code-asher

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we wanted to make sure ConnectionCount was also updating correctly, but did not realize there would be other connections going on.

@spikecurtis spikecurtis merged commit 520c3a8 into main Jan 2, 2024
@spikecurtis spikecurtis deleted the spike/tsmp-ping branch January 2, 2024 11:53
Copy link
Contributor Author

Merge activity

@github-actions github-actions bot locked and limited conversation to collaborators Jan 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

flake: TestPortForward/UDP_OnePort - EOF
4 participants