Skip to content

fix: fix graceful disconnect in DialWorkspaceAgent #11993

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 5, 2024

Conversation

spikecurtis
Copy link
Contributor

@spikecurtis spikecurtis commented Feb 2, 2024

I noticed in testing that the CLI wasn't correctly sending the disconnect message when it shuts down, and thus agents are seeing this as a "lost" peer, rather than a "disconnected" one.

What was happening is that we just used a single context for everything from the netconn to the RPCs, and when the context was canceled we failed to send the disconnect message due to canceled context.

So, this PR splits things into two contexts, with a graceful one set to last up to 1 second longer than the main one.

Copy link
Contributor Author

spikecurtis commented Feb 2, 2024

@spikecurtis spikecurtis marked this pull request as ready for review February 2, 2024 09:21
@spikecurtis spikecurtis force-pushed the spike/tailnet-graceful-disconnect branch from 6371570 to 308bad6 Compare February 2, 2024 09:33
Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find! Some comments below.

Comment on lines +91 to +106
type fakeTailnetConn struct{}

func (*fakeTailnetConn) UpdatePeers([]*proto.CoordinateResponse_PeerUpdate) error {
// TODO implement me
panic("implement me")
}

func (*fakeTailnetConn) SetAllPeersLost() {}

func (*fakeTailnetConn) SetNodeCallback(func(*tailnet.Node)) {}

func (*fakeTailnetConn) SetDERPMap(*tailcfg.DERPMap) {}

func newFakeTailnetConn() *fakeTailnetConn {
return &fakeTailnetConn{}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this live in tailnettest as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, because the interface we are faking lives in codersdk, even though the "real" object we are faking lives in tailnet.

Comment on lines +93 to +96
func (*fakeTailnetConn) UpdatePeers([]*proto.CoordinateResponse_PeerUpdate) error {
// TODO implement me
panic("implement me")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call t.Fail() instead of just panicking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

panic is nice because it gives you a stack trace.

changing to t.Fail() won't give a stack trace, so you'll have to manually chase down how the function could have been called by your test.

<-tac.ctx.Done()
select {
case <-tac.closed:
case <-time.After(time.Second):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Non-blocking) I figure this is a best-effort situation, but will 1 second be enough? Does this need to be a configurable knob?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be plenty, even on a slow connection because we're not waiting for a reply. I definitely don't want to plumb configuration thru.

It is a best effort as you say --- consequence of not doing this is that the agent on the other side will see it as "lost" and possibly still try to handshake with it for up to 15 minutes.

@spikecurtis spikecurtis force-pushed the spike/fake-coordinator-move branch from 02f29b5 to f39414c Compare February 5, 2024 06:45
@spikecurtis spikecurtis force-pushed the spike/tailnet-graceful-disconnect branch from 308bad6 to 73cfe1a Compare February 5, 2024 06:46
@spikecurtis spikecurtis requested a review from johnstcn February 5, 2024 06:46
Copy link
Contributor Author

spikecurtis commented Feb 5, 2024

Merge activity

  • Feb 5, 4:33 AM EST: @spikecurtis started a stack merge that includes this pull request via Graphite.
  • Feb 5, 4:50 AM EST: Graphite rebased this pull request as part of a merge.
  • Feb 5, 5:01 AM EST: @spikecurtis merged this pull request with Graphite.

@spikecurtis spikecurtis force-pushed the spike/fake-coordinator-move branch from f39414c to 19806f6 Compare February 5, 2024 09:33
Base automatically changed from spike/fake-coordinator-move to main February 5, 2024 09:49
@spikecurtis spikecurtis force-pushed the spike/tailnet-graceful-disconnect branch from 73cfe1a to 33e84aa Compare February 5, 2024 09:49
@spikecurtis spikecurtis merged commit e5ba586 into main Feb 5, 2024
@spikecurtis spikecurtis deleted the spike/tailnet-graceful-disconnect branch February 5, 2024 10:01
@github-actions github-actions bot locked and limited conversation to collaborators Feb 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants