-
Notifications
You must be signed in to change notification settings - Fork 889
fix: fix graceful disconnect in DialWorkspaceAgent #11993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @spikecurtis and the rest of your teammates on |
6371570
to
308bad6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find! Some comments below.
type fakeTailnetConn struct{} | ||
|
||
func (*fakeTailnetConn) UpdatePeers([]*proto.CoordinateResponse_PeerUpdate) error { | ||
// TODO implement me | ||
panic("implement me") | ||
} | ||
|
||
func (*fakeTailnetConn) SetAllPeersLost() {} | ||
|
||
func (*fakeTailnetConn) SetNodeCallback(func(*tailnet.Node)) {} | ||
|
||
func (*fakeTailnetConn) SetDERPMap(*tailcfg.DERPMap) {} | ||
|
||
func newFakeTailnetConn() *fakeTailnetConn { | ||
return &fakeTailnetConn{} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this live in tailnettest
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, because the interface
we are faking lives in codersdk
, even though the "real" object we are faking lives in tailnet
.
func (*fakeTailnetConn) UpdatePeers([]*proto.CoordinateResponse_PeerUpdate) error { | ||
// TODO implement me | ||
panic("implement me") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call t.Fail()
instead of just panicking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
panic
is nice because it gives you a stack trace.
changing to t.Fail()
won't give a stack trace, so you'll have to manually chase down how the function could have been called by your test.
<-tac.ctx.Done() | ||
select { | ||
case <-tac.closed: | ||
case <-time.After(time.Second): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Non-blocking) I figure this is a best-effort situation, but will 1 second be enough? Does this need to be a configurable knob?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be plenty, even on a slow connection because we're not waiting for a reply. I definitely don't want to plumb configuration thru.
It is a best effort as you say --- consequence of not doing this is that the agent on the other side will see it as "lost" and possibly still try to handshake with it for up to 15 minutes.
02f29b5
to
f39414c
Compare
308bad6
to
73cfe1a
Compare
Merge activity
|
f39414c
to
19806f6
Compare
73cfe1a
to
33e84aa
Compare
I noticed in testing that the CLI wasn't correctly sending the disconnect message when it shuts down, and thus agents are seeing this as a "lost" peer, rather than a "disconnected" one.
What was happening is that we just used a single context for everything from the netconn to the RPCs, and when the context was canceled we failed to send the disconnect message due to canceled context.
So, this PR splits things into two contexts, with a graceful one set to last up to 1 second longer than the main one.