-
Notifications
You must be signed in to change notification settings - Fork 874
feat: improve coder connect tunnel handling on reconnect #17598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: improve coder connect tunnel handling on reconnect #17598
Conversation
f66d81a
to
52f1c2b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm lacking context on what this change fixes. The code itself looks fine and the test coverage checks out, but deferring to Dean and Spike.
…te never has them)
@@ -552,6 +554,42 @@ func (u *updater) netStatusLoop() { | |||
} | |||
} | |||
|
|||
// processFreshState handles the logic for when a fresh state update is received. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should probably expand this comment to explain why this is necessary. Should mention that we only receive diffs except for the first packet on any given reconnect to the tailnet API, which means that without this we weren't processing deletes for any workspaces or agents deleted while the client was disconnected (e.g. while the computer was asleep)
require.Equal(t, aID2[:], peerUpdate.UpsertedAgents[0].Id) | ||
require.Equal(t, hsTime, peerUpdate.UpsertedAgents[0].LastHandshake.AsTime()) | ||
|
||
require.Equal(t, aID1[:], peerUpdate.DeletedAgents[0].Id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should verify that there's only one upserted workspace, and zero deleted workspaces.
@@ -513,3 +719,152 @@ func setupTunnel(t *testing.T, ctx context.Context, client *fakeClient, mClock q | |||
mgr.start() | |||
return tun, mgr | |||
} | |||
|
|||
func TestProcessFreshState(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
@@ -1083,6 +1086,7 @@ type WorkspaceUpdate struct { | |||
UpsertedAgents []*Agent | |||
DeletedWorkspaces []*Workspace | |||
DeletedAgents []*Agent | |||
FreshState bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about others, but "fresh" doesn't really capture the meaning for me. An update that is not fresh sounds like it is outdated or of dubious validity, which isn't the case. I think the most clear would be an enum:
UpdateKind: [Snapshot, Diff]
} | ||
|
||
cbUpdate := testutil.TryReceive(ctx, t, fUH.ch) | ||
require.Equal(t, initRecvUp, cbUpdate) | ||
|
||
// Current state should match initial | ||
// Current state should match initial but shouldn't be a fresh state | ||
initRecvUp.FreshState = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems wrong to me. When we ask for the current state, we are getting a complete snapshot, not a diff, so it should be "fresh" in your terminology.
}) | ||
// if the workspace connected to an agent we're deleting, | ||
// is not present in the fresh state, add it to the deleted workspaces | ||
if _, ok := ignoredWorkspaces[agent.WorkspaceID]; !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assumption here seems to be that every deleted workspace is going to be associated with a deleted agent, which, I think, assumes that every workspace has at least one agent. That's definitely not true of stopped workspaces. I think it also technically doesn't have to be true of started workspaces (although in practice it is).
The consequence is that if a workspace that was stopped is deleted while we are disconnected, then I don't think we'll ever generate a Delete for it on the protocol and Coder Desktop will continue to think it exists. This is a good test case to have!
Basically, right now we're only storing the agents, not the workspaces. So, there is no way for us to notice that a workspace without an agent needs to be deleted. I don't think there is any way around this fact and we need to store the workspaces too in order to do the right thing.
Closes coder/internal#563
The Coder Connect tunnel receives workspace state from the Coder server over a dRPC stream. When first connecting to this stream, the current state of the user's workspaces is received, with subsequent messages being diffs on top of that state.
However, if the client disconnects from this stream, such as when the user's device is suspended, and then reconnects later, no mechanism exists for the tunnel to differentiate that message containing the entire initial state from another diff, and so that state is incorrectly applied as a diff.
In practice:
This PR introduces a solution in which tunnelUpdater, when created, sends a FreshState flag with the WorkspaceUpdate type. This flag is handled in the vpn tunnel in the following fashion: