Skip to content

feat: improve coder connect tunnel handling on reconnect #17598

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

ibetitsmike
Copy link
Contributor

@ibetitsmike ibetitsmike commented Apr 29, 2025

Closes coder/internal#563

The Coder Connect tunnel receives workspace state from the Coder server over a dRPC stream. When first connecting to this stream, the current state of the user's workspaces is received, with subsequent messages being diffs on top of that state.

However, if the client disconnects from this stream, such as when the user's device is suspended, and then reconnects later, no mechanism exists for the tunnel to differentiate that message containing the entire initial state from another diff, and so that state is incorrectly applied as a diff.

In practice:

  • Tunnel connects, receives a workspace update containing all the existing workspaces & agents.
  • Tunnel loses connection, but isn't completely stopped.
  • All the user's workspaces are restarted, producing a new set of agents.
  • Tunnel regains connection, and receives a workspace update containing all the existing workspaces & agents.
  • This initial update is incorrectly applied as a diff, with the Tunnel's state containing both the old & new agents.

This PR introduces a solution in which tunnelUpdater, when created, sends a FreshState flag with the WorkspaceUpdate type. This flag is handled in the vpn tunnel in the following fashion:

  • Preserve existing Agents
  • Remove current Agents in the tunnel that are not present in the WorkspaceUpdate
  • Remove unreferenced Workspaces

@ibetitsmike ibetitsmike force-pushed the mike/internal-563-improve-coder-connect-tunnel-handling branch from f66d81a to 52f1c2b Compare April 29, 2025 11:36
@ibetitsmike ibetitsmike changed the title Mike/internal 563 improve coder connect tunnel handling feat: improve coder connect tunnel handling on reconnect Apr 29, 2025
Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm lacking context on what this change fixes. The code itself looks fine and the test coverage checks out, but deferring to Dean and Spike.

@@ -552,6 +554,42 @@ func (u *updater) netStatusLoop() {
}
}

// processFreshState handles the logic for when a fresh state update is received.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably expand this comment to explain why this is necessary. Should mention that we only receive diffs except for the first packet on any given reconnect to the tailnet API, which means that without this we weren't processing deletes for any workspaces or agents deleted while the client was disconnected (e.g. while the computer was asleep)

require.Equal(t, aID2[:], peerUpdate.UpsertedAgents[0].Id)
require.Equal(t, hsTime, peerUpdate.UpsertedAgents[0].LastHandshake.AsTime())

require.Equal(t, aID1[:], peerUpdate.DeletedAgents[0].Id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should verify that there's only one upserted workspace, and zero deleted workspaces.

@@ -513,3 +719,152 @@ func setupTunnel(t *testing.T, ctx context.Context, client *fakeClient, mClock q
mgr.start()
return tun, mgr
}

func TestProcessFreshState(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -1083,6 +1086,7 @@ type WorkspaceUpdate struct {
UpsertedAgents []*Agent
DeletedWorkspaces []*Workspace
DeletedAgents []*Agent
FreshState bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about others, but "fresh" doesn't really capture the meaning for me. An update that is not fresh sounds like it is outdated or of dubious validity, which isn't the case. I think the most clear would be an enum:

UpdateKind: [Snapshot, Diff]

}

cbUpdate := testutil.TryReceive(ctx, t, fUH.ch)
require.Equal(t, initRecvUp, cbUpdate)

// Current state should match initial
// Current state should match initial but shouldn't be a fresh state
initRecvUp.FreshState = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong to me. When we ask for the current state, we are getting a complete snapshot, not a diff, so it should be "fresh" in your terminology.

})
// if the workspace connected to an agent we're deleting,
// is not present in the fresh state, add it to the deleted workspaces
if _, ok := ignoredWorkspaces[agent.WorkspaceID]; !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption here seems to be that every deleted workspace is going to be associated with a deleted agent, which, I think, assumes that every workspace has at least one agent. That's definitely not true of stopped workspaces. I think it also technically doesn't have to be true of started workspaces (although in practice it is).

The consequence is that if a workspace that was stopped is deleted while we are disconnected, then I don't think we'll ever generate a Delete for it on the protocol and Coder Desktop will continue to think it exists. This is a good test case to have!

Basically, right now we're only storing the agents, not the workspaces. So, there is no way for us to notice that a workspace without an agent needs to be deleted. I don't think there is any way around this fact and we need to store the workspaces too in order to do the right thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve Coder Connect tunnel reconnect handling
4 participants