Skip to content

Switch to Tailscale for agent networking #2779

Closed
@kylecarbs

Description

@kylecarbs

Problem

Users have been reporting connection issues:

Some additional issues that haven't been reported:

  • Throughput is ~30mb/s (improving seems like it'd require substantial effort)
  • The initial connection takes a few hundred milliseconds (I believe @mafredri reported this actually)

And the debt we take on:

  • Maintaining ~3000 lines of networking code that requires very specific domain knowledge to edit
  • Intermittent disconnects have been occurring, and due to a lack of domain knowledge we have limited throughput to debug them

Definition of Done

Our current networking stack with WebRTC is not durable, fast, or resilient. To fix all the issues listed, I propose we switch to using Tailscale's OSS implementation on top of Wireguard. This will remove a significant portion of our effort being spent on networking, and let us focus on improving the fundamentals where we have higher leverage.

Read this blog post by Tailscale to understand how it works: https://tailscale.com/blog/how-tailscale-works/

A proof of concept and partial implementation are merged into the product under a hidden flag for SSH: coder config-ssh --wireguard. It appears to resolve the problems listed above. The code can be viewed here.

FAQ

How do users upgrade from WebRTC to Wireguard?

Once we remove WebRTC, agents will be unable to connect to Coder. Those endpoints will be removed, and workspaces will require a stop -> start to begin working again. This could be eased by doing over a few releases, but I think making it explicit in the release notes would be better.

Does this talk to Tailscale's hosted offering at all?

Nope. A DERP server is embedded in the Coder server, just like how TURN has worked. A user could add the Tailscale DERP mapping to their Coder deployment to enable this proxying, and we'll allow that in the future.

How does this change our architecture?

The architecture diagram here stays exactly the same. DERP is used instead of TURN, but it already uses HTTP(s) as the protocol. This should simplify and reduce proxying.

Does this still require significant domain knowledge to edit?

Yes, but the amount of code reduces substantially and the architecture maintains the same shape. The surface area of this problem should reduce.

Does this prevent us from using WebRTC from the browser for P2P connections in the future?

Yes, but due to our TURN over WebSockets proxying that hasn't been possible.

How long will we dogfood the Tailscale networking before removing WebRTC?

It's very similar architecturally, so it's not expected we'll run into significant hurdles. We'll make this the default on dogfood over break, and will likely cut a new release containing it as default the week we're back. We've been dogfooding it on dev.coder.com for ~1 week already.

Does Wireguard require anything in the kernel?

Nope. Tailscale uses wireguard-go in userspace.

How do multiple regions work?

Just as Tailscale describes in their post. Users will specify a region name for each of their Coder instances (in an HA deployment), and they will mesh together.

What firewall ports are required?

It's just exposing an HTTP(s) server, nothing extra!

Does this require an external IP address? (if workspaces are running on the same node as Coder)

Nope. Just like the current networking stack, everything should just work locally too.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions