Skip to content

Filter out devices in tailnet that have an MTU that is too small #13327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks
deansheather opened this issue May 21, 2024 · 7 comments
Closed
2 tasks

Filter out devices in tailnet that have an MTU that is too small #13327

deansheather opened this issue May 21, 2024 · 7 comments
Assignees
Labels
networking Area: networking s2 Broken use cases or features (with a workaround). Only humans may set this.

Comments

@deansheather
Copy link
Member

deansheather commented May 21, 2024

On devices with too small of an MTU, direct connections may be established (i.e. disco packets fit) but will drop every packet containing app data which renders the connection unusable.

This can happen on Cloudflare WARP VPN.

We should block these devices from being used for direct connections:

  • When binding the UDP socket, do not bind to any devices that have an MTU below the minimum
  • Don't share endpoints gathered locally from these devices

Background

Relates to https://github.com/coder/customers/issues/592

Inside the Wireguard tunnel to the workspace, we use IPv6 as the network layer protocol. IPv6 sets the minimum MTU for devices that carry it to 1280 octets. Thus, we set the virtual Wireguard TUN to an MTU of 1280 octets, and gVisor's tcpip stack will keep the inner IPv6 packets smaller than 1280 octets. However, these packets then need to get encapsulated in Wireguard, and in the case of direct connections, further encapsulated in UDP & IP and sent over the outer network interface. With this overhead, the typical TCP packets are 1338 octets. Thus, if the outer network interface has an MTU smaller than this, the packets are dropped.

It is not possible to reconfigure the inner protocols to use a smaller MTU because 1280 is the minimum for IPv6 according to the IPv6 specs.

gVisor is actually a bit conservative in its choice of segment size, leaving some extra room for TCP & IP options, so we may need greater than 1338 minimum to ensure packets aren't dropped.

Additional Requirements:

  • Update our documentation to note this limitation
  • Update our netcheck command to highlight devices which are being skipped
@deansheather deansheather added networking Area: networking bug labels May 21, 2024
@coder-labeler coder-labeler bot added the s3 Bugs that confuse, annoy, or are purely cosmetic label May 21, 2024
@deansheather deansheather added s2 Broken use cases or features (with a workaround). Only humans may set this. and removed s3 Bugs that confuse, annoy, or are purely cosmetic labels May 21, 2024
@bpmct bpmct added must-do Issues that must be completed by the end of the Sprint. Or else. Only humans may set this. and removed must-do Issues that must be completed by the end of the Sprint. Or else. Only humans may set this. labels May 31, 2024
@spikecurtis
Copy link
Contributor

gVisor is nominally permitted to send packets up to 1280.
Wireguard adds 30 bytes (1310)
UDP adds 8 bytes (1318)
IP adds 20-60 bytes (1338-1378)

So, I think it really needs to be 1378 to be totally safe

@spikecurtis
Copy link
Contributor

Unfortunately, we can't filter out the devices at the bind, since tailscale binds to 0.0.0.0, meaning all interfaces.

We could potentially intercept and squash Disco packets coming from these interfaces, or artificially inflate Disco packets to the maximum size we expect from Wireguard (1310), such that only paths that handle packets that big can get discovered for p2p.

@spikecurtis
Copy link
Contributor

spikecurtis commented Jun 7, 2024

Disco doesn't have any length parameter built in, so if we padded the messages we'd have to change the protocol. That's a nightmare from back-compatibility standpoint.

Or, figure out how to predict the length of the ciphertext from the plaintext, and pad before we encrypt with NaCl.

@spikecurtis
Copy link
Contributor

Intercepting and squashing Disco packets from particular interfaces is quite difficult because of the way UDP sockets work. For incoming UDP packets we can get their source IP (where they are from), but don't normally get the destination IP or interface. There are some low-level system calls you can make to get this information, but they're not portable, so we'd be handling multiple OS implementations.

The next idea I had was to figure out which interface we'd be sending replies via by looking at the routing table. But again, there is no portable implementation. We'd have to do multiple OS implementations.

@spikecurtis
Copy link
Contributor

Looks like we can pad out Disco pings and pongs prior to encrypting with NaCl. (CallMeMaybe is only ever sent over DERP.)

This has some performance implications, but Disco packets don't get sent that often, so it's probably fine.

@spikecurtis
Copy link
Contributor

Testing on Linux, I've discovered that with the inflated ping/pong packets, the Don't Fragment (DF) flag only gets set if the packet is smaller than the interface MTU. If the interface MTU is smaller than the packet, then it fragments the packet.

  1. I think this means that the strategy of inflating the ping/pong packets doesn't really work on its own to keep Tailscale from using interfaces with small MTU. It just fragments and reassembles the packet.

  2. It also means that our understanding of the Cloudflare WARP VPN issue might be incomplete. The OS might have been fragmenting the packets --- which is bad for performance, sure, but I don't think it should completely break connectivity. It's possible that other OSes act differently, e.g. dropping UDP datagrams too big for the MTU rather than fragmenting, or that WARP refuses to forward fragmented packets (very odd if true). Or maybe fragmentated performance is just so bad that it brings TCP to its knees?

So, possible next steps:

There is a DONT_FRAGMENT socket option we could try setting. Not sure how portable it is, but in theory it might convince Linux to drop too-large packets instead of fragmenting them. Setting that on the tailscale UDP socket is basically us saying that we should always prefer DERP to a fragmented direct path, which may or may not be faster.

This whole thing is honestly turning into a boondoggle, IMO. We thought it would be easy, and prevent some customer heartburn by automatically preferring DERP over paths with "bad" MTU, but it's turning out to be very difficult and involved. Customers can already work around the issue by setting DISABLE_DIRECT, so maybe we should cut our losses: drop an MTU check into coder netcheck / coder support bundle and call it a day?

@spikecurtis
Copy link
Contributor

@sreya what say we close this issue after #13562 is merged? I don't think there are any tailnet changes we can make that would be worth our time.

spikecurtis added a commit that referenced this issue Jun 13, 2024
re: #13327

Adds local interfaces to `coder netcheck` and checks their MTUs for potential problems.

This is mostly relevant for end-user systems where VPNs are common.  We _could_ also add it to coderd healthcheck, but until I see coderd connecting to workspaces over a VPN in the wild, I don't think its worth the UX effort.

Netcheck results get the following:

```
  "interfaces": {
    "error": null,
    "severity": "ok",
    "warnings": null,
    "dismissed": false,
    "interfaces": [
      {
        "name": "lo0",
        "mtu": 16384,
        "addresses": [
          "127.0.0.1/8",
          "::1/128",
          "fe80::1/64"
        ]
      },
      {
        "name": "en8",
        "mtu": 1500,
        "addresses": [
          "192.168.50.217/24",
          "fe80::c13:1a92:3fa5:dd7e/64"
        ]
      }
    ]
  }
```

_Technically_ not back compatible if anyone is parsing `coder netcheck` output as JSON, since the original output is now under `"derp"` in the output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
networking Area: networking s2 Broken use cases or features (with a workaround). Only humans may set this.
Projects
None yet
Development

No branches or pull requests

3 participants