refactor: add postgres tailnet coordinator #8044

spikecurtis · 2023-06-15T08:54:37Z

This PR adds a postgres-backed, enterprise, distributed tailnet coordinator.

Apologies for the large-ish PR, but this felt like the smallest amount of code that could be coherently understood and reviewed.

This PR adds the new coordinator, but leaves the old, pubsub-only coordinator, and does not wire up the new coordinator to anything. Once this is merged, in a follow-on PR, I plan to introduce the postgres coordinator behind an experiment flag. That will allow me to test the performance impact of the new coordinator in a scale test. Then, I suggest we enable the experiment on our dogfood instance and test for a while.

Other items missing from this PR that will be added later:

the ServeHTTPDebug endpoint
cleaning out orphaned state in the database from coordinators that crashed/exited before deleting their bindings

Signed-off-by: Spike Curtis <spike@coder.com>

…oordinator

enterprise/tailnet/pgcoord.go

johnstcn · 2023-06-15T11:09:38Z

enterprise/tailnet/pgcoord.go

+func (q *querier) newConn(c *connIO) {
+	q.mu.Lock()
+	defer q.mu.Unlock()
+	mk := mKey{
+		agent: c.agent,
+		// if client is Nil, this is an agent connection, and it wants the mappings for all the clients of itself
+		clientsOfAgent: c.client == uuid.Nil,
+	}
+	cm, ok := q.mappers[mk]
+	if !ok {
+		ctx, cancel := context.WithCancel(q.ctx)
+		mpr := newMapper(ctx, q.logger, mk, q.heartbeats)
+		cm = &countedMapper{
+			mapper: mpr,
+			count:  0,
+			cancel: cancel,
+		}
+		q.mappers[mk] = cm
+		// we don't have any mapping state for this key yet
+		q.workQ.enqueue(mk)
+	}
+	if err := sendCtx(cm.ctx, cm.add, c); err != nil {
+		return
+	}
+	cm.count++
+	go q.waitForConn(c)
+}
+
+func (q *querier) waitForConn(c *connIO) {
+	<-c.ctx.Done()
+	q.mu.Lock()
+	defer q.mu.Unlock()
+	mk := mKey{
+		agent: c.agent,
+		// if client is Nil, this is an agent connection, and it wants the mappings for all the clients of itself
+		clientsOfAgent: c.client == uuid.Nil,
+	}
+	cm := q.mappers[mk]
+	if err := sendCtx(cm.ctx, cm.del, c); err != nil {
+		return
+	}
+	cm.count--
+	if cm.count == 0 {
+		cm.cancel()
+		delete(q.mappers, mk)
+	}
+}


The locking logic here between newConn and waitForConn is tricky to follow. Is there a particular reason we lock, spawn a new goroutine, and unlock, only to immediately lock and unlock again?

maybe a rename would make this more clear --- waitForConn doesn't do anything immediately, it waits for the connIO's context to expire, and then cleans it up. It's on a separate goroutine so that we don't block newConn from returning.

How do you feed about cleanupConn()?

Ah, the key here is <-c.ctx.Done() -- that's easy to skip over.
cleanupConn does make more sense in this context.

Signed-off-by: Spike Curtis <spike@coder.com>

kylecarbs

I haven't had an opportunity to look through pgcoord thoroughly, but this looks good from a brief glance. My only comment is related to duplicated heartbeats!

kylecarbs · 2023-06-20T14:50:56Z

coderd/database/migrations/000127_ha_coordinator.up.sql

+CREATE TABLE tailnet_coordinators (
+	id uuid NOT NULL PRIMARY KEY,
+	heartbeat_at timestamp with time zone NOT NULL
+);


Could this be integrated into the replicas table somehow? Feels unnecessary to have both essentially doing the same thing (heartbeating).

Maybe! I'll take a deeper look at the replicas stuff and see

One positive to moving this out of the replicas table is that each update requires copying the entire row. The smaller the row the less garbage there is to clean up.

coadler · 2023-06-20T20:35:10Z

coderd/database/migrations/000127_ha_coordinator.up.sql

+CREATE TABLE tailnet_coordinators (
+	id uuid NOT NULL PRIMARY KEY,
+	heartbeat_at timestamp with time zone NOT NULL
+);


One positive to moving this out of the replicas table is that each update requires copying the entire row. The smaller the row the less garbage there is to clean up.

coderd/database/migrations/000127_ha_coordinator.up.sql

enterprise/tailnet/pgcoord_test.go

enterprise/tailnet/pgcoord.go

mtojek

Actually, it is a new feature/implementation of the coordinator rather than refactoring. Keep in mind, that it is hard to give sensible feedback due to the size and complexity of the coordinator, but I did my best.

I went through the RFC first to understand implemented ideas like withdrawals or mappings. I have a few high-level questions:

Which database issues will be harmful to the PG coordinator? Are there ones he won't be able to recover without a restart?
Considering the number of bugs you're fixing, I have a feeling that we should open-source it instead of putting it in the enterprise. I assume that it is better than the in-memory implementation, right?
Did you think about the load testing plan to make sure that it works as intended and can handle expected traffic?
Security related: is it possible to sabotage the coordinator by sending fake/invalid queries? I know that this implementation is not purely distributed, so it can't be susceptible to typical Sybil attacks, but could it be possible to disconnect/isolate an agent or force it to connect to a node managed/taken over by an attacker?

mtojek · 2023-06-21T08:16:04Z

enterprise/tailnet/pgcoord.go

+}
+
+func (q *querier) listenClient(_ context.Context, msg []byte, err error) {
+	if xerrors.Is(err, pubsub.ErrDroppedMessages) {


If listenClient expects only pubsub.ErrDroppedMessages, shouldn't it be extracted to another wrapping function?

I'm afraid I don't understand what you're suggesting. Can you elaborate?

Something like this to ensure type-safety on the method declaration level:

func (q *querier) listenClientRaw(_ context.Context, msg []byte, err error) { if err != nil && !xerrors.Is(err, pubsub.ErrDroppedMessages) { return } droppedMessagesErr := err.(pubsub.ErrDroppedMessages) q.listenClient(msg, droppedMessagesErr) } func (q *querier) listenClient(msg []byte, err pubsub.ErrDroppedMessages) { ... }

listenClient() is a pubsub listener with error. The pubsub either sends a message or an error. If err is non-nil, then the msg is zero length and not an actual pubsub message.

Right now, the only error that the pubsub can send, and the only one this code handles is ErrDroppedMessages.

enterprise/tailnet/pgcoord.go

enterprise/tailnet/pgcoord_test.go

spikecurtis · 2023-06-21T10:22:04Z

Which database issues will be harmful to the PG coordinator? Are there ones he won't be able to recover without a restart?

The usual suspects: data corruption, incompatible schema changes, etc.

In particular, it is designed to recover gracefully from temporary loss of connectivity to the database. When the pubsub reconnects, it triggers a resync of mappings. Bindings errors are retried with a backoff.

spikecurtis · 2023-06-21T10:25:33Z

Considering the number of bugs you're fixing, I have a feeling that we should open-source it instead of putting it in the enterprise. I assume that it is better than the in-memory implementation, right?

I think our current position is that features supporting multiple Coderd replicas are Enterprise licensed.

If you have only one Coderd, then the in-memory coordinator is totally sufficient, and will in fact have less latency to set up a new connection.

spikecurtis · 2023-06-21T10:29:07Z

Did you think about the load testing plan to make sure that it works as intended and can handle expected traffic?

After merging, I plan to enable it behind an experiment flag and then run some load tests comparing it to the old haCoordinator. I expect it to be a little more chatty with the database, on account of us pushing the state there, but hopefully not too different. The goal here was correctness, not increased performance.

spikecurtis · 2023-06-21T10:42:58Z

Security related: is it possible to sabotage the coordinator by sending fake/invalid queries? I know that this implementation is not purely distributed, so it can't be susceptible to typical Sybil attacks, but could it be possible to disconnect/isolate an agent or force it to connect to a node managed/taken over by an attacker?

The coordinator facilitates connectivity between clients and agents. The Coderd API authenticates and authorizes connections into the coordinator, so by design we trust the information being provided by clients and agents. It's of course true that an employee or contractor could attempt to act maliciously, as an insider attack. What could such a malicious actor do?

Spam node updates to try to DoS the coordinator or database: They could certainly eat CPU on the coordinator server with us deserializing node updates over and over, but we only allow one DB upsert in flight for each connection, so we would limit the pressure on the database.
Send malformed data: This will result in deserialization errors, which triggers the coordinator to disconnect.
Send node updates with spoofed IP addresses. The will cause the peer to send Disco (discovery) packets to the spoofed IPs. But, an insider with access to Coder can easily send packets wherever they want by just directly opening a socket, no need to involve the coordinator. The coordinator itself never sends disco packets.

Signed-off-by: Spike Curtis <spike@coder.com>

…oordinator

Signed-off-by: Spike Curtis <spike@coder.com>

mtojek · 2023-06-21T11:16:24Z

I think our current position is that features supporting multiple Coderd replicas are Enterprise licensed.

I see. It would be awesome to "port" any extra fixes you identified in the existing coordinator to the in-memory one (if you fixed any).

The goal here was correctness, not increased performance.

Hopefully not decreased :)

Send malformed data: This will result in deserialization errors, which triggers the coordinator to disconnect.

Do you think that we should ensure any integrity by signing exchanged data? I agree that this is a topic for an open discussion, but it can be asked during any security review on the customer side.

as an insider attack

I was thinking also about a different vector. Assuming that the attacker has access to the database, are they able to reconfigure agents or coordinators by spoofing mapping entries directly in the database? I know that the attacker can do other bad things, but was curious if it is possible to perform a truly stealth attack against the customer infra. I apologize if this is a silly question, but it is better to be safe than sorry.

spikecurtis · 2023-06-21T11:19:18Z

Do you think that we should ensure any integrity by signing exchanged data? I agree that this is a topic for an open discussion, but it can be asked during any security review on the customer side.

These data are exchanged over a websocket, which, for any serious deployment, will be TLS protected, and therefore, integrity protected.

spikecurtis · 2023-06-21T11:22:46Z

I was thinking also about a different vector. Assuming that the attacker has access to the database, are they able to reconfigure agents or coordinators by spoofing mapping entries directly in the database? I know that the attacker can do other bad things, but was curious if it is possible to perform a truly stealth attack against the customer infra. I apologize if this is a silly question, but it is better to be safe than sorry.

It's not a silly question, but the answer is that we consider database access as complete compromise of the system. It would be possible to stealthily attack the system in numerous ways that I won't get into here.

Signed-off-by: Spike Curtis <spike@coder.com>

postgres tailnet coordinator

d00c556

Signed-off-by: Spike Curtis <spike@coder.com>

spikecurtis requested review from johnstcn, Emyrk, coadler and kylecarbs June 15, 2023 08:54

cdr-bot bot added the community Pull Requests and issues created by the community. label Jun 15, 2023

github-actions bot assigned spikecurtis Jun 15, 2023

spikecurtis removed the community Pull Requests and issues created by the community. label Jun 15, 2023

cdr-bot bot added the community Pull Requests and issues created by the community. label Jun 15, 2023

spikecurtis removed the community Pull Requests and issues created by the community. label Jun 15, 2023

cdr-bot bot added the community Pull Requests and issues created by the community. label Jun 15, 2023

spikecurtis added 2 commits June 15, 2023 10:38

Fix db migration; tests

d37787f

Signed-off-by: Spike Curtis <spike@coder.com>

Merge branch 'main' of https://github.com/coder/coder into spike/ha-c…

06079a4

…oordinator

johnstcn reviewed Jun 15, 2023

View reviewed changes

spikecurtis added 3 commits June 15, 2023 11:27

Add fixture, regenerate

513c3b7

Signed-off-by: Spike Curtis <spike@coder.com>

Fix fixtures

7d751b3

Signed-off-by: Spike Curtis <spike@coder.com>

review comments, run clean gen

dc4b30c

Signed-off-by: Spike Curtis <spike@coder.com>

spikecurtis removed the community Pull Requests and issues created by the community. label Jun 20, 2023

Rename waitForConn -> cleanupConn

fb87ea4

Signed-off-by: Spike Curtis <spike@coder.com>

kylecarbs approved these changes Jun 20, 2023

View reviewed changes

coadler reviewed Jun 20, 2023

View reviewed changes

mtojek approved these changes Jun 21, 2023

View reviewed changes

spikecurtis added 3 commits June 21, 2023 11:05

code review updates

95d01cf

Signed-off-by: Spike Curtis <spike@coder.com>

Merge branch 'main' of https://github.com/coder/coder into spike/ha-c…

9ea3ef2

…oordinator

db migration order

689d014

Signed-off-by: Spike Curtis <spike@coder.com>

spikecurtis changed the title ~~refactor: adds postgres tailnet coordinator~~ refactor: add postgres tailnet coordinator Jun 21, 2023

spikecurtis added 3 commits June 21, 2023 11:26

fix log field name last_heartbeat

197b5a6

Signed-off-by: Spike Curtis <spike@coder.com>

fix heartbeat_from log field

7edda4d

Signed-off-by: Spike Curtis <spike@coder.com>

fix slog fields for linting

2c757c2

Signed-off-by: Spike Curtis <spike@coder.com>

spikecurtis merged commit cc17d2f into main Jun 21, 2023

spikecurtis deleted the spike/ha-coordinator branch June 21, 2023 12:21

github-actions bot locked and limited conversation to collaborators Jun 21, 2023

refactor: add postgres tailnet coordinator #8044

refactor: add postgres tailnet coordinator #8044

Uh oh!

Conversation

spikecurtis commented Jun 15, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylecarbs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coadler Jun 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coadler Jun 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mtojek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

spikecurtis commented Jun 21, 2023

Uh oh!

spikecurtis commented Jun 21, 2023

Uh oh!

spikecurtis commented Jun 21, 2023

Uh oh!

spikecurtis commented Jun 21, 2023

Uh oh!

mtojek commented Jun 21, 2023

Uh oh!

spikecurtis commented Jun 21, 2023

Uh oh!

spikecurtis commented Jun 21, 2023

Uh oh!

Uh oh!

coadler Jun 20, 2023 •

edited

Loading

coadler Jun 20, 2023 •

edited

Loading