Skip to content

fix: use explicit api versions for agent and tailnet #15508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 15, 2024

Conversation

spikecurtis
Copy link
Contributor

@spikecurtis spikecurtis commented Nov 14, 2024

Bumps the Tailnet and Agent API version 2.3, and creates some extra controls and machinery around these versions.

What happened is that we accidentally shipped two new API features without bumping the version. ScriptCompleted on the Agent API in Coder v2.16 and RefreshResumeToken on the Tailnet API in Coder v2.15.

Since we can't easily retroactively bump the versions, we'll roll these changes into API version 2.3 along with the new WorkspaceUpdates RPC, which hasn't been released yet. That means there is some ambiguity in Coder v2.15-v2.17 about exactly what methods are supported on the Tailnet and Agent APIs. This isn't great, but hasn't caused us major issues because

  1. RefreshResumeToken is considered optional, and clients just log and move on if the RPC isn't supported.
  2. Agents basically never get started talking to a Coderd that is older than they are, since the agent binary is normally downloaded from Coderd at workspace start.

Still it's good to get things squared away in terms of versions for SDK users and possible edge cases around client and server versions.

To mitigate against this thing happening again, this PR also:

  1. adds a CODEOWNERS for the API proto packages, so I'll review changes
  2. defines interface types for different API versions, and has the agent explicitly use a specific version. That way, if you add a new method, and try to use it in the agent without thinking explicitly about versions, it won't compile.

With the protocol controllers stuff, we've sort of already abstracted the Tailnet API such that the interface type strategy won't work, but I'll work on getting the Controller to be version aware, such that it can check the API version it's getting against the controllers it has -- in a later PR.

Copy link
Contributor Author

spikecurtis commented Nov 14, 2024

@spikecurtis spikecurtis requested review from deansheather, johnstcn and mafredri and removed request for deansheather November 14, 2024 09:51
@spikecurtis spikecurtis marked this pull request as ready for review November 14, 2024 09:52
@spikecurtis spikecurtis force-pushed the spike/agent-tailnet-explicit-versions branch from 59c864e to 0dc33fb Compare November 14, 2024 09:57
Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments but nothing blocking from my side.

Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR cover the case where proto changes are introduced without bumping the version? If it is I didn't quite catch how that results in a build failure. (I.e. is anything stopping me from appending a new method to DRPCAgentClient23?)

Other than that, looks good. If we can automate more of this process in the future that'd be good. 👍🏻

return nil, nil, err
}
return proto.NewDRPCAgentClient(conn), tailnetproto.NewDRPCTailnetClient(conn), nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could some or most of these changes be automated via code-gen? How easy is it for someone unfamiliar with this domain to add a new API version without forgetting something in the process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to automate directly from the .proto because it requires knowing which methods were added in which version.

We could annotate the RPCs somehow, in comments, but I'm hesitant to do that.

  1. As a matter of automation, I don't think the time spent to implement will save us time overall, considering this is something we change a few times per year.
  2. What I'm doing here only covers API changes that add new methods. You could also imagine adding new fields to existing messages in a back-compatible way.
  3. I don't want to entrench us too much into this particular paradigm by building a bunch of infrastructure around it. I'm not totally sure this is the right long-term way to manage the versions, so I want to keep it lightweight. We might change it depending on future changes to the API.

@spikecurtis
Copy link
Contributor Author

Does this PR cover the case where proto changes are introduced without bumping the version? If it is I didn't quite catch how that results in a build failure. (I.e. is anything stopping me from appending a new method to DRPCAgentClient23?)

My assumption is that we just edited agent.proto and tailnet.proto, then ran make gen and forgot/didn't know we needed to care about the version. So, having a separate DRPCAgentClient23 piped around the agent means that when you go to use your fancy new method, it isn't on the interface and your build will fail. You could just add it to the interface manually, but my hope is that the comment on that interface makes it a bit more obvious that this is a versioned interface and you shouldn't do that. Before we were just passing around the drpc.Conn, and constructing a DRPCAgentClient, which is autogenerated from the .proto, so it was very easy not to realize the protocol is versioned.

@spikecurtis spikecurtis force-pushed the spike/14730-set-dns branch 2 times, most recently from 700a82f to 0d7aa15 Compare November 14, 2024 12:16
@spikecurtis spikecurtis force-pushed the spike/agent-tailnet-explicit-versions branch from 0dc33fb to 50f124f Compare November 14, 2024 12:17
@spikecurtis spikecurtis force-pushed the spike/agent-tailnet-explicit-versions branch from 50f124f to 7b37b40 Compare November 14, 2024 13:03
@mafredri
Copy link
Member

mafredri commented Nov 14, 2024

@spikecurtis ok, thanks for the explanation. This got me thinking though, could we instead make the protocol version a type alias so you don't need to change all the code every time you're switching? I.e.

type DRPCAgentClientLatest = DRPCAgentClient23

func (c *Client) ConnectRPCLatest(...) { return c.ConnectRPC23(...) }

Edit: Thinking about this some more, this would essentially automatically bump the protocol in the common case for all consumers when you change the alias. Maybe that's not what we want. I was hoping to avoid the yank renaming across the code-base but may be unavoidable without some rethinking of the API.

@Emyrk
Copy link
Member

Emyrk commented Nov 14, 2024

@spikecurtis ok, thanks for the explanation. This got me thinking though, could we instead make the protocol version a type alias so you don't need to change all the code every time you're switching? I.e.

type DRPCAgentClientLatest = DRPCAgentClient23

func (c *Client) ConnectRPCLatest(...) { return c.ConnectRPC23(...) }

Edit: Thinking about this some more, this would essentially automatically bump the protocol in the common case for all consumers when you change the alias. Maybe that's not what we want. I was hoping to avoid the yank renaming across the code-base but may be unavoidable without some rethinking of the API.

@mafredri I think if you did this, maybe we could do something with golden files and dump a JSON representation of the interface somewhere. If it changes, via a argument field change, new method, etc. The test would fail until you update-golden-files again.

Just an idea to "snapshot" what the API is from a JSON POV and dump it into a test that can detect changes.

@spikecurtis I know you mentioned it might not be worth spending a lot of time on this infrastructure. I'll defer to ya'll on it.

@spikecurtis
Copy link
Contributor Author

maybe we could do something with golden files and dump a JSON representation of the interface somewhere. If it changes, via a argument field change, new method, etc. The test would fail until you update-golden-files again.

Just an idea to "snapshot" what the API is from a JSON POV and dump it into a test that can detect changes.

Look, if we wanted to snapshot the API definition, we could just make a copy of the .proto file. No need to involve JSON.

This isn't like a "golden file" where you're testing that the code gives some complex output that's easier to just compare wholesale rather than asserting individual facts about it.

We could stash a snapshot of the current version in testdata and directly compare, so that if you edited the proto without creating a new version, tests would fail. Personally, I think CODEOWNERS should help here, but if y'all really want another robot to complain to you about stuff, we can do it.

@Emyrk
Copy link
Member

Emyrk commented Nov 14, 2024

we could just make a copy of the .proto file. No need to involve JSON.

Oh yea 🤦. Nah, this is all fine with me 👍

@mafredri
Copy link
Member

Let's go with this for now, we can always revisit if the process is too cumbersome or has other shortcomings 👍🏻.

@spikecurtis spikecurtis changed the base branch from spike/14730-set-dns to graphite-base/15508 November 15, 2024 07:00
@spikecurtis spikecurtis force-pushed the spike/agent-tailnet-explicit-versions branch from 7b37b40 to 7d0d79d Compare November 15, 2024 07:00
@spikecurtis spikecurtis changed the base branch from graphite-base/15508 to main November 15, 2024 07:01
@spikecurtis spikecurtis force-pushed the spike/agent-tailnet-explicit-versions branch from 7d0d79d to 46718e6 Compare November 15, 2024 07:01
@spikecurtis spikecurtis merged commit 4080295 into main Nov 15, 2024
27 checks passed
Copy link
Contributor Author

Merge activity

  • Nov 15, 2:16 AM EST: A user merged this pull request with Graphite.

@spikecurtis spikecurtis deleted the spike/agent-tailnet-explicit-versions branch November 15, 2024 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants