-
Notifications
You must be signed in to change notification settings - Fork 876
feat: add resume support to coordinator connections #14234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Inspired by other real time apps, the coordinator RPC API now has a "RefreshResumeToken" RPC that issues a JWT that can be used on reconnect to persist the same client peer ID.
tailnet/resume.go
Outdated
|
||
type resumeTokenPayload struct { | ||
PeerID uuid.UUID `json:"peer_id"` | ||
Expiry time.Time `json:"expiry"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use industry-standard JWT claims like iss
, sub
, exp
, nbf
and aud
here, and use the library "github.com/golang-jwt/jwt/v4"
(as we already use for licenses) for building and verifying the JWT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also already use go-jose for workspaceapps. We probably should only use one library but I'm not going to fix it in this PR, I'll open a security ticket to remove one or the other
tailnet/resume.go
Outdated
expiry time.Duration | ||
} | ||
|
||
func NewResumeTokenKeyProvider(key [64]byte, expiry time.Duration) ResumeTokenProvider { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be able to accept multiple keys, each identified by a keyID (JOSE header kid
), or we won't be able to rotate keys. There should be one signing key (possibly just document the first key is used for signing by convention), and zero or more additional keys that can be used for verification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to support rotating keys as invalid JWTs just result in reconnects getting assigned a new peer ID. The workspaceapps signing key is similar so we also don't support rotation for those. If people want to regenerate it, they can delete the row and restart coder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Delete the row and restart Coder," without proper key rotation is going to cause the very issue we are trying to fix on a deployment wide scale: every client will have to reconnect to the control plane, have its token rejected by the new key, get a new PeerID, and then drop connection to the agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have no customers right now who are constantly rotating the workspaceapps key which is another JWT signing key for a similar purpose, so I don't think customers are going to rotate this either. I think we can cross this bridge when we get to it rather than implementing multiple key management and rotation now.
If we supported adding multiple keys we would need some mechanism to add/remove them as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have no customers right now who are constantly rotating the workspaceapps key which is another JWT signing key for a similar purpose, so I don't think customers are going to rotate this either.
This logic is circular: we don't have any way for customers to rotate the workspaceapps key (other than DB surgery, and I don't think we document it anywhere), so of course they don't do it.
We have a current customer not only rotating the provisionerd auth key, but asking us if they can have multiple keys so they can rotate them without an outage: https://codercom.slack.com/archives/C014JH42DBJ/p1722966318378149?thread_ts=1722963221.953029&cid=C014JH42DBJ
I don't think we need to implement key rotation in this PR, but this PR needs to be forward-compatible with key rotation, meaning:
- DB config format needs to support multiple keys, even if we only set one for now
- keys need to be identified by a keyID and key algorithm, in the DB and in the JWT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like it may require a relatively significant refactor to properly address this issue for the various JWT usages we have. In order to consider this PR mergeable, can we open a ticket detailing what needs to happen in order to implement the changes described here? Given the size of this PR I don't think we should do it as part of this changeset but we can get started on it once this gets merged.
@deansheather is there any low hanging fruit that makes sense to address now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably do something like storing the key as <uuid>:<key hex>
for now and add the key ID to each JWS. The algo is already stored on the JWT and is hardcoded locally, so it should support anything we do in the future with some migrations to allow for storing multiple.
I think we should probably centralize the key management and signature generation code in a single package for workspaceapps, oauth and resume tokens eventually (maybe even licenses). Would probably be a good refactor/RFC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Store as {<uuid>: {"key_hex": <key>}}
so that we can add additional key metadata (e.g. key algo) without a DB migration. Writing JSON transforms in SQL is a PitA and I'd like to avoid it if we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future I imagine us using a table to store keys, so using json now guarantees we'll be writing JSON transforms in future migrations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it probably does make sense to consolidate signing keys into an actual SQL table. We still need the key ID for migration, so a bit hard to avoid.
We could just hard-code a key ID for now that will represent "whatever key we had prior to adding proper key rotation support" --- put it in the JOSE-header but leave it out of the DB for now. When we migrate we'd move the key (if it exists) to the table with that explicit key ID.
Thanks for catching the test races, I didn't consider that the resume tokens would be identical for the same peer ID with the same timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Inspired by other real time apps, the coordinator RPC API now has a "RefreshResumeToken" RPC that issues a JWT that can be used on reconnect to persist the same client peer ID.
Closes #14051
Relates to #13726