-
Notifications
You must be signed in to change notification settings - Fork 937
feat: add derp mesh health checking in workspace proxies #12222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add derp mesh health checking in workspace proxies #12222
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @deansheather and the rest of your teammates on |
replicaPingSingleflight singleflight.Group | ||
replicaErrMut sync.Mutex | ||
replicaErr string | ||
latestDERPMap atomic.Pointer[tailcfg.DERPMap] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like the DERP stuff should be refactored out into a subcomponent of the proxy.
enterprise/wsproxy/wsproxy.go
Outdated
var ( | ||
wg sync.WaitGroup | ||
mu sync.Mutex | ||
failed = []string{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be easier to follow and more idiomatic if you made a chan error
, and then wrote nil if successful and the error if not for each peer.
Then, you read from the channel until you get the expected number of peers and build up your error message. Avoids the mutex and waitgroup as well.
…workspace_proxies
enterprise/wsproxy/wsproxy_test.go
Outdated
}, | ||
}) | ||
|
||
// Register (but don't start) 5 other proxies in the same region. Since |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment is confusing to me --- it looks like registerBrokenProxy
starts an HTTP server, so in what sense do you mean "Register (but don't start)"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The server is spun up but only responds with 500. We need to associate a unique IP:port with each replica, so instead of hoping that the port I picked doesn't have a TCP listener to it we just make a real server but cause all healthchecks to fail which has the same effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Adds missing sibling replica healthcheck code (similar to
replicasync
package in the primary).Reports errors via a callback (for tests), logs, the
healthz-report
endpoint, and registration requests.Related to coder/customers#438