-
Notifications
You must be signed in to change notification settings - Fork 881
Add multi-replica support for high availability #3227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
All of these technically happen at the same time as soon as we make sure that we use the database for everything and don't store state in memory (at least unless we send notifications to other replicas on changes) and synchronize migrations. To support low-latency web/TURN/DERP connections we will most likely need to recreate Coder Classic's "satellite" feature, but this isn't a showstopper for HA support as it will only affect latency in multi-region deployments. |
This has some relevance to my org because AWS us-east-2 went down today for a few hours and we couldn't develop. Although not a huge priority because our thought process on multi region is "if something goes so wrong that it took out an AWS region other stuff we use is probably also down and hey we could all use a break every once and a while" |
HA support would be important for me and my org if I were to start using v2 at |
HA would be important for me if it allowed low-latency multi-region deployments, or lower latency for workspaces. |
@bpmct I don't think this is actionable until we decompose the ideas a bit more. Do we want multiple geo-distributed replicas? Or low latency globally? I think they are separate issues. |
My initial impression was that HA was achievable by using an external database and then balancing between multiple Coder servers (or geo selecting which requests are directed to). For our case, we would want a Coder server in each of our 3 regions (US East, West, and Europe) to reduce latency. I would also like to minimize disruption when we're deploying a new version of Coder. How would upgrading one Coder server affect operation of the others? Similarly, can I run two Coder servers in each region and "seamlessly" upgrade by removing them from traffic, upgrading and returning them to service? The service startup time seems extremely fast so maybe that's not an immediate need, but will it always be? Finally, is there a way to gracefully shutdown the server to ensure that any current Terraform runs are not disrupted? |
@kylecarbs @bpmct Yeah I agree, we should split this issue into "multiple replica support in same region" and "geo-distributed low latency access points" (i.e. satellites from v1) and make it clear that each issue is separate from the other. @mattlqx Multiple replicas are currently not supported, we currently only support running one coder server instance at a time connected to the database. Upgrades must be performed by turning off the old instance and then starting up the new instance. Once we have multiple replica support you should be able to upgrade without downtime by using the method you suggested. Graceful shutdowns should work by sending a SIGINT with ctrl+c AFAIK. |
We've added documentation about how our networking works here, including how to support geo-distributed SSH: https://coder.com/docs/coder-oss/latest/networking We still need to add support for more replicas, which will be an enterprise feature and tracked via this issue. |
Add support for running multiple dashboard replicas, and display a warning in the dashboard when multiple replicas are used without a license attached, as this could lead to connection disconnects.
Read more about our networking and HA support here: https://coder.com/docs/coder-oss/latest/networking
Original description
This is a general issue to discuss how features, documentation, and support for high availability are relevant to Coder's architecture. GitLab currently recommends that Gitlab deployments with over 3000 users should have highly avalible architecture.
To our knowledge, there aren't any Coder (OSS) deployments with 3000+ users, or any users asking for HA support. This issue can also serve as a running list of some fault-tolerant features that can be added to Coder before Coder achieves full HA support, which still may require some defining.
Examples of fault-tolerant features:
coder.replicas
value commented out, so once we support multiple replicas we will need to uncomment that out.The text was updated successfully, but these errors were encountered: