Skip to content

Add multi-replica support for high availability #3227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #4748
bpmct opened this issue Jul 26, 2022 · 8 comments · Fixed by #4555
Closed
Tracked by #4748

Add multi-replica support for high availability #3227

bpmct opened this issue Jul 26, 2022 · 8 comments · Fixed by #4555
Assignees
Labels
enterprise Enterprise-license / premium functionality
Milestone

Comments

@bpmct
Copy link
Member

bpmct commented Jul 26, 2022

Add support for running multiple dashboard replicas, and display a warning in the dashboard when multiple replicas are used without a license attached, as this could lead to connection disconnects.

Read more about our networking and HA support here: https://coder.com/docs/coder-oss/latest/networking

Original description

This is a general issue to discuss how features, documentation, and support for high availability are relevant to Coder's architecture. GitLab currently recommends that Gitlab deployments with over 3000 users should have highly avalible architecture.

To our knowledge, there aren't any Coder (OSS) deployments with 3000+ users, or any users asking for HA support. This issue can also serve as a running list of some fault-tolerant features that can be added to Coder before Coder achieves full HA support, which still may require some defining.

Examples of fault-tolerant features:

  • Kubernetes deployment, multiple replicas - (needs issue)
    • Our helm chart has the coder.replicas value commented out, so once we support multiple replicas we will need to uncomment that out.
  • Multi-region support (AWS, GCP, Azure) for control panel - (needs issue)
  • Multi-region support (AWS, GCP, Azure) for workspaces - (needs issue)
  • Multi-region support (AWS, GCP, Azure) for postgres database - (needs issue)
  • Low-latency web/TURN/DERP connections (satellites) - (needs issue)
  • Support for additional provisioner daemons - (needs issue)
  • Something else? Contact us!

Note: If your Coder deployment has over 3000 users and/or HA is important to you, please leave a comment in this issue or contact us.

@deansheather
Copy link
Member

All of these technically happen at the same time as soon as we make sure that we use the database for everything and don't store state in memory (at least unless we send notifications to other replicas on changes) and synchronize migrations.

To support low-latency web/TURN/DERP connections we will most likely need to recreate Coder Classic's "satellite" feature, but this isn't a showstopper for HA support as it will only affect latency in multi-region deployments.

@catvec
Copy link

catvec commented Jul 28, 2022

This has some relevance to my org because AWS us-east-2 went down today for a few hours and we couldn't develop. Although not a huge priority because our thought process on multi region is "if something goes so wrong that it took out an AWS region other stuff we use is probably also down and hey we could all use a break every once and a while"

@denbeigh2000
Copy link
Contributor

HA support would be important for me and my org if I were to start using v2 at $WORK, primarily for redundancy and to minimise latency for geographically-distributed teams.

@ammario ammario added the enterprise Enterprise-license / premium functionality label Aug 10, 2022
@misskniss misskniss added the needs decision Needs a higher-level decision to be unblocked. label Aug 15, 2022
@ammario ammario added this to the EE milestone Aug 22, 2022
@kotx
Copy link

kotx commented Aug 22, 2022

HA would be important for me if it allowed low-latency multi-region deployments, or lower latency for workspaces.

@kylecarbs kylecarbs changed the title High availability (HA) support High availability support Aug 24, 2022
@kylecarbs
Copy link
Member

@bpmct I don't think this is actionable until we decompose the ideas a bit more. Do we want multiple geo-distributed replicas? Or low latency globally? I think they are separate issues.

@mattlqx
Copy link

mattlqx commented Sep 7, 2022

My initial impression was that HA was achievable by using an external database and then balancing between multiple Coder servers (or geo selecting which requests are directed to). For our case, we would want a Coder server in each of our 3 regions (US East, West, and Europe) to reduce latency. I would also like to minimize disruption when we're deploying a new version of Coder.

How would upgrading one Coder server affect operation of the others? Similarly, can I run two Coder servers in each region and "seamlessly" upgrade by removing them from traffic, upgrading and returning them to service? The service startup time seems extremely fast so maybe that's not an immediate need, but will it always be? Finally, is there a way to gracefully shutdown the server to ensure that any current Terraform runs are not disrupted?

@deansheather
Copy link
Member

@kylecarbs @bpmct Yeah I agree, we should split this issue into "multiple replica support in same region" and "geo-distributed low latency access points" (i.e. satellites from v1) and make it clear that each issue is separate from the other.

@mattlqx Multiple replicas are currently not supported, we currently only support running one coder server instance at a time connected to the database. Upgrades must be performed by turning off the old instance and then starting up the new instance. Once we have multiple replica support you should be able to upgrade without downtime by using the method you suggested.

Graceful shutdowns should work by sending a SIGINT with ctrl+c AFAIK.

@kylecarbs kylecarbs changed the title High availability support Add multi-replica support for high availability Sep 9, 2022
@bpmct bpmct removed the needs decision Needs a higher-level decision to be unblocked. label Sep 15, 2022
@bpmct
Copy link
Member Author

bpmct commented Sep 15, 2022

We've added documentation about how our networking works here, including how to support geo-distributed SSH: https://coder.com/docs/coder-oss/latest/networking

We still need to add support for more replicas, which will be an enterprise feature and tracked via this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enterprise Enterprise-license / premium functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants