Skip to content

Coder leaving workspace resources around on failure #5815

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dcarrion87 opened this issue Jan 21, 2023 · 21 comments
Closed

Coder leaving workspace resources around on failure #5815

dcarrion87 opened this issue Jan 21, 2023 · 21 comments
Assignees

Comments

@dcarrion87
Copy link
Contributor

dcarrion87 commented Jan 21, 2023

We're having an issue with Coder leaving resources around when creation fails. Users are naturally lazy and won't try and fix this themselves.

E.g. this scenario:

kubernetes_pod.main[0]: Still creating... [4m50s elapsed]
kubernetes_pod.main[0]: Still creating... [5m0s elapsed]
kubernetes_pod.main[0]: Creation errored after 5m0s

How do we get coder to run a stop automatically when this happens so pods aren't sitting around in pending state forever plus all the rest of the supporting resources?

I'd rather not have to write our own cleanup scripts.

@ElliotG
Copy link
Contributor

ElliotG commented Jan 22, 2023

I think there are circumstances where the team maintaining Coder might want to look at the pod later for forensics. What are your thoughts on the following:

A: Coder provides a script or CLI that identifies and deletes pods like this, but it's up to you to choose to run it
B: Coder deletes orphaned pods that are more than X hours or days old

Or maybe you have another idea that would work well?

@dcarrion87
Copy link
Contributor Author

dcarrion87 commented Jan 22, 2023

Hi @ElliotG we would want the option of both terraform delete with stop count and without stop count. There are other resources that come up as part of the workspace start that need to be removed as well.

Something inbuilt into coder would be ideal. If that's a coder script I suppose that's fine too. We could script this now but we want to minimise external services maintaining coder over time and let coder do this itself. At the moment it's an admin doing coder workspace stops across the coder instances we maintain for people that have just left their workspace in failed state indefinitely.

Forensics we are building out logging and other things to catch why something happened.

Happy for it to be on an optional delay as well.

@dcarrion87 dcarrion87 changed the title Coder leaving pods around Coder leaving workspace resources around on failure Jan 22, 2023
@dcarrion87
Copy link
Contributor Author

dcarrion87 commented Mar 8, 2023

Cleaning up lots this week after having to deprecate a template this week by putting in a forced failure because there's no deprecated feature in Coder and they just fail out. I really do wish there was auto cleanup on failed workspaces.

@Kira-Pilot
Copy link
Member

Kira-Pilot commented Mar 27, 2023

Related to #6535, #5598, #3380, #4504

@dcarrion87
Copy link
Contributor Author

dcarrion87 commented Mar 28, 2023

I think a stop on failure option would be the way to go. That allows a user to review the run logs and see that it stopped because of a failure. Or on failure allow an option to trigger a stop and leave in failed state.

I'm doing so much bottom 💩 cleaning right now over not having this available and really don't want to do customs for something like this.

@dcarrion87
Copy link
Contributor Author

dcarrion87 commented Apr 12, 2023

@Kira-Pilot @bpmct @kylecarbs do you know if this is likely to be worked on in the next 2-4 weeks?

If not, we think it's affecting us enough operationally that we're willing to try implement this ourselves directly in the code base. E.g. template or admin option to trigger a stop (apply with counts) on failure.

Following a planning meeting we think it makes more sense to do the implementation straight in coder vs write something to watch or hook coder externally.

We understand that catching the available resources before allowing users to fire is ideal but that's not trivial and there's still a chance that it will fail.

If there's something already in flight that is going to solve this or you think it's not likely to be approved due to current product roadmap then we'll hold off.

Let me know your thoughts.

@bpmct
Copy link
Member

bpmct commented Apr 13, 2023

We have something in-flight to stop resources in a failed state, but it probably will not be shipped within 2-4 weeks. This is on our roadmap though and we will have a built-in solution for this. The feature will optionally clean up/delete inactive workspaces as well. Time-frame is TBD but would probably be ~2 months until this is shipped.

Following a planning meeting we think it makes more sense to do the implementation straight in coder vs write something to watch or hook coder externally.

I'm curious why this is? Is this for better user experience (e.g. users can "recover" from a failed state before churning) or does it just "feel" more appropriate and saves you from spinning up and maintaining extra infra?

I know this sounds silly, but if y'all were to temporarily set up an hourly CRON that queries all workspaces and stops any failed builds, what functionality are you missing? As I said, we plan on creating a native feature for this in Coder that is "smarter" than this but wanted to get a better sense of how a native feature could help you, so that we can make sure we're building the right thing :)

@dcarrion87
Copy link
Contributor Author

dcarrion87 commented Apr 13, 2023

I'm curious why this is? Is this for better user experience (e.g. users can "recover" from a failed state before churning) or does it just "feel" more appropriate and saves you from spinning up and maintaining extra infra?

A bit of both. It doesn't make sense to us why coder would leave the workspace in a broken state on failed starts. If it failed starting, roll back half baked infra in the same action and let the user know that's what happened. Sweeping through failed starts every hour just doesn't feel like the right way to handle it to us.

If there's something I've learnt over time is nothing comes with little maintainence and overheads. Something you think might be a quick job turns into more than that tweaking, maintaining and then eventual unravelling. We're already doing similar for other things and have regretted it so far remembering that those "gap handler" apps are there and need to be managed. We're just trying to minimise those. But we may just bite the bullet on this one. Or try and think of another way to handle the infra provisioning to support a kind of rollback on failure.

Understanding the timing and knowing something is in the works does help us in deciding what to do. Thanks for that.

@bpmct
Copy link
Member

bpmct commented May 8, 2023

@dcarrion87 here's a sneak peek at the options we're adding. You can also set a fraction of a day via decimals. Open to feedback though :)

image

@dcarrion87
Copy link
Contributor Author

This looks great thank you 🙇

@matifali matifali added bug s3 Bugs that confuse, annoy, or are purely cosmetic and removed bug s3 Bugs that confuse, annoy, or are purely cosmetic labels May 10, 2023
@bpmct bpmct added the feature label May 10, 2023
@dcarrion87
Copy link
Contributor Author

@bpmct curious is that forming part of 0.24.X or later. Just for planning our scheduled upgrades.

@dcarrion87
Copy link
Contributor Author

@bpmct just wanted to check if this was scheduled for 0.24.x?

@dcarrion87
Copy link
Contributor Author

Hi @Kira-Pilot @bpmct @kylecarbs

Just wanting to get an update from anyone regarding on when this is likely to drop: #5815 (comment)

@sreya
Copy link
Collaborator

sreya commented Jun 14, 2023

Hey @dcarrion87 sorry for failing to respond earlier...we're aiming to have it released in approximately 3 weeks.

@dcarrion87
Copy link
Contributor Author

Thanks @sreya. Just to double check is this feature going to be gated by an Etnerprise licence or will it be available to everyone?

@Kira-Pilot
Copy link
Member

Hi @dcarrion87! This feature will be available under the enterprise -> advanced template scheduling license.

@dcarrion87
Copy link
Contributor Author

dcarrion87 commented Jun 20, 2023

Hi @dcarrion87! This feature will be available under the enterprise -> advanced template scheduling license.

@Kira-Pilot @bpmct I really wish you made that clear earlier on. I was under the impression it was coming in the standard version given you already do auto stop in it. I will need to go back to the business to determine if it's justified to go to Enterprise for this. I'm not sure why cleaning up failures constitutes an Enterprise licence.

@bpmct
Copy link
Member

bpmct commented Jun 20, 2023

@dcarrion87 We emailed you roughly 30 minutes ago with an enterprise license that is entitled to use this feature since you've been an early user of Coder v2 and gave us detailed feedback on this feature. I'll aim to do a better job communicating the feature levels of upcoming work on GitHub. Regarding why this is an enterprise, we currently follow a consistent philosophy across all our features:

  • Open Source: Any features that enable scale (airgap support, OIDC, VS Code Extension, templates, user roles) in an organization. It should be entirely possible for a 50-200 user deployment to use Coder assuming they have relaxed controls.
  • Enterprise Edition: Any features that govern/manage scale (group sync, template permissions, browser-only, bulk actions, scheduling permissions, automated cleanup, etc).

Our enterprise customers typically deploy Coder for use across multiple teams, have hundreds of workspaces, and naturally need ways to make it easier to operate/manage/clean up workspaces and users when they are not in use. For us, making this a feature exclusively for our enterprise customers was the right fit.

Again, this doesn't justify the lack of communication around our OSS philosophy and what to expect for upcoming features. Sorry about that!

@matifali
Copy link
Member

matifali commented Sep 3, 2023

@bpmct I guess this is now available under the workspace_actions experiment name.

@bpmct
Copy link
Member

bpmct commented Sep 5, 2023

Yep! You can try this now with an enterprise license as long as you enable the experiment :) We'll close the issue once it's alpha!

@sreya
Copy link
Collaborator

sreya commented Jan 17, 2024

Closing this since workspace actions was recently promoted to GA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants