-
Notifications
You must be signed in to change notification settings - Fork 887
Coder leaving workspace resources around on failure #5815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think there are circumstances where the team maintaining Coder might want to look at the pod later for forensics. What are your thoughts on the following: A: Coder provides a script or CLI that identifies and deletes pods like this, but it's up to you to choose to run it Or maybe you have another idea that would work well? |
Hi @ElliotG we would want the option of both terraform delete with stop count and without stop count. There are other resources that come up as part of the workspace start that need to be removed as well. Something inbuilt into coder would be ideal. If that's a coder script I suppose that's fine too. We could script this now but we want to minimise external services maintaining coder over time and let coder do this itself. At the moment it's an admin doing coder workspace stops across the coder instances we maintain for people that have just left their workspace in failed state indefinitely. Forensics we are building out logging and other things to catch why something happened. Happy for it to be on an optional delay as well. |
Cleaning up lots this week after having to deprecate a template this week by putting in a forced failure because there's no deprecated feature in Coder and they just fail out. I really do wish there was auto cleanup on failed workspaces. |
I think a stop on failure option would be the way to go. That allows a user to review the run logs and see that it stopped because of a failure. Or on failure allow an option to trigger a stop and leave in failed state. I'm doing so much bottom 💩 cleaning right now over not having this available and really don't want to do customs for something like this. |
@Kira-Pilot @bpmct @kylecarbs do you know if this is likely to be worked on in the next 2-4 weeks? If not, we think it's affecting us enough operationally that we're willing to try implement this ourselves directly in the code base. E.g. template or admin option to trigger a stop (apply with counts) on failure. Following a planning meeting we think it makes more sense to do the implementation straight in coder vs write something to watch or hook coder externally. We understand that catching the available resources before allowing users to fire is ideal but that's not trivial and there's still a chance that it will fail. If there's something already in flight that is going to solve this or you think it's not likely to be approved due to current product roadmap then we'll hold off. Let me know your thoughts. |
We have something in-flight to stop resources in a failed state, but it probably will not be shipped within 2-4 weeks. This is on our roadmap though and we will have a built-in solution for this. The feature will optionally clean up/delete inactive workspaces as well. Time-frame is TBD but would probably be ~2 months until this is shipped.
I'm curious why this is? Is this for better user experience (e.g. users can "recover" from a failed state before churning) or does it just "feel" more appropriate and saves you from spinning up and maintaining extra infra? I know this sounds silly, but if y'all were to temporarily set up an hourly CRON that queries all workspaces and stops any failed builds, what functionality are you missing? As I said, we plan on creating a native feature for this in Coder that is "smarter" than this but wanted to get a better sense of how a native feature could help you, so that we can make sure we're building the right thing :) |
A bit of both. It doesn't make sense to us why coder would leave the workspace in a broken state on failed starts. If it failed starting, roll back half baked infra in the same action and let the user know that's what happened. Sweeping through failed starts every hour just doesn't feel like the right way to handle it to us. If there's something I've learnt over time is nothing comes with little maintainence and overheads. Something you think might be a quick job turns into more than that tweaking, maintaining and then eventual unravelling. We're already doing similar for other things and have regretted it so far remembering that those "gap handler" apps are there and need to be managed. We're just trying to minimise those. But we may just bite the bullet on this one. Or try and think of another way to handle the infra provisioning to support a kind of rollback on failure. Understanding the timing and knowing something is in the works does help us in deciding what to do. Thanks for that. |
@dcarrion87 here's a sneak peek at the options we're adding. You can also set a fraction of a day via decimals. Open to feedback though :) |
This looks great thank you 🙇 |
@bpmct curious is that forming part of 0.24.X or later. Just for planning our scheduled upgrades. |
@bpmct just wanted to check if this was scheduled for 0.24.x? |
Hi @Kira-Pilot @bpmct @kylecarbs Just wanting to get an update from anyone regarding on when this is likely to drop: #5815 (comment) |
Hey @dcarrion87 sorry for failing to respond earlier...we're aiming to have it released in approximately 3 weeks. |
Thanks @sreya. Just to double check is this feature going to be gated by an Etnerprise licence or will it be available to everyone? |
Hi @dcarrion87! This feature will be available under the enterprise -> advanced template scheduling license. |
@Kira-Pilot @bpmct I really wish you made that clear earlier on. I was under the impression it was coming in the standard version given you already do auto stop in it. I will need to go back to the business to determine if it's justified to go to Enterprise for this. I'm not sure why cleaning up failures constitutes an Enterprise licence. |
@dcarrion87 We emailed you roughly 30 minutes ago with an enterprise license that is entitled to use this feature since you've been an early user of Coder v2 and gave us detailed feedback on this feature. I'll aim to do a better job communicating the feature levels of upcoming work on GitHub. Regarding why this is an enterprise, we currently follow a consistent philosophy across all our features:
Our enterprise customers typically deploy Coder for use across multiple teams, have hundreds of workspaces, and naturally need ways to make it easier to operate/manage/clean up workspaces and users when they are not in use. For us, making this a feature exclusively for our enterprise customers was the right fit. Again, this doesn't justify the lack of communication around our OSS philosophy and what to expect for upcoming features. Sorry about that! |
@bpmct I guess this is now available under the |
Yep! You can try this now with an enterprise license as long as you enable the experiment :) We'll close the issue once it's alpha! |
Closing this since workspace actions was recently promoted to GA |
We're having an issue with Coder leaving resources around when creation fails. Users are naturally lazy and won't try and fix this themselves.
E.g. this scenario:
How do we get coder to run a stop automatically when this happens so pods aren't sitting around in pending state forever plus all the rest of the supporting resources?
I'd rather not have to write our own cleanup scripts.
The text was updated successfully, but these errors were encountered: