FR: Immediate Workspace Timeout When GPU Resources Are Unavailable #17105
bjornrobertsson
started this conversation in
Feature Requests
Replies: 2 comments
-
I don't think it's a good idea to handle AWS specifically. We should resort to a more general terraform centric solution. Here is an example showing timeouts configuration for an AWS resource. |
Beta Was this translation helpful? Give feedback.
0 replies
-
The particular pain point here is not the terraform create/destroy operation, which do have timeouts that function. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Problem
When GPU resources (specifically AWS G5 48x large instances) are unavailable on the cloud provider, Coder workspaces continue to try to provision these resources for up to 60 minutes before timing out. This creates a poor user experience where users wait unnecessarily when the resources cannot be allocated. Users need an immediate notification when resources are unavailable rather than waiting for a long timeout period.
Proposed Solution
Add configuration options to customize timeout behavior when resources are unavailable:
Configurable resource availability timeout - Allow administrators to set custom timeout periods specifically for resource availability issues (separate from other timeout categories)
Improved error messaging in UI - Display specific resource unavailability errors to users, including:
Resource type-specific timeouts - Allow different timeout settings for different resource types:
AWS-specific detection mechanism - Implement integration with AWS API to detect true resource unavailability vs. temporary scheduling issues
Implementation Details
Based on examination of the codebase, the following components would need modification:
In
provisioner/terraform/terraform.go
, modify the resource creation logic to detect specific AWS resource unavailability errors and fail fasterUpdate the workspace state model in
coderd/workspaces.go
to include more granular status information about resource allocation failuresAdd new configuration fields to
coderd/parameter.go
to support customizable timeout settings for different resource typesEnhance the UI components to display more detailed error information when resources are unavailable
Expected Outcome
Related Documentation
These changes would require updates to:
Potential Implementation Challenges
Beta Was this translation helpful? Give feedback.
All reactions