-
Notifications
You must be signed in to change notification settings - Fork 664
Description
Hi,
I've recently upgraded from 5.20.0
to 6.5.10
so I could use ephemeral runners (needed the recent SSM parameter tier fix).
Recent attempt - ephemeral runners
We use ARM instances in our company and these tend to have low spot capacity in our region (spread across 6 AZs). We rather have our workflows wait in the queue until spot capacity allows it to run, than falling back to on-demand instances.
The problem is, when using the default job_retry
configuration, I see the queued workflows being ignored by the scale-up Lambda after a single attempt:
Job retry is disabled or max attempts reached, skipping retry
Our workflows can last for long time (up to an hour), I cannot set job_retry.delay_in_seconds
to such a high value since I'll lose the retry attempts when spots are unavailable.
Previous attempt - non ephemeral runners
Before changing the configuration to use ephemeral runners, we kept hitting the following errors:
The job was not acquired by Runner of type self-hosted even after multiple attempts
--
Internal server error. Correlation ID: b764a....
I noticed that the terminated runners were never removed from Github runners, and I assumed that this is what's causing Github Actions to terminate our workflows before they actually start.
Thanks!