Retries exausthed when jobs are still queued

Hi, 
I've recently upgraded from `5.20.0` to `6.5.10` so I could use ephemeral runners (needed the recent [SSM parameter tier fix](https://github.com/github-aws-runners/terraform-aws-github-runner/pull/4613)).

### Recent attempt - ephemeral runners
We use ARM instances in our company and these tend to have low spot capacity in our region (spread across 6 AZs). We rather have our workflows wait in the queue until spot capacity allows it to run, than falling back to on-demand instances.

The problem is, when using the default `job_retry` configuration, I see the queued workflows being ignored by the scale-up Lambda after a single attempt:
`Job retry is disabled or max attempts reached, skipping retry`

Our workflows can last for long time (up to an hour), I cannot set `job_retry.delay_in_seconds` to such a high value since I'll lose the retry attempts when spots are unavailable.


### Previous attempt - non ephemeral runners
Before changing the configuration to use ephemeral runners, we kept hitting the following errors:

```
The job was not acquired by Runner of type self-hosted even after multiple attempts
--
Internal server error. Correlation ID: b764a....
```
I noticed that the terminated runners were never removed from Github runners, and I assumed that this is what's causing Github Actions to terminate our workflows before they actually start.


Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Retries exausthed when jobs are still queued #4653

Recent attempt - ephemeral runners

Previous attempt - non ephemeral runners

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Retries exausthed when jobs are still queued #4653

Description

Recent attempt - ephemeral runners

Previous attempt - non ephemeral runners

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions