Skip to content

Retries exausthed when jobs are still queued #4653

@avni-ef

Description

@avni-ef

Hi,
I've recently upgraded from 5.20.0 to 6.5.10 so I could use ephemeral runners (needed the recent SSM parameter tier fix).

Recent attempt - ephemeral runners

We use ARM instances in our company and these tend to have low spot capacity in our region (spread across 6 AZs). We rather have our workflows wait in the queue until spot capacity allows it to run, than falling back to on-demand instances.

The problem is, when using the default job_retry configuration, I see the queued workflows being ignored by the scale-up Lambda after a single attempt:
Job retry is disabled or max attempts reached, skipping retry

Our workflows can last for long time (up to an hour), I cannot set job_retry.delay_in_seconds to such a high value since I'll lose the retry attempts when spots are unavailable.

Previous attempt - non ephemeral runners

Before changing the configuration to use ephemeral runners, we kept hitting the following errors:

The job was not acquired by Runner of type self-hosted even after multiple attempts
--
Internal server error. Correlation ID: b764a....

I noticed that the terminated runners were never removed from Github runners, and I assumed that this is what's causing Github Actions to terminate our workflows before they actually start.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions