OADP-6653: CloudStorage exponential backoff by removing RequeueAfter #1937

kaovilai · 2025-09-03T13:54:31Z

Description

This PR implements exponential backoff for the CloudStorage controller using Kubernetes-native libraries. The controller leverages controller-runtime's built-in workqueue backoff mechanism for automatic retries with increasing delays when bucket operations fail.

Key Changes

Exponential Backoff Implementation

Leverages controller-runtime's workqueue: Returns errors on failures to trigger automatic exponential backoff (5ms initial → 1000s maximum)
Per-item backoff: Each CloudStorage CR gets independent retry timing through the workqueue
Self-healing behavior: Continues retrying with exponentially increasing delays instead of giving up

Status Conditions for Observability

Added Conditions field to CloudStorageStatus for standard Kubernetes status reporting
Implemented BucketReady condition with multiple reason constants:
- ReasonBucketCreated: Bucket successfully created
- ReasonBucketReady: Bucket exists and is ready
- ReasonBucketCreationFailed: Bucket creation failed (triggers backoff)
- ReasonBucketCheckError: Error checking bucket status (triggers backoff)
- ReasonSTSSecretError: STS secret operation failed (triggers backoff)

Improved Testing

Created mockBucketClient for dependency injection in tests
Added BucketClientFactory to CloudStorageReconciler for test mocking
Comprehensive test coverage for exponential backoff behavior
Fixed fake client configuration with status subresource support

Benefits

Self-healing: System continues retrying with exponentially increasing delays
Resource efficiency: Avoids overwhelming the API with rapid retries
Better observability: Status conditions provide clear visibility into reconciliation state
Standard pattern: Uses Kubernetes-native retry mechanisms
Resilient operation: Handles transient failures gracefully with automatic retries

Testing

All unit tests pass
Exponential backoff behavior verified through mock testing
Status conditions properly set and updated
Linting checks pass

Related Issues

Fixes OADP-6653: CloudStorage controller should implement exponential backoff for retries

Note

Responses generated with Claude

openshift-ci · 2025-09-03T13:54:49Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2025-09-03T18:27:57Z

@kaovilai: This pull request references OADP-6653 which is a valid jira issue.

In response to this:

Remove after rebase against OADP-6652: Fix unnecessary secret updates and logging in STS flow #1936 Fix unnecessary secret updates and logging in STS flow

Add retry logic and status conditions to CloudStorage controller

Why the changes were made

How to test the changes made

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

kaovilai · 2025-09-04T14:34:16Z

from scrum backoff for transient/unknown 500 errs
if known issue then super long backoff. this would cover the case where there is even nothing to watch.. like STS tokens which on rotation do not result in secret updates.

openshift-ci-robot · 2025-09-05T15:05:55Z

@kaovilai: This pull request references OADP-6653 which is a valid jira issue.

In response to this:

Description

This PR implements exponential backoff for the CloudStorage controller using Kubernetes-native libraries. The controller leverages controller-runtime's built-in workqueue backoff mechanism for automatic retries with increasing delays when bucket operations fail.

Key Changes

Exponential Backoff Implementation

Leverages controller-runtime's workqueue: Returns errors on failures to trigger automatic exponential backoff (5ms initial → 1000s maximum)

Per-item backoff: Each CloudStorage CR gets independent retry timing through the workqueue

Self-healing behavior: Continues retrying with exponentially increasing delays instead of giving up

Status Conditions for Observability

Added Conditions field to CloudStorageStatus for standard Kubernetes status reporting

Implemented BucketReady condition with multiple reason constants:

ReasonBucketCreated: Bucket successfully created

ReasonBucketReady: Bucket exists and is ready

ReasonBucketCreationFailed: Bucket creation failed (triggers backoff)

ReasonBucketCheckError: Error checking bucket status (triggers backoff)

ReasonSTSSecretError: STS secret operation failed (triggers backoff)

Improved Testing

Created mockBucketClient for dependency injection in tests

Added BucketClientFactory to CloudStorageReconciler for test mocking

Comprehensive test coverage for exponential backoff behavior

Fixed fake client configuration with status subresource support

Benefits

Self-healing: System continues retrying with exponentially increasing delays

Resource efficiency: Avoids overwhelming the API with rapid retries

Better observability: Status conditions provide clear visibility into reconciliation state

Standard pattern: Uses Kubernetes-native retry mechanisms

Resilient operation: Handles transient failures gracefully with automatic retries

Testing

All unit tests pass

Exponential backoff behavior verified through mock testing

Status conditions properly set and updated

Linting checks pass

Related Issues

Fixes OADP-6653: CloudStorage controller should implement exponential backoff for retries

[!Note]
Responses generated with Claude

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

kaovilai · 2025-09-05T15:10:54Z

internal/controller/cloudstorage_controller.go

+		if updateErr := b.Client.Status().Update(ctx, &bucket); updateErr != nil {
+			logger.Error(updateErr, "failed to update CloudStorage status")
+		}
+		return ctrl.Result{}, err


return ctrl.Result{}, err makes it requeue with exponential backoff if err != nil.

kaovilai · 2025-09-05T15:13:39Z

internal/controller/cloudstorage_controller.go

+			// Return error to trigger exponential backoff
+			if err != nil {
+				return ctrl.Result{}, err
+			}
+			return ctrl.Result{}, fmt.Errorf("bucket creation failed")


return ctrl.Result{}, err makes it requeue with exponential backoff if err != nil.

kaovilai · 2025-09-05T15:14:03Z

internal/controller/cloudstorage_controller.go

+		// Return error to trigger exponential backoff
+		return ctrl.Result{}, err


return ctrl.Result{}, err makes it requeue with exponential backoff if err != nil.

kaovilai · 2025-09-05T15:14:45Z

/hold to rebase against #1936

kaovilai · 2025-09-05T15:15:51Z

Reviewers are asked to review only second commit for this PR.

openshift-ci · 2025-09-05T18:14:43Z

@kaovilai: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/4.19-e2e-test-aws	`bb041c9`	link	true	`/test 4.19-e2e-test-aws`
ci/prow/4.20-e2e-test-hcp-aws	`bb041c9`	link	false	`/test 4.20-e2e-test-hcp-aws`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

- Add Conditions field to CloudStorageStatus for better observability - Implement exponential backoff by returning errors on bucket operations - Controller-runtime automatically handles retries (5ms to 1000s max) - Add condition constants for type-safe reason strings - Create mock bucket client for improved testing - Add comprehensive tests for backoff behavior and conditions Key improvements: - Standard Kubernetes pattern using built-in workqueue backoff - Self-healing: continues retrying with increasing delays - Better observability through status conditions - Per-item backoff: each CloudStorage CR gets independent retry timing 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

openshift-ci · 2025-09-08T17:03:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kaovilai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [kaovilai]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2025-09-08T17:03:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kaovilai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [kaovilai]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kaovilai · 2025-09-08T17:33:11Z

/unhold

mpryc · 2025-09-08T17:43:44Z

internal/controller/cloudstorage_controller.go

 	}

 	// Update status with updated value
 	bucket.Status.LastSynced = &metav1.Time{Time: time.Now()}
 	bucket.Status.Name = bucket.Spec.Name

-	b.Client.Status().Update(ctx, &bucket)
+	if err := b.Client.Status().Update(ctx, &bucket); err != nil {
+		logger.Error(err, "failed to update CloudStorage status")


Should we also use a backoff here? Otherwise, could this result in the condition remaining not ready?

- Return error instead of just logging when final status update fails - Add documentation test explaining the change - Ensures controller-runtime's exponential backoff is triggered for status update failures Addresses PR comment openshift#1937 discussion_r2330918689 Co-authored-by: kaovilai <11228024+kaovilai@users.noreply.github.com>

* Initial plan * Add exponential backoff for status update failures - Return error instead of just logging when final status update fails - Add documentation test explaining the change - Ensures controller-runtime's exponential backoff is triggered for status update failures Addresses PR comment openshift#1937 discussion_r2330918689 Co-authored-by: kaovilai <11228024+kaovilai@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: kaovilai <11228024+kaovilai@users.noreply.github.com>

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 3, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 3, 2025

kaovilai changed the title ~~CloudStorage LimitedRetries~~ OADP-6653: CloudStorage stop retrying after 3 errors Sep 3, 2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 3, 2025

kaovilai force-pushed the CloudStorage-LimitedRetries branch 2 times, most recently from 3a9b0b4 to bb041c9 Compare September 5, 2025 14:53

kaovilai changed the title ~~OADP-6653: CloudStorage stop retrying after 3 errors~~ OADP-6653: CloudStorage exponential backoff using Kubernetes libraries Sep 5, 2025

kaovilai commented Sep 5, 2025

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 5, 2025

kaovilai marked this pull request as ready for review September 5, 2025 15:15

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 5, 2025

openshift-ci bot requested review from shubham-pampattiwar and sseago September 5, 2025 15:15

kaovilai changed the title ~~OADP-6653: CloudStorage exponential backoff using Kubernetes libraries~~ OADP-6653: CloudStorage exponential backoff by removing requeueAfter Sep 5, 2025

kaovilai changed the title ~~OADP-6653: CloudStorage exponential backoff by removing requeueAfter~~ OADP-6653: CloudStorage exponential backoff by removing RequeueAfter Sep 5, 2025

sseago previously approved these changes Sep 5, 2025

View reviewed changes

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 5, 2025

kaovilai dismissed sseago’s stale review via 9cd79a3 September 8, 2025 17:02

kaovilai force-pushed the CloudStorage-LimitedRetries branch from bb041c9 to 9cd79a3 Compare September 8, 2025 17:02

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 8, 2025

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 8, 2025

mpryc reviewed Sep 8, 2025

View reviewed changes

Copilot AI mentioned this pull request Sep 9, 2025

Add exponential backoff for CloudStorage status update failures kaovilai/oadp-operator#124

Merged

		// Return error to trigger exponential backoff
		return ctrl.Result{}, err

OADP-6653: CloudStorage exponential backoff by removing RequeueAfter #1937

Are you sure you want to change the base?

OADP-6653: CloudStorage exponential backoff by removing RequeueAfter #1937

Conversation

kaovilai commented Sep 3, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Changes

Exponential Backoff Implementation

Status Conditions for Observability

Improved Testing

Benefits

Testing

Related Issues

Uh oh!

openshift-ci bot commented Sep 3, 2025

Uh oh!

openshift-ci-robot commented Sep 3, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why the changes were made

How to test the changes made

Uh oh!

kaovilai commented Sep 4, 2025

Uh oh!

openshift-ci-robot commented Sep 5, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Changes

Exponential Backoff Implementation

Status Conditions for Observability

Improved Testing

Benefits

Testing

Related Issues

Uh oh!

kaovilai Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

kaovilai Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

kaovilai Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

kaovilai commented Sep 5, 2025

Uh oh!

kaovilai commented Sep 5, 2025

Uh oh!

openshift-ci bot commented Sep 5, 2025

Uh oh!

openshift-ci bot commented Sep 8, 2025

Uh oh!

openshift-ci bot commented Sep 8, 2025

Uh oh!

kaovilai commented Sep 8, 2025

Uh oh!

mpryc Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kaovilai commented Sep 3, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Sep 3, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Sep 5, 2025 •

edited by openshift-ci bot

Loading