Skip to content

Conversation

kaovilai
Copy link
Member

@kaovilai kaovilai commented Sep 3, 2025

Description

This PR implements exponential backoff for the CloudStorage controller using Kubernetes-native libraries. The controller leverages controller-runtime's built-in workqueue backoff mechanism for automatic retries with increasing delays when bucket operations fail.

Key Changes

Exponential Backoff Implementation

  • Leverages controller-runtime's workqueue: Returns errors on failures to trigger automatic exponential backoff (5ms initial → 1000s maximum)
  • Per-item backoff: Each CloudStorage CR gets independent retry timing through the workqueue
  • Self-healing behavior: Continues retrying with exponentially increasing delays instead of giving up

Status Conditions for Observability

  • Added Conditions field to CloudStorageStatus for standard Kubernetes status reporting
  • Implemented BucketReady condition with multiple reason constants:
    • ReasonBucketCreated: Bucket successfully created
    • ReasonBucketReady: Bucket exists and is ready
    • ReasonBucketCreationFailed: Bucket creation failed (triggers backoff)
    • ReasonBucketCheckError: Error checking bucket status (triggers backoff)
    • ReasonSTSSecretError: STS secret operation failed (triggers backoff)

Improved Testing

  • Created mockBucketClient for dependency injection in tests
  • Added BucketClientFactory to CloudStorageReconciler for test mocking
  • Comprehensive test coverage for exponential backoff behavior
  • Fixed fake client configuration with status subresource support

Benefits

  1. Self-healing: System continues retrying with exponentially increasing delays
  2. Resource efficiency: Avoids overwhelming the API with rapid retries
  3. Better observability: Status conditions provide clear visibility into reconciliation state
  4. Standard pattern: Uses Kubernetes-native retry mechanisms
  5. Resilient operation: Handles transient failures gracefully with automatic retries

Testing

  • All unit tests pass
  • Exponential backoff behavior verified through mock testing
  • Status conditions properly set and updated
  • Linting checks pass

Related Issues

  • Fixes OADP-6653: CloudStorage controller should implement exponential backoff for retries

Note

Responses generated with Claude

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 3, 2025
Copy link

openshift-ci bot commented Sep 3, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 3, 2025
@kaovilai kaovilai changed the title CloudStorage LimitedRetries OADP-6653: CloudStorage stop retrying after 3 errors Sep 3, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 3, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 3, 2025

@kaovilai: This pull request references OADP-6653 which is a valid jira issue.

In response to this:

Why the changes were made

How to test the changes made

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@kaovilai
Copy link
Member Author

kaovilai commented Sep 4, 2025

from scrum backoff for transient/unknown 500 errs
if known issue then super long backoff. this would cover the case where there is even nothing to watch.. like STS tokens which on rotation do not result in secret updates.

@kaovilai kaovilai force-pushed the CloudStorage-LimitedRetries branch 2 times, most recently from 3a9b0b4 to bb041c9 Compare September 5, 2025 14:53
@kaovilai kaovilai changed the title OADP-6653: CloudStorage stop retrying after 3 errors OADP-6653: CloudStorage exponential backoff using Kubernetes libraries Sep 5, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 5, 2025

@kaovilai: This pull request references OADP-6653 which is a valid jira issue.

In response to this:

Description

This PR implements exponential backoff for the CloudStorage controller using Kubernetes-native libraries. The controller leverages controller-runtime's built-in workqueue backoff mechanism for automatic retries with increasing delays when bucket operations fail.

Key Changes

Exponential Backoff Implementation

  • Leverages controller-runtime's workqueue: Returns errors on failures to trigger automatic exponential backoff (5ms initial → 1000s maximum)
  • Per-item backoff: Each CloudStorage CR gets independent retry timing through the workqueue
  • Self-healing behavior: Continues retrying with exponentially increasing delays instead of giving up

Status Conditions for Observability

  • Added Conditions field to CloudStorageStatus for standard Kubernetes status reporting
  • Implemented BucketReady condition with multiple reason constants:
  • ReasonBucketCreated: Bucket successfully created
  • ReasonBucketReady: Bucket exists and is ready
  • ReasonBucketCreationFailed: Bucket creation failed (triggers backoff)
  • ReasonBucketCheckError: Error checking bucket status (triggers backoff)
  • ReasonSTSSecretError: STS secret operation failed (triggers backoff)

Improved Testing

  • Created mockBucketClient for dependency injection in tests
  • Added BucketClientFactory to CloudStorageReconciler for test mocking
  • Comprehensive test coverage for exponential backoff behavior
  • Fixed fake client configuration with status subresource support

Benefits

  1. Self-healing: System continues retrying with exponentially increasing delays
  2. Resource efficiency: Avoids overwhelming the API with rapid retries
  3. Better observability: Status conditions provide clear visibility into reconciliation state
  4. Standard pattern: Uses Kubernetes-native retry mechanisms
  5. Resilient operation: Handles transient failures gracefully with automatic retries

Testing

  • All unit tests pass
  • Exponential backoff behavior verified through mock testing
  • Status conditions properly set and updated
  • Linting checks pass

Related Issues

  • Fixes OADP-6653: CloudStorage controller should implement exponential backoff for retries

[!Note]
Responses generated with Claude

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

if updateErr := b.Client.Status().Update(ctx, &bucket); updateErr != nil {
logger.Error(updateErr, "failed to update CloudStorage status")
}
return ctrl.Result{}, err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return ctrl.Result{}, err makes it requeue with exponential backoff if err != nil.

Comment on lines +179 to +183
// Return error to trigger exponential backoff
if err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{}, fmt.Errorf("bucket creation failed")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return ctrl.Result{}, err makes it requeue with exponential backoff if err != nil.

Comment on lines +208 to +209
// Return error to trigger exponential backoff
return ctrl.Result{}, err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return ctrl.Result{}, err makes it requeue with exponential backoff if err != nil.

@kaovilai
Copy link
Member Author

kaovilai commented Sep 5, 2025

/hold to rebase against #1936

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 5, 2025
@kaovilai kaovilai marked this pull request as ready for review September 5, 2025 15:15
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 5, 2025
@kaovilai
Copy link
Member Author

kaovilai commented Sep 5, 2025

Reviewers are asked to review only second commit for this PR.

@kaovilai kaovilai changed the title OADP-6653: CloudStorage exponential backoff using Kubernetes libraries OADP-6653: CloudStorage exponential backoff by removing requeueAfter Sep 5, 2025
@kaovilai kaovilai changed the title OADP-6653: CloudStorage exponential backoff by removing requeueAfter OADP-6653: CloudStorage exponential backoff by removing RequeueAfter Sep 5, 2025
sseago
sseago previously approved these changes Sep 5, 2025
Copy link

openshift-ci bot commented Sep 5, 2025

@kaovilai: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.19-e2e-test-aws bb041c9 link true /test 4.19-e2e-test-aws
ci/prow/4.20-e2e-test-hcp-aws bb041c9 link false /test 4.20-e2e-test-hcp-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 5, 2025
- Add Conditions field to CloudStorageStatus for better observability
- Implement exponential backoff by returning errors on bucket operations
- Controller-runtime automatically handles retries (5ms to 1000s max)
- Add condition constants for type-safe reason strings
- Create mock bucket client for improved testing
- Add comprehensive tests for backoff behavior and conditions

Key improvements:
- Standard Kubernetes pattern using built-in workqueue backoff
- Self-healing: continues retrying with increasing delays
- Better observability through status conditions
- Per-item backoff: each CloudStorage CR gets independent retry timing

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@kaovilai kaovilai force-pushed the CloudStorage-LimitedRetries branch from bb041c9 to 9cd79a3 Compare September 8, 2025 17:02
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 8, 2025
Copy link

openshift-ci bot commented Sep 8, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kaovilai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

1 similar comment
Copy link

openshift-ci bot commented Sep 8, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kaovilai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kaovilai
Copy link
Member Author

kaovilai commented Sep 8, 2025

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 8, 2025
}

// Update status with updated value
bucket.Status.LastSynced = &metav1.Time{Time: time.Now()}
bucket.Status.Name = bucket.Spec.Name

b.Client.Status().Update(ctx, &bucket)
if err := b.Client.Status().Update(ctx, &bucket); err != nil {
logger.Error(err, "failed to update CloudStorage status")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also use a backoff here? Otherwise, could this result in the condition remaining not ready?

Copilot AI added a commit to kaovilai/oadp-operator that referenced this pull request Sep 9, 2025
- Return error instead of just logging when final status update fails
- Add documentation test explaining the change
- Ensures controller-runtime's exponential backoff is triggered for status update failures

Addresses PR comment openshift#1937 discussion_r2330918689

Co-authored-by: kaovilai <11228024+kaovilai@users.noreply.github.com>
* Initial plan

* Add exponential backoff for status update failures

- Return error instead of just logging when final status update fails
- Add documentation test explaining the change
- Ensures controller-runtime's exponential backoff is triggered for status update failures

Addresses PR comment openshift#1937 discussion_r2330918689

Co-authored-by: kaovilai <11228024+kaovilai@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: kaovilai <11228024+kaovilai@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants