Skip to content

Conversation

kaovilai
Copy link
Member

@kaovilai kaovilai commented Sep 3, 2025

Description

This PR implements exponential backoff for the CloudStorage controller using Kubernetes-native libraries. The controller leverages controller-runtime's built-in workqueue backoff mechanism for automatic retries with increasing delays when bucket operations fail.

Key Changes

Exponential Backoff Implementation

  • Leverages controller-runtime's workqueue: Returns errors on failures to trigger automatic exponential backoff (5ms initial → 1000s maximum)
  • Per-item backoff: Each CloudStorage CR gets independent retry timing through the workqueue
  • Self-healing behavior: Continues retrying with exponentially increasing delays instead of giving up

Status Conditions for Observability

  • Added Conditions field to CloudStorageStatus for standard Kubernetes status reporting
  • Implemented BucketReady condition with multiple reason constants:
    • ReasonBucketCreated: Bucket successfully created
    • ReasonBucketReady: Bucket exists and is ready
    • ReasonBucketCreationFailed: Bucket creation failed (triggers backoff)
    • ReasonBucketCheckError: Error checking bucket status (triggers backoff)
    • ReasonSTSSecretError: STS secret operation failed (triggers backoff)

Improved Testing

  • Created mockBucketClient for dependency injection in tests
  • Added BucketClientFactory to CloudStorageReconciler for test mocking
  • Comprehensive test coverage for exponential backoff behavior
  • Fixed fake client configuration with status subresource support

Benefits

  1. Self-healing: System continues retrying with exponentially increasing delays
  2. Resource efficiency: Avoids overwhelming the API with rapid retries
  3. Better observability: Status conditions provide clear visibility into reconciliation state
  4. Standard pattern: Uses Kubernetes-native retry mechanisms
  5. Resilient operation: Handles transient failures gracefully with automatic retries

Testing

  • All unit tests pass
  • Exponential backoff behavior verified through mock testing
  • Status conditions properly set and updated
  • Linting checks pass

Related Issues

  • Fixes OADP-6653: CloudStorage controller should implement exponential backoff for retries

Note

Responses generated with Claude

The operator was repeatedly logging "Secret already exists, updating"
and "Following standardized STS workflow, secret created successfully"
even when the secret content hadn't changed. This was happening because
the CloudStorage controller calls STSStandardizedFlow() on every
reconciliation, which always attempted to create the secret first,
then caught the AlreadyExists error and performed an update.

Changed the approach to:
- First check if the secret exists
- Compare existing data with desired data
- Only update when there are actual differences
- Skip updates and avoid logging when content is identical
- Changed CloudStorage controller to use Debug level and more accurate
  message when STS secret is available (not necessarily created)

This eliminates unnecessary API calls to the Kubernetes cluster and
reduces noise in the operator logs.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 3, 2025
Copy link

openshift-ci bot commented Sep 3, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 3, 2025
@kaovilai kaovilai changed the title CloudStorage LimitedRetries OADP-6653: CloudStorage stop retrying after 3 errors Sep 3, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 3, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 3, 2025

@kaovilai: This pull request references OADP-6653 which is a valid jira issue.

In response to this:

Why the changes were made

How to test the changes made

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@kaovilai
Copy link
Member Author

kaovilai commented Sep 4, 2025

from scrum backoff for transient/unknown 500 errs
if known issue then super long backoff. this would cover the case where there is even nothing to watch.. like STS tokens which on rotation do not result in secret updates.

@kaovilai kaovilai force-pushed the CloudStorage-LimitedRetries branch from dbbfdab to 3a9b0b4 Compare September 5, 2025 14:10
- Add Conditions field to CloudStorageStatus for better observability
- Implement exponential backoff by returning errors on bucket operations
- Controller-runtime automatically handles retries (5ms to 1000s max)
- Add condition constants for type-safe reason strings
- Create mock bucket client for improved testing
- Add comprehensive tests for backoff behavior and conditions

Key improvements:
- Standard Kubernetes pattern using built-in workqueue backoff
- Self-healing: continues retrying with increasing delays
- Better observability through status conditions
- Per-item backoff: each CloudStorage CR gets independent retry timing

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@kaovilai kaovilai force-pushed the CloudStorage-LimitedRetries branch from 3a9b0b4 to bb041c9 Compare September 5, 2025 14:53
@kaovilai kaovilai changed the title OADP-6653: CloudStorage stop retrying after 3 errors OADP-6653: CloudStorage exponential backoff using Kubernetes libraries Sep 5, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 5, 2025

@kaovilai: This pull request references OADP-6653 which is a valid jira issue.

In response to this:

Description

This PR implements exponential backoff for the CloudStorage controller using Kubernetes-native libraries. The controller leverages controller-runtime's built-in workqueue backoff mechanism for automatic retries with increasing delays when bucket operations fail.

Key Changes

Exponential Backoff Implementation

  • Leverages controller-runtime's workqueue: Returns errors on failures to trigger automatic exponential backoff (5ms initial → 1000s maximum)
  • Per-item backoff: Each CloudStorage CR gets independent retry timing through the workqueue
  • Self-healing behavior: Continues retrying with exponentially increasing delays instead of giving up

Status Conditions for Observability

  • Added Conditions field to CloudStorageStatus for standard Kubernetes status reporting
  • Implemented BucketReady condition with multiple reason constants:
  • ReasonBucketCreated: Bucket successfully created
  • ReasonBucketReady: Bucket exists and is ready
  • ReasonBucketCreationFailed: Bucket creation failed (triggers backoff)
  • ReasonBucketCheckError: Error checking bucket status (triggers backoff)
  • ReasonSTSSecretError: STS secret operation failed (triggers backoff)

Improved Testing

  • Created mockBucketClient for dependency injection in tests
  • Added BucketClientFactory to CloudStorageReconciler for test mocking
  • Comprehensive test coverage for exponential backoff behavior
  • Fixed fake client configuration with status subresource support

Benefits

  1. Self-healing: System continues retrying with exponentially increasing delays
  2. Resource efficiency: Avoids overwhelming the API with rapid retries
  3. Better observability: Status conditions provide clear visibility into reconciliation state
  4. Standard pattern: Uses Kubernetes-native retry mechanisms
  5. Resilient operation: Handles transient failures gracefully with automatic retries

Testing

  • All unit tests pass
  • Exponential backoff behavior verified through mock testing
  • Status conditions properly set and updated
  • Linting checks pass

Related Issues

  • Fixes OADP-6653: CloudStorage controller should implement exponential backoff for retries

[!Note]
Responses generated with Claude

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

if updateErr := b.Client.Status().Update(ctx, &bucket); updateErr != nil {
logger.Error(updateErr, "failed to update CloudStorage status")
}
return ctrl.Result{}, err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return ctrl.Result{}, err makes it requeue with exponential backoff if err != nil.

Comment on lines +179 to +183
// Return error to trigger exponential backoff
if err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{}, fmt.Errorf("bucket creation failed")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return ctrl.Result{}, err makes it requeue with exponential backoff if err != nil.

Comment on lines +208 to +209
// Return error to trigger exponential backoff
return ctrl.Result{}, err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return ctrl.Result{}, err makes it requeue with exponential backoff if err != nil.

@kaovilai
Copy link
Member Author

kaovilai commented Sep 5, 2025

/hold to rebase against #1936

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 5, 2025
@kaovilai kaovilai marked this pull request as ready for review September 5, 2025 15:15
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 5, 2025
@kaovilai
Copy link
Member Author

kaovilai commented Sep 5, 2025

Reviewers are asked to review only second commit for this PR.

@kaovilai kaovilai changed the title OADP-6653: CloudStorage exponential backoff using Kubernetes libraries OADP-6653: CloudStorage exponential backoff by removing requeueAfter Sep 5, 2025
@kaovilai kaovilai changed the title OADP-6653: CloudStorage exponential backoff by removing requeueAfter OADP-6653: CloudStorage exponential backoff by removing RequeueAfter Sep 5, 2025
Copy link

openshift-ci bot commented Sep 5, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kaovilai, sseago

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants