-
Notifications
You must be signed in to change notification settings - Fork 82
OADP-6653: CloudStorage stop retrying after 3 errors #1937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: oadp-dev
Are you sure you want to change the base?
Conversation
The operator was repeatedly logging "Secret already exists, updating" and "Following standardized STS workflow, secret created successfully" even when the secret content hadn't changed. This was happening because the CloudStorage controller calls STSStandardizedFlow() on every reconciliation, which always attempted to create the secret first, then caught the AlreadyExists error and performed an update. Changed the approach to: - First check if the secret exists - Compare existing data with desired data - Only update when there are actual differences - Skip updates and avoid logging when content is identical - Changed CloudStorage controller to use Debug level and more accurate message when STS secret is available (not necessarily created) This eliminates unnecessary API calls to the Kubernetes cluster and reduces noise in the operator logs. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kaovilai The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@kaovilai: This pull request references OADP-6653 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
from scrum backoff for transient/unknown 500 errs |
dbbfdab
to
3a9b0b4
Compare
- Add Conditions field to CloudStorageStatus for better observability - Implement exponential backoff by returning errors on bucket operations - Controller-runtime automatically handles retries (5ms to 1000s max) - Add condition constants for type-safe reason strings - Create mock bucket client for improved testing - Add comprehensive tests for backoff behavior and conditions Key improvements: - Standard Kubernetes pattern using built-in workqueue backoff - Self-healing: continues retrying with increasing delays - Better observability through status conditions - Per-item backoff: each CloudStorage CR gets independent retry timing 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
3a9b0b4
to
bb041c9
Compare
Why the changes were made
How to test the changes made