-
Notifications
You must be signed in to change notification settings - Fork 13
Fix auth transition on edge-cases #321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
MCK 1.3.0 Release NotesBug Fixes
Other Changes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a race condition in authentication transitions by refining the readiness logic to correctly handle ongoing auth transitions. Previously, the system would mark clusters as "Running" too early when LastGoalVersionAchieved == GoalVersion
, even when authentication transitions were still in progress, leading to process restarts with incorrect auth configurations.
Key changes:
- Added authentication transition detection in the automation status checker
- Introduced logic to wait for auth-related moves to complete before marking clusters as ready
- Added comprehensive test coverage for authentication transition scenarios
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
File | Description |
---|---|
controllers/om/automation_status.go | Added authentication transition detection and isAuthenticationTransitionMove helper function |
controllers/om/automation_status_test.go | Added comprehensive test cases for authentication transition scenarios |
changelog/20250808_fix_fixing_auth_transition_edge_cases.md | Added changelog entry documenting the fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes an edge case in authentication transitions where nodes could be marked as "Running" too early, causing premature restarts and authentication failures. Previously, the operator considered processes ready once they reached the goal version, even if authentication transitions were still in progress.
- Enhanced automation status logic to detect ongoing authentication transitions
- Added checks for auth-related moves (
UpdateAuth
,WaitAuthUpdate
) in process plans - Prevents premature cluster operations during authentication state changes
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
File | Description |
---|---|
controllers/om/automation_status.go |
Implements logic to detect ongoing auth transitions and prevent premature readiness |
controllers/om/automation_status_test.go |
Adds comprehensive tests for auth transition detection and validation |
changelog/20250808_fix_fixing_auth_transition_edge_cases.md |
Documents the fix for auth transition edge cases |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
Summary
Add Not-Ready Handling for Ongoing Auth Transitions:
This patch refines our readiness logic to correctly reflect the state of authentication transitions. Previously, we treated LastGoalVersionAchieved == GoalVersion as a signal that the cluster was "Running", but this assumption breaks down when auth transitions are still in progress.
This happened because we returned "ready" during a wait step (WaitAuthCanUpdate) — and we generally return ready for all wait steps, regardless of whether auth is fully transitioned. Example status:
Why implemented in the operator and not readinessProbe:
I didn't fix the readinessProbe but rather the operator
operator does
The core idea:
What happened in our tests:
node-0
completed its auth transition (now uses scram, instead of x509)Config server
hasn't finished its auth transition yetnod e-0
node-0
restarted with the old X509 config (see below comment from the agent code)tl;dr: first
node-0
moved to new auth,config
not yet,node-0
restarted and during the restartconfig
transitioned to the new auth whilenode-0
is again running old authProof of Work
Checklist
skip-changelog
label if not needed