Skip to content

Fix auth transition on edge-cases #321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Fix auth transition on edge-cases #321

wants to merge 12 commits into from

Conversation

nammn
Copy link
Collaborator

@nammn nammn commented Aug 7, 2025

Summary

Add Not-Ready Handling for Ongoing Auth Transitions:
This patch refines our readiness logic to correctly reflect the state of authentication transitions. Previously, we treated LastGoalVersionAchieved == GoalVersion as a signal that the cluster was "Running", but this assumption breaks down when auth transitions are still in progress.
This happened because we returned "ready" during a wait step (WaitAuthCanUpdate) — and we generally return ready for all wait steps, regardless of whether auth is fully transitioned. Example status:

{
  "step": "WaitAuthUpdate",
  "stepDoc": "Wait to update Auth",
  "isWaitStep": true,
  "started": "2025-08-07T14:59:40.213178437Z",
  "attempts": 512,
  "latestAttempt": "2025-08-07T15:09:20.966699961Z",
  "completed": null,
  "result": "wait"
}

Why implemented in the operator and not readinessProbe:
I didn't fix the readinessProbe but rather the operator

  • if the readinessProbe blocks new nodes are not coming up
  • we want new nodes coming up
  • but we also want to block new configurations being applied, which the automation_status check in the
    operator does

The core idea:

  • Configuration applied ≠ transition fully complete.

What happened in our tests:

  • we update auth via CR x509 -> scram
  • node-0 completed its auth transition (now uses scram, instead of x509)
  • Config server hasn't finished its auth transition yet
  • We hit a race condition where clusters were marked as "Running" too early and thus continued the rolling restart of nod e-0
  • node-0 restarted with the old X509 config (see below comment from the agent code)
  • The X509 process couldn’t access the SCRAM automation user
  • Leads to Error: "process...doesn't have the automation user"
  • in the mms-automation there is also a comment; that indicates thats they are handling the edge-case if an auth transition was not successful, they start the process with old auth to "finish" it. But this is exactly what causes our race condition
	// If a process went down unexpectedly in the middle of an auth transition,
	// we want to restart it with the old auth args.
	// Otherwise, it could be upgraded to the new auth state too soon,
	// and not be able to communicate with other shard members.

tl;dr: first node-0 moved to new auth, config not yet, node-0 restarted and during the restart config transitioned to the new auth while node-0 is again running old auth

Proof of Work

  • auth change tests are passing multiple times in a row: Link - the most flaky auth tests + Link2 - from the patch

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title?
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you added changelog file?

Copy link

github-actions bot commented Aug 7, 2025

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.3.0 Release Notes

Bug Fixes

  • This change fixes the current complex and difficult-to-maintain architecture for stateful set containers, which relies on an "agent matrix" to map operator and agent versions which led to a sheer amount of images.
  • We solve this by shifting to a 3-container setup. This new design eliminates the need for the operator-version/agent-version matrix by adding one additional container containing all required binaries. This architecture maps to what we already do with the mongodb-database container.
  • Fixed an issue where the readiness probe reported the node as ready even when its authentication mechanism was not in sync with the other nodes, potentially causing premature restarts.

Other Changes

  • Optional permissions for PersistentVolumeClaim moved to a separate role. When managing the operator with Helm it is possible to disable permissions for PersistentVolumeClaim resources by setting operator.enablePVCResize value to false (true by default). When enabled, previously these permissions were part of the primary operator role. With this change, permissions have a separate role.
  • subresourceEnabled Helm value was removed. This setting used to be true by default and made it possible to exclude subresource permissions from the operator role by specifying false as the value. We are removing this configuration option, making the operator roles always have subresource permissions. This setting was introduced as a temporary solution for this OpenShift issue. The issue has since been resolved and the setting is no longer needed.

@nammn nammn changed the title Fix scram 222 Fix auth transition on edge-cases Aug 8, 2025
@nammn nammn marked this pull request as ready for review August 8, 2025 08:59
@nammn nammn requested a review from a team as a code owner August 8, 2025 08:59
@nammn nammn requested review from m1kola, anandsyncs and Copilot August 8, 2025 08:59
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a race condition in authentication transitions by refining the readiness logic to correctly handle ongoing auth transitions. Previously, the system would mark clusters as "Running" too early when LastGoalVersionAchieved == GoalVersion, even when authentication transitions were still in progress, leading to process restarts with incorrect auth configurations.

Key changes:

  • Added authentication transition detection in the automation status checker
  • Introduced logic to wait for auth-related moves to complete before marking clusters as ready
  • Added comprehensive test coverage for authentication transition scenarios

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
controllers/om/automation_status.go Added authentication transition detection and isAuthenticationTransitionMove helper function
controllers/om/automation_status_test.go Added comprehensive test cases for authentication transition scenarios
changelog/20250808_fix_fixing_auth_transition_edge_cases.md Added changelog entry documenting the fix

Copy link
Member

@mircea-cosbuc mircea-cosbuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a nit

Copilot

This comment was marked as outdated.

@nammn nammn requested a review from viveksinghggits August 13, 2025 08:09
@anandsyncs anandsyncs requested a review from Copilot August 13, 2025 08:18
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes an edge case in authentication transitions where nodes could be marked as "Running" too early, causing premature restarts and authentication failures. Previously, the operator considered processes ready once they reached the goal version, even if authentication transitions were still in progress.

  • Enhanced automation status logic to detect ongoing authentication transitions
  • Added checks for auth-related moves (UpdateAuth, WaitAuthUpdate) in process plans
  • Prevents premature cluster operations during authentication state changes

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
controllers/om/automation_status.go Implements logic to detect ongoing auth transitions and prevent premature readiness
controllers/om/automation_status_test.go Adds comprehensive tests for auth transition detection and validation
changelog/20250808_fix_fixing_auth_transition_edge_cases.md Documents the fix for auth transition edge cases

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants