Fix auth transition on edge-cases #321

nammn · 2025-08-07T15:04:22Z

Summary

Add Not-Ready Handling for Ongoing Auth Transitions:
This patch refines our readiness logic to correctly reflect the state of authentication transitions. Previously, we treated LastGoalVersionAchieved == GoalVersion as a signal that the cluster was "Running", but this assumption breaks down when auth transitions are still in progress.
This happened because we returned "ready" during a wait step (WaitAuthCanUpdate) — and we generally return ready for all wait steps, regardless of whether auth is fully transitioned. Example status:

{
  "step": "WaitAuthUpdate",
  "stepDoc": "Wait to update Auth",
  "isWaitStep": true,
  "started": "2025-08-07T14:59:40.213178437Z",
  "attempts": 512,
  "latestAttempt": "2025-08-07T15:09:20.966699961Z",
  "completed": null,
  "result": "wait"
}

Why implemented in the operator and not readinessProbe:
I didn't fix the readinessProbe but rather the operator

if the readinessProbe blocks new nodes are not coming up
we want new nodes coming up
but we also want to block new configurations being applied, which the automation_status check in the
operator does

The core idea:

Configuration applied ≠ transition fully complete.

What happened in our tests:

we update auth via CR x509 -> scram
node-0 completed its auth transition (now uses scram, instead of x509)
Config server hasn't finished its auth transition yet
We hit a race condition where clusters were marked as "Running" too early and thus continued the rolling restart of nod e-0
node-0 restarted with the old X509 config (see below comment from the agent code)
The X509 process couldn’t access the SCRAM automation user
Leads to Error: "process...doesn't have the automation user"

in the mms-automation there is also a comment; that indicates thats they are handling the edge-case if an auth transition was not successful, they start the process with old auth to "finish" it. But this is exactly what causes our race condition

	// If a process went down unexpectedly in the middle of an auth transition,
	// we want to restart it with the old auth args.
	// Otherwise, it could be upgraded to the new auth state too soon,
	// and not be able to communicate with other shard members.

tl;dr: first node-0 moved to new auth, config not yet, node-0 restarted and during the restart config transitioned to the new auth while node-0 is again running old auth

Proof of Work

auth change tests are passing multiple times in a row: Link - the most flaky auth tests + Link2 - from the patch

Checklist

Have you linked a jira ticket and/or is the ticket in the title?
Have you checked whether your jira ticket required DOCSP changes?
Have you added changelog file?
- use skip-changelog label if not needed
- refer to Changelog files and Release Notes section in CONTRIBUTING.md for more details

github-actions · 2025-08-07T15:05:17Z

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.3.0 Release Notes

Bug Fixes

This change fixes the current complex and difficult-to-maintain architecture for stateful set containers, which relies on an "agent matrix" to map operator and agent versions which led to a sheer amount of images.
We solve this by shifting to a 3-container setup. This new design eliminates the need for the operator-version/agent-version matrix by adding one additional container containing all required binaries. This architecture maps to what we already do with the mongodb-database container.
Fixed an issue where the readiness probe reported the node as ready even when its authentication mechanism was not in sync with the other nodes, potentially causing premature restarts.

Other Changes

Optional permissions for PersistentVolumeClaim moved to a separate role. When managing the operator with Helm it is possible to disable permissions for PersistentVolumeClaim resources by setting operator.enablePVCResize value to false (true by default). When enabled, previously these permissions were part of the primary operator role. With this change, permissions have a separate role.
subresourceEnabled Helm value was removed. This setting used to be true by default and made it possible to exclude subresource permissions from the operator role by specifying false as the value. We are removing this configuration option, making the operator roles always have subresource permissions. This setting was introduced as a temporary solution for this OpenShift issue. The issue has since been resolved and the setting is no longer needed.

Copilot

Pull Request Overview

This PR fixes a race condition in authentication transitions by refining the readiness logic to correctly handle ongoing auth transitions. Previously, the system would mark clusters as "Running" too early when LastGoalVersionAchieved == GoalVersion, even when authentication transitions were still in progress, leading to process restarts with incorrect auth configurations.

Key changes:

Added authentication transition detection in the automation status checker
Introduced logic to wait for auth-related moves to complete before marking clusters as ready
Added comprehensive test coverage for authentication transition scenarios

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
controllers/om/automation_status.go	Added authentication transition detection and `isAuthenticationTransitionMove` helper function
controllers/om/automation_status_test.go	Added comprehensive test cases for authentication transition scenarios
changelog/20250808_fix_fixing_auth_transition_edge_cases.md	Added changelog entry documenting the fix

mircea-cosbuc

LGTM with a nit

controllers/om/automation_status.go

changelog/20250808_fix_fixing_auth_transition_edge_cases.md

controllers/om/automation_status.go

changelog/20250808_fix_fixing_auth_transition_edge_cases.md

… into fix-scram-222

Copilot

Pull Request Overview

This PR fixes an edge case in authentication transitions where nodes could be marked as "Running" too early, causing premature restarts and authentication failures. Previously, the operator considered processes ready once they reached the goal version, even if authentication transitions were still in progress.

Enhanced automation status logic to detect ongoing authentication transitions
Added checks for auth-related moves (UpdateAuth, WaitAuthUpdate) in process plans
Prevents premature cluster operations during authentication state changes

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
`controllers/om/automation_status.go`	Implements logic to detect ongoing auth transitions and prevent premature readiness
`controllers/om/automation_status_test.go`	Adds comprehensive tests for auth transition detection and validation
`changelog/20250808_fix_fixing_auth_transition_edge_cases.md`	Documents the fix for auth transition edge cases

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.}

nammn added 3 commits August 7, 2025 17:01

try out auth detection

d0c2fb5

refactor auth

a1ae41d

refactor auth

9d42913

linter

54dd7d2

nammn changed the title ~~Fix scram 222~~ Fix auth transition on edge-cases Aug 8, 2025

fix auth transition

2894667

nammn marked this pull request as ready for review August 8, 2025 08:59

nammn requested a review from a team as a code owner August 8, 2025 08:59

nammn requested review from m1kola, anandsyncs and Copilot August 8, 2025 08:59

Copilot AI reviewed Aug 8, 2025

View reviewed changes

nammn requested review from mircea-cosbuc, viveksinghggits, Julien-Ben and lucian-tosa August 8, 2025 09:12

nammn added 2 commits August 11, 2025 10:16

try update auth

b4fc956

fix unit test

f75b01c

mircea-cosbuc requested changes Aug 11, 2025

View reviewed changes

controllers/om/automation_status.go Show resolved Hide resolved

controllers/om/automation_status.go Outdated Show resolved Hide resolved

viveksinghggits reviewed Aug 11, 2025

View reviewed changes

controllers/om/automation_status.go Show resolved Hide resolved

nammn added 2 commits August 12, 2025 16:19

address feedback

e551f2b

Merge branch 'master' into fix-scram-222

bf8256c

nammn requested review from viveksinghggits, mircea-cosbuc and Copilot August 12, 2025 14:23

viveksinghggits reviewed Aug 12, 2025

View reviewed changes

changelog/20250808_fix_fixing_auth_transition_edge_cases.md Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

update rn

6121c47

nammn added 2 commits August 13, 2025 10:04

Merge branch 'fix-scram-222' of github.com:mongodb/mongodb-kubernetes…

6edeb6f

… into fix-scram-222

Merge branch 'master' into fix-scram-222

48bd635

nammn requested a review from viveksinghggits August 13, 2025 08:09

anandsyncs requested a review from Copilot August 13, 2025 08:18

Copilot AI reviewed Aug 13, 2025

View reviewed changes

viveksinghggits approved these changes Aug 13, 2025

View reviewed changes

mircea-cosbuc approved these changes Aug 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix auth transition on edge-cases #321

Fix auth transition on edge-cases #321

nammn commented Aug 7, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 7, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

mircea-cosbuc left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Fix auth transition on edge-cases #321

Are you sure you want to change the base?

Fix auth transition on edge-cases #321

Conversation

nammn commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Proof of Work

Checklist

Uh oh!

github-actions bot commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MCK 1.3.0 Release Notes

Bug Fixes

Other Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

mircea-cosbuc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

nammn commented Aug 7, 2025 •

edited

Loading

github-actions bot commented Aug 7, 2025 •

edited

Loading