Add opt in initial check run #14087

SimonUnge · 2025-06-16T18:14:58Z

Proposed Changes

This PR introduces a new optional feature to help detect potential data loss scenarios during RabbitMQ node startup. The feature adds a verify_initial_run configuration option that defaults to false. When enabled, nodes will create a marker file called node_initialized.marker on their first startup and use this to verify data consistency on subsequent restarts.

The implementation works by adding a new boot step called initial_run_check that runs after recovery but before the existing empty_db_check step.

If the marker file exists but database tables are empty, this indicates a potential data loss scenario such as corruption, accidental database resets, or split-brain recovery issues. In these cases, the node will fail to start with a specific error cluster_already_initialized_but_tables_empty, giving operators clear indication that manual intervention may be required rather than silently starting with empty data.

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

Bug fix (non-breaking change which fixes issue #NNNN)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause an observable behavior change in existing systems)
Documentation improvements (corrections, new content, etc)
Cosmetic change (whitespace, formatting, etc)
Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
I have added tests that prove my fix is effective or that my feature works
All tests pass locally with my changes
If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

It the testing section, I fairly aggressively delete schema files to force rabbit_table:needs_default_data() returns true. There might be more elegant solutions?

SimonUnge · 2025-06-16T19:09:00Z

@michaelklishin it seems I cannot do PRs directly to rabbitmq/rabbitmq-server anymore so unsure best way to do PRs that makes the tests run?

michaelklishin · 2025-06-24T15:04:52Z

@SimonUnge I had to re-add you as a collaborator. You should have an invite.

Zerpet · 2025-06-24T15:58:56Z

I like this idea, at a high level. I'm wondering about what do you foresee users doing when this check fails?

In other words, what can a user do about e.g. accidental database resets, when they observe this error? I think most (if not all) of the time, a user would want rabbit to start, and let it sync from its peers.

lukebakken · 2025-06-24T16:26:34Z

accidental database resets, when they observe this error? I think most (if not all) of the time, a user would want rabbit to start, and let it sync from its peers.

The problem that this change helps to address are scenarios in which a peer tries to sync from a reset node. Yep, it can happen.

michaelklishin · 2025-06-24T16:37:59Z

Making the node fail and decomissioning, re-creating it is a valid approach.

michaelklishin · 2025-06-24T17:09:51Z

deps/rabbit/priv/schema/rabbit.schema

+%% and refuse to start if the marker exists but tables are empty.
+%%
+
+{mapping, "verify_initial_run", "rabbit.verify_initial_run", [


Can we rename this to prevent_startup_if_node_was_reset? I have considered many names, such as

track_initial_run_and_seeding

enforce_initialized_database

and a few others but the most specific option is the one that describes the end goal: prevent a previously reset node from booting.

Yep, will do

michaelklishin · 2025-06-24T17:10:17Z

deps/rabbit/src/rabbit.erl

@@ -199,10 +199,16 @@
                    {requires,    [core_initialized]},
                    {enables,     routing_ready}]}).

+-rabbit_boot_step({initial_run_check,


I'd also rename this to prevent_startup_if_node_was_reset.

michaelklishin · 2025-06-24T17:12:30Z

deps/rabbit/src/rabbit.erl

+
+-spec check_initial_run() -> 'ok' | no_return().
+
+check_initial_run() ->


Same recommendation as above.

Ill update the naming!

(cherry picked from commit 5f1ab14)

michaelklishin · 2025-06-25T12:26:25Z

Moved to #14125.

@SimonUnge

Re-submit #14087 by @SimonUnge: introduce an opinionated, opt-in way to prevent a node from booting if it's been reset in the past

(cherry picked from commit 5f1ab14) (cherry picked from commit 7810b4e)

(cherry picked from commit 74c4ec8)

@SimonUnge

Re-submit #14087 by @SimonUnge: introduce an opinionated, opt-in way to prevent a node from booting if it's been reset in the past (backport #14125)

(cherry picked from commit 5f1ab14)

rabbitmq#14125

mergify bot added the make label Jun 16, 2025

Add opt in initial check run

2d2c70c

SimonUnge force-pushed the su_aws/initial_run_check branch from f25f581 to 2d2c70c Compare June 17, 2025 20:33

SimonUnge marked this pull request as ready for review June 20, 2025 18:21

michaelklishin reviewed Jun 24, 2025

View reviewed changes

michaelklishin requested changes Jun 24, 2025

View reviewed changes

Rename

77cec49

SimonUnge force-pushed the su_aws/initial_run_check branch from e72b695 to 77cec49 Compare June 24, 2025 20:32

michaelklishin added this to the 4.2.0 milestone Jun 25, 2025

michaelklishin added the backport-v4.1.x label Jun 25, 2025

michaelklishin added a commit that referenced this pull request Jun 25, 2025

More renaming #14087, add new test suite to a parallel CT group

5f1ab14

michaelklishin added a commit that referenced this pull request Jun 25, 2025

More renaming #14087, add new test suite to a parallel CT group

7810b4e

(cherry picked from commit 5f1ab14)

michaelklishin mentioned this pull request Jun 25, 2025

Re-submit #14087 by @SimonUnge: introduce an opinionated, opt-in way to prevent a node from booting if it's been reset in the past #14125

Merged

michaelklishin closed this Jun 25, 2025

michaelklishin removed this from the 4.2.0 milestone Jun 25, 2025

michaelklishin added a commit that referenced this pull request Jun 25, 2025

Don't list a test suite twice in parallel CT suite groups #14087 #14125

74c4ec8

michaelklishin added a commit that referenced this pull request Jun 25, 2025

Merge pull request #14125 from rabbitmq/rabbitmq-server-14087-take-2

b75fc23

Re-submit #14087 by @SimonUnge: introduce an opinionated, opt-in way to prevent a node from booting if it's been reset in the past

mergify bot pushed a commit that referenced this pull request Jun 25, 2025

More renaming #14087, add new test suite to a parallel CT group

c727c88

(cherry picked from commit 5f1ab14) (cherry picked from commit 7810b4e)

mergify bot pushed a commit that referenced this pull request Jun 25, 2025

Don't list a test suite twice in parallel CT suite groups #14087 #14125

921ec24

(cherry picked from commit 74c4ec8)

mergify bot mentioned this pull request Jun 25, 2025

Re-submit #14087 by @SimonUnge: introduce an opinionated, opt-in way to prevent a node from booting if it's been reset in the past (backport #14125) #14129

Merged

LoisSotoLopez pushed a commit to cloudamqp/rabbitmq-server that referenced this pull request Jul 16, 2025

More renaming rabbitmq#14087, add new test suite to a parallel CT group

c042505

(cherry picked from commit 5f1ab14)

LoisSotoLopez pushed a commit to cloudamqp/rabbitmq-server that referenced this pull request Jul 16, 2025

Don't list a test suite twice in parallel CT suite groups rabbitmq#14087

1846a15

rabbitmq#14125

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add opt in initial check run #14087

Add opt in initial check run #14087

Uh oh!

SimonUnge commented Jun 16, 2025 •

edited

Loading

Uh oh!

SimonUnge commented Jun 16, 2025

Uh oh!

michaelklishin commented Jun 24, 2025

Uh oh!

Zerpet commented Jun 24, 2025

Uh oh!

lukebakken commented Jun 24, 2025

Uh oh!

michaelklishin commented Jun 24, 2025

Uh oh!

michaelklishin Jun 24, 2025

Uh oh!

SimonUnge Jun 24, 2025

Uh oh!

michaelklishin Jun 24, 2025

Uh oh!

michaelklishin Jun 24, 2025

Uh oh!

SimonUnge Jun 24, 2025

Uh oh!

michaelklishin commented Jun 25, 2025

Uh oh!

Uh oh!


		-spec check_initial_run() -> 'ok' \| no_return().

		check_initial_run() ->

Add opt in initial check run #14087

Add opt in initial check run #14087

Uh oh!

Conversation

SimonUnge commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Changes

Types of Changes

Checklist

Further Comments

Uh oh!

SimonUnge commented Jun 16, 2025

Uh oh!

michaelklishin commented Jun 24, 2025

Uh oh!

Zerpet commented Jun 24, 2025

Uh oh!

lukebakken commented Jun 24, 2025

Uh oh!

michaelklishin commented Jun 24, 2025

Uh oh!

michaelklishin Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

SimonUnge Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

michaelklishin Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

michaelklishin Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

SimonUnge Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

michaelklishin commented Jun 25, 2025

Uh oh!

Uh oh!

SimonUnge commented Jun 16, 2025 •

edited

Loading