Skip to content

Add opt in initial check run #14087

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

SimonUnge
Copy link
Collaborator

@SimonUnge SimonUnge commented Jun 16, 2025

Proposed Changes

This PR introduces a new optional feature to help detect potential data loss scenarios during RabbitMQ node startup. The feature adds a verify_initial_run configuration option that defaults to false. When enabled, nodes will create a marker file called node_initialized.marker on their first startup and use this to verify data consistency on subsequent restarts.

The implementation works by adding a new boot step called initial_run_check that runs after recovery but before the existing empty_db_check step.

If the marker file exists but database tables are empty, this indicates a potential data loss scenario such as corruption, accidental database resets, or split-brain recovery issues. In these cases, the node will fail to start with a specific error cluster_already_initialized_but_tables_empty, giving operators clear indication that manual intervention may be required rather than silently starting with empty data.

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

It the testing section, I fairly aggressively delete schema files to force rabbit_table:needs_default_data() returns true. There might be more elegant solutions?

@SimonUnge
Copy link
Collaborator Author

@michaelklishin it seems I cannot do PRs directly to rabbitmq/rabbitmq-server anymore so unsure best way to do PRs that makes the tests run?

@mergify mergify bot added the make label Jun 16, 2025
@SimonUnge SimonUnge force-pushed the su_aws/initial_run_check branch from f25f581 to 2d2c70c Compare June 17, 2025 20:33
@SimonUnge SimonUnge marked this pull request as ready for review June 20, 2025 18:21
@michaelklishin
Copy link
Collaborator

@SimonUnge I had to re-add you as a collaborator. You should have an invite.

@Zerpet
Copy link
Member

Zerpet commented Jun 24, 2025

I like this idea, at a high level. I'm wondering about what do you foresee users doing when this check fails?

In other words, what can a user do about e.g. accidental database resets, when they observe this error? I think most (if not all) of the time, a user would want rabbit to start, and let it sync from its peers.

@lukebakken
Copy link
Collaborator

accidental database resets, when they observe this error? I think most (if not all) of the time, a user would want rabbit to start, and let it sync from its peers.

The problem that this change helps to address are scenarios in which a peer tries to sync from a reset node. Yep, it can happen.

@michaelklishin
Copy link
Collaborator

Making the node fail and decomissioning, re-creating it is a valid approach.

%% and refuse to start if the marker exists but tables are empty.
%%

{mapping, "verify_initial_run", "rabbit.verify_initial_run", [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to prevent_startup_if_node_was_reset? I have considered many names, such as

  • track_initial_run_and_seeding
  • enforce_initialized_database

and a few others but the most specific option is the one that describes the end goal: prevent a previously reset node from booting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, will do

@@ -199,10 +199,16 @@
{requires, [core_initialized]},
{enables, routing_ready}]}).

-rabbit_boot_step({initial_run_check,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also rename this to prevent_startup_if_node_was_reset.


-spec check_initial_run() -> 'ok' | no_return().

check_initial_run() ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same recommendation as above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ill update the naming!

@michaelklishin
Copy link
Collaborator

Moved to #14125.

@michaelklishin michaelklishin removed this from the 4.2.0 milestone Jun 25, 2025
michaelklishin added a commit that referenced this pull request Jun 25, 2025
Re-submit #14087 by @SimonUnge: introduce an opinionated, opt-in way to prevent a node from booting if it's been reset in the past
mergify bot pushed a commit that referenced this pull request Jun 25, 2025
(cherry picked from commit 5f1ab14)
(cherry picked from commit 7810b4e)
mergify bot pushed a commit that referenced this pull request Jun 25, 2025
michaelklishin added a commit that referenced this pull request Jun 25, 2025
Re-submit #14087 by @SimonUnge: introduce an opinionated, opt-in way to prevent a node from booting if it's been reset in the past (backport #14125)
LoisSotoLopez pushed a commit to cloudamqp/rabbitmq-server that referenced this pull request Jul 16, 2025
LoisSotoLopez pushed a commit to cloudamqp/rabbitmq-server that referenced this pull request Jul 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants