-
Notifications
You must be signed in to change notification settings - Fork 4k
Add opt in initial check run #14087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add opt in initial check run #14087
Conversation
@michaelklishin it seems I cannot do PRs directly to rabbitmq/rabbitmq-server anymore so unsure best way to do PRs that makes the tests run? |
f25f581
to
2d2c70c
Compare
@SimonUnge I had to re-add you as a collaborator. You should have an invite. |
I like this idea, at a high level. I'm wondering about what do you foresee users doing when this check fails? In other words, what can a user do about e.g. accidental database resets, when they observe this error? I think most (if not all) of the time, a user would want rabbit to start, and let it sync from its peers. |
The problem that this change helps to address are scenarios in which a peer tries to sync from a reset node. Yep, it can happen. |
Making the node fail and decomissioning, re-creating it is a valid approach. |
%% and refuse to start if the marker exists but tables are empty. | ||
%% | ||
|
||
{mapping, "verify_initial_run", "rabbit.verify_initial_run", [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this to prevent_startup_if_node_was_reset
? I have considered many names, such as
track_initial_run_and_seeding
enforce_initialized_database
and a few others but the most specific option is the one that describes the end goal: prevent a previously reset node from booting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, will do
@@ -199,10 +199,16 @@ | |||
{requires, [core_initialized]}, | |||
{enables, routing_ready}]}). | |||
|
|||
-rabbit_boot_step({initial_run_check, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also rename this to prevent_startup_if_node_was_reset
.
deps/rabbit/src/rabbit.erl
Outdated
|
||
-spec check_initial_run() -> 'ok' | no_return(). | ||
|
||
check_initial_run() -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same recommendation as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ill update the naming!
e72b695
to
77cec49
Compare
(cherry picked from commit 5f1ab14)
Moved to #14125. |
Re-submit #14087 by @SimonUnge: introduce an opinionated, opt-in way to prevent a node from booting if it's been reset in the past
(cherry picked from commit 74c4ec8)
Re-submit #14087 by @SimonUnge: introduce an opinionated, opt-in way to prevent a node from booting if it's been reset in the past (backport #14125)
(cherry picked from commit 5f1ab14)
Proposed Changes
This PR introduces a new optional feature to help detect potential data loss scenarios during RabbitMQ node startup. The feature adds a verify_initial_run configuration option that defaults to false. When enabled, nodes will create a marker file called node_initialized.marker on their first startup and use this to verify data consistency on subsequent restarts.
The implementation works by adding a new boot step called initial_run_check that runs after recovery but before the existing empty_db_check step.
If the marker file exists but database tables are empty, this indicates a potential data loss scenario such as corruption, accidental database resets, or split-brain recovery issues. In these cases, the node will fail to start with a specific error cluster_already_initialized_but_tables_empty, giving operators clear indication that manual intervention may be required rather than silently starting with empty data.
Types of Changes
What types of changes does your code introduce to this project?
Put an
x
in the boxes that applyChecklist
Put an
x
in the boxes that apply.You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.
CONTRIBUTING.md
documentFurther Comments
It the testing section, I fairly aggressively delete schema files to force
rabbit_table:needs_default_data()
returnstrue
. There might be more elegant solutions?