Skip to content

Safe database swap #2122

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Safe database swap #2122

wants to merge 4 commits into from

Conversation

m04f
Copy link
Contributor

@m04f m04f commented Jun 22, 2025

This PR implements safe database swap functionality to address the data loss issue described in #2103.

Testing Note

The recovery test uses syscall.Mkfifo() to create a file that passes initial SQLite header validation but fails during database opening. This simulates a swap failure scenario, though I think won't run on Windows platforms. (This is the only way I had in mind to pass the IsValidSQLiteFile guard)

Closes #2103

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @m04f, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the reliability of database swap operations by implementing a robust, transactional-like mechanism. It directly addresses a potential data loss issue by ensuring that if a new database fails to be properly integrated, the system can automatically recover the original, functional database, thereby preventing data loss and maintaining system availability.

Highlights

  • Safe Database Swap Implementation: The core Swap method in SwappableDB has been re-engineered to prevent data loss during database replacement operations.
  • Transactional Swap with Rollback: The old database is now temporarily backed up to a .bak file before the new database is moved into its place. If the new database fails to open after being moved, the system attempts to automatically restore the original database from this backup, ensuring data integrity.
  • New Error Type for Data Loss: A specific ErrDbDataLost error has been introduced to clearly indicate when both the primary swap operation and the subsequent recovery attempt of the original database have failed, signaling an unrecoverable state.
  • Comprehensive Recovery Testing: A new test case, Test_SwapRecoveryOnFailure, has been added. This test specifically validates the new recovery mechanism by simulating a scenario where a new database file passes initial SQLite header validation but fails during the actual database opening, ensuring the rollback logic is robust.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a safer database swap mechanism to prevent data loss during the operation. The new implementation backs up the existing database before attempting to swap, and restores it if the swap fails. The changes are well-reasoned and include a thorough test case for the recovery path. My review includes suggestions to improve error handling by using fmt.Errorf with %w for better error inspection, and to make the new test more robust and portable.

Comment on lines 61 to 62
if err := os.Rename(s.db.Path(), oldDbPath); err != nil {
return fmt.Errorf("failed to rename old database: %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's a good practice to use %w format verb when wrapping errors with fmt.Errorf[^1]. This preserves the original error in the error chain, allowing callers to inspect it using errors.Is or errors.As.

return fmt.Errorf("failed to rename old database: %w", err)

Comment on lines 72 to 73
if err != nil {
return fmt.Errorf("open SQLite file failed: %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using %w to wrap this error will improve error handling capabilities, as it preserves the underlying error for inspection[^1].

return fmt.Errorf("open SQLite file failed: %w", err)

return fmt.Errorf("open SQLite file failed: %s", err)
if err := replaceDb(path, fkConstraints, walEnabled); err != nil {
if recoveryErr := replaceDb(oldDbPath, oldFkConstraints, oldWalEnabled); recoveryErr != nil {
return fmt.Errorf("%w: failed to swap database (%v) and failed to recover original database (%v)", ErrDbDataLost, err, recoveryErr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider using %w to wrap the errors here to preserve the error chain[^1].

return fmt.Errorf("%w: failed to swap database (%v) and failed to recover original database (%v)", ErrDbDataLost, err, recoveryErr)

Comment on lines +176 to +178
// Test_SwapRecoveryOnFailure tests that the Swap function recovers the original database
// when the swap operation fails after the old database has been moved.
func Test_SwapRecoveryOnFailure(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test uses syscall.Mkfifo, which is not available on Windows. As noted in the PR description, this will cause the test to fail on Windows platforms. To ensure portability of the test suite, it's best to skip this test on Windows.

You can do this by adding a check at the beginning of the test function:

import "runtime"

// ... in Test_SwapRecoveryOnFailure ...
if runtime.GOOS == "windows" {
		t.Skip("Skipping test on Windows because it uses syscall.Mkfifo")
}

Alternatively, you can use a build tag //go:build !windows at the top of the file.

Comment on lines +204 to +205
os.WriteFile(swapPath, sqliteHeader, 0666)
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error returned by os.WriteFile is not checked. If this write fails, the test might fail in a confusing way later on. It's better to handle this error.

Since this runs in a separate goroutine, you can't use t.Fatalf directly. A common pattern is to use a channel to send the error back to the main test goroutine, which can then fail the test.

For example:

errCh := make(chan error, 1)
go func() {
    errCh <- os.WriteFile(swapPath, sqliteHeader, 0666)
}()

// ... perform swap ...

// After swap, check for error from goroutine
if err := <-errCh; err != nil {
    t.Fatalf("writing to fifo failed: %v", err)
}

defer os.Remove(swapPath)

// Avoid blocking the thread
go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this? Why a goroutine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing to a fifo blocks until the other end is opened for reading. Without the goroutine, the test would hang indefinitely at the write, since Swap() (which triggers the read attempt) hasn't been called yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Well, as per my other comment if we can improve this function such that if the passed in database doesn't open we just put the original DB back in, it might be OK to remove the SQLite check.

swapPath := mustTempPath()
defer os.Remove(swapPath)

err = syscall.Mkfifo(swapPath, 0666)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we modify Swap so that it always leaves the original database place if the swap fails then we can remove the SQLite validity check from the implementation. That was only there to try to protect us because Swap() was a one-way operation. But if we fix that we can remove the check.

@m04f
Copy link
Contributor Author

m04f commented Jun 22, 2025

Currently, the Swap method works by renaming the existing database files to a backup location, then renaming the new database into place, and finally attempting to open it. If opening the new database fails, it attempts to restore the original files. However, this backup-and-restore mechanism is itself fallible -- for example, the rename of WAL might fail after the database file has been renamed, potentially leading to data loss.

A potentially safer approach would be to open and validate the new database file before performing any file renames. This would guarantee that we only touch the original database files if we’re confident the new database is valid. However, this would mean that the active database path would change after the swap.

Would it be acceptable for the database path to change in this case? Based on a quick review of Swap's call sites, it doesn't seem like anything depends on the path staying the same, but I wanted to confirm.

@otoolep
Copy link
Member

otoolep commented Jun 23, 2025

No, you can't change the database path. It's a core assumption of rqlite that the path, once set, is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Swapping a database should not delete the files unless swapping worked
2 participants