Skip to content

PG-1866 Reset wal cache on shmem init #542

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

AndersAstrand
Copy link
Collaborator

@AndersAstrand AndersAstrand commented Aug 18, 2025

It seems like there are cases when the postmaster have "restarted"
after a backend crash where the wal cache inherited from the postmaster
is wrong.

I'm not at all sure exactly how and why this happens, but this patch
fixes a bug with this and allows recovery/013_crash_restart to pass with
WAL encryption enabled.

@AndersAstrand AndersAstrand force-pushed the tde/reset-wal-key-cache-on-shmem-init branch from 78215a0 to 2f3f7dc Compare August 18, 2025 16:28
@codecov-commenter
Copy link

codecov-commenter commented Aug 18, 2025

Codecov Report

❌ Patch coverage is 70.00000% with 3 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (TDE_REL_17_STABLE@621c3f8). Learn more about missing BASE report.

Additional details and impacted files
@@                 Coverage Diff                  @@
##             TDE_REL_17_STABLE     #542   +/-   ##
====================================================
  Coverage                     ?   82.46%           
====================================================
  Files                        ?       25           
  Lines                        ?     3228           
  Branches                     ?      511           
====================================================
  Hits                         ?     2662           
  Misses                       ?      455           
  Partials                     ?      111           
Components Coverage Δ
access 84.62% <70.00%> (?)
catalog 87.65% <ø> (?)
common 77.77% <ø> (?)
encryption 72.97% <ø> (?)
keyring 73.21% <ø> (?)
src 94.18% <ø> (?)
smgr 96.53% <ø> (?)
transam ∅ <ø> (?)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@AndersAstrand AndersAstrand force-pushed the tde/reset-wal-key-cache-on-shmem-init branch from 2f3f7dc to 46daae4 Compare August 18, 2025 17:32
@AndersAstrand AndersAstrand marked this pull request as ready for review August 18, 2025 17:33
@AndersAstrand AndersAstrand force-pushed the tde/reset-wal-key-cache-on-shmem-init branch from 46daae4 to 767a59e Compare August 18, 2025 18:08
@AndersAstrand AndersAstrand changed the title Reset wal cache on shmem init PG-1866 Reset wal cache on shmem init Aug 18, 2025
Copy link
Collaborator

@artemgavrilov artemgavrilov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting that some memory survives restarts. What is the reason to restart then?

Copy link
Collaborator

@dutow dutow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is the best place for this fix - see #530, and also how the principal key cache initialization works similarly to that.

I assume that we also rerun _PG_Init during a restart, maybe we should detect the situation there instead?

And while it doesn't cause bugs now, shouldn't we also clear the principal key cache?

@AndersAstrand
Copy link
Collaborator Author

AndersAstrand commented Aug 18, 2025

I'm not sure if this is the best place for this fix - see #530, and also how the principal key cache initialization works similarly to that.

The principal key cache lives in shared memory and not in the backend, so that one is fine.

I assume that we also rerun _PG_Init during a restart, maybe we should detect the situation there instead?

I believe we don't do that when the postmaster does a "soft" restart on backend crash as the shared preload libraries are already loaded in the postmaster.

And while it doesn't cause bugs now, shouldn't we also clear the principal key cache?

Again, the principal key cache lives in shared memory which is cleared by the postmaster, and then reinitialized before any other backends are launched (except on windows) i believe.

I do however think we need to look through all of our assumptions about what's done at startup with the lens of how postmaster does a soft restart. I couldn't find anything glaringly obvious though.

This is not at all related to #530 as that one is for shared memory. The issue here is global memory in each backend that they inherit as they fork.

@dutow
Copy link
Collaborator

dutow commented Aug 18, 2025

The principal key cache lives in shared memory and not in the backend, so that one is fine.

I see, I forgot that the wal key cache isn't also in shmem because this call is in ShmemInit in the PR. Maybe put it into SmgrInitWrite instead? That also gets called from the shmem hook, but at least it isn't shmem related in name.

@AndersAstrand
Copy link
Collaborator Author

AndersAstrand commented Aug 18, 2025

The principal key cache lives in shared memory and not in the backend, so that one is fine.

I see, I forgot that the wal key cache isn't also in shmem because this call is in ShmemInit in the PR. Maybe put it into SmgrInitWrite instead? That also gets called from the shmem hook, but at least it isn't shmem related in name.

We already do initialization of backend global wal key stuff in the shmem routines. This cleanup is for the future I believe. Because the whole thing is quite janky.

Copy link
Collaborator

@dutow dutow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the whole thing is quite janky.

That's why it would be nice to keep this also in InitWrite where we initialize the cache, to not increase the chaos even more.

But I agree that this is mostly just organization and we have to refactor this, so we can merge it any way.

@@ -160,6 +160,8 @@ TDEXLogShmemInit(void)

Assert(LWLockHeldByMeInMode(AddinShmemInitLock, LW_EXCLUSIVE));

pg_tde_free_wal_key_cache();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels very scary. Is this really the only thing we need to reinitialize?

Copy link
Collaborator Author

@AndersAstrand AndersAstrand Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one I'm unsure about is the ddlEventStack, but we're focusing on wal right now so I didn't look into it. The rest looked fine to me.

They're either initialized to values that doesn't change, or are initialized in the shmem_startup_hook already. I may have missed something though.

It seems like there are cases when the postmaster have "restarted"
after a backend crash where the wal cache inherited from the postmaster
is wrong.

I'm not at all sure exactly how and why this happens, but this patch
fixes a bug with this and allows recovery/013_crash_restart to pass with
WAL encryption enabled.
@AndersAstrand
Copy link
Collaborator Author

Because the whole thing is quite janky.

That's why it would be nice to keep this also in InitWrite where we initialize the cache, to not increase the chaos even more.

I moved it there and added a comment. I agree that it probably fits better there.

@AndersAstrand AndersAstrand force-pushed the tde/reset-wal-key-cache-on-shmem-init branch from 767a59e to a3d2a24 Compare August 19, 2025 06:31
@AndersAstrand AndersAstrand merged commit aed49c0 into percona:TDE_REL_17_STABLE Aug 19, 2025
19 checks passed
@AndersAstrand AndersAstrand deleted the tde/reset-wal-key-cache-on-shmem-init branch August 19, 2025 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants