PG-1866 Reset wal cache on shmem init #542

AndersAstrand · 2025-08-18T16:26:10Z

It seems like there are cases when the postmaster have "restarted"
after a backend crash where the wal cache inherited from the postmaster
is wrong.

I'm not at all sure exactly how and why this happens, but this patch
fixes a bug with this and allows recovery/013_crash_restart to pass with
WAL encryption enabled.

codecov-commenter · 2025-08-18T16:34:36Z

Codecov Report

❌ Patch coverage is 70.00000% with 3 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (TDE_REL_17_STABLE@621c3f8). Learn more about missing BASE report.

Additional details and impacted files

@@                 Coverage Diff                  @@
##             TDE_REL_17_STABLE     #542   +/-   ##
====================================================
  Coverage                     ?   82.46%           
====================================================
  Files                        ?       25           
  Lines                        ?     3228           
  Branches                     ?      511           
====================================================
  Hits                         ?     2662           
  Misses                       ?      455           
  Partials                     ?      111

Components	Coverage Δ
access	`84.62% <70.00%> (?)`
catalog	`87.65% <ø> (?)`
common	`77.77% <ø> (?)`
encryption	`72.97% <ø> (?)`
keyring	`73.21% <ø> (?)`
src	`94.18% <ø> (?)`
smgr	`96.53% <ø> (?)`
transam	`∅ <ø> (?)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

artemgavrilov

It's interesting that some memory survives restarts. What is the reason to restart then?

contrib/pg_tde/src/access/pg_tde_xlog_keys.c

dutow

I'm not sure if this is the best place for this fix - see #530, and also how the principal key cache initialization works similarly to that.

I assume that we also rerun _PG_Init during a restart, maybe we should detect the situation there instead?

And while it doesn't cause bugs now, shouldn't we also clear the principal key cache?

AndersAstrand · 2025-08-18T19:51:18Z

I'm not sure if this is the best place for this fix - see #530, and also how the principal key cache initialization works similarly to that.

The principal key cache lives in shared memory and not in the backend, so that one is fine.

I assume that we also rerun _PG_Init during a restart, maybe we should detect the situation there instead?

I believe we don't do that when the postmaster does a "soft" restart on backend crash as the shared preload libraries are already loaded in the postmaster.

And while it doesn't cause bugs now, shouldn't we also clear the principal key cache?

Again, the principal key cache lives in shared memory which is cleared by the postmaster, and then reinitialized before any other backends are launched (except on windows) i believe.

I do however think we need to look through all of our assumptions about what's done at startup with the lens of how postmaster does a soft restart. I couldn't find anything glaringly obvious though.

This is not at all related to #530 as that one is for shared memory. The issue here is global memory in each backend that they inherit as they fork.

dutow · 2025-08-18T19:59:03Z

The principal key cache lives in shared memory and not in the backend, so that one is fine.

I see, I forgot that the wal key cache isn't also in shmem because this call is in ShmemInit in the PR. Maybe put it into SmgrInitWrite instead? That also gets called from the shmem hook, but at least it isn't shmem related in name.

AndersAstrand · 2025-08-18T20:00:48Z

The principal key cache lives in shared memory and not in the backend, so that one is fine.

I see, I forgot that the wal key cache isn't also in shmem because this call is in ShmemInit in the PR. Maybe put it into SmgrInitWrite instead? That also gets called from the shmem hook, but at least it isn't shmem related in name.

We already do initialization of backend global wal key stuff in the shmem routines. This cleanup is for the future I believe. Because the whole thing is quite janky.

dutow

Because the whole thing is quite janky.

That's why it would be nice to keep this also in InitWrite where we initialize the cache, to not increase the chaos even more.

But I agree that this is mostly just organization and we have to refactor this, so we can merge it any way.

jeltz · 2025-08-19T01:26:45Z

contrib/pg_tde/src/access/pg_tde_xlog_smgr.c

@@ -160,6 +160,8 @@ TDEXLogShmemInit(void)

 	Assert(LWLockHeldByMeInMode(AddinShmemInitLock, LW_EXCLUSIVE));

+	pg_tde_free_wal_key_cache();


This feels very scary. Is this really the only thing we need to reinitialize?

The one I'm unsure about is the ddlEventStack, but we're focusing on wal right now so I didn't look into it. The rest looked fine to me.

They're either initialized to values that doesn't change, or are initialized in the shmem_startup_hook already. I may have missed something though.

It seems like there are cases when the postmaster have "restarted" after a backend crash where the wal cache inherited from the postmaster is wrong. I'm not at all sure exactly how and why this happens, but this patch fixes a bug with this and allows recovery/013_crash_restart to pass with WAL encryption enabled.

AndersAstrand · 2025-08-19T06:29:16Z

Because the whole thing is quite janky.

That's why it would be nice to keep this also in InitWrite where we initialize the cache, to not increase the chaos even more.

I moved it there and added a comment. I agree that it probably fits better there.

AndersAstrand force-pushed the tde/reset-wal-key-cache-on-shmem-init branch from 78215a0 to 2f3f7dc Compare August 18, 2025 16:28

AndersAstrand force-pushed the tde/reset-wal-key-cache-on-shmem-init branch from 2f3f7dc to 46daae4 Compare August 18, 2025 17:32

AndersAstrand marked this pull request as ready for review August 18, 2025 17:33

AndersAstrand force-pushed the tde/reset-wal-key-cache-on-shmem-init branch from 46daae4 to 767a59e Compare August 18, 2025 18:08

AndersAstrand changed the title ~~Reset wal cache on shmem init~~ PG-1866 Reset wal cache on shmem init Aug 18, 2025

artemgavrilov approved these changes Aug 18, 2025

View reviewed changes

contrib/pg_tde/src/access/pg_tde_xlog_keys.c Show resolved Hide resolved

dutow reviewed Aug 18, 2025

View reviewed changes

dutow approved these changes Aug 18, 2025

View reviewed changes

jeltz reviewed Aug 19, 2025

View reviewed changes

AndersAstrand force-pushed the tde/reset-wal-key-cache-on-shmem-init branch from 767a59e to a3d2a24 Compare August 19, 2025 06:31

AndersAstrand merged commit aed49c0 into percona:TDE_REL_17_STABLE Aug 19, 2025
19 checks passed

AndersAstrand deleted the tde/reset-wal-key-cache-on-shmem-init branch August 19, 2025 08:30

AndersAstrand mentioned this pull request Aug 19, 2025

pg_basebackup: encrypt streamed WAL with new key #537

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PG-1866 Reset wal cache on shmem init #542

PG-1866 Reset wal cache on shmem init #542

Uh oh!

AndersAstrand commented Aug 18, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Aug 18, 2025 •

edited

Loading

Uh oh!

artemgavrilov left a comment

Uh oh!

Uh oh!

dutow left a comment

Uh oh!

AndersAstrand commented Aug 18, 2025 •

edited

Loading

Uh oh!

dutow commented Aug 18, 2025

Uh oh!

AndersAstrand commented Aug 18, 2025 •

edited

Loading

Uh oh!

dutow left a comment

Uh oh!

jeltz Aug 19, 2025

Uh oh!

AndersAstrand Aug 19, 2025 •

edited

Loading

Uh oh!

AndersAstrand commented Aug 19, 2025

Uh oh!

Uh oh!

Uh oh!

		@@ -160,6 +160,8 @@ TDEXLogShmemInit(void)

		Assert(LWLockHeldByMeInMode(AddinShmemInitLock, LW_EXCLUSIVE));

		pg_tde_free_wal_key_cache();

PG-1866 Reset wal cache on shmem init #542

PG-1866 Reset wal cache on shmem init #542

Uh oh!

Conversation

AndersAstrand commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

artemgavrilov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dutow left a comment

Choose a reason for hiding this comment

Uh oh!

AndersAstrand commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dutow commented Aug 18, 2025

Uh oh!

AndersAstrand commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dutow left a comment

Choose a reason for hiding this comment

Uh oh!

jeltz Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

AndersAstrand Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndersAstrand commented Aug 19, 2025

Uh oh!

Uh oh!

Uh oh!

AndersAstrand commented Aug 18, 2025 •

edited

Loading

codecov-commenter commented Aug 18, 2025 •

edited

Loading

AndersAstrand commented Aug 18, 2025 •

edited

Loading

AndersAstrand commented Aug 18, 2025 •

edited

Loading

AndersAstrand Aug 19, 2025 •

edited

Loading