Skip to content

Commit 7587286

Browse files
committed
Improve coding around the fsync request queue.
In all branches back to 8.3, this patch fixes a questionable assumption in CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue that there are no uninitialized pad bytes in the request queue structs. This would only cause trouble if (a) there were such pad bytes, which could happen in 8.4 and up if the compiler makes enum ForkNumber narrower than 32 bits, but otherwise would require not-currently-planned changes in the widths of other typedefs; and (b) the kernel has not uniformly initialized the contents of shared memory to zeroes. Still, it seems a tad risky, and we can easily remove any risk by pre-zeroing the request array for ourselves. In addition to that, we need to establish a coding rule that struct RelFileNode can't contain any padding bytes, since such structs are copied into the request array verbatim. (There are other places that are assuming this anyway, it turns out.) In 9.1 and up, the risk was a bit larger because we were also effectively assuming that struct RelFileNodeBackend contained no pad bytes, and with fields of different types in there, that would be much easier to break. However, there is no good reason to ever transmit fsync or delete requests for temp files to the bgwriter/checkpointer, so we can revert the request structs to plain RelFileNode, getting rid of the padding risk and saving some marginal number of bytes and cycles in fsync queue manipulation while we are at it. The savings might be more than marginal during deletion of a temp relation, because the old code transmitted an entirely useless but nonetheless expensive-to-process ForgetRelationFsync request to the background process, and also had the background process perform the file deletion even though that can safely be done immediately. In addition, make some cleanup of nearby comments and small improvements to the code in CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue.
1 parent f81d50d commit 7587286

File tree

3 files changed

+54
-25
lines changed

3 files changed

+54
-25
lines changed

src/backend/postmaster/bgwriter.c

Lines changed: 41 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -885,17 +885,22 @@ BgWriterShmemSize(void)
885885
void
886886
BgWriterShmemInit(void)
887887
{
888+
Size size = BgWriterShmemSize();
888889
bool found;
889890

890891
BgWriterShmem = (BgWriterShmemStruct *)
891892
ShmemInitStruct("Background Writer Data",
892-
BgWriterShmemSize(),
893+
size,
893894
&found);
894895

895896
if (!found)
896897
{
897-
/* First time through, so initialize */
898-
MemSet(BgWriterShmem, 0, sizeof(BgWriterShmemStruct));
898+
/*
899+
* First time through, so initialize. Note that we zero the whole
900+
* requests array; this is so that CompactBgwriterRequestQueue
901+
* can assume that any pad bytes in the request structs are zeroes.
902+
*/
903+
MemSet(BgWriterShmem, 0, size);
899904
SpinLockInit(&BgWriterShmem->ckpt_lck);
900905
BgWriterShmem->max_requests = NBuffers;
901906
}
@@ -1111,25 +1116,27 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
11111116

11121117
/*
11131118
* CompactBgwriterRequestQueue
1114-
* Remove duplicates from the request queue to avoid backend fsyncs.
1119+
* Remove duplicates from the request queue to avoid backend fsyncs.
1120+
* Returns "true" if any entries were removed.
11151121
*
11161122
* Although a full fsync request queue is not common, it can lead to severe
11171123
* performance problems when it does happen. So far, this situation has
11181124
* only been observed to occur when the system is under heavy write load,
1119-
* and especially during the "sync" phase of a checkpoint. Without this
1125+
* and especially during the "sync" phase of a checkpoint. Without this
11201126
* logic, each backend begins doing an fsync for every block written, which
11211127
* gets very expensive and can slow down the whole system.
11221128
*
11231129
* Trying to do this every time the queue is full could lose if there
1124-
* aren't any removable entries. But should be vanishingly rare in
1130+
* aren't any removable entries. But that should be vanishingly rare in
11251131
* practice: there's one queue entry per shared buffer.
11261132
*/
11271133
static bool
1128-
CompactBgwriterRequestQueue()
1134+
CompactBgwriterRequestQueue(void)
11291135
{
1130-
struct BgWriterSlotMapping {
1131-
BgWriterRequest request;
1132-
int slot;
1136+
struct BgWriterSlotMapping
1137+
{
1138+
BgWriterRequest request;
1139+
int slot;
11331140
};
11341141

11351142
int n,
@@ -1142,20 +1149,22 @@ CompactBgwriterRequestQueue()
11421149
/* must hold BgWriterCommLock in exclusive mode */
11431150
Assert(LWLockHeldByMe(BgWriterCommLock));
11441151

1152+
/* Initialize skip_slot array */
1153+
skip_slot = palloc0(sizeof(bool) * BgWriterShmem->num_requests);
1154+
11451155
/* Initialize temporary hash table */
11461156
MemSet(&ctl, 0, sizeof(ctl));
11471157
ctl.keysize = sizeof(BgWriterRequest);
11481158
ctl.entrysize = sizeof(struct BgWriterSlotMapping);
11491159
ctl.hash = tag_hash;
1160+
ctl.hcxt = CurrentMemoryContext;
1161+
11501162
htab = hash_create("CompactBgwriterRequestQueue",
11511163
BgWriterShmem->num_requests,
11521164
&ctl,
1153-
HASH_ELEM | HASH_FUNCTION);
1165+
HASH_ELEM | HASH_FUNCTION | HASH_CONTEXT);
11541166

1155-
/* Initialize skip_slot array */
1156-
skip_slot = palloc0(sizeof(bool) * BgWriterShmem->num_requests);
1157-
1158-
/*
1167+
/*
11591168
* The basic idea here is that a request can be skipped if it's followed
11601169
* by a later, identical request. It might seem more sensible to work
11611170
* backwards from the end of the queue and check whether a request is
@@ -1164,23 +1173,32 @@ CompactBgwriterRequestQueue()
11641173
* intervening FORGET_RELATION_FSYNC or FORGET_DATABASE_FSYNC request, so
11651174
* we do it this way. It would be possible to be even smarter if we made
11661175
* the code below understand the specific semantics of such requests (it
1167-
* could blow away preceding entries that would end up being cancelled
1176+
* could blow away preceding entries that would end up being canceled
11681177
* anyhow), but it's not clear that the extra complexity would buy us
11691178
* anything.
11701179
*/
1171-
for (n = 0; n < BgWriterShmem->num_requests; ++n)
1180+
for (n = 0; n < BgWriterShmem->num_requests; n++)
11721181
{
11731182
BgWriterRequest *request;
11741183
struct BgWriterSlotMapping *slotmap;
1175-
bool found;
1184+
bool found;
11761185

1186+
/*
1187+
* We use the request struct directly as a hashtable key. This
1188+
* assumes that any padding bytes in the structs are consistently the
1189+
* same, which should be okay because we zeroed them in
1190+
* BgWriterShmemInit. Note also that RelFileNode had better
1191+
* contain no pad bytes.
1192+
*/
11771193
request = &BgWriterShmem->requests[n];
11781194
slotmap = hash_search(htab, request, HASH_ENTER, &found);
11791195
if (found)
11801196
{
1197+
/* Duplicate, so mark the previous occurrence as skippable */
11811198
skip_slot[slotmap->slot] = true;
1182-
++num_skipped;
1199+
num_skipped++;
11831200
}
1201+
/* Remember slot containing latest occurrence of this request value */
11841202
slotmap->slot = n;
11851203
}
11861204

@@ -1195,15 +1213,16 @@ CompactBgwriterRequestQueue()
11951213
}
11961214

11971215
/* We found some duplicates; remove them. */
1198-
for (n = 0, preserve_count = 0; n < BgWriterShmem->num_requests; ++n)
1216+
preserve_count = 0;
1217+
for (n = 0; n < BgWriterShmem->num_requests; n++)
11991218
{
12001219
if (skip_slot[n])
12011220
continue;
12021221
BgWriterShmem->requests[preserve_count++] = BgWriterShmem->requests[n];
12031222
}
12041223
ereport(DEBUG1,
1205-
(errmsg("compacted fsync request queue from %d entries to %d entries",
1206-
BgWriterShmem->num_requests, preserve_count)));
1224+
(errmsg("compacted fsync request queue from %d entries to %d entries",
1225+
BgWriterShmem->num_requests, preserve_count)));
12071226
BgWriterShmem->num_requests = preserve_count;
12081227

12091228
/* Cleanup. */

src/backend/storage/smgr/md.c

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,7 @@ static MemoryContext MdCxt; /* context for all md.c allocations */
126126
typedef struct
127127
{
128128
RelFileNode rnode; /* the targeted relation */
129-
ForkNumber forknum;
129+
ForkNumber forknum; /* which fork */
130130
BlockNumber segno; /* which segment */
131131
} PendingOperationTag;
132132

@@ -1212,7 +1212,7 @@ mdpostckpt(void)
12121212
* If there is a local pending-ops table, just make an entry in it for
12131213
* mdsync to process later. Otherwise, try to pass off the fsync request
12141214
* to the background writer process. If that fails, just do the fsync
1215-
* locally before returning (we expect this will not happen often enough
1215+
* locally before returning (we hope this will not happen often enough
12161216
* to be a performance problem).
12171217
*/
12181218
static void
@@ -1239,6 +1239,9 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
12391239
/*
12401240
* register_unlink() -- Schedule a file to be deleted after next checkpoint
12411241
*
1242+
* We don't bother passing in the fork number, because this is only used
1243+
* with main forks.
1244+
*
12421245
* As with register_dirty_segment, this could involve either a local or
12431246
* a remote pending-ops table.
12441247
*/
@@ -1349,6 +1352,9 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
13491352
MemoryContext oldcxt = MemoryContextSwitchTo(MdCxt);
13501353
PendingUnlinkEntry *entry;
13511354

1355+
/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
1356+
Assert(forknum == MAIN_FORKNUM);
1357+
13521358
entry = palloc(sizeof(PendingUnlinkEntry));
13531359
entry->rnode = rnode;
13541360
entry->cycle_ctr = mdckpt_cycle_ctr;
@@ -1398,7 +1404,7 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
13981404
}
13991405

14001406
/*
1401-
* ForgetRelationFsyncRequests -- forget any fsyncs for a rel
1407+
* ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
14021408
*/
14031409
void
14041410
ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)

src/include/storage/relfilenode.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,10 @@ typedef enum ForkNumber
6565
* Note: in pg_class, relfilenode can be zero to denote that the relation
6666
* is a "mapped" relation, whose current true filenode number is available
6767
* from relmapper.c. Again, this case is NOT allowed in RelFileNodes.
68+
*
69+
* Note: various places use RelFileNode in hashtable keys. Therefore,
70+
* there *must not* be any unused padding bytes in this struct. That
71+
* should be safe as long as all the fields are of type Oid.
6872
*/
6973
typedef struct RelFileNode
7074
{

0 commit comments

Comments
 (0)