Skip to content

Commit 85d8b30

Browse files
committed
Apply a better fix to mdunlinkfork().
Replace the stopgap fix I made in 0e758ae with a cleaner one. The real problem with 4ab5dae is that it contorted this function's logic substantially, by introducing a third code path that required different behavior in the function's main loop. That seems quite unnecessary on closer inspection: the new IsBinaryUpgrade case can just share the behavior of the other immediate-unlink cases. Hence, revert 4ab5dae and most of 0e758ae (keeping the latter's save/restore errno fix), and add IsBinaryUpgrade to the set of conditions tested to choose immediate unlink. Also fix some additional places with sloppy handling of errno, to ensure we have an invariant that we always continue processing after any non-ENOENT failure of do_truncate. I doubt that that's fixing any bug of field importance, so I don't feel it necessary to back-patch; but we might as well get it right while we're here. Also improve the comments, which had drifted a bit from what the code actually does, and neglected to mention some important considerations. Back-patch to v15, not because this is fixing any bug but because it doesn't seem like a good idea for v15's mdunlinkfork logic to be significantly different from both v14 and v16. Discussion: https://postgr.es/m/3797575.1667924888@sss.pgh.pa.us
1 parent ff0d8f2 commit 85d8b30

File tree

1 file changed

+45
-37
lines changed
  • src/backend/storage/smgr

1 file changed

+45
-37
lines changed

src/backend/storage/smgr/md.c

+45-37
Original file line numberDiff line numberDiff line change
@@ -257,11 +257,23 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
257257
* next checkpoint, we prevent reassignment of the relfilenumber until it's
258258
* safe, because relfilenumber assignment skips over any existing file.
259259
*
260+
* Additional segments, if any, are truncated and then unlinked. The reason
261+
* for truncating is that other backends may still hold open FDs for these at
262+
* the smgr level, so that the kernel can't remove the file yet. We want to
263+
* reclaim the disk space right away despite that.
264+
*
260265
* We do not need to go through this dance for temp relations, though, because
261266
* we never make WAL entries for temp rels, and so a temp rel poses no threat
262267
* to the health of a regular rel that has taken over its relfilenumber.
263268
* The fact that temp rels and regular rels have different file naming
264-
* patterns provides additional safety.
269+
* patterns provides additional safety. Other backends shouldn't have open
270+
* FDs for them, either.
271+
*
272+
* We also don't do it while performing a binary upgrade. There is no reuse
273+
* hazard in that case, since after a crash or even a simple ERROR, the
274+
* upgrade fails and the whole cluster must be recreated from scratch.
275+
* Furthermore, it is important to remove the files from disk immediately,
276+
* because we may be about to reuse the same relfilenumber.
265277
*
266278
* All the above applies only to the relation's main fork; other forks can
267279
* just be removed immediately, since they are not needed to prevent the
@@ -274,6 +286,9 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
274286
* for later, since during redo there's no possibility of creating a
275287
* conflicting relation.
276288
*
289+
* Note: we currently just never warn about ENOENT at all. We could warn in
290+
* the main-fork, non-isRedo case, but it doesn't seem worth the trouble.
291+
*
277292
* Note: any failure should be reported as WARNING not ERROR, because
278293
* we are usually not in a transaction anymore when this is called.
279294
*/
@@ -319,19 +334,19 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
319334
{
320335
char *path;
321336
int ret;
322-
BlockNumber segno = 0;
337+
int save_errno;
323338

324339
path = relpath(rlocator, forknum);
325340

326341
/*
327-
* Delete or truncate the first segment.
342+
* Truncate and then unlink the first segment, or just register a request
343+
* to unlink it later, as described in the comments for mdunlink().
328344
*/
329-
if (isRedo || forknum != MAIN_FORKNUM || RelFileLocatorBackendIsTemp(rlocator))
345+
if (isRedo || IsBinaryUpgrade || forknum != MAIN_FORKNUM ||
346+
RelFileLocatorBackendIsTemp(rlocator))
330347
{
331348
if (!RelFileLocatorBackendIsTemp(rlocator))
332349
{
333-
int save_errno;
334-
335350
/* Prevent other backends' fds from holding on to the disk space */
336351
ret = do_truncate(path);
337352

@@ -344,63 +359,56 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
344359
ret = 0;
345360

346361
/* Next unlink the file, unless it was already found to be missing */
347-
if (ret == 0 || errno != ENOENT)
362+
if (ret >= 0 || errno != ENOENT)
348363
{
349364
ret = unlink(path);
350365
if (ret < 0 && errno != ENOENT)
366+
{
367+
save_errno = errno;
351368
ereport(WARNING,
352369
(errcode_for_file_access(),
353370
errmsg("could not remove file \"%s\": %m", path)));
354-
segno++;
371+
errno = save_errno;
372+
}
355373
}
356374
}
357375
else
358376
{
359377
/* Prevent other backends' fds from holding on to the disk space */
360378
ret = do_truncate(path);
361379

362-
/*
363-
* Except during a binary upgrade, register request to unlink first
364-
* segment later, rather than now.
365-
*
366-
* If we're performing a binary upgrade, the dangers described in the
367-
* header comments for mdunlink() do not exist, since after a crash or
368-
* even a simple ERROR, the upgrade fails and the whole new cluster
369-
* must be recreated from scratch. And, on the other hand, it is
370-
* important to remove the files from disk immediately, because we may
371-
* be about to reuse the same relfilenumber.
372-
*/
373-
if (!IsBinaryUpgrade)
374-
{
375-
register_unlink_segment(rlocator, forknum, 0 /* first seg */ );
376-
segno++;
377-
}
380+
/* Register request to unlink first segment later */
381+
save_errno = errno;
382+
register_unlink_segment(rlocator, forknum, 0 /* first seg */ );
383+
errno = save_errno;
378384
}
379385

380386
/*
381-
* Delete any remaining segments (we might or might not have dealt with
382-
* the first one above).
387+
* Delete any additional segments.
388+
*
389+
* Note that because we loop until getting ENOENT, we will correctly
390+
* remove all inactive segments as well as active ones. Ideally we'd
391+
* continue the loop until getting exactly that errno, but that risks an
392+
* infinite loop if the problem is directory-wide (for instance, if we
393+
* suddenly can't read the data directory itself). We compromise by
394+
* continuing after a non-ENOENT truncate error, but stopping after any
395+
* unlink error. If there is indeed a directory-wide problem, additional
396+
* unlink attempts wouldn't work anyway.
383397
*/
384-
if (ret >= 0)
398+
if (ret >= 0 || errno != ENOENT)
385399
{
386400
char *segpath = (char *) palloc(strlen(path) + 12);
401+
BlockNumber segno;
387402

388-
/*
389-
* Note that because we loop until getting ENOENT, we will correctly
390-
* remove all inactive segments as well as active ones.
391-
*/
392-
for (;; segno++)
403+
for (segno = 1;; segno++)
393404
{
394-
if (segno == 0)
395-
strcpy(segpath, path);
396-
else
397-
sprintf(segpath, "%s.%u", path, segno);
405+
sprintf(segpath, "%s.%u", path, segno);
398406

399407
if (!RelFileLocatorBackendIsTemp(rlocator))
400408
{
401409
/*
402410
* Prevent other backends' fds from holding on to the disk
403-
* space.
411+
* space. We're done if we see ENOENT, though.
404412
*/
405413
if (do_truncate(segpath) < 0 && errno == ENOENT)
406414
break;

0 commit comments

Comments
 (0)