Skip to content

Commit bae868c

Browse files
committed
Don't trust unvalidated xl_tot_len.
xl_tot_len comes first in a WAL record. Usually we don't trust it to be the true length until we've validated the record header. If the record header was split across two pages, previously we wouldn't do the validation until after we'd already tried to allocate enough memory to hold the record, which was bad because it might actually be garbage bytes from a recycled WAL file, so we could try to allocate a lot of memory. Release 15 made it worse. Since 70b4f82, we'd at least generate an end-of-WAL condition if the garbage 4 byte value happened to be > 1GB, but we'd still try to allocate up to 1GB of memory bogusly otherwise. That was an improvement, but unfortunately release 15 tries to allocate another object before that, so you could get a FATAL error and recovery could fail. We can fix both variants of the problem more fundamentally using pre-existing page-level validation, if we just re-order some logic. The new order of operations in the split-header case defers all memory allocation based on xl_tot_len until we've read the following page. At that point we know that its first few bytes are not recycled data, by checking its xlp_pageaddr, and that its xlp_rem_len agrees with xl_tot_len on the preceding page. That is strong evidence that xl_tot_len was truly the start of a record that was logged. This problem was most likely to occur on a standby, because walreceiver.c recycles WAL files without zeroing out trailing regions of each page. We could fix that too, but it wouldn't protect us from rare crash scenarios where the trailing zeroes don't make it to disk. With reliable xl_tot_len validation in place, the ancient policy of considering malloc failure to indicate corruption at end-of-WAL seems quite surprising, but changing that is left for later work. Also included is a new TAP test to exercise various cases of end-of-WAL detection by writing contrived data into the WAL from Perl. Back-patch to 12. We decided not to put this change into the final release of 11. Author: Thomas Munro <thomas.munro@gmail.com> Author: Michael Paquier <michael@paquier.xyz> Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> (the idea, not the code) Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Sergei Kornilov <sk@zsrv.org> Reviewed-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/17928-aa92416a70ff44a2%40postgresql.org
1 parent 755eb44 commit bae868c

File tree

4 files changed

+569
-56
lines changed

4 files changed

+569
-56
lines changed

src/backend/access/transam/xlogreader.c

Lines changed: 67 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -192,6 +192,9 @@ XLogReaderFree(XLogReaderState *state)
192192
* XLOG_BLCKSZ, and make sure it's at least 5*Max(BLCKSZ, XLOG_BLCKSZ) to start
193193
* with. (That is enough for all "normal" records, but very large commit or
194194
* abort records might need more space.)
195+
*
196+
* Note: This routine should *never* be called for xl_tot_len until the header
197+
* of the record has been fully validated.
195198
*/
196199
static bool
197200
allocate_recordbuf(XLogReaderState *state, uint32 reclength)
@@ -201,25 +204,6 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
201204
newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
202205
newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ));
203206

204-
#ifndef FRONTEND
205-
206-
/*
207-
* Note that in much unlucky circumstances, the random data read from a
208-
* recycled segment can cause this routine to be called with a size
209-
* causing a hard failure at allocation. For a standby, this would cause
210-
* the instance to stop suddenly with a hard failure, preventing it to
211-
* retry fetching WAL from one of its sources which could allow it to move
212-
* on with replay without a manual restart. If the data comes from a past
213-
* recycled segment and is still valid, then the allocation may succeed
214-
* but record checks are going to fail so this would be short-lived. If
215-
* the allocation fails because of a memory shortage, then this is not a
216-
* hard failure either per the guarantee given by MCXT_ALLOC_NO_OOM.
217-
*/
218-
if (!AllocSizeIsValid(newSize))
219-
return false;
220-
221-
#endif
222-
223207
if (state->readRecordBuf)
224208
pfree(state->readRecordBuf);
225209
state->readRecordBuf =
@@ -669,41 +653,26 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
669653
}
670654
else
671655
{
672-
/* XXX: more validation should be done here */
673-
if (total_len < SizeOfXLogRecord)
674-
{
675-
report_invalid_record(state,
676-
"invalid record length at %X/%X: expected at least %u, got %u",
677-
LSN_FORMAT_ARGS(RecPtr),
678-
(uint32) SizeOfXLogRecord, total_len);
679-
goto err;
680-
}
656+
/* We'll validate the header once we have the next page. */
681657
gotheader = false;
682658
}
683659

684660
/*
685-
* Find space to decode this record. Don't allow oversized allocation if
686-
* the caller requested nonblocking. Otherwise, we *have* to try to
687-
* decode the record now because the caller has nothing else to do, so
688-
* allow an oversized record to be palloc'd if that turns out to be
689-
* necessary.
661+
* Try to find space to decode this record, if we can do so without
662+
* calling palloc. If we can't, we'll try again below after we've
663+
* validated that total_len isn't garbage bytes from a recycled WAL page.
690664
*/
691665
decoded = XLogReadRecordAlloc(state,
692666
total_len,
693-
!nonblocking /* allow_oversized */ );
694-
if (decoded == NULL)
667+
false /* allow_oversized */ );
668+
if (decoded == NULL && nonblocking)
695669
{
696670
/*
697-
* There is no space in the decode buffer. The caller should help
698-
* with that problem by consuming some records.
671+
* There is no space in the circular decode buffer, and the caller is
672+
* only reading ahead. The caller should consume existing records to
673+
* make space.
699674
*/
700-
if (nonblocking)
701-
return XLREAD_WOULDBLOCK;
702-
703-
/* We failed to allocate memory for an oversized record. */
704-
report_invalid_record(state,
705-
"out of memory while trying to decode a record of length %u", total_len);
706-
goto err;
675+
return XLREAD_WOULDBLOCK;
707676
}
708677

709678
len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
@@ -718,16 +687,11 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
718687
assembled = true;
719688

720689
/*
721-
* Enlarge readRecordBuf as needed.
690+
* We always have space for a couple of pages, enough to validate a
691+
* boundary-spanning record header.
722692
*/
723-
if (total_len > state->readRecordBufSize &&
724-
!allocate_recordbuf(state, total_len))
725-
{
726-
/* We treat this as a "bogus data" condition */
727-
report_invalid_record(state, "record length %u at %X/%X too long",
728-
total_len, LSN_FORMAT_ARGS(RecPtr));
729-
goto err;
730-
}
693+
Assert(state->readRecordBufSize >= XLOG_BLCKSZ * 2);
694+
Assert(state->readRecordBufSize >= len);
731695

732696
/* Copy the first fragment of the record from the first page. */
733697
memcpy(state->readRecordBuf,
@@ -824,8 +788,35 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
824788
goto err;
825789
gotheader = true;
826790
}
827-
} while (gotlen < total_len);
828791

792+
/*
793+
* We might need a bigger buffer. We have validated the record
794+
* header, in the case that it split over a page boundary. We've
795+
* also cross-checked total_len against xlp_rem_len on the second
796+
* page, and verified xlp_pageaddr on both.
797+
*/
798+
if (total_len > state->readRecordBufSize)
799+
{
800+
char save_copy[XLOG_BLCKSZ * 2];
801+
802+
/*
803+
* Save and restore the data we already had. It can't be more
804+
* than two pages.
805+
*/
806+
Assert(gotlen <= lengthof(save_copy));
807+
Assert(gotlen <= state->readRecordBufSize);
808+
memcpy(save_copy, state->readRecordBuf, gotlen);
809+
if (!allocate_recordbuf(state, total_len))
810+
{
811+
/* We treat this as a "bogus data" condition */
812+
report_invalid_record(state, "record length %u at %X/%X too long",
813+
total_len, LSN_FORMAT_ARGS(RecPtr));
814+
goto err;
815+
}
816+
memcpy(state->readRecordBuf, save_copy, gotlen);
817+
buffer = state->readRecordBuf + gotlen;
818+
}
819+
} while (gotlen < total_len);
829820
Assert(gotheader);
830821

831822
record = (XLogRecord *) state->readRecordBuf;
@@ -867,6 +858,28 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
867858
state->NextRecPtr -= XLogSegmentOffset(state->NextRecPtr, state->segcxt.ws_segsize);
868859
}
869860

861+
/*
862+
* If we got here without a DecodedXLogRecord, it means we needed to
863+
* validate total_len before trusting it, but by now now we've done that.
864+
*/
865+
if (decoded == NULL)
866+
{
867+
Assert(!nonblocking);
868+
decoded = XLogReadRecordAlloc(state,
869+
total_len,
870+
true /* allow_oversized */ );
871+
if (decoded == NULL)
872+
{
873+
/*
874+
* We failed to allocate memory for an oversized record. As
875+
* above, we currently treat this as a "bogus data" condition.
876+
*/
877+
report_invalid_record(state,
878+
"out of memory while trying to decode a record of length %u", total_len);
879+
goto err;
880+
}
881+
}
882+
870883
if (DecodeXLogRecord(state, decoded, record, RecPtr, &errormsg))
871884
{
872885
/* Record the location of the next record. */
@@ -895,8 +908,6 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
895908
state->decode_queue_head = decoded;
896909
return XLREAD_SUCCESS;
897910
}
898-
else
899-
return XLREAD_FAIL;
900911

901912
err:
902913
if (assembled)

src/test/perl/PostgreSQL/Test/Utils.pm

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ our @EXPORT = qw(
7171
chmod_recursive
7272
check_pg_config
7373
dir_symlink
74+
scan_server_header
7475
system_or_bail
7576
system_log
7677
run_log
@@ -702,6 +703,46 @@ sub chmod_recursive
702703

703704
=pod
704705
706+
=item scan_server_header(header_path, regexp)
707+
708+
Returns an array that stores all the matches of the given regular expression
709+
within the PostgreSQL installation's C<header_path>. This can be used to
710+
retrieve specific value patterns from the installation's header files.
711+
712+
=cut
713+
714+
sub scan_server_header
715+
{
716+
my ($header_path, $regexp) = @_;
717+
718+
my ($stdout, $stderr);
719+
my $result = IPC::Run::run [ 'pg_config', '--includedir-server' ], '>',
720+
\$stdout, '2>', \$stderr
721+
or die "could not execute pg_config";
722+
chomp($stdout);
723+
$stdout =~ s/\r$//;
724+
725+
open my $header_h, '<', "$stdout/$header_path" or die "$!";
726+
727+
my @match = undef;
728+
while (<$header_h>)
729+
{
730+
my $line = $_;
731+
732+
if (@match = $line =~ /^$regexp/)
733+
{
734+
last;
735+
}
736+
}
737+
738+
close $header_h;
739+
die "could not find match in header $header_path\n"
740+
unless @match;
741+
return @match;
742+
}
743+
744+
=pod
745+
705746
=item check_pg_config(regexp)
706747
707748
Return the number of matches of the given regular expression

src/test/recovery/meson.build

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ tests += {
4444
't/036_truncated_dropped.pl',
4545
't/037_invalid_database.pl',
4646
't/038_save_logical_slots_shutdown.pl',
47+
't/039_end_of_wal.pl',
4748
],
4849
},
4950
}

0 commit comments

Comments
 (0)