Skip to content

Commit dfc7977

Browse files
committed
Fix two issues in TOAST decompression.
pglz_maximum_compressed_size() potentially underestimated the amount of compressed data required to produce N bytes of decompressed data; this is a fault in commit 11a078c. Separately from that, pglz_decompress() failed to protect itself against corrupt compressed data, particularly off == 0 in a match tag. Commit c60e520 turned such a situation into an infinite loop, where before it'd just have resulted in garbage output. The combination of these two bugs seems like it may explain bug #16694 from Tom Vijlbrief, though it's impossible to be quite sure without direct inspection of the failing session. (One needs to assume that the pglz_maximum_compressed_size() bug caused us to fail to fetch the second byte of a match tag, and what happened to be there instead was a zero. The reported infinite loop is hard to explain without off == 0, though.) Aside from fixing the bugs, rewrite associated comments for more clarity. Back-patch to v13 where both these commits landed. Discussion: https://postgr.es/m/16694-f107871e499ec114@postgresql.org
1 parent 7f42350 commit dfc7977

File tree

1 file changed

+66
-35
lines changed

1 file changed

+66
-35
lines changed

src/common/pg_lzcompress.c

+66-35
Original file line numberDiff line numberDiff line change
@@ -710,7 +710,6 @@ pglz_decompress(const char *source, int32 slen, char *dest,
710710

711711
for (ctrlc = 0; ctrlc < 8 && sp < srcend && dp < destend; ctrlc++)
712712
{
713-
714713
if (ctrl & 1)
715714
{
716715
/*
@@ -732,41 +731,62 @@ pglz_decompress(const char *source, int32 slen, char *dest,
732731
len += *sp++;
733732

734733
/*
735-
* Now we copy the bytes specified by the tag from OUTPUT to
736-
* OUTPUT (copy len bytes from dp - off to dp). The copied
737-
* areas could overlap, to preven possible uncertainty, we
738-
* copy only non-overlapping regions.
734+
* Check for corrupt data: if we fell off the end of the
735+
* source, or if we obtained off = 0, we have problems. (We
736+
* must check this, else we risk an infinite loop below in the
737+
* face of corrupt data.)
738+
*/
739+
if (sp > srcend || off == 0)
740+
break;
741+
742+
/*
743+
* Don't emit more data than requested.
739744
*/
740745
len = Min(len, destend - dp);
746+
747+
/*
748+
* Now we copy the bytes specified by the tag from OUTPUT to
749+
* OUTPUT (copy len bytes from dp - off to dp). The copied
750+
* areas could overlap, so to avoid undefined behavior in
751+
* memcpy(), be careful to copy only non-overlapping regions.
752+
*
753+
* Note that we cannot use memmove() instead, since while its
754+
* behavior is well-defined, it's also not what we want.
755+
*/
741756
while (off < len)
742757
{
743-
/*---------
744-
* When offset is smaller than length - source and
745-
* destination regions overlap. memmove() is resolving
746-
* this overlap in an incompatible way with pglz. Thus we
747-
* resort to memcpy()-ing non-overlapping regions.
748-
*
749-
* Consider input: 112341234123412341234
750-
* At byte 5 here ^ we have match with length 16 and
751-
* offset 4. 11234M(len=16, off=4)
752-
* We are decoding first period of match and rewrite match
753-
* 112341234M(len=12, off=8)
754-
*
755-
* The same match is now at position 9, it points to the
756-
* same start byte of output, but from another position:
757-
* the offset is doubled.
758-
*
759-
* We iterate through this offset growth until we can
760-
* proceed to usual memcpy(). If we would try to decode
761-
* the match at byte 5 (len=16, off=4) by memmove() we
762-
* would issue memmove(5, 1, 16) which would produce
763-
* 112341234XXXXXXXXXXXX, where series of X is 12
764-
* undefined bytes, that were at bytes [5:17].
765-
*---------
758+
/*
759+
* We can safely copy "off" bytes since that clearly
760+
* results in non-overlapping source and destination.
766761
*/
767762
memcpy(dp, dp - off, off);
768763
len -= off;
769764
dp += off;
765+
766+
/*----------
767+
* This bit is less obvious: we can double "off" after
768+
* each such step. Consider this raw input:
769+
* 112341234123412341234
770+
* This will be encoded as 5 literal bytes "11234" and
771+
* then a match tag with length 16 and offset 4. After
772+
* memcpy'ing the first 4 bytes, we will have emitted
773+
* 112341234
774+
* so we can double "off" to 8, then after the next step
775+
* we have emitted
776+
* 11234123412341234
777+
* Then we can double "off" again, after which it is more
778+
* than the remaining "len" so we fall out of this loop
779+
* and finish with a non-overlapping copy of the
780+
* remainder. In general, a match tag with off < len
781+
* implies that the decoded data has a repeat length of
782+
* "off". We can handle 1, 2, 4, etc repetitions of the
783+
* repeated string per memcpy until we get to a situation
784+
* where the final copy step is non-overlapping.
785+
*
786+
* (Another way to understand this is that we are keeping
787+
* the copy source point dp - off the same throughout.)
788+
*----------
789+
*/
770790
off += off;
771791
}
772792
memcpy(dp, dp - off, len);
@@ -820,21 +840,32 @@ pglz_decompress(const char *source, int32 slen, char *dest,
820840
int32
821841
pglz_maximum_compressed_size(int32 rawsize, int32 total_compressed_size)
822842
{
823-
int32 compressed_size;
843+
int64 compressed_size;
824844

825845
/*
826-
* pglz uses one control bit per byte, so we need (rawsize * 9) bits. We
827-
* care about bytes though, so we add 7 to make sure we include the last
828-
* incomplete byte (integer division rounds down).
846+
* pglz uses one control bit per byte, so if the entire desired prefix is
847+
* represented as literal bytes, we'll need (rawsize * 9) bits. We care
848+
* about bytes though, so be sure to round up not down.
829849
*
830-
* XXX Use int64 to prevent overflow during calculation.
850+
* Use int64 here to prevent overflow during calculation.
851+
*/
852+
compressed_size = ((int64) rawsize * 9 + 7) / 8;
853+
854+
/*
855+
* The above fails to account for a corner case: we could have compressed
856+
* data that starts with N-1 or N-2 literal bytes and then has a match tag
857+
* of 2 or 3 bytes. It's therefore possible that we need to fetch 1 or 2
858+
* more bytes in order to have the whole match tag. (Match tags earlier
859+
* in the compressed data don't cause a problem, since they should
860+
* represent more decompressed bytes than they occupy themselves.)
831861
*/
832-
compressed_size = (int32) ((int64) rawsize * 9 + 7) / 8;
862+
compressed_size += 2;
833863

834864
/*
835865
* Maximum compressed size can't be larger than total compressed size.
866+
* (This also ensures that our result fits in int32.)
836867
*/
837868
compressed_size = Min(compressed_size, total_compressed_size);
838869

839-
return compressed_size;
870+
return (int32) compressed_size;
840871
}

0 commit comments

Comments
 (0)