Skip to content

Commit 5fc1ac1

Browse files
committed
Fix pg_dump for hash partitioning on enum columns.
Hash partitioning on an enum is problematic because the hash codes are derived from the OIDs assigned to the enum values, which will almost certainly be different after a dump-and-reload than they were before. This means that some rows probably end up in different partitions than before, causing restore to fail because of partition constraint violations. (pg_upgrade dodges this problem by using hacks to force the enum values to keep the same OIDs, but that's not possible nor desirable for pg_dump.) Users can work around that by specifying --load-via-partition-root, but since that's a dump-time not restore-time decision, one might find out the need for it far too late. Instead, teach pg_dump to apply that option automatically when dealing with a partitioned table that has hash-on-enum partitioning. Also deal with a pre-existing issue for --load-via-partition-root mode: in a parallel restore, we try to TRUNCATE target tables just before loading them, in order to enable some backend optimizations. This is bad when using --load-via-partition-root because (a) we're likely to suffer deadlocks from restore jobs trying to restore rows into other partitions than they came from, and (b) if we miss getting a deadlock we might still lose data due to a TRUNCATE removing rows from some already-completed restore job. The fix for this is conceptually simple: just don't TRUNCATE if we're dealing with a --load-via-partition-root case. The tricky bit is for pg_restore to identify those cases. In dumps using COPY commands we can inspect each COPY command to see if it targets the nominal target table or some ancestor. However, in dumps using INSERT commands it's pretty impractical to examine the INSERTs in advance. To provide a solution for that going forward, modify pg_dump to mark TABLE DATA items that are using --load-via-partition-root with a comment. (This change also responds to a complaint from Robert Haas that the dump output for --load-via-partition-root is pretty confusing.) pg_restore checks for the special comment as well as checking the COPY command if present. This will fail to identify the combination of --load-via-partition-root and --inserts in pre-existing dump files, but that should be a pretty rare case in the field. If it does happen you will probably get a deadlock failure that you can work around by not using parallel restore, which is the same as before this bug fix. Having done this, there seems no remaining reason for the alarmism in the pg_dump man page about combining --load-via-partition-root with parallel restore, so remove that warning. Patch by me; thanks to Julien Rouhaud for review. Back-patch to v11 where hash partitioning was introduced. Discussion: https://postgr.es/m/1376149.1675268279@sss.pgh.pa.us
1 parent a9b716c commit 5fc1ac1

File tree

7 files changed

+287
-52
lines changed

7 files changed

+287
-52
lines changed

doc/src/sgml/ref/pg_dump.sgml

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -863,16 +863,6 @@ PostgreSQL documentation
863863
and the two systems have different definitions of the collation used
864864
to sort the partitioning column.
865865
</para>
866-
867-
<para>
868-
It is best not to use parallelism when restoring from an archive made
869-
with this option, because <application>pg_restore</application> will
870-
not know exactly which partition(s) a given archive data item will
871-
load data into. This could result in inefficiency due to lock
872-
conflicts between parallel jobs, or perhaps even restore failures due
873-
to foreign key constraints being set up before all the relevant data
874-
is loaded.
875-
</para>
876866
</listitem>
877867
</varlistentry>
878868

doc/src/sgml/ref/pg_dumpall.sgml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -360,10 +360,6 @@ PostgreSQL documentation
360360
and the two systems have different definitions of the collation used
361361
to sort the partitioning column.
362362
</para>
363-
364-
<!-- Currently, we don't need pg_dump's warning about parallelism here,
365-
since parallel restore from a pg_dumpall script is impossible.
366-
-->
367363
</listitem>
368364
</varlistentry>
369365

src/bin/pg_dump/common.c

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -228,6 +228,9 @@ getSchemaData(Archive *fout, int *numTablesPtr)
228228
pg_log_info("flagging inherited columns in subtables");
229229
flagInhAttrs(fout->dopt, tblinfo, numTables);
230230

231+
pg_log_info("reading partitioning data");
232+
getPartitioningInfo(fout);
233+
231234
pg_log_info("reading indexes");
232235
getIndexes(fout, tblinfo, numTables);
233236

@@ -280,7 +283,6 @@ static void
280283
flagInhTables(Archive *fout, TableInfo *tblinfo, int numTables,
281284
InhInfo *inhinfo, int numInherits)
282285
{
283-
DumpOptions *dopt = fout->dopt;
284286
int i,
285287
j;
286288

@@ -296,18 +298,18 @@ flagInhTables(Archive *fout, TableInfo *tblinfo, int numTables,
296298
continue;
297299

298300
/*
299-
* Normally, we don't bother computing anything for non-target tables,
300-
* but if load-via-partition-root is specified, we gather information
301-
* on every partition in the system so that getRootTableInfo can trace
302-
* from any given to leaf partition all the way up to the root. (We
303-
* don't need to mark them as interesting for getTableAttrs, though.)
301+
* Normally, we don't bother computing anything for non-target tables.
302+
* However, we must find the parents of non-root partitioned tables in
303+
* any case, so that we can trace from leaf partitions up to the root
304+
* (in case a leaf is to be dumped but its parents are not). We need
305+
* not mark such parents interesting for getTableAttrs, though.
304306
*/
305307
if (!tblinfo[i].dobj.dump)
306308
{
307309
mark_parents = false;
308310

309-
if (!dopt->load_via_partition_root ||
310-
!tblinfo[i].ispartition)
311+
if (!(tblinfo[i].relkind == RELKIND_PARTITIONED_TABLE &&
312+
tblinfo[i].ispartition))
311313
find_parents = false;
312314
}
313315

src/bin/pg_dump/pg_backup_archiver.c

Lines changed: 64 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,7 @@ static RestorePass _tocEntryRestorePass(TocEntry *te);
9191
static bool _tocEntryIsACL(TocEntry *te);
9292
static void _disableTriggersIfNecessary(ArchiveHandle *AH, TocEntry *te);
9393
static void _enableTriggersIfNecessary(ArchiveHandle *AH, TocEntry *te);
94+
static bool is_load_via_partition_root(TocEntry *te);
9495
static void buildTocEntryArrays(ArchiveHandle *AH);
9596
static void _moveBefore(TocEntry *pos, TocEntry *te);
9697
static int _discoverArchiveFormat(ArchiveHandle *AH);
@@ -878,6 +879,8 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel)
878879
}
879880
else
880881
{
882+
bool use_truncate;
883+
881884
_disableTriggersIfNecessary(AH, te);
882885

883886
/* Select owner and schema as necessary */
@@ -889,13 +892,24 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel)
889892

890893
/*
891894
* In parallel restore, if we created the table earlier in
892-
* the run then we wrap the COPY in a transaction and
893-
* precede it with a TRUNCATE. If archiving is not on
894-
* this prevents WAL-logging the COPY. This obtains a
895-
* speedup similar to that from using single_txn mode in
896-
* non-parallel restores.
895+
* this run (so that we know it is empty) and we are not
896+
* restoring a load-via-partition-root data item then we
897+
* wrap the COPY in a transaction and precede it with a
898+
* TRUNCATE. If wal_level is set to minimal this prevents
899+
* WAL-logging the COPY. This obtains a speedup similar
900+
* to that from using single_txn mode in non-parallel
901+
* restores.
902+
*
903+
* We mustn't do this for load-via-partition-root cases
904+
* because some data might get moved across partition
905+
* boundaries, risking deadlock and/or loss of previously
906+
* loaded data. (We assume that all partitions of a
907+
* partitioned table will be treated the same way.)
897908
*/
898-
if (is_parallel && te->created)
909+
use_truncate = is_parallel && te->created &&
910+
!is_load_via_partition_root(te);
911+
912+
if (use_truncate)
899913
{
900914
/*
901915
* Parallel restore is always talking directly to a
@@ -936,7 +950,7 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel)
936950
AH->outputKind = OUTPUT_SQLCMDS;
937951

938952
/* close out the transaction started above */
939-
if (is_parallel && te->created)
953+
if (use_truncate)
940954
CommitTransaction(&AH->public);
941955

942956
_enableTriggersIfNecessary(AH, te);
@@ -1028,6 +1042,43 @@ _enableTriggersIfNecessary(ArchiveHandle *AH, TocEntry *te)
10281042
fmtQualifiedId(te->namespace, te->tag));
10291043
}
10301044

1045+
/*
1046+
* Detect whether a TABLE DATA TOC item is performing "load via partition
1047+
* root", that is the target table is an ancestor partition rather than the
1048+
* table the TOC item is nominally for.
1049+
*
1050+
* In newer archive files this can be detected by checking for a special
1051+
* comment placed in te->defn. In older files we have to fall back to seeing
1052+
* if the COPY statement targets the named table or some other one. This
1053+
* will not work for data dumped as INSERT commands, so we could give a false
1054+
* negative in that case; fortunately, that's a rarely-used option.
1055+
*/
1056+
static bool
1057+
is_load_via_partition_root(TocEntry *te)
1058+
{
1059+
if (te->defn &&
1060+
strncmp(te->defn, "-- load via partition root ", 27) == 0)
1061+
return true;
1062+
if (te->copyStmt && *te->copyStmt)
1063+
{
1064+
PQExpBuffer copyStmt = createPQExpBuffer();
1065+
bool result;
1066+
1067+
/*
1068+
* Build the initial part of the COPY as it would appear if the
1069+
* nominal target table is the actual target. If we see anything
1070+
* else, it must be a load-via-partition-root case.
1071+
*/
1072+
appendPQExpBuffer(copyStmt, "COPY %s ",
1073+
fmtQualifiedId(te->namespace, te->tag));
1074+
result = strncmp(te->copyStmt, copyStmt->data, copyStmt->len) != 0;
1075+
destroyPQExpBuffer(copyStmt);
1076+
return result;
1077+
}
1078+
/* Assume it's not load-via-partition-root */
1079+
return false;
1080+
}
1081+
10311082
/*
10321083
* This is a routine that is part of the dumper interface, hence the 'Archive*' parameter.
10331084
*/
@@ -2956,8 +3007,12 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH)
29563007
res = res & ~REQ_DATA;
29573008
}
29583009

2959-
/* If there's no definition command, there's no schema component */
2960-
if (!te->defn || !te->defn[0])
3010+
/*
3011+
* If there's no definition command, there's no schema component. Treat
3012+
* "load via partition root" comments as not schema.
3013+
*/
3014+
if (!te->defn || !te->defn[0] ||
3015+
strncmp(te->defn, "-- load via partition root ", 27) == 0)
29613016
res = res & ~REQ_SCHEMA;
29623017

29633018
/*

src/bin/pg_dump/pg_dump.c

Lines changed: 130 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -302,6 +302,7 @@ static void appendReloptionsArrayAH(PQExpBuffer buffer, const char *reloptions,
302302
static char *get_synchronized_snapshot(Archive *fout);
303303
static void setupDumpWorker(Archive *AHX);
304304
static TableInfo *getRootTableInfo(const TableInfo *tbinfo);
305+
static bool forcePartitionRootLoad(const TableInfo *tbinfo);
305306

306307

307308
int
@@ -2217,11 +2218,13 @@ dumpTableData_insert(Archive *fout, const void *dcontext)
22172218
insertStmt = createPQExpBuffer();
22182219

22192220
/*
2220-
* When load-via-partition-root is set, get the root table name
2221-
* for the partition table, so that we can reload data through the
2222-
* root table.
2221+
* When load-via-partition-root is set or forced, get the root
2222+
* table name for the partition table, so that we can reload data
2223+
* through the root table.
22232224
*/
2224-
if (dopt->load_via_partition_root && tbinfo->ispartition)
2225+
if (tbinfo->ispartition &&
2226+
(dopt->load_via_partition_root ||
2227+
forcePartitionRootLoad(tbinfo)))
22252228
targettab = getRootTableInfo(tbinfo);
22262229
else
22272230
targettab = tbinfo;
@@ -2419,6 +2422,35 @@ getRootTableInfo(const TableInfo *tbinfo)
24192422
return parentTbinfo;
24202423
}
24212424

2425+
/*
2426+
* forcePartitionRootLoad
2427+
* Check if we must force load_via_partition_root for this partition.
2428+
*
2429+
* This is required if any level of ancestral partitioned table has an
2430+
* unsafe partitioning scheme.
2431+
*/
2432+
static bool
2433+
forcePartitionRootLoad(const TableInfo *tbinfo)
2434+
{
2435+
TableInfo *parentTbinfo;
2436+
2437+
Assert(tbinfo->ispartition);
2438+
Assert(tbinfo->numParents == 1);
2439+
2440+
parentTbinfo = tbinfo->parents[0];
2441+
if (parentTbinfo->unsafe_partitions)
2442+
return true;
2443+
while (parentTbinfo->ispartition)
2444+
{
2445+
Assert(parentTbinfo->numParents == 1);
2446+
parentTbinfo = parentTbinfo->parents[0];
2447+
if (parentTbinfo->unsafe_partitions)
2448+
return true;
2449+
}
2450+
2451+
return false;
2452+
}
2453+
24222454
/*
24232455
* dumpTableData -
24242456
* dump the contents of a single table
@@ -2433,34 +2465,40 @@ dumpTableData(Archive *fout, const TableDataInfo *tdinfo)
24332465
PQExpBuffer copyBuf = createPQExpBuffer();
24342466
PQExpBuffer clistBuf = createPQExpBuffer();
24352467
DataDumperPtr dumpFn;
2468+
char *tdDefn = NULL;
24362469
char *copyStmt;
24372470
const char *copyFrom;
24382471

24392472
/* We had better have loaded per-column details about this table */
24402473
Assert(tbinfo->interesting);
24412474

2475+
/*
2476+
* When load-via-partition-root is set or forced, get the root table name
2477+
* for the partition table, so that we can reload data through the root
2478+
* table. Then construct a comment to be inserted into the TOC entry's
2479+
* defn field, so that such cases can be identified reliably.
2480+
*/
2481+
if (tbinfo->ispartition &&
2482+
(dopt->load_via_partition_root ||
2483+
forcePartitionRootLoad(tbinfo)))
2484+
{
2485+
TableInfo *parentTbinfo;
2486+
2487+
parentTbinfo = getRootTableInfo(tbinfo);
2488+
copyFrom = fmtQualifiedDumpable(parentTbinfo);
2489+
printfPQExpBuffer(copyBuf, "-- load via partition root %s",
2490+
copyFrom);
2491+
tdDefn = pg_strdup(copyBuf->data);
2492+
}
2493+
else
2494+
copyFrom = fmtQualifiedDumpable(tbinfo);
2495+
24422496
if (dopt->dump_inserts == 0)
24432497
{
24442498
/* Dump/restore using COPY */
24452499
dumpFn = dumpTableData_copy;
2446-
2447-
/*
2448-
* When load-via-partition-root is set, get the root table name for
2449-
* the partition table, so that we can reload data through the root
2450-
* table.
2451-
*/
2452-
if (dopt->load_via_partition_root && tbinfo->ispartition)
2453-
{
2454-
TableInfo *parentTbinfo;
2455-
2456-
parentTbinfo = getRootTableInfo(tbinfo);
2457-
copyFrom = fmtQualifiedDumpable(parentTbinfo);
2458-
}
2459-
else
2460-
copyFrom = fmtQualifiedDumpable(tbinfo);
2461-
24622500
/* must use 2 steps here 'cause fmtId is nonreentrant */
2463-
appendPQExpBuffer(copyBuf, "COPY %s ",
2501+
printfPQExpBuffer(copyBuf, "COPY %s ",
24642502
copyFrom);
24652503
appendPQExpBuffer(copyBuf, "%s FROM stdin;\n",
24662504
fmtCopyColumnList(tbinfo, clistBuf));
@@ -2488,6 +2526,7 @@ dumpTableData(Archive *fout, const TableDataInfo *tdinfo)
24882526
.owner = tbinfo->rolname,
24892527
.description = "TABLE DATA",
24902528
.section = SECTION_DATA,
2529+
.createStmt = tdDefn,
24912530
.copyStmt = copyStmt,
24922531
.deps = &(tbinfo->dobj.dumpId),
24932532
.nDeps = 1,
@@ -7196,6 +7235,76 @@ getInherits(Archive *fout, int *numInherits)
71967235
return inhinfo;
71977236
}
71987237

7238+
/*
7239+
* getPartitioningInfo
7240+
* get information about partitioning
7241+
*
7242+
* For the most part, we only collect partitioning info about tables we
7243+
* intend to dump. However, this function has to consider all partitioned
7244+
* tables in the database, because we need to know about parents of partitions
7245+
* we are going to dump even if the parents themselves won't be dumped.
7246+
*
7247+
* Specifically, what we need to know is whether each partitioned table
7248+
* has an "unsafe" partitioning scheme that requires us to force
7249+
* load-via-partition-root mode for its children. Currently the only case
7250+
* for which we force that is hash partitioning on enum columns, since the
7251+
* hash codes depend on enum value OIDs which won't be replicated across
7252+
* dump-and-reload. There are other cases in which load-via-partition-root
7253+
* might be necessary, but we expect users to cope with them.
7254+
*/
7255+
void
7256+
getPartitioningInfo(Archive *fout)
7257+
{
7258+
PQExpBuffer query;
7259+
PGresult *res;
7260+
int ntups;
7261+
7262+
/* hash partitioning didn't exist before v11 */
7263+
if (fout->remoteVersion < 110000)
7264+
return;
7265+
/* needn't bother if schema-only dump */
7266+
if (fout->dopt->schemaOnly)
7267+
return;
7268+
7269+
query = createPQExpBuffer();
7270+
7271+
/*
7272+
* Unsafe partitioning schemes are exactly those for which hash enum_ops
7273+
* appears among the partition opclasses. We needn't check partstrat.
7274+
*
7275+
* Note that this query may well retrieve info about tables we aren't
7276+
* going to dump and hence have no lock on. That's okay since we need not
7277+
* invoke any unsafe server-side functions.
7278+
*/
7279+
appendPQExpBufferStr(query,
7280+
"SELECT partrelid FROM pg_partitioned_table WHERE\n"
7281+
"(SELECT c.oid FROM pg_opclass c JOIN pg_am a "
7282+
"ON c.opcmethod = a.oid\n"
7283+
"WHERE opcname = 'enum_ops' "
7284+
"AND opcnamespace = 'pg_catalog'::regnamespace "
7285+
"AND amname = 'hash') = ANY(partclass)");
7286+
7287+
res = ExecuteSqlQuery(fout, query->data, PGRES_TUPLES_OK);
7288+
7289+
ntups = PQntuples(res);
7290+
7291+
for (int i = 0; i < ntups; i++)
7292+
{
7293+
Oid tabrelid = atooid(PQgetvalue(res, i, 0));
7294+
TableInfo *tbinfo;
7295+
7296+
tbinfo = findTableByOid(tabrelid);
7297+
if (tbinfo == NULL)
7298+
fatal("failed sanity check, table OID %u appearing in pg_partitioned_table not found",
7299+
tabrelid);
7300+
tbinfo->unsafe_partitions = true;
7301+
}
7302+
7303+
PQclear(res);
7304+
7305+
destroyPQExpBuffer(query);
7306+
}
7307+
71997308
/*
72007309
* getIndexes
72017310
* get information about every index on a dumpable table

0 commit comments

Comments
 (0)