Skip to content

Commit 9c08aea

Browse files
committed
Add new block-by-block strategy for CREATE DATABASE.
Because this strategy logs changes on a block-by-block basis, it avoids the need to checkpoint before and after the operation. However, because it logs each changed block individually, it might generate a lot of extra write-ahead logging if the template database is large. Therefore, the older strategy remains available via a new STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy option to createdb. Somewhat controversially, this patch assembles the list of relations to be copied to the new database by reading the pg_class relation of the template database. Cross-database access like this isn't normally possible, but it can be made to work here because there can't be any connections to the database being copied, nor can it contain any in-doubt transactions. Even so, we have to use lower-level interfaces than normal, since the table scan and relcache interfaces will not work for a database to which we're not connected. The advantage of this approach is that we do not need to rely on the filesystem to determine what ought to be copied, but instead on PostgreSQL's own knowledge of the database structure. This avoids, for example, copying stray files that happen to be located in the source database directory. Dilip Kumar, with a fairly large number of cosmetic changes by me. Reviewed and tested by Ashutosh Sharma, Andres Freund, John Naylor, Greg Nancarrow, Neha Sharma. Additional feedback from Bruce Momjian, Heikki Linnakangas, Julien Rouhaud, Adam Brusselback, Kyotaro Horiguchi, Tomas Vondra, Andrew Dunstan, Álvaro Herrera, and others. Discussion: http://postgr.es/m/CA+TgmoYtcdxBjLh31DLxUXHxFVMPGzrU5_T=CYCvRyFHywSBUQ@mail.gmail.com
1 parent bf902c1 commit 9c08aea

File tree

28 files changed

+1081
-157
lines changed

28 files changed

+1081
-157
lines changed

contrib/bloom/blinsert.c

+1-1
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,7 @@ blbuildempty(Relation index)
173173
* Write the page and log it. It might seem that an immediate sync would
174174
* be sufficient to guarantee that the file exists on disk, but recovery
175175
* itself might remove it while replaying, for example, an
176-
* XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we need
176+
* XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record. Therefore, we need
177177
* this even when wal_level=minimal.
178178
*/
179179
PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);

doc/src/sgml/monitoring.sgml

+4
Original file line numberDiff line numberDiff line change
@@ -1502,6 +1502,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
15021502
<entry><literal>TwophaseFileWrite</literal></entry>
15031503
<entry>Waiting for a write of a two phase state file.</entry>
15041504
</row>
1505+
<row>
1506+
<entry><literal>VersionFileWrite</literal></entry>
1507+
<entry>Waiting for the version file to be written while creating a database.</entry>
1508+
</row>
15051509
<row>
15061510
<entry><literal>WALBootstrapSync</literal></entry>
15071511
<entry>Waiting for WAL to reach durable storage during

doc/src/sgml/ref/create_database.sgml

+22
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
2525
[ [ WITH ] [ OWNER [=] <replaceable class="parameter">user_name</replaceable> ]
2626
[ TEMPLATE [=] <replaceable class="parameter">template</replaceable> ]
2727
[ ENCODING [=] <replaceable class="parameter">encoding</replaceable> ]
28+
[ STRATEGY [=] <replaceable class="parameter">strategy</replaceable> ] ]
2829
[ LOCALE [=] <replaceable class="parameter">locale</replaceable> ]
2930
[ LC_COLLATE [=] <replaceable class="parameter">lc_collate</replaceable> ]
3031
[ LC_CTYPE [=] <replaceable class="parameter">lc_ctype</replaceable> ]
@@ -118,6 +119,27 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
118119
</para>
119120
</listitem>
120121
</varlistentry>
122+
<varlistentry id="create-database-strategy" xreflabel="CREATE DATABASE STRATEGY">
123+
<term><replaceable class="parameter">strategy</replaceable></term>
124+
<listitem>
125+
<para>
126+
Strategy to be used in creating the new database. If
127+
the <literal>WAL_LOG</literal> strategy is used, the database will be
128+
copied block by block and each block will be separately written
129+
to the write-ahead log. This is the most efficient strategy in
130+
cases where the template database is small, and therefore it is the
131+
default. The older <literal>FILE_COPY</literal> strategy is also
132+
available. This strategy writes a small record to the write-ahead log
133+
for each tablespace used by the target database. Each such record
134+
represents copying an entire directory to a new location at the
135+
filesystem level. While this does reduce the write-ahed
136+
log volume substantially, especially if the template database is large,
137+
it also forces the system to perform a checkpoint both before and
138+
after the creation of the new database. In some situations, this may
139+
have a noticeable negative impact on overall system performance.
140+
</para>
141+
</listitem>
142+
</varlistentry>
121143
<varlistentry>
122144
<term><replaceable class="parameter">locale</replaceable></term>
123145
<listitem>

doc/src/sgml/ref/createdb.sgml

+11
Original file line numberDiff line numberDiff line change
@@ -177,6 +177,17 @@ PostgreSQL documentation
177177
</listitem>
178178
</varlistentry>
179179

180+
<varlistentry>
181+
<term><option>-S <replaceable class="parameter">template</replaceable></option></term>
182+
<term><option>--strategy=<replaceable class="parameter">strategy</replaceable></option></term>
183+
<listitem>
184+
<para>
185+
Specifies the database creation strategy. See
186+
<xref linkend="create-database-strategy" /> for more details.
187+
</para>
188+
</listitem>
189+
</varlistentry>
190+
180191
<varlistentry>
181192
<term><option>-T <replaceable class="parameter">template</replaceable></option></term>
182193
<term><option>--template=<replaceable class="parameter">template</replaceable></option></term>

src/backend/access/heap/heapam_handler.c

+3-3
Original file line numberDiff line numberDiff line change
@@ -593,15 +593,15 @@ heapam_relation_set_new_filenode(Relation rel,
593593
*/
594594
*minmulti = GetOldestMultiXactId();
595595

596-
srel = RelationCreateStorage(*newrnode, persistence);
596+
srel = RelationCreateStorage(*newrnode, persistence, true);
597597

598598
/*
599599
* If required, set up an init fork for an unlogged table so that it can
600600
* be correctly reinitialized on restart. An immediate sync is required
601601
* even if the page has been logged, because the write did not go through
602602
* shared_buffers and therefore a concurrent checkpoint may have moved the
603603
* redo pointer past our xlog record. Recovery may as well remove it
604-
* while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
604+
* while replaying, for example, XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE
605605
* record. Therefore, logging is necessary even if wal_level=minimal.
606606
*/
607607
if (persistence == RELPERSISTENCE_UNLOGGED)
@@ -645,7 +645,7 @@ heapam_relation_copy_data(Relation rel, const RelFileNode *newrnode)
645645
* NOTE: any conflict in relfilenode value will be caught in
646646
* RelationCreateStorage().
647647
*/
648-
RelationCreateStorage(*newrnode, rel->rd_rel->relpersistence);
648+
RelationCreateStorage(*newrnode, rel->rd_rel->relpersistence, true);
649649

650650
/* copy main fork */
651651
RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,

src/backend/access/nbtree/nbtree.c

+1-1
Original file line numberDiff line numberDiff line change
@@ -161,7 +161,7 @@ btbuildempty(Relation index)
161161
* Write the page and log it. It might seem that an immediate sync would
162162
* be sufficient to guarantee that the file exists on disk, but recovery
163163
* itself might remove it while replaying, for example, an
164-
* XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we need
164+
* XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE record. Therefore, we need
165165
* this even when wal_level=minimal.
166166
*/
167167
PageSetChecksumInplace(metapage, BTREE_METAPAGE);

src/backend/access/rmgrdesc/dbasedesc.c

+16-4
Original file line numberDiff line numberDiff line change
@@ -24,14 +24,23 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
2424
char *rec = XLogRecGetData(record);
2525
uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
2626

27-
if (info == XLOG_DBASE_CREATE)
27+
if (info == XLOG_DBASE_CREATE_FILE_COPY)
2828
{
29-
xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) rec;
29+
xl_dbase_create_file_copy_rec *xlrec =
30+
(xl_dbase_create_file_copy_rec *) rec;
3031

3132
appendStringInfo(buf, "copy dir %u/%u to %u/%u",
3233
xlrec->src_tablespace_id, xlrec->src_db_id,
3334
xlrec->tablespace_id, xlrec->db_id);
3435
}
36+
else if (info == XLOG_DBASE_CREATE_WAL_LOG)
37+
{
38+
xl_dbase_create_wal_log_rec *xlrec =
39+
(xl_dbase_create_wal_log_rec *) rec;
40+
41+
appendStringInfo(buf, "create dir %u/%u",
42+
xlrec->tablespace_id, xlrec->db_id);
43+
}
3544
else if (info == XLOG_DBASE_DROP)
3645
{
3746
xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
@@ -51,8 +60,11 @@ dbase_identify(uint8 info)
5160

5261
switch (info & ~XLR_INFO_MASK)
5362
{
54-
case XLOG_DBASE_CREATE:
55-
id = "CREATE";
63+
case XLOG_DBASE_CREATE_FILE_COPY:
64+
id = "CREATE_FILE_COPY";
65+
break;
66+
case XLOG_DBASE_CREATE_WAL_LOG:
67+
id = "CREATE_WAL_LOG";
5668
break;
5769
case XLOG_DBASE_DROP:
5870
id = "DROP";

src/backend/access/transam/xlogutils.c

+3-3
Original file line numberDiff line numberDiff line change
@@ -484,7 +484,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
484484
{
485485
/* page exists in file */
486486
buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
487-
mode, NULL);
487+
mode, NULL, true);
488488
}
489489
else
490490
{
@@ -509,7 +509,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
509509
ReleaseBuffer(buffer);
510510
}
511511
buffer = ReadBufferWithoutRelcache(rnode, forknum,
512-
P_NEW, mode, NULL);
512+
P_NEW, mode, NULL, true);
513513
}
514514
while (BufferGetBlockNumber(buffer) < blkno);
515515
/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -519,7 +519,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
519519
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
520520
ReleaseBuffer(buffer);
521521
buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
522-
mode, NULL);
522+
mode, NULL, true);
523523
}
524524
}
525525

src/backend/catalog/heap.c

+1-1
Original file line numberDiff line numberDiff line change
@@ -387,7 +387,7 @@ heap_create(const char *relname,
387387
relpersistence,
388388
relfrozenxid, relminmxid);
389389
else if (RELKIND_HAS_STORAGE(rel->rd_rel->relkind))
390-
RelationCreateStorage(rel->rd_node, relpersistence);
390+
RelationCreateStorage(rel->rd_node, relpersistence, true);
391391
else
392392
Assert(false);
393393
}

src/backend/catalog/storage.c

+22-12
Original file line numberDiff line numberDiff line change
@@ -112,12 +112,14 @@ AddPendingSync(const RelFileNode *rnode)
112112
* modules that need them.
113113
*
114114
* This function is transactional. The creation is WAL-logged, and if the
115-
* transaction aborts later on, the storage will be destroyed.
115+
* transaction aborts later on, the storage will be destroyed. A caller
116+
* that does not want the storage to be destroyed in case of an abort may
117+
* pass register_delete = false.
116118
*/
117119
SMgrRelation
118-
RelationCreateStorage(RelFileNode rnode, char relpersistence)
120+
RelationCreateStorage(RelFileNode rnode, char relpersistence,
121+
bool register_delete)
119122
{
120-
PendingRelDelete *pending;
121123
SMgrRelation srel;
122124
BackendId backend;
123125
bool needs_wal;
@@ -149,15 +151,23 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
149151
if (needs_wal)
150152
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
151153

152-
/* Add the relation to the list of stuff to delete at abort */
153-
pending = (PendingRelDelete *)
154-
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
155-
pending->relnode = rnode;
156-
pending->backend = backend;
157-
pending->atCommit = false; /* delete if abort */
158-
pending->nestLevel = GetCurrentTransactionNestLevel();
159-
pending->next = pendingDeletes;
160-
pendingDeletes = pending;
154+
/*
155+
* Add the relation to the list of stuff to delete at abort, if we are
156+
* asked to do so.
157+
*/
158+
if (register_delete)
159+
{
160+
PendingRelDelete *pending;
161+
162+
pending = (PendingRelDelete *)
163+
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
164+
pending->relnode = rnode;
165+
pending->backend = backend;
166+
pending->atCommit = false; /* delete if abort */
167+
pending->nestLevel = GetCurrentTransactionNestLevel();
168+
pending->next = pendingDeletes;
169+
pendingDeletes = pending;
170+
}
161171

162172
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
163173
{

0 commit comments

Comments
 (0)