Skip to content

Commit 04c8634

Browse files
committed
pg_createsubscriber: Only --recovery-timeout controls the end of recovery process
It used to check if the target server is connected to the primary server (send required WAL) to rapidly react when the process won't succeed. This code is not enough to guarantee that the recovery process will complete. There is a window between the walreceiver shutdown and the pg_is_in_recovery() returns false that can reach NUM_CONN_ATTEMPTS attempts and fails. Instead, rely only on the --recovery-timeout option to give up the process after the specified number of seconds. This should help with buildfarm failures on slow machines. Author: Euler Taveira <euler.taveira@enterprisedb.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Discussion: https://www.postgresql.org/message-id/776c5cac-5ef5-4001-b1bc-5b698bc0c62a%40app.fastmail.com
1 parent 8f1888e commit 04c8634

File tree

3 files changed

+5
-33
lines changed

3 files changed

+5
-33
lines changed

doc/src/sgml/ref/pg_createsubscriber.sgml

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -325,13 +325,6 @@ PostgreSQL documentation
325325
connections to the target server should fail.
326326
</para>
327327

328-
<para>
329-
During the recovery process, if the target server disconnects from the
330-
source server, <application>pg_createsubscriber</application> will check a
331-
few times if the connection has been reestablished to stream the required
332-
WAL. After a few attempts, it terminates with an error.
333-
</para>
334-
335328
<para>
336329
Since DDL commands are not replicated by logical replication, avoid
337330
executing DDL commands that change the database schema while running

src/bin/pg_basebackup/pg_createsubscriber.c

Lines changed: 3 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1360,24 +1360,23 @@ stop_standby_server(const char *datadir)
13601360
*
13611361
* If recovery_timeout option is set, terminate abnormally without finishing
13621362
* the recovery process. By default, it waits forever.
1363+
*
1364+
* XXX Is the recovery process still in progress? When recovery process has a
1365+
* better progress reporting mechanism, it should be added here.
13631366
*/
13641367
static void
13651368
wait_for_end_recovery(const char *conninfo, const struct CreateSubscriberOptions *opt)
13661369
{
13671370
PGconn *conn;
13681371
int status = POSTMASTER_STILL_STARTING;
13691372
int timer = 0;
1370-
int count = 0; /* number of consecutive connection attempts */
1371-
1372-
#define NUM_CONN_ATTEMPTS 10
13731373

13741374
pg_log_info("waiting for the target server to reach the consistent state");
13751375

13761376
conn = connect_database(conninfo, true);
13771377

13781378
for (;;)
13791379
{
1380-
PGresult *res;
13811380
bool in_recovery = server_is_in_recovery(conn);
13821381

13831382
/*
@@ -1391,28 +1390,6 @@ wait_for_end_recovery(const char *conninfo, const struct CreateSubscriberOptions
13911390
break;
13921391
}
13931392

1394-
/*
1395-
* If it is still in recovery, make sure the target server is
1396-
* connected to the primary so it can receive the required WAL to
1397-
* finish the recovery process. If it is disconnected try
1398-
* NUM_CONN_ATTEMPTS in a row and bail out if not succeed.
1399-
*/
1400-
res = PQexec(conn,
1401-
"SELECT 1 FROM pg_catalog.pg_stat_wal_receiver");
1402-
if (PQntuples(res) == 0)
1403-
{
1404-
if (++count > NUM_CONN_ATTEMPTS)
1405-
{
1406-
stop_standby_server(subscriber_dir);
1407-
pg_log_error("standby server disconnected from the primary");
1408-
break;
1409-
}
1410-
}
1411-
else
1412-
count = 0; /* reset counter if it connects again */
1413-
1414-
PQclear(res);
1415-
14161393
/* Bail out after recovery_timeout seconds if this option is set */
14171394
if (opt->recovery_timeout > 0 && timer >= opt->recovery_timeout)
14181395
{

src/bin/pg_basebackup/t/040_pg_createsubscriber.pl

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -264,6 +264,7 @@
264264
command_ok(
265265
[
266266
'pg_createsubscriber', '--verbose',
267+
'--recovery-timeout', "$PostgreSQL::Test::Utils::timeout_default",
267268
'--dry-run', '--pgdata',
268269
$node_s->data_dir, '--publisher-server',
269270
$node_p->connstr('pg1'), '--socket-directory',
@@ -301,6 +302,7 @@
301302
command_ok(
302303
[
303304
'pg_createsubscriber', '--verbose',
305+
'--recovery-timeout', "$PostgreSQL::Test::Utils::timeout_default",
304306
'--verbose', '--pgdata',
305307
$node_s->data_dir, '--publisher-server',
306308
$node_p->connstr('pg1'), '--socket-directory',

0 commit comments

Comments
 (0)