Skip to content

Commit a075c84

Browse files
committed
Fix race conditions in newly-added test.
Buildfarm has been failing sporadically on the new test. I was able to reproduce this by adding a random 0-10 s delay in the walreceiver, just before it connects to the primary. There's a race condition where node_3 is promoted before it has fully caught up with node_1, leading to diverged timelines. When node_1 is later reconfigured as standby following node_3, it fails to catch up: LOG: primary server contains no more WAL on requested timeline 1 LOG: new timeline 2 forked off current database system timeline 1 before current recovery point 0/30000A0 That's the situation where you'd need to use pg_rewind, but in this case it happens already when we are just setting up the actual pg_rewind scenario we want to test, so change the test so that it waits until node_3 is connected and fully caught up before promoting it, so that you get a clean, controlled failover. Also rewrite some of the comments, for clarity. The existing comments detailed what each step in the test did, but didn't give a good overview of the situation the steps were trying to create. For reasons I don't understand, the test setup had to be written slightly differently in 9.6 and 9.5 than in later versions. The 9.5/9.6 version needed node 1 to be reinitialized from backup, whereas in later versions it could be shut down and reconfigured to be a standby. But even 9.5 should support "clean switchover", where primary makes sure that pending WAL is replicated to standby on shutdown. It would be nice to figure out what's going on there, but that's independent of pg_rewind and the scenario that this test tests. Discussion: https://www.postgresql.org/message-id/b0a3b95b-82d2-6089-6892-40570f8c5e60%40iki.fi
1 parent 89cdf1b commit a075c84

File tree

1 file changed

+34
-15
lines changed

1 file changed

+34
-15
lines changed

src/bin/pg_rewind/t/008_min_recovery_point.pl

Lines changed: 34 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@
3434
use Test::More tests => 3;
3535

3636
use File::Copy;
37+
use File::Path qw(rmtree);
3738

3839
my $tmp_folder = TestLib::tempdir;
3940

@@ -50,63 +51,81 @@
5051
$node_1->safe_psql('postgres', 'CREATE TABLE public.bar (t TEXT)');
5152
$node_1->safe_psql('postgres', "INSERT INTO public.bar VALUES ('in both')");
5253

53-
54-
# Take backup
54+
#
55+
# Create node_2 and node_3 as standbys following node_1
56+
#
5557
my $backup_name = 'my_backup';
5658
$node_1->backup($backup_name);
5759

58-
# Create streaming standby from backup
5960
my $node_2 = get_new_node('node_2');
6061
$node_2->init_from_backup($node_1, $backup_name,
6162
has_streaming => 1);
6263
$node_2->start;
6364

64-
# Create streaming standby from backup
6565
my $node_3 = get_new_node('node_3');
6666
$node_3->init_from_backup($node_1, $backup_name,
6767
has_streaming => 1);
6868
$node_3->start;
6969

70-
# Stop node_1
70+
# Wait until node 3 has connected and caught up
71+
my $until_lsn =
72+
$node_1->safe_psql('postgres', "SELECT pg_current_xlog_location();");
73+
my $caughtup_query =
74+
"SELECT '$until_lsn'::pg_lsn <= pg_last_xlog_replay_location()";
75+
$node_3->poll_query_until('postgres', $caughtup_query)
76+
or die "Timed out while waiting for standby to catch up";
7177

78+
#
79+
# Swap the roles of node_1 and node_3, so that node_1 follows node_3.
80+
#
7281
$node_1->stop('fast');
73-
74-
# Promote node_3
7582
$node_3->promote;
7683

77-
# node_1 rejoins node_3
84+
# reconfigure node_1 as a standby following node_3
85+
rmtree $node_1->data_dir;
86+
$node_1->init_from_backup($node_1, $backup_name);
7887

7988
my $node_3_connstr = $node_3->connstr;
80-
81-
unlink($node_2->data_dir . '/recovery.conf');
89+
unlink($node_1->data_dir . '/recovery.conf');
8290
$node_1->append_conf('recovery.conf', qq(
8391
standby_mode=on
84-
primary_conninfo='$node_3_connstr'
92+
primary_conninfo='$node_3_connstr application_name=node_1'
8593
recovery_target_timeline='latest'
8694
));
8795
$node_1->start();
8896

89-
# node_2 follows node_3
90-
97+
# also reconfigure node_2 to follow node_3
9198
unlink($node_2->data_dir . '/recovery.conf');
9299
$node_2->append_conf('recovery.conf', qq(
93100
standby_mode=on
94-
primary_conninfo='$node_3_connstr'
101+
primary_conninfo='$node_3_connstr application_name=node_2'
95102
recovery_target_timeline='latest'
96103
));
97104
$node_2->restart();
98105

99-
# Promote node_1
106+
#
107+
# Promote node_1, to create a split-brain scenario.
108+
#
109+
110+
# make sure node_1 is full caught up with node_3 first
111+
$until_lsn =
112+
$node_3->safe_psql('postgres', "SELECT pg_current_xlog_location();");
113+
$caughtup_query =
114+
"SELECT '$until_lsn'::pg_lsn <= pg_last_xlog_replay_location()";
115+
$node_1->poll_query_until('postgres', $caughtup_query)
116+
or die "Timed out while waiting for standby to catch up";
100117

101118
$node_1->promote;
102119

103120
# Wait until nodes 1 and 3 have been fully promoted.
104121
$node_1->poll_query_until('postgres', "SELECT pg_is_in_recovery() <> true");
105122
$node_3->poll_query_until('postgres', "SELECT pg_is_in_recovery() <> true");
106123

124+
#
107125
# We now have a split-brain with two primaries. Insert a row on both to
108126
# demonstratively create a split brain. After the rewind, we should only
109127
# see the insert on 1, as the insert on node 3 is rewound away.
128+
#
110129
$node_1->safe_psql('postgres', "INSERT INTO public.foo (t) VALUES ('keep this')");
111130

112131
# Insert more rows in node 1, to bump up the XID counter. Otherwise, if

0 commit comments

Comments
 (0)