Skip to content

Commit beb6b45

Browse files
committed
Fix more race conditions in the newly-added pg_rewind test.
pg_rewind looks at the control file to check what timeline a server is on. But promotion doesn't immediately write a checkpoint, it merely writes an end-of-recovery WAL record. If pg_rewind runs immediately after promotion, before the checkpoint has completed, it will think think that the server is still on the earlier timeline. We ran into this issue a long time ago already, see commit 484a848. It's a bit bogus that pg_rewind doesn't determine the timeline correctly until the end-of-recovery checkpoint has completed. We probably should fix that. But for now work around it by waiting for the checkpoint to complete before running pg_rewind, like we did in commit 484a848. In the passing, tidy up the new test a little bit. Rerder the INSERTs so that the comments make more sense, remove a spurious CHECKPOINT call after pg_rewind has already run, and add --debug option, so that if this fails again, we'll have more data. Per buildfarm failure at https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=rorqual&dt=2020-12-06%2018%3A32%3A19&stg=pg_rewind-check. Backpatch to all supported versions. Discussion: https://www.postgresql.org/message-id/1713707e-e318-761c-d287-5b6a4aa807e8@iki.fi
1 parent 1dd608b commit beb6b45

File tree

1 file changed

+15
-7
lines changed

1 file changed

+15
-7
lines changed

src/bin/pg_rewind/t/008_min_recovery_point.pl

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,13 @@
8080
#
8181
$node_1->stop('fast');
8282
$node_3->promote;
83+
# Force a checkpoint after the promotion. pg_rewind looks at the control
84+
# file to determine what timeline the server is on, and that isn't updated
85+
# immediately at promotion, but only at the next checkpoint. When running
86+
# pg_rewind in remote mode, it's possible that we complete the test steps
87+
# after promotion so quickly that when pg_rewind runs, the standby has not
88+
# performed a checkpoint after promotion yet.
89+
$node_3->safe_psql('postgres', "checkpoint");
8390

8491
# reconfigure node_1 as a standby following node_3
8592
rmtree $node_1->data_dir;
@@ -116,6 +123,8 @@
116123
or die "Timed out while waiting for standby to catch up";
117124

118125
$node_1->promote;
126+
# Force a checkpoint after promotion, like earlier.
127+
$node_1->safe_psql('postgres', "checkpoint");
119128

120129
# Wait until nodes 1 and 3 have been fully promoted.
121130
$node_1->poll_query_until('postgres', "SELECT pg_is_in_recovery() <> true");
@@ -127,6 +136,9 @@
127136
# see the insert on 1, as the insert on node 3 is rewound away.
128137
#
129138
$node_1->safe_psql('postgres', "INSERT INTO public.foo (t) VALUES ('keep this')");
139+
# 'bar' is unmodified in node 1, so it won't be overwritten by replaying the
140+
# WAL from node 1.
141+
$node_3->safe_psql('postgres', "INSERT INTO public.bar (t) VALUES ('rewind this')");
130142

131143
# Insert more rows in node 1, to bump up the XID counter. Otherwise, if
132144
# rewind doesn't correctly rewind the changes made on the other node,
@@ -135,10 +147,6 @@
135147
$node_1->safe_psql('postgres', "INSERT INTO public.foo (t) VALUES ('and this')");
136148
$node_1->safe_psql('postgres', "INSERT INTO public.foo (t) VALUES ('and this too')");
137149

138-
# Also insert a row in 'bar' on node 3. It is unmodified in node 1, so it won't get
139-
# overwritten by replaying the WAL from node 1.
140-
$node_3->safe_psql('postgres', "INSERT INTO public.bar (t) VALUES ('rewind this')");
141-
142150
# Wait for node 2 to catch up
143151
$node_2->poll_query_until('postgres',
144152
q|SELECT COUNT(*) > 1 FROM public.bar|, 't');
@@ -160,9 +168,10 @@
160168
[
161169
'pg_rewind',
162170
"--source-server=$node_1_connstr",
163-
"--target-pgdata=$node_2_pgdata"
171+
"--target-pgdata=$node_2_pgdata",
172+
"--debug"
164173
],
165-
'pg_rewind detects rewind needed');
174+
'run pg_rewind');
166175

167176
# Now move back postgresql.conf with old settings
168177
move(
@@ -174,7 +183,6 @@
174183
# Check contents of the test tables after rewind. The rows inserted in node 3
175184
# before rewind should've been overwritten with the data from node 1.
176185
my $result;
177-
$result = $node_2->safe_psql('postgres', 'checkpoint');
178186
$result = $node_2->safe_psql('postgres', 'SELECT * FROM public.foo');
179187
is($result, qq(keep this
180188
and this

0 commit comments

Comments
 (0)