Skip to content

Commit abfd192

Browse files
committed
Allow a streaming replication standby to follow a timeline switch.
Before this patch, streaming replication would refuse to start replicating if the timeline in the primary doesn't exactly match the standby. The situation where it doesn't match is when you have a master, and two standbys, and you promote one of the standbys to become new master. Promoting bumps up the timeline ID, and after that bump, the other standby would refuse to continue. There's significantly more timeline related logic in streaming replication now. First of all, when a standby connects to primary, it will ask the primary for any timeline history files that are missing from the standby. The missing files are sent using a new replication command TIMELINE_HISTORY, and stored in standby's pg_xlog directory. Using the timeline history files, the standby can follow the latest timeline present in the primary (recovery_target_timeline='latest'), just as it can follow new timelines appearing in an archive directory. START_REPLICATION now takes a TIMELINE parameter, to specify exactly which timeline to stream WAL from. This allows the standby to request the primary to send over WAL that precedes the promotion. The replication protocol is changed slightly (in a backwards-compatible way although there's little hope of streaming replication working across major versions anyway), to allow replication to stop when the end of timeline reached, putting the walsender back into accepting a replication command. Many thanks to Amit Kapila for testing and reviewing various versions of this patch.
1 parent 5276687 commit abfd192

23 files changed

+1396
-370
lines changed

doc/src/sgml/high-availability.sgml

+3-4
Original file line numberDiff line numberDiff line change
@@ -912,10 +912,9 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
912912
</para>
913913

914914
<para>
915-
Promoting a cascading standby terminates the immediate downstream replication
916-
connections which it serves. This is because the timeline becomes different
917-
between standbys, and they can no longer continue replication. The
918-
affected standby(s) may reconnect to reestablish streaming replication.
915+
If an upstream standby server is promoted to become new master, downstream
916+
servers will continue to stream from the new master if
917+
<varname>recovery_target_timeline</> is set to <literal>'latest'</>.
919918
</para>
920919

921920
<para>

doc/src/sgml/protocol.sgml

+67-10
Original file line numberDiff line numberDiff line change
@@ -1018,14 +1018,21 @@
10181018
</para>
10191019

10201020
<para>
1021-
There is another Copy-related mode called Copy-both, which allows
1021+
There is another Copy-related mode called copy-both, which allows
10221022
high-speed bulk data transfer to <emphasis>and</> from the server.
10231023
Copy-both mode is initiated when a backend in walsender mode
10241024
executes a <command>START_REPLICATION</command> statement. The
10251025
backend sends a CopyBothResponse message to the frontend. Both
10261026
the backend and the frontend may then send CopyData messages
1027-
until the connection is terminated. See <xref
1028-
linkend="protocol-replication">.
1027+
until either end sends a CopyDone message. After the client
1028+
sends a CopyDone message, the connection goes from copy-both mode to
1029+
copy-out mode, and the client may not send any more CopyData messages.
1030+
Similarly, when the server sends a CopyDone message, the connection
1031+
goes into copy-in mode, and the server may not send any more CopyData
1032+
messages. After both sides have sent a CopyDone message, the copy mode
1033+
is terminated, and the backend reverts to the command-processing mode.
1034+
See <xref linkend="protocol-replication"> for more information on the
1035+
subprotocol transmitted over copy-both mode.
10291036
</para>
10301037

10311038
<para>
@@ -1350,19 +1357,69 @@ The commands accepted in walsender mode are:
13501357
</varlistentry>
13511358

13521359
<varlistentry>
1353-
<term>START_REPLICATION <replaceable>XXX</>/<replaceable>XXX</></term>
1360+
<term>TIMELINE_HISTORY <replaceable class="parameter">tli</replaceable></term>
1361+
<listitem>
1362+
<para>
1363+
Requests the server to send over the timeline history file for timeline
1364+
<replaceable class="parameter">tli</replaceable>. Server replies with a
1365+
result set of a single row, containing two fields:
1366+
</para>
1367+
1368+
<para>
1369+
<variablelist>
1370+
<varlistentry>
1371+
<term>
1372+
filename
1373+
</term>
1374+
<listitem>
1375+
<para>
1376+
Filename of the timeline history file, e.g 00000002.history.
1377+
</para>
1378+
</listitem>
1379+
</varlistentry>
1380+
1381+
<varlistentry>
1382+
<term>
1383+
content
1384+
</term>
1385+
<listitem>
1386+
<para>
1387+
Contents of the timeline history file.
1388+
</para>
1389+
</listitem>
1390+
</varlistentry>
1391+
1392+
</variablelist>
1393+
</para>
1394+
</listitem>
1395+
</varlistentry>
1396+
1397+
<varlistentry>
1398+
<term>START_REPLICATION <replaceable class="parameter">XXX/XXX</> TIMELINE <replaceable class="parameter">tli</></term>
13541399
<listitem>
13551400
<para>
13561401
Instructs server to start streaming WAL, starting at
1357-
WAL position <replaceable>XXX</>/<replaceable>XXX</>.
1402+
WAL position <replaceable class="parameter">XXX/XXX</> on timeline
1403+
<replaceable class="parameter">tli</>.
13581404
The server can reply with an error, e.g. if the requested section of WAL
13591405
has already been recycled. On success, server responds with a
13601406
CopyBothResponse message, and then starts to stream WAL to the frontend.
1361-
WAL will continue to be streamed until the connection is broken;
1362-
no further commands will be accepted. If the WAL sender process is
1363-
terminated normally (during postmaster shutdown), it will send a
1364-
CommandComplete message before exiting. This might not happen during an
1365-
abnormal shutdown, of course.
1407+
</para>
1408+
1409+
<para>
1410+
If the client requests a timeline that's not the latest, but is part of
1411+
the history of the server, the server will stream all the WAL on that
1412+
timeline starting from the requested startpoint, up to the point where
1413+
the server switched to another timeline. If the client requests
1414+
streaming at exactly the end of an old timeline, the server responds
1415+
immediately with CommandComplete without entering COPY mode.
1416+
</para>
1417+
1418+
<para>
1419+
After streaming all the WAL on a timeline that is not the latest one,
1420+
the server will end streaming by exiting the COPY mode. When the client
1421+
acknowledges this by also exiting COPY mode, the server responds with a
1422+
CommandComplete message, and is ready to accept a new command.
13661423
</para>
13671424

13681425
<para>

src/backend/access/transam/timeline.c

+83
Original file line numberDiff line numberDiff line change
@@ -410,6 +410,89 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
410410
XLogArchiveNotify(histfname);
411411
}
412412

413+
/*
414+
* Writes a history file for given timeline and contents.
415+
*
416+
* Currently this is only used in the walreceiver process, and so there are
417+
* no locking considerations. But we should be just as tense as XLogFileInit
418+
* to avoid emplacing a bogus file.
419+
*/
420+
void
421+
writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
422+
{
423+
char path[MAXPGPATH];
424+
char tmppath[MAXPGPATH];
425+
int fd;
426+
427+
/*
428+
* Write into a temp file name.
429+
*/
430+
snprintf(tmppath, MAXPGPATH, XLOGDIR "/xlogtemp.%d", (int) getpid());
431+
432+
unlink(tmppath);
433+
434+
/* do not use get_sync_bit() here --- want to fsync only at end of fill */
435+
fd = OpenTransientFile(tmppath, O_RDWR | O_CREAT | O_EXCL,
436+
S_IRUSR | S_IWUSR);
437+
if (fd < 0)
438+
ereport(ERROR,
439+
(errcode_for_file_access(),
440+
errmsg("could not create file \"%s\": %m", tmppath)));
441+
442+
errno = 0;
443+
if ((int) write(fd, content, size) != size)
444+
{
445+
int save_errno = errno;
446+
447+
/*
448+
* If we fail to make the file, delete it to release disk space
449+
*/
450+
unlink(tmppath);
451+
/* if write didn't set errno, assume problem is no disk space */
452+
errno = save_errno ? save_errno : ENOSPC;
453+
454+
ereport(ERROR,
455+
(errcode_for_file_access(),
456+
errmsg("could not write to file \"%s\": %m", tmppath)));
457+
}
458+
459+
if (pg_fsync(fd) != 0)
460+
ereport(ERROR,
461+
(errcode_for_file_access(),
462+
errmsg("could not fsync file \"%s\": %m", tmppath)));
463+
464+
if (CloseTransientFile(fd))
465+
ereport(ERROR,
466+
(errcode_for_file_access(),
467+
errmsg("could not close file \"%s\": %m", tmppath)));
468+
469+
470+
/*
471+
* Now move the completed history file into place with its final name.
472+
*/
473+
TLHistoryFilePath(path, tli);
474+
475+
/*
476+
* Prefer link() to rename() here just to be really sure that we don't
477+
* overwrite an existing logfile. However, there shouldn't be one, so
478+
* rename() is an acceptable substitute except for the truly paranoid.
479+
*/
480+
#if HAVE_WORKING_LINK
481+
if (link(tmppath, path) < 0)
482+
ereport(ERROR,
483+
(errcode_for_file_access(),
484+
errmsg("could not link file \"%s\" to \"%s\": %m",
485+
tmppath, path)));
486+
unlink(tmppath);
487+
#else
488+
if (rename(tmppath, path) < 0)
489+
ereport(ERROR,
490+
(errcode_for_file_access(),
491+
errmsg("could not rename file \"%s\" to \"%s\": %m",
492+
tmppath, path)));
493+
#endif
494+
}
495+
413496
/*
414497
* Returns true if 'expectedTLEs' contains a timeline with id 'tli'
415498
*/

src/backend/access/transam/xlog.c

+30-25
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,7 @@ static XLogRecPtr LastRec;
153153

154154
/* Local copy of WalRcv->receivedUpto */
155155
static XLogRecPtr receivedUpto = 0;
156+
static TimeLineID receiveTLI = 0;
156157

157158
/*
158159
* During recovery, lastFullPageWrites keeps track of full_page_writes that
@@ -6366,6 +6367,12 @@ StartupXLOG(void)
63666367
xlogctl->SharedRecoveryInProgress = false;
63676368
SpinLockRelease(&xlogctl->info_lck);
63686369
}
6370+
6371+
/*
6372+
* If there were cascading standby servers connected to us, nudge any
6373+
* wal sender processes to notice that we've been promoted.
6374+
*/
6375+
WalSndWakeup();
63696376
}
63706377

63716378
/*
@@ -7626,7 +7633,7 @@ CreateRestartPoint(int flags)
76267633
XLogRecPtr endptr;
76277634

76287635
/* Get the current (or recent) end of xlog */
7629-
endptr = GetStandbyFlushRecPtr(NULL);
7636+
endptr = GetStandbyFlushRecPtr();
76307637

76317638
KeepLogSeg(endptr, &_logSegNo);
76327639
_logSegNo--;
@@ -9087,22 +9094,17 @@ do_pg_abort_backup(void)
90879094
/*
90889095
* Get latest redo apply position.
90899096
*
9090-
* Optionally, returns the current recovery target timeline. Callers not
9091-
* interested in that may pass NULL for targetTLI.
9092-
*
90939097
* Exported to allow WALReceiver to read the pointer directly.
90949098
*/
90959099
XLogRecPtr
9096-
GetXLogReplayRecPtr(TimeLineID *targetTLI)
9100+
GetXLogReplayRecPtr(void)
90979101
{
90989102
/* use volatile pointer to prevent code rearrangement */
90999103
volatile XLogCtlData *xlogctl = XLogCtl;
91009104
XLogRecPtr recptr;
91019105

91029106
SpinLockAcquire(&xlogctl->info_lck);
91039107
recptr = xlogctl->lastReplayedEndRecPtr;
9104-
if (targetTLI)
9105-
*targetTLI = xlogctl->RecoveryTargetTLI;
91069108
SpinLockRelease(&xlogctl->info_lck);
91079109

91089110
return recptr;
@@ -9111,18 +9113,15 @@ GetXLogReplayRecPtr(TimeLineID *targetTLI)
91119113
/*
91129114
* Get current standby flush position, ie, the last WAL position
91139115
* known to be fsync'd to disk in standby.
9114-
*
9115-
* If 'targetTLI' is not NULL, it's set to the current recovery target
9116-
* timeline.
91179116
*/
91189117
XLogRecPtr
9119-
GetStandbyFlushRecPtr(TimeLineID *targetTLI)
9118+
GetStandbyFlushRecPtr(void)
91209119
{
91219120
XLogRecPtr receivePtr;
91229121
XLogRecPtr replayPtr;
91239122

9124-
receivePtr = GetWalRcvWriteRecPtr(NULL);
9125-
replayPtr = GetXLogReplayRecPtr(targetTLI);
9123+
receivePtr = GetWalRcvWriteRecPtr(NULL, NULL);
9124+
replayPtr = GetXLogReplayRecPtr();
91269125

91279126
if (XLByteLT(receivePtr, replayPtr))
91289127
return replayPtr;
@@ -9611,7 +9610,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
96119610
* archive and pg_xlog before failover.
96129611
*/
96139612
if (CheckForStandbyTrigger())
9613+
{
9614+
ShutdownWalRcv();
96149615
return false;
9616+
}
96159617

96169618
/*
96179619
* If primary_conninfo is set, launch walreceiver to try to
@@ -9626,8 +9628,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
96269628
if (PrimaryConnInfo)
96279629
{
96289630
XLogRecPtr ptr = fetching_ckpt ? RedoStartLSN : RecPtr;
9629-
9630-
RequestXLogStreaming(ptr, PrimaryConnInfo);
9631+
TimeLineID tli = tliOfPointInHistory(ptr, expectedTLEs);
9632+
9633+
if (curFileTLI > 0 && tli < curFileTLI)
9634+
elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but previous recovered WAL file came from timeline %u",
9635+
(uint32) (ptr >> 32), (uint32) ptr,
9636+
tli, curFileTLI);
9637+
curFileTLI = tli;
9638+
RequestXLogStreaming(curFileTLI, ptr, PrimaryConnInfo);
96319639
}
96329640
/*
96339641
* Move to XLOG_FROM_STREAM state in either case. We'll get
@@ -9653,10 +9661,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
96539661
*/
96549662
/*
96559663
* Before we leave XLOG_FROM_STREAM state, make sure that
9656-
* walreceiver is not running, so that it won't overwrite
9657-
* any WAL that we restore from archive.
9664+
* walreceiver is not active, so that it won't overwrite
9665+
* WAL that we restore from archive.
96589666
*/
9659-
if (WalRcvInProgress())
9667+
if (WalRcvStreaming())
96609668
ShutdownWalRcv();
96619669

96629670
/*
@@ -9749,7 +9757,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
97499757
/*
97509758
* Check if WAL receiver is still active.
97519759
*/
9752-
if (!WalRcvInProgress())
9760+
if (!WalRcvStreaming())
97539761
{
97549762
lastSourceFailed = true;
97559763
break;
@@ -9772,8 +9780,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
97729780
{
97739781
XLogRecPtr latestChunkStart;
97749782

9775-
receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart);
9776-
if (XLByteLT(RecPtr, receivedUpto))
9783+
receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &receiveTLI);
9784+
if (XLByteLT(RecPtr, receivedUpto) && receiveTLI == curFileTLI)
97779785
{
97789786
havedata = true;
97799787
if (!XLByteLT(RecPtr, latestChunkStart))
@@ -9888,8 +9896,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
98889896

98899897
/*
98909898
* Check to see whether the user-specified trigger file exists and whether a
9891-
* promote request has arrived. If either condition holds, request postmaster
9892-
* to shut down walreceiver, wait for it to exit, and return true.
9899+
* promote request has arrived. If either condition holds, return true.
98939900
*/
98949901
static bool
98959902
CheckForStandbyTrigger(void)
@@ -9904,7 +9911,6 @@ CheckForStandbyTrigger(void)
99049911
{
99059912
ereport(LOG,
99069913
(errmsg("received promote request")));
9907-
ShutdownWalRcv();
99089914
ResetPromoteTriggered();
99099915
triggered = true;
99109916
return true;
@@ -9917,7 +9923,6 @@ CheckForStandbyTrigger(void)
99179923
{
99189924
ereport(LOG,
99199925
(errmsg("trigger file found: %s", TriggerFile)));
9920-
ShutdownWalRcv();
99219926
unlink(TriggerFile);
99229927
triggered = true;
99239928
return true;

src/backend/access/transam/xlogfuncs.c

+2-2
Original file line numberDiff line numberDiff line change
@@ -226,7 +226,7 @@ pg_last_xlog_receive_location(PG_FUNCTION_ARGS)
226226
XLogRecPtr recptr;
227227
char location[MAXFNAMELEN];
228228

229-
recptr = GetWalRcvWriteRecPtr(NULL);
229+
recptr = GetWalRcvWriteRecPtr(NULL, NULL);
230230

231231
if (recptr == 0)
232232
PG_RETURN_NULL();
@@ -248,7 +248,7 @@ pg_last_xlog_replay_location(PG_FUNCTION_ARGS)
248248
XLogRecPtr recptr;
249249
char location[MAXFNAMELEN];
250250

251-
recptr = GetXLogReplayRecPtr(NULL);
251+
recptr = GetXLogReplayRecPtr();
252252

253253
if (recptr == 0)
254254
PG_RETURN_NULL();

0 commit comments

Comments
 (0)