Skip to content

Commit 6c6d6ba

Browse files
author
Amit Kapila
committed
Fix the Drop Database hang.
The drop database command waits for the logical replication sync worker to accept ProcSignalBarrier and the worker's slot creation waits for the drop database to finish which leads to a deadlock. This happens because the tablesync worker holds interrupts while creating a slot. We prevent cancel/die interrupts while creating a slot in the table sync worker because it is possible that before the server finishes this command, a concurrent drop subscription happens which would complete without removing this slot and that leads to the slot existing until the end of walsender. However, the slot will eventually get dropped at the walsender exit time, so there is no danger of the dangling slot. This patch reallows cancel/die interrupts while creating a slot and modifies the test to wait for slots to become zero to prevent finding an ephemeral slot. The reported hang doesn't happen in PG14 as the drop database starts to wait for ProcSignalBarrier with PG15 (commits 4eb2176 and e2f65f4) but it is good to backpatch this till PG14 as it is not a good idea to prevent interrupts during a network call that could block indefinitely. Reported-by: Lakshmi Narayanan Sreethar Diagnosed-by: Andres Freund Author: Hou Zhijie Reviewed-by: Vignesh C, Amit Kapila Backpatch-through: 14, where it was introduced in commit 6b67d72 Discussion: https://postgr.es/m/CA+kvmZELXQ4ZD3U=XCXuG3KvFgkuPoN1QrEj8c-rMRodrLOnsg@mail.gmail.com
1 parent 728f86f commit 6c6d6ba

File tree

2 files changed

+7
-10
lines changed

2 files changed

+7
-10
lines changed

src/backend/replication/logical/tablesync.c

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1396,17 +1396,10 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
13961396
* Create a new permanent logical decoding slot. This slot will be used
13971397
* for the catchup phase after COPY is done, so tell it to use the
13981398
* snapshot to make the final data consistent.
1399-
*
1400-
* Prevent cancel/die interrupts while creating slot here because it is
1401-
* possible that before the server finishes this command, a concurrent
1402-
* drop subscription happens which would complete without removing this
1403-
* slot leading to a dangling slot on the server.
14041399
*/
1405-
HOLD_INTERRUPTS();
14061400
walrcv_create_slot(LogRepWorkerWalRcvConn,
14071401
slotname, false /* permanent */ , false /* two_phase */ ,
14081402
CRS_USE_SNAPSHOT, origin_startpos);
1409-
RESUME_INTERRUPTS();
14101403

14111404
/*
14121405
* Setup replication origin tracking. The purpose of doing this before the

src/test/subscription/t/004_sync.pl

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -163,9 +163,13 @@
163163
# subscriber is stuck on data copy for constraint violation.
164164
$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
165165

166-
$result = $node_publisher->safe_psql('postgres',
167-
"SELECT count(*) FROM pg_replication_slots");
168-
is($result, qq(0),
166+
# When DROP SUBSCRIPTION tries to drop the tablesync slot, the slot may not
167+
# have been created, which causes the slot to be created after the DROP
168+
# SUSCRIPTION finishes. Such slots eventually get dropped at walsender exit
169+
# time. So, to prevent being affected by such ephemeral tablesync slots, we
170+
# wait until all the slots have been cleaned.
171+
ok( $node_publisher->poll_query_until(
172+
'postgres', 'SELECT count(*) = 0 FROM pg_replication_slots'),
169173
'DROP SUBSCRIPTION during error can clean up the slots on the publisher');
170174

171175
$node_subscriber->stop('fast');

0 commit comments

Comments
 (0)