Re-create pods only if all replicas are running #903

sdudoladov · 2020-04-07T07:45:30Z

Operator currently kills pods too eagerly, for example, when re-init is still running.

Here a sequence of events leading to the problem:

Re-init starts on the replica
Rolling update comes
Operator kills a replica pod being re-initialized almost immediately because it selects for recreation all pods belonging to a PG cluster (called from here) and later skips only the master

Re-doing re-init on the same pod later suffers from the same problem because the rolling upgrade may not complete. For instance, when a replica on pod -0 is being constantly killed but pod-1 is the yet-to-be-updated master.

The current workaround for this problem is to manually shut down the operator, let the re-init complete and then start the operator again.

The PR fixes this by asking Patroni to confirm there is no PG replicas in in the creating-replica state before deleting any pod. If that check fails, pod re-creation is postponed until next Sync .

@FxKu can you please link related issues ?

sdudoladov · 2020-04-07T15:44:32Z

As of commit 84dc06bfb755482c41ee1ca096f52436b92e0ac7, this patch prevents incorrect termination of replicas in the following scenario (tested manually for now):

Get some data to make re-init long enough
pgbench -i -s 1000 -n &> pgbench.log # approx. 17 GB
start re-init on the replica
patronictl reinit $SCOPE replica-name
Cause the rolling upgrade, for example by updating resources

The patched operator under such circumstances will report update success without actually re-creating the pods. The 1st Sync after reinit terminates will find out incomplete rolling update thanks to annotations and complete it.

Note Patroni returns the 503 code for GET on replica_ip:8008, so presence of

 http get response: &{Status:503 Service Unavailable StatusCode:503 Proto:HTTP/1.0

in the logs is not an error in this case.

FxKu · 2020-04-20T12:54:02Z

👍

sdudoladov · 2020-04-20T13:11:52Z

👍

Sergey Dudoladov added 3 commits April 7, 2020 08:53

exclude extra dir for generated code

eb3f7fc

add a call to Patroni Api to fetch node state

8811dfd

add comments

af30a55

sdudoladov added bug in-progress discussion zalando labels Apr 7, 2020

sdudoladov added this to the 1.5 milestone Apr 7, 2020

sdudoladov requested review from avaczi, CyberDem0n, erthalion, FxKu, Jan-M and RafiaSabih as code owners April 7, 2020 07:45

sdudoladov assigned sdudoladov and FxKu Apr 7, 2020

Sergey Dudoladov added 2 commits April 7, 2020 09:54

fix logging format

ac46c95

fix json unmarshalling

84dc06b

Sergey Dudoladov added 4 commits April 8, 2020 08:25

make check condition more specific

7a37693

simplify json unmarshalling

5ae17aa

fix log message

3814e89

fix error description

3e63967

sdudoladov changed the title ~~[WIP] Re-create pods only if all replicas are running~~ Re-create pods only if all replicas are running Apr 9, 2020

Sergey Dudoladov and others added 6 commits April 9, 2020 10:40

add skeleton for e2e test

e9db2f0

disable e2e tests temporarliy

db30ab3

remove test skeleton

48c3d6a

fix flake8 violations

4b78ff5

Merge branch 'master' into safe-pod-delete

3b382f6

rename

0c5dfea

sdudoladov removed discussion in-progress labels Apr 20, 2020

remove dead code

5e35ea7

sdudoladov merged commit 3c91bde into master Apr 20, 2020

FxKu mentioned this pull request Apr 20, 2020

double check before recreate #923

Closed

FxKu mentioned this pull request May 4, 2020

Consider Postgres health during rolling updates #486

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Re-create pods only if all replicas are running #903

Re-create pods only if all replicas are running #903

Uh oh!

sdudoladov commented Apr 7, 2020 •

edited

Loading

Uh oh!

sdudoladov commented Apr 7, 2020

Uh oh!

FxKu commented Apr 20, 2020

Uh oh!

sdudoladov commented Apr 20, 2020

Uh oh!

Uh oh!

Re-create pods only if all replicas are running #903

Re-create pods only if all replicas are running #903

Uh oh!

Conversation

sdudoladov commented Apr 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sdudoladov commented Apr 7, 2020

Uh oh!

FxKu commented Apr 20, 2020

Uh oh!

sdudoladov commented Apr 20, 2020

Uh oh!

Uh oh!

sdudoladov commented Apr 7, 2020 •

edited

Loading