Skip to content

Commit cc635a0

Browse files
sdudoladovSergey DudoladovFxKu
authored
Lazy upgrade of the Spilo image (zalando#859)
* initial implementation * describe forcing the rolling upgrade * make parameter name more descriptive * add missing pieces * address review * address review * fix bug in e2e tests * fix cluster name label in e2e test * raise test timeout * load spilo test image * use available spilo image * delete replica pod for lazy update test * fix e2e * fix e2e with a vengeance * lets wait for another 30m * print pod name in error msg * print pod name in error msg 2 * raise timeout, comment other tests * subsequent updates of config * add comma * fix e2e test * run unit tests before e2e * remove conflicting dependency * Revert "remove conflicting dependency" This reverts commit 65fc090. * improve cdp build * dont run unit before e2e tests * Revert "improve cdp build" This reverts commit e2a8fa1. Co-authored-by: Sergey Dudoladov <sergey.dudoladov@zalando.de> Co-authored-by: Felix Kunde <felix-kunde@gmx.de>
1 parent 0016ebf commit cc635a0

File tree

19 files changed

+220
-42
lines changed

19 files changed

+220
-42
lines changed

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,4 +97,4 @@ test:
9797
GO111MODULE=on go test ./...
9898

9999
e2e: docker # build operator image to be tested
100-
cd e2e; make tools test clean
100+
cd e2e; make tools e2etest clean

charts/postgres-operator/crds/operatorconfigurations.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,8 @@ spec:
6262
type: string
6363
enable_crd_validation:
6464
type: boolean
65+
enable_lazy_spilo_upgrade:
66+
type: boolean
6567
enable_shm_volume:
6668
type: boolean
6769
etcd_host:

charts/postgres-operator/values-crd.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ configTarget: "OperatorConfigurationCRD"
1919
configGeneral:
2020
# choose if deployment creates/updates CRDs with OpenAPIV3Validation
2121
enable_crd_validation: true
22+
# update only the statefulsets without immediately doing the rolling update
23+
enable_lazy_spilo_upgrade: false
2224
# start any new database pod without limitations on shm memory
2325
enable_shm_volume: true
2426
# etcd connection string for Patroni. Empty uses K8s-native DCS.

charts/postgres-operator/values.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ configTarget: "ConfigMap"
1919
configGeneral:
2020
# choose if deployment creates/updates CRDs with OpenAPIV3Validation
2121
enable_crd_validation: "true"
22+
# update only the statefulsets without immediately doing the rolling update
23+
enable_lazy_spilo_upgrade: "false"
2224
# start any new database pod without limitations on shm memory
2325
enable_shm_volume: "true"
2426
# etcd connection string for Patroni. Empty uses K8s-native DCS.

docs/administrator.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -458,6 +458,17 @@ from numerous escape characters in the latter log entry, view it in CLI with
458458
`PodTemplate` used by the operator is yet to be updated with the default values
459459
used internally in K8s.
460460

461+
The operator also support lazy updates of the Spilo image. That means the pod
462+
template of a PG cluster's stateful set is updated immediately with the new
463+
image, but no rolling update follows. This feature saves you a switchover - and
464+
hence downtime - when you know pods are re-started later anyway, for instance
465+
due to the node rotation. To force a rolling update, disable this mode by
466+
setting the `enable_lazy_spilo_upgrade` to `false` in the operator configuration
467+
and restart the operator pod. With the standard eager rolling updates the
468+
operator checks during Sync all pods run images specified in their respective
469+
statefulsets. The operator triggers a rolling upgrade for PG clusters that
470+
violate this condition.
471+
461472
## Logical backups
462473

463474
The operator can manage K8s cron jobs to run logical backups of Postgres

docs/reference/operator_parameters.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,10 @@ Those are top-level keys, containing both leaf keys and groups.
7575
[OpenAPI v3 schema validation](https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#validation)
7676
The default is `true`.
7777

78+
* **enable_lazy_spilo_upgrade**
79+
Instruct operator to update only the statefulsets with the new image without immediately doing the rolling update. The assumption is pods will be re-started later with the new image, for example due to the node rotation.
80+
The default is `false`.
81+
7882
* **etcd_host**
7983
Etcd connection string for Patroni defined as `host:port`. Not required when
8084
Patroni native Kubernetes support is used. The default is empty (use

e2e/Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,5 +44,5 @@ tools: docker
4444
# install pinned version of 'kind'
4545
GO111MODULE=on go get sigs.k8s.io/kind@v0.5.1
4646

47-
test:
47+
e2etest:
4848
./run.sh

e2e/tests/test_e2e.py

Lines changed: 70 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -142,15 +142,6 @@ def test_enable_disable_connection_pooler(self):
142142
})
143143
k8s.wait_for_pods_to_stop(pod_selector)
144144

145-
k8s.api.custom_objects_api.patch_namespaced_custom_object(
146-
'acid.zalan.do', 'v1', 'default',
147-
'postgresqls', 'acid-minimal-cluster',
148-
{
149-
'spec': {
150-
'enableConnectionPooler': True,
151-
}
152-
})
153-
k8s.wait_for_pod_start(pod_selector)
154145
except timeout_decorator.TimeoutError:
155146
print('Operator log: {}'.format(k8s.get_operator_log()))
156147
raise
@@ -204,6 +195,66 @@ def test_enable_load_balancer(self):
204195
self.assertEqual(repl_svc_type, 'ClusterIP',
205196
"Expected ClusterIP service type for replica, found {}".format(repl_svc_type))
206197

198+
@timeout_decorator.timeout(TEST_TIMEOUT_SEC)
199+
def test_lazy_spilo_upgrade(self):
200+
'''
201+
Test lazy upgrade for the Spilo image: operator changes a stateful set but lets pods run with the old image
202+
until they are recreated for reasons other than operator's activity. That works because the operator configures
203+
stateful sets to use "onDelete" pod update policy.
204+
205+
The test covers:
206+
1) enabling lazy upgrade in existing operator deployment
207+
2) forcing the normal rolling upgrade by changing the operator configmap and restarting its pod
208+
'''
209+
210+
k8s = self.k8s
211+
212+
# update docker image in config and enable the lazy upgrade
213+
conf_image = "registry.opensource.zalan.do/acid/spilo-cdp-12:1.6-p114"
214+
patch_lazy_spilo_upgrade = {
215+
"data": {
216+
"docker_image": conf_image,
217+
"enable_lazy_spilo_upgrade": "true"
218+
}
219+
}
220+
k8s.update_config(patch_lazy_spilo_upgrade)
221+
222+
pod0 = 'acid-minimal-cluster-0'
223+
pod1 = 'acid-minimal-cluster-1'
224+
225+
# restart the pod to get a container with the new image
226+
k8s.api.core_v1.delete_namespaced_pod(pod0, 'default')
227+
time.sleep(60)
228+
229+
# lazy update works if the restarted pod and older pods run different Spilo versions
230+
new_image = k8s.get_effective_pod_image(pod0)
231+
old_image = k8s.get_effective_pod_image(pod1)
232+
self.assertNotEqual(new_image, old_image, "Lazy updated failed: pods have the same image {}".format(new_image))
233+
234+
# sanity check
235+
assert_msg = "Image {} of a new pod differs from {} in operator conf".format(new_image, conf_image)
236+
self.assertEqual(new_image, conf_image, assert_msg)
237+
238+
# clean up
239+
unpatch_lazy_spilo_upgrade = {
240+
"data": {
241+
"enable_lazy_spilo_upgrade": "false",
242+
}
243+
}
244+
k8s.update_config(unpatch_lazy_spilo_upgrade)
245+
246+
# at this point operator will complete the normal rolling upgrade
247+
# so we additonally test if disabling the lazy upgrade - forcing the normal rolling upgrade - works
248+
249+
# XXX there is no easy way to wait until the end of Sync()
250+
time.sleep(60)
251+
252+
image0 = k8s.get_effective_pod_image(pod0)
253+
image1 = k8s.get_effective_pod_image(pod1)
254+
255+
assert_msg = "Disabling lazy upgrade failed: pods still have different images {} and {}".format(image0, image1)
256+
self.assertEqual(image0, image1, assert_msg)
257+
207258
@timeout_decorator.timeout(TEST_TIMEOUT_SEC)
208259
def test_logical_backup_cron_job(self):
209260
'''
@@ -594,7 +645,7 @@ def get_pg_nodes(self, pg_cluster_name, namespace='default'):
594645

595646
def wait_for_operator_pod_start(self):
596647
self. wait_for_pod_start("name=postgres-operator")
597-
# HACK operator must register CRD / add existing PG clusters after pod start up
648+
# HACK operator must register CRD and/or Sync existing PG clusters after start up
598649
# for local execution ~ 10 seconds suffices
599650
time.sleep(60)
600651

@@ -724,6 +775,15 @@ def create_with_kubectl(self, path):
724775
stdout=subprocess.PIPE,
725776
stderr=subprocess.PIPE)
726777

778+
def get_effective_pod_image(self, pod_name, namespace='default'):
779+
'''
780+
Get the Spilo image pod currently uses. In case of lazy rolling updates
781+
it may differ from the one specified in the stateful set.
782+
'''
783+
pod = self.api.core_v1.list_namespaced_pod(
784+
namespace, label_selector="statefulset.kubernetes.io/pod-name=" + pod_name)
785+
return pod.items[0].spec.containers[0].image
786+
727787

728788
if __name__ == '__main__':
729789
unittest.main()

go.mod

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,20 @@ go 1.14
44

55
require (
66
github.com/aws/aws-sdk-go v1.29.33
7+
github.com/emicklei/go-restful v2.9.6+incompatible // indirect
8+
github.com/evanphx/json-patch v4.5.0+incompatible // indirect
9+
github.com/googleapis/gnostic v0.3.0 // indirect
710
github.com/lib/pq v1.3.0
811
github.com/motomux/pretty v0.0.0-20161209205251-b2aad2c9a95d
912
github.com/r3labs/diff v0.0.0-20191120142937-b4ed99a31f5a
1013
github.com/sirupsen/logrus v1.5.0
1114
github.com/stretchr/testify v1.4.0
12-
golang.org/x/tools v0.0.0-20200326210457-5d86d385bf88 // indirect
15+
golang.org/x/tools v0.0.0-20200426102838-f3a5411a4c3b // indirect
1316
gopkg.in/yaml.v2 v2.2.8
14-
k8s.io/api v0.18.0
15-
k8s.io/apiextensions-apiserver v0.18.0
16-
k8s.io/apimachinery v0.18.0
17-
k8s.io/client-go v0.18.0
18-
k8s.io/code-generator v0.18.0
17+
k8s.io/api v0.18.2
18+
k8s.io/apiextensions-apiserver v0.18.2
19+
k8s.io/apimachinery v0.18.2
20+
k8s.io/client-go v11.0.0+incompatible
21+
k8s.io/code-generator v0.18.2
22+
sigs.k8s.io/kind v0.5.1 // indirect
1923
)

0 commit comments

Comments
 (0)