Skip to content

databases keep restarting after the upgrade from 5.3 to 5.7.2 #4069

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
batulziiy opened this issue Jan 13, 2025 · 7 comments
Open

databases keep restarting after the upgrade from 5.3 to 5.7.2 #4069

batulziiy opened this issue Jan 13, 2025 · 7 comments

Comments

@batulziiy
Copy link

Questions

I did postgres-operator version upgrade from 5.2 to 5.7.2 yesterday and it seemed working fine after the upgrade. I was able to connect to the database and query. However this morning found that clusters keep restarting after running normal for ~1-2 mins. I did the exact same upgrade for 2 different environments and one works fine while another one keeps restarting.

Only difference I found is on kube cluster with version v1.26.3+k3s1 it works fine, and we have an issue on the cluster with version v1.25.4+k3s1. Not sure if it makes great difference.

In the pgo pod log, I can see the below error is generated :

time="2025-01-13T09:11:21Z" level=error msg="Reconciler error" PostgresCluster=postgres-operator/hippo controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster error="Operation cannot be fulfilled on Pod \"hippo-instance1-f9rv-0\": the ResourceVersion in the precondition (1043902053) does not match the ResourceVersion in record (1043902088). The object might have been modified" file="internal/controller/postgrescluster/instance.go:879" func="postgrescluster.(*Reconciler).rolloutInstance" name=hippo namespace=postgres-operator reconcileID=6f692192-9b74-4921-891e-ac91a86b00ea version=5.7.2-0

Also in the pod log, I got an exit code 137, which is a bit strange as I have enough memory resource on the cluster.

2025-01-13 09:19:49.116 UTC [95] LOG:  received SIGHUP, reloading configuration files
2025-01-13 09:19:59.121 UTC [95] LOG:  received SIGHUP, reloading configuration files
2025-01-13 09:20:09.127 UTC [95] LOG:  received SIGHUP, reloading configuration files
2025-01-13 09:20:14.758 UTC [98] LOG:  checkpoint complete: wrote 90256 buffers (2.9%); 0 WAL file(s) added, 0 removed, 58 recycled; write=159.616 s, sync=0.007 s, total=159.711 s; sync files=70, longest=0.004 s, average=0.001 s; distance=695145 kB, estimate=695145 kB
2025-01-13 09:20:14.759 UTC [98] LOG:  checkpoint starting: immediate force wait
2025-01-13 09:20:15.001 UTC [98] LOG:  checkpoint complete: wrote 11836 buffers (0.4%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.233 s, sync=0.005 s, total=0.243 s; sync files=10, longest=0.003 s, average=0.001 s; distance=100283 kB, estimate=635659 kB
2025-01-13 09:20:15.417 UTC [95] LOG:  received fast shutdown request
2025-01-13 09:20:15.418 UTC [95] LOG:  aborting any active transactions
2025-01-13 09:20:15.418 UTC [752] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [211] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [210] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [181] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [187] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [179] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [175] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [177] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [173] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [169] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [171] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [167] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [161] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [115] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.418 UTC [163] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.419 UTC [165] FATAL:  terminating connection due to administrator command
2025-01-13 09:20:15.420 UTC [95] LOG:  background worker "logical replication launcher" (PID 131) exited with exit code 1
2025-01-13 09:20:15.427 UTC [98] LOG:  shutting down
2025-01-13 09:20:15.456 UTC [98] LOG:  checkpoint starting: shutdown immediate
2025-01-13 09:20:15.477 UTC [98] LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.015 s, sync=0.001 s, total=0.022 s; sync files=0, longest=0.000 s, average=0.000 s; distance=7386 kB, estimate=572832 kB
2025-01-13 09:20:16.278 UTC [95] LOG:  database system is shut down
command terminated with exit code 137

Have you ever had the same experience, would appreciate if you share your thoughts here, thanks.

Environment

Please provide the following details:

  • Platform: Kubernetes
  • Platform Version: v1.25.4+k3s1
  • PGO Image Tag: postgres-operator:ubi8-5.7.2-0
  • Postgres Version 15
  • Storage: longhorn
@batulziiy
Copy link
Author

Ever after upgrading k3s version to 1.26.3, nothing has changed and still keep restarting.

@batulziiy
Copy link
Author

now at least it restarts randomly after running for 5-40mins and prints out below message before it gets rebooted. I assume something is trying to apply some changes to the database, which requires reboot. Then the database restarts when it has been applied. The interesting part is, it doesn't say what kind of configuration change is going to be applied.

2025-01-13 13:59:48.732 UTC [95] LOG:  received SIGHUP, reloading configuration files
2025-01-13 13:59:58.738 UTC [95] LOG:  received SIGHUP, reloading configuration files
2025-01-13 14:00:08.743 UTC [95] LOG:  received SIGHUP, reloading configuration files
2025-01-13 14:00:18.748 UTC [95] LOG:  received SIGHUP, reloading configuration files
2025-01-13 14:00:28.753 UTC [95] LOG:  received SIGHUP, reloading configuration files

@batulziiy
Copy link
Author

@benjaminjb do you have any idea what could be the cause?

@benjaminjb
Copy link
Contributor

Hey @batulziiy sorry you're running into a problem. I am real curious why one k8s cluster has errors and the other is fine, but just to check that I understand the situation:

  • You had two K8s clusters, one running 1.25 and one running 1.26
  • You had PGO 5.3 running fine on both
  • You upgraded to PGO 5.7
  • The Postgres on 1.25 started having errors, the 1.26 Postgreses are fine
  • BUT when you upgrade the 1.25 cluster to 1.26, you still saw some errors (though maybe the amount of time was different).

Do I have that right?

I have to say, I am curious that the original 1.26 K8s cluster was working fine with PGO 5.7, because according to our support matrix, 5.7 is compatible with 1.28 - 1.32.

So, for hopefully a quick fix, can you upgrade your k8s clusters from 1.26 to 1.28?
For a potentially more involved fix, if I summarized the situation correctly, then I am curious why your 1.26 K8s cluster worked while you're 1.25-1.26 cluster has errors. Are there no other differences? (Was the 1.26 cluster that worked always a 1.26 cluster?)

@batulziiy
Copy link
Author

thanks @benjaminjb for the quick response. You understood it correctly and all the bullet items you wrote above are correct. I'm still wondering why it runs on one, while doesn't run on another one. The both k3s cluster have been running stable for more than 2 years now, but one was running on 1.25 till yesterday.
Actually I saw that support matrix, however the pgo version 5.7 was running normal on my test cluster with kube version 1.26, so I figured it should work.
I tried to compare everything including workload, network settings, db settings, and storage settings between those 2 clusters. They're literally just mirror of each other.
For your last question, (Was the 1.26 cluster that worked always a 1.26 cluster?) Yes it's been 1.26 from the beginning and I haven't touched it till now.
Do you think it's possible to downgrade back to pgo version 5.3 from 5.7? I don't want to rollback the change, but if that's the last option, I guess need to have an option.

@batulziiy
Copy link
Author

now I've paused the pg cluster reconcilliation process by running below command and haven't received 'receiving SIGHUP' message since then. But it's not a permanent solution as the operator will lose the ability to control the database.

@batulziiy
Copy link
Author

so far below are the differences I've found between 2 clusters.

  • On the problematic cluster, we have fluxcd version v0.38.3 running while v2.0.0-rc.1 running on the another cluster (where everything works fine). For some reason, we store pgo repo on local git and PGO upgrade was done by fluxcd. However, v0.38.3 has support in server-side apply.
  • the underlying storage Longhorn versions have small difference, on the problematic cluster we have v1.4.0 while v1.4.1 on other one.
  • on the problematic cluster, we have local-path and longhorn storageClasses, but only longhorn sc is selected as default.
    Other than that I don't see any difference between them and they look quite identical.
    @benjaminjb Do you have any suggestion other than upgrading k3s version? Since we have the pgo running normally on one cluster, probably k3s version is not the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants