@@ -60,79 +60,13 @@ the availability of any of the PostgreSQL clusters that it manages, as the
60
60
PostgreSQL Operator is only maintaining the definitions of what should be in the
61
61
cluster (e.g. how many instances in the cluster, etc.).
62
62
63
- Each HA PostgreSQL cluster maintains its availability using concepts that come
64
- from the [ Raft algorithm] ( https://raft.github.io/ ) to achieve distributed
65
- consensus. The Raft algorithm ("Reliable, Replicated, Redundant,
66
- Fault-Tolerant") was developed for systems that have one "leader" (i.e. a
67
- primary) and one-to-many followers (i.e. replicas) to provide the same fault
68
- tolerance and safety as the PAXOS algorithm while being easier to implement.
69
-
70
- For the PostgreSQL cluster group to achieve distributed consensus on who the
71
- primary (or leader) is, each PostgreSQL cluster leverages the distributed etcd
72
- key-value store that is bundled with Kubernetes. After it is elected as the
73
- leader, a primary will place a lock in the distributed etcd cluster to indicate
74
- that it is the leader. The "lock" serves as the method for the primary to
75
- provide a heartbeat: the primary will periodically update the lock with the
76
- latest time it was able to access the lock. As long as each replica sees that
77
- the lock was updated within the allowable automated failover time, the replicas
78
- will continue to follow the leader.
79
-
80
- The "log replication" portion that is defined in the Raft algorithm is handled
81
- by PostgreSQL in two ways. First, the primary instance will replicate changes to
82
- each replica based on the rules set up in the provisioning process. For
83
- PostgreSQL clusters that leverage "synchronous replication," a transaction is
84
- not considered complete until all changes from those transactions have been sent
85
- to all replicas that are subscribed to the primary.
86
-
87
- In the above section, note the key word that the transaction are sent to each
88
- replica: the replicas will acknowledge receipt of the transaction, but they may
89
- not be immediately replayed. We will address how we handle this further down in
90
- this section.
91
-
92
- During this process, each replica keeps track of how far along in the recovery
93
- process it is using a "log sequence number" (LSN), a built-in PostgreSQL serial
94
- representation of how many logs have been replayed on each replica. For the
95
- purposes of HA, there are two LSNs that need to be considered: the LSN for the
96
- last log received by the replica, and the LSN for the changes replayed for the
97
- replica. The LSN for the latest changes received can be compared amongst the
98
- replicas to determine which one has replayed the most changes, and an important
99
- part of the automated failover process.
100
-
101
- The replicas periodically check in on the lock to see if it has been updated by
102
- the primary within the allowable automated failover timeout. Each replica checks
103
- in at a randomly set interval, which is a key part of Raft algorithm that helps
104
- to ensure consensus during an election process. If a replica believes that the
105
- primary is unavailable, it becomes a candidate and initiates an election and
106
- votes for itself as the new primary. A candidate must receive a majority of
107
- votes in a cluster in order to be elected as the new primary.
108
-
109
- There are several cases for how the election can occur. If a replica believes
110
- that a primary is down and starts an election, but the primary is actually not
111
- down, the replica will not receive enough votes to become a new primary and will
112
- go back to following and replaying the changes from the primary.
113
-
114
- In the case where the primary is down, the first replica to notice this starts
115
- an election. Per the Raft algorithm, each available replica compares which one
116
- has the latest changes available, based upon the LSN of the latest logs
117
- received. The replica with the latest LSN wins and receives the vote of the
118
- other replica. The replica with the majority of the votes wins. In the event
119
- that two replicas' logs have the same LSN, the tie goes to the replica that
120
- initiated the voting request.
121
-
122
- Once an election is decided, the winning replica is immediately promoted to be a
123
- primary and takes a new lock in the distributed etcd cluster. If the new primary
124
- has not finished replaying all of its transactions logs, it must do so in order
125
- to reach the desired state based on the LSN. Once the logs are finished being
126
- replayed, the primary is able to accept new queries.
127
-
128
- At this point, any existing replicas are updated to follow the new primary.
129
-
130
- When the old primary tries to become available again, it realizes that it has
131
- been deposed as the leader and must be healed. The old primary determines what
132
- kind of replica it should be based upon the CRD, which allows it to set itself
133
- up with appropriate attributes. It is then restored from the pgBackRest backup
134
- archive using the "delta restore" feature, which heals the instance and makes it
135
- ready to follow the new primary, which is known as "auto healing."
63
+ Each HA PostgreSQL cluster maintains its availability by using Patroni to manage
64
+ failover when the primary becomes compromised. Patroni stores the primary’s ID in
65
+ annotations on a Kubernetes ` Endpoints ` object which acts as a lease. The primary
66
+ must periodically renew the lease to signal that it’s healthy. If the primary
67
+ misses its deadline, replicas compare their WAL positions to see who has the most
68
+ up-to-date data. Instances with the latest data try to overwrite the ID on the lease.
69
+ The first to succeed becomes the new primary, and all others follow the new primary.
136
70
137
71
## How The Crunchy PostgreSQL Operator Uses Pod Anti-Affinity
138
72
0 commit comments