0% found this document useful (0 votes)
6 views70 pages

5 Transaction Processing

The document outlines the principles of distributed database systems, covering topics such as transaction processing, concurrency control, reliability, and data replication. It discusses various algorithms for managing transactions, including locking-based and timestamp ordering methods, and addresses issues like deadlocks and multiversion concurrency control. The content is aimed at providing a comprehensive understanding of how distributed databases maintain consistency and reliability in the face of concurrent operations and failures.

Uploaded by

kieunty22416c
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views70 pages

5 Transaction Processing

The document outlines the principles of distributed database systems, covering topics such as transaction processing, concurrency control, reliability, and data replication. It discusses various algorithms for managing transactions, including locking-based and timestamp ordering methods, and addresses issues like deadlocks and multiversion concurrency control. The content is aimed at providing a comprehensive understanding of how distributed databases maintain consistency and reliability in the face of concurrent operations and failures.

Uploaded by

kieunty22416c
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Principles of Distributed Database

Systems
M. Tamer Özsu
Patrick Valduriez

© 2020, M.T. Özsu & P. Valduriez 1


Outline
 Introduction
 Distributed and parallel database design
 Distributed data control
 Distributed Transaction Processing
 Data Replication
 Database Integration – Multidatabase Systems
 Parallel Database Systems
 Peer-to-Peer Data Management
 Big Data Processing
 NoSQL, NewSQL and Polystores
 Web Data Management
© 2020, M.T. Özsu & P. Valduriez 2
Outline
 Distributed Transaction Processing
 Distributed Concurrency Control
 Distributed Reliability

© 2020, M.T. Özsu & P. Valduriez 3


Transaction
A database is in a consistent state if it obeys all of the
consistency (integrity) constraints defined over it.

Transaction consistency, on the other hand, refers to the


operations of concurrent transactions.

The database to remain in a consistent state even if there are


a number of user requests that are concurrently accessing
(reading or updating) the database.

Reliability refers to both the resiliency of a system to various


types of failures and its capability to recover from them.
© 2020, M.T. Özsu & P. Valduriez 4
Transaction
Reliability refers to both the resiliency of a system to various
types of failures and its capability to recover from them.

A resilient system is tolerant of system failures and can


continue to provide services even when failures occur.

A recoverable DBMS is one that can get to a consistent state


(by moving back to a previous consistent state or forward to a
new consistent state) following various types of failures.

© 2020, M.T. Özsu & P. Valduriez 5


Transaction

A transaction is a collection of actions that make consistent


transformations of system states while preserving system
consistency.
 concurrency transparency
 failure transparency

© 2020, M.T. Özsu & P. Valduriez 6


Transaction Characterization
Begin_transaction

Read
Read

Write
Read

Commit
 Read set (RS)
 The set of data items that are read by a transaction
 Write set (WS)
 The set of data items whose values are changed by this transaction
 Base set (BS)
 RS ∪ WS

© 2020, M.T. Özsu & P. Valduriez 7


Principles of Transactions

ATOMICITY
 all or nothing

CONSISTENCY
 no violation of integrity constraints

ISOLATION
 concurrent changes invisible  serializable

DURABILITY
 committed updates persist

© 2020, M.T. Özsu & P. Valduriez 8


Transactions

 Atomic and reliable execution in the presence of failures

 Correct execution in the presence of multiple user


accesses

 Correct management of replicas (if they support it)

© 2020, M.T. Özsu & P. Valduriez 9


Distributed TM Architecture

© 2020, M.T. Özsu & P. Valduriez 10


Outline
 Distributed Transaction Processing
 Distributed Concurrency Control

© 2020, M.T. Özsu & P. Valduriez 11


5.2 Distributed Concurrency Control
(p.188)
 The problem of synchronizing concurrent transactions
such that the consistency of the database is maintained
while, at the same time, maximum degree of concurrency
is achieved.
 Enforce isolation property
 Anomalies:
 Lost updates
 The effects of some transactions are not reflected on the database.
 Inconsistent retrievals
 A transaction, if it reads the same data item more than once, should
always read the same value.

© 2020, M.T. Özsu & P. Valduriez 12


Serializability in Distributed DBMS

 Two histories have to be considered:


 local histories
 global history

 For global transactions (i.e., global history) to be


serializable, two conditions are necessary:
 Each local history should be serializable → local serializability
 Two conflicting operations should be in the same relative order
in all of the local histories where they appear together →
global serializability

© 2020, M.T. Özsu & P. Valduriez 13


Global Non-serializability (p.188)
Example 5.1
T1: Read(x) T2: Read(x)
x ←x-100 Read(y)
Write(x) Commit
Read(y)
y ←y+100
Write(y)
Commit

 x stored at Site 1, y stored at Site 2


H1={R1(x), W1(x), R2(x)} T1 -> T2
H2={R1(y), W1(y), R2(y)} T1 -> T2
H1, H2 are individually serializable, the two transactions are
globally serializable.
© 2020, M.T. Özsu & P. Valduriez 14
Global Non-serializability (p.188)

H’1={R1(x), W1(x), R2(x)} T1 -> T2


H’2={R2(y), R1(y), W1(y)} T2 -> T1

 H’1, H’2 are individually serializable (in fact serial), but the
two transactions are not globally serializable.

© 2020, M.T. Özsu & P. Valduriez 15


Concurrency Control Algorithms

 Pessimistic
 Two-Phase Locking-based (2PL)
 Centralized (primary site) 2PL
 Primary copy 2PL
 Distributed 2PL
 Timestamp Ordering (TO)
 Basic TO
 Multiversion TO
 Conservative TO
 Optimistic
 Locking-based
 Timestamp ordering-based

© 2020, M.T. Özsu & P. Valduriez 16


5.2.1 Locking-Based Algorithms (p.189)

 Transactions indicate their intentions by requesting locks


from the scheduler (called lock manager).
 Locks are either read lock (rl) [also called shared lock] or
write lock (wl) [also called exclusive lock]
 Read locks and write locks conflict (because Read and
Write operations are incompatible
rl wl
rl yes no
wl no no
 Locking works nicely to allow concurrent processing of
transactions.
© 2020, M.T. Özsu & P. Valduriez 17
5.2.1.1 Centralized 2PL (p.190)
 There is only one 2PL scheduler in the distributed system.
 Lock requests are issued to the central scheduler.

© 2020, M.T. Özsu & P. Valduriez 18


Centralized 2PL

Tham khảo :

Algorithm 5.1: Centralized 2PL Transaction Manager


(C2PL-TM) (p.191)
Algorithm 5.2: Centralized 2PL Lock Manager (C2PL-LM)
(p.192)
Algorithm 5.3: Data Processor (DP) (p.193)

© 2020, M.T. Özsu & P. Valduriez 19


5.2.1.2 Distributed 2PL (p.193)

 2PL schedulers are placed at each site. Each scheduler


handles lock requests for data at that site.
 A transaction may read any of the replicated copies of
item x, by obtaining a read lock on one of the copies of x.
Writing into x requires obtaining write locks for all copies
of x.

© 2020, M.T. Özsu & P. Valduriez 20


Distributed 2PL Execution

© 2020, M.T. Özsu & P. Valduriez 21


Distributed 2PL Execution
The distributed 2PL transaction management algorithm is
similar to the C2PLTM, with two major modifications.
 The messages that are sent to the central site lock
manager in C2PL-TM are sent to the lock managers at all
participating sites in D2PL-TM.
 The operations are not passed to the data processors by
the coordinating transaction manager, but by the
participating lock managers.
= >This means that the coordinating transaction
manager does not wait for a “lock request granted”
message. The participating data processors send the
“end of operation” messages to the coordinating TM

© 2020, M.T. Özsu & P. Valduriez 22


5.2.1.3 Deadlock (p.194)
 A transaction is deadlocked if it is blocked and will
remain blocked until there is intervention.
 Locking-based concurrency control algorithms may
cause deadlocks.
 TO-based algorithms that involve waiting may cause
deadlocks.
 Wait-for graph (WFG)
If transaction Ti waits for another transaction Tj to
release a lock on an entity, then Ti → Tj in WFG.

Ti Tj

© 2020, M.T. Özsu & P. Valduriez 23


Local versus Global WFG
Example 5.2 (p.194)
 T1 and T2 run at site 1, T3 and T4 run at site 2.
 T3 waits for a lock held by T4 which waits for a lock held by T1 which
waits for a lock held by T2 which, in turn, waits for a lock held by T3.

Local WFG

Global WFG

© 2020, M.T. Özsu & P. Valduriez 24


Deadlock Detection

 Transactions are allowed to wait freely.


 Wait-for graphs and cycles.
 Topologies for deadlock detection algorithms
 Centralized
 Distributed
 Hierarchical

© 2020, M.T. Özsu & P. Valduriez 25


Centralized Deadlock Detection

 One site is designated as the deadlock detector for the


system. Each scheduler periodically sends its local WFG
to the central site which merges them to a global WFG to
determine cycles.
 How often to transmit?
 Too often ⇒ higher communication cost but lower delays due to
undetected deadlocks
 Too late ⇒ higher delays due to deadlocks, but lower
communication cost
 Would be a reasonable choice if the concurrency control
algorithm is also centralized (centralized 2PL)
 Proposed for Distributed INGRES

© 2020, M.T. Özsu & P. Valduriez 26


Hierarchical Deadlock Detection

Deadlocks that are Build a hierarchy of detectors


local to a single site
would be detected at
that site using the
LWFG.

Each site also sends


its LWFG to the
deadlock detector at
the next level.

© 2020, M.T. Özsu & P. Valduriez 27


Distributed Deadlock Detection
Distributed deadlock detection algorithms delegate the responsibility of
detecting deadlocks to individual sites. Local deadlock detectors at each
site that communicate their LWFGs with one another (in fact, only the
potential deadlock cycles are transmitted)
 Sites cooperate in detection of deadlocks.
 One example:
 Form local WFGs at each modified as follows:
1) Potential deadlock cycles from other sites are
added as edges
2) Join these with edges in LWFGs depicting that
remote transactions are waiting for local ones

 Each local deadlock detector:


 looks for a cycle that does not involve the external edge. If it exists,
there is a local deadlock which can be handled locally.
 looks for a cycle involving the external edge. If it exists, it indicates a
potential global deadlock. Pass on the information to the next site.

© 2020, M.T. Özsu & P. Valduriez 28


5.2.2 Timestamp Ordering (p.197)
Transaction (Ti) is assigned a globally unique
timestamp ts(Ti).
Transaction manager attaches the timestamp to
all operations issued by the transaction.
Each data item is assigned a write timestamp
(wts) and a read timestamp (rts):
rts(x) = largest timestamp of any read on x
wts(x) = largest timestamp of any write
(update) on x
Conflicting operations are resolved by
timestamp order.
© 2020, M.T. Özsu & P. Valduriez 29
Basic Timestamp Ordering
Two conflicting operations Oij of Ti and Okl of Tk → Oij executed before
Okl iff ts(Ti) < ts(Tk).
 Ti is called older transaction
 Tk is called younger transaction

Basic TO algorithm never causes operations to wait, but


instead, restarts them.
© 2020, M.T. Özsu & P. Valduriez 30
5.2.2.2 Conservative Timestamp
Ordering (p.201)
 Basic timestamp ordering tries to execute an operation
as soon as it receives it
 progressive
 too many restarts since there is no delaying
 Conservative timestamping delays each operation until
there is an assurance that it will not be restarted
 Assurance?
 No other operation with a smaller timestamp can arrive at the
scheduler
 Note that the delay may result in the formation of deadlocks

© 2020, M.T. Özsu & P. Valduriez 31


5.2.3 Multiversion Concurrency Control
(MVCC)

 Each version of a data item that is created is labeled with


the timestamp of the transaction that creates it
-> Do not modify the values in the database, create
new values.
 Typically timestamp-based implementation
ts(Ti) < ts(xr) < ts(Tj)
 Implemented in a number of systems: IBM DB2, Oracle,
SQL Server, SAP HANA, BerkeleyDB, PostgreSQL

© 2020, M.T. Özsu & P. Valduriez 32


MVCC Reads

 A Ri(x) is translated into a read on one version of x.


 Find a version of x (say xv) such that ts(xv) is the largest
timestamp less than ts(Ti).

© 2020, M.T. Özsu & P. Valduriez 33


MVCC Writes

 A Wi(x) is translated into Wi(xw) and accepted if the


scheduler has not yet processed any Rj(xr) such that
ts(Ti) < ts(xr) < ts(Tj)
ts(Ti) create new version xr, not read any Rj

© 2020, M.T. Özsu & P. Valduriez 34


5.2.4 Optimistic Concurrency Control
Algorithms (p.205)
Optimistic algorithms assume that transaction conflicts and
contention for data will not be predominant.

Each transaction follows five phases: read (R), execute


(E), write (W), validate (V), and commit (C)—of course
commit phase becomes abort if the transaction is
not validated.

This algorithm assigns timestamps to transactions at the


beginning of their validation step rather than at the
beginning of transactions as is done with (pessimistic) TO
algorithms.
© 2020, M.T. Özsu & P. Valduriez 35
Optimistic Concurrency Control
Algorithms
 Transaction execution model: divide Ti into subtransactions
each of which execute at a site
 𝑇𝑖𝑠 : subtransaction of Ti that executes at site s.

 At the beginning of the validation phase a timestamp is


assigned to the transaction (timestamp of its
subtransactions)
 The local validation of 𝑇𝑖𝑠 is performed according to the
following rules :

© 2020, M.T. Özsu & P. Valduriez 36


Optimistic CC Validation Test

 Rule 1 : At each site s, if all transactions 𝑇𝑘𝑠 where


ts(𝑇𝑘𝑠 ) < ts(𝑇𝑖𝑠 ) have completed their write phase
before 𝑇𝑖𝑠 has started its read phase (Fig. 5.9a),
validation succeeds, because transaction executions
are in serial order.

© 2020, M.T. Özsu & P. Valduriez 37


Optimistic CC Validation Test

 At each site s, if there is any transaction 𝑇𝑘𝑠 such that


ts(𝑇𝑘𝑠 ) < ts(𝑇𝑖𝑠 ), and which completes its write phase
(commit) while 𝑇𝑖𝑠 is in its read phase (Fig. 5.9b), the
validation succeeds if Ws(𝑇𝑘𝑠 ) ∩ Rs(𝑇𝑖𝑠 ) = ∅.

The updates of 𝑇𝑖𝑠 will not be overwritten by the


updates of 𝑇𝑘𝑠
© 2020, M.T. Özsu & P. Valduriez 38
Optimistic CC Validation Test

 At each site s, if there is any transaction 𝑇𝑘𝑠 such that


ts(𝑇𝑘𝑠 ) < ts(𝑇𝑖𝑠 ), and which completes its read phase
before 𝑇𝑖𝑠 completes its read phase (Fig. 5.9c), the
validation succeeds if Ws(𝑇𝑘𝑠 ) ∩ Rs(𝑇𝑖𝑠 ) = ∅, and
Ws(𝑇𝑘𝑠 ) ∩ Ws(𝑇𝑖𝑠 ) = ∅.
It simply requires that the updates of 𝑇𝑘𝑠 not affect the read
phase or the write phase of 𝑇𝑖𝑠

© 2020, M.T. Özsu & P. Valduriez 39


5.3 Snapshot Isolation (SI)

 Each transaction “sees” a consistent snapshot of the


database when it starts and R/W this snapshot
 Repeatable reads, but not serializable isolation
 Read-only transactions proceed without significant
synchronization overhead
 Centralized SI-based CC
1) Ti starts, obtains a begin timestamp tsb(Ti)
2) Ti ready to commit, obtains a commit timestamp tsc(Ti) that is greater
than any of the existing tsb or tsc
3) Ti commits if no other Tj such that tsc (Tj) € [tsb(Ti), tsc(Ti)]; otherwise
aborted (first committer wins)
4) When Ti commits, changes visible to all Tk where tsb(Tk)> tsc(Ti)

© 2020, M.T. Özsu & P. Valduriez 40


Distributed CC with SI
 Computing a consistent distributed snapshot is hard
 Similar rules to serializability
 Each local history should be SI
 Global history is SI → commitment orders at each site are the same

 Dependence relationship: 𝑇𝑖 at site s (𝑇𝑖𝑠 ) is dependent on 𝑇𝑗𝑠


(dependent(𝑇𝑖𝑠 , 𝑇𝑗𝑠 )) iff
(𝑅𝑆(𝑇𝑖𝑠 ) ∩ 𝑊𝑆(𝑇𝑗𝑠 ) ≠ ∅) ∨ (𝑊𝑆 𝑇𝑖𝑠 ∩ 𝑅𝑆(𝑇𝑗𝑠 ) ≠ ∅) ∨ (𝑊𝑆(𝑇𝑖𝑠 ) ∩ 𝑊𝑆(𝑇𝑗𝑠 ) ≠ ∅)
 Conditions
1) 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡(𝑇𝑖 , 𝑇𝑗 ) ∧ 𝑡𝑠𝑏 𝑇𝑖𝑠 < 𝑡𝑠𝑐 𝑇𝑗𝑠 ⟹ 𝑡𝑠𝑏 𝑇𝑖𝑡 < 𝑡𝑠𝑐 𝑇𝑗𝑡 at every
site 𝑡 where 𝑇𝑖 and 𝑇𝑗 execute together
2) 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡(𝑇𝑖 , 𝑇𝑗 ) ∧ 𝑡𝑠𝑐 𝑇𝑖𝑠 < 𝑡𝑠𝑐 𝑇𝑗𝑠 ⟹ 𝑡𝑠𝑐 𝑇𝑖𝑡 < 𝑡𝑠𝑏 𝑇𝑗𝑡 at every
site 𝑡 where 𝑇𝑖 and 𝑇𝑗 execute together
3) 𝑡𝑠𝑐 𝑇𝑖𝑠 < 𝑡𝑠𝑐 𝑇𝑗𝑠 ⟹ 𝑡𝑠𝑐 𝑇𝑖𝑡 < 𝑡𝑠𝑏 𝑇𝑗𝑡 at every site 𝑡 where 𝑇𝑖 and 𝑇𝑗 execute
© 2020, M.T.together
Özsu & P. Valduriez 41
Distributed CC with SI – Executing 𝑇𝑖
Coordinating TM Each site s

Check if first 2
Conditions hold

Positive Yes
Any ?
neg? Update event clock:
No max(own,coord TM)

Update event clock:


Max(event clocks of all s)
Wait

If global commit:
1) Persist 𝑇𝑖 updates
2) Update event clock
© 2020, M.T. Özsu & P. Valduriez 42
Outline
 Distributed Transaction Processing

 Distributed Reliability

© 2020, M.T. Özsu & P. Valduriez 43


Reliability

Problem:
How to maintain

atomicity

durability

properties of transactions

© 2020, M.T. Özsu & P. Valduriez 44


Ch.10/44
Types of Failures

 Transaction failures
 Transaction aborts (unilaterally or due to deadlock)
 System (site) failures
 Failure of processor, main memory, power supply, …
 Main memory contents are lost, but secondary storage contents
are safe
 Partial vs. total failure
 Media failures
 Failure of secondary storage devices → stored data is lost
 Head crash/controller failure
 Communication failures
 Lost/undeliverable messages
 Network partitioning
© 2020, M.T. Özsu & P. Valduriez 45
Distributed Reliability Protocols

 Commit protocols
 How to execute commit command for distributed transactions.
 Issue: how to ensure atomicity and durability?
 Termination protocols
 If a failure occurs, how can the remaining operational sites deal with it.
 Non-blocking: the occurrence of failures should not force the sites to
wait until the failure is repaired to terminate the transaction.
 Recovery protocols
 When a failure occurs, how do the sites where the failure occurred deal
with it.
 Independent: a failed site can determine the outcome of a transaction
without having to obtain remote information.
 Independent recovery  non-blocking termination

© 2020, M.T. Özsu & P. Valduriez 46


Two-Phase Commit (2PC)

Phase 1 : The coordinator gets the participants ready to


write the results into the database
Phase 2 : Everybody writes the results into the database
 Coordinator :The process at the site where the transaction
originates and which controls the execution
 Participant :The process at the other sites that participate in
executing the transaction
Global Commit Rule:
 The coordinator aborts a transaction if and only if at least one
participant votes to abort it.
 The coordinator commits a transaction if and only if all of the
participants vote to commit it.

© 2020, M.T. Özsu & P. Valduriez 47


State Transitions in 2PC

Coordinator Participant

COMMIT
Fig. 5.10

© 2020, M.T. Özsu & P. Valduriez 48


Centralized 2PC

Fig. 5.12
C: Coordinator, P: Participant
© 2020, M.T. Özsu & P. Valduriez 49
2PC Protocol Actions

centralized 2PC

Fig. 5.11

© 2020, M.T. Özsu & P. Valduriez 50


Linear 2PC

V-C: Vote-Commit, V-A: Vote-Abort, G-C: Global-commit, G-A: Global-abort

© 2020, M.T. Özsu & P. Valduriez 51


Distributed 2PC
Fig. 5.14 Distributed 2PC
communication structure

there is no need for the second phase of the protocol


© 2020, M.T. Özsu & P. Valduriez 52
Variations of 2PC

To improve performance by
1) Reduce the number of messages between coordinator &
participants
2) Reduce the number of time logs are written
 Presumed Abort 2PC
 Participant polls coordinator about transaction’s outcome
 No information → abort the transaction
 Presumed Commit 2PC
 Participant polls coordinator about transaction’s outcome
 No information → assume transaction is committed
 Not an exact dual of presumed abort 2PC

© 2020, M.T. Özsu & P. Valduriez 53


Site Failures - 2PC Termination
The termination protocols serve the
Coordinator
timeouts for both the coordinator and
the participant processes.

Coordinator Timeouts

 Timeout in WAIT
 Cannot unilaterally commit
 Can unilaterally abort
 Timeout in ABORT or COMMIT
 Stay blocked and wait for the acks

© 2020, M.T. Özsu & P. Valduriez 54


Site Failures - 2PC Termination

Participant
Participant Timeouts

 Timeout in INITIAL
 Coordinator must have failed
in INITIAL state
 Unilaterally abort
 Timeout in READY
 Stay blocked,

© 2020, M.T. Özsu & P. Valduriez 55


Site Failures - 2PC Recovery

Coordinator
 Failure in INITIAL
 Start the commit process upon
recovery
 Failure in WAIT
 Restart the commit process upon
recovery
 Failure in ABORT or COMMIT
 Nothing special if all the acks have
been received
 Otherwise the termination protocol is
involved

© 2020, M.T. Özsu & P. Valduriez 56


Site Failures - 2PC Recovery

Participant
 Failure in INITIAL
 Unilaterally abort upon recovery
 Failure in READY
 The coordinator has been
informed about the local decision
 Treat as timeout in READY state
and invoke the termination
protocol
 Failure in ABORT or COMMIT
 Nothing special needs to be done

© 2020, M.T. Özsu & P. Valduriez 57


2PC Recovery Protocols –
Additional Cases
Arise due to non-atomicity of log and message send
actions
 Coordinator site fails after writing “begin_commit” log
and before sending “prepare” command
 treat it as a failure in WAIT state; send “prepare” command
 Participant site fails after writing “ready” record in log but
before “vote-commit” is sent
 treat it as failure in READY state
 alternatively, can send “vote-commit” upon recovery
 Participant site fails after writing “abort” record in log but
before “vote-abort” is sent
 no need to do anything upon recovery
© 2020, M.T. Özsu & P. Valduriez 58
2PC Recovery Protocols –
Additional Case

 Coordinator site fails after logging its final decision


record but before sending its decision to the participants
 coordinator treats it as a failure in COMMIT or ABORT state
 participants treat it as timeout in the READY state
 Participant site fails after writing “abort” or “commit”
record in log but before acknowledgement is sent
 participant treats it as failure in COMMIT or ABORT state
 coordinator will handle it by timeout in COMMIT or ABORT state

© 2020, M.T. Özsu & P. Valduriez 59


Problem With 2PC

 Blocking
 Ready implies that the participant waits for the coordinator
 If coordinator fails, site is blocked until recovery
 Blocking reduces availability
 Independent recovery is not possible
 However, it is known that:
 Independent recovery protocols exist only for single site failures;
no independent recovery protocol exists which is resilient to
multiple-site failures.
 So we search for these protocols – 3PC

© 2020, M.T. Özsu & P. Valduriez 60


Three-Phase Commit

 3PC is non-blocking.
 A commit protocols is non-blocking iff
 it is synchronous within one state transition, and
 its state transition diagram contains
 no state which is “adjacent” to both a commit and an abort state,
and
 no non-committable state which is “adjacent” to a commit state
 Adjacent: possible to go from one stat to another with a
single state transition
 Committable: all sites have voted to commit a
transaction
 e.g.: COMMIT state

© 2020, M.T. Özsu & P. Valduriez 61


State Transitions in 3PC

Coordinator Participant
3PC Protocol Actions

© 2020, M.T. Özsu & P. Valduriez 63


Network Partitioning

 Simple partitioning
 Only two partitions
 Multiple partitioning
 More than two partitions
 Formal bounds:
 There exists no non-blocking protocol that is resilient to a
network partition if messages are lost when partition occurs.
 There exist non-blocking protocols which are resilient to a single
network partition if all undeliverable messages are returned to
sender.
 There exists no non-blocking protocol which is resilient to a
multiple partition.

© 2020, M.T. Özsu & P. Valduriez 64


Independent Recovery Protocols for
Network Partitioning

 No general solution possible


 allow one group to terminate while the other is blocked
 improve availability
 How to determine which group to proceed?
 The group with a majority
 How does a group know if it has majority?
 Centralized
 Whichever partitions contains the central site should terminate the
transaction
 Voting-based (quorum)

© 2020, M.T. Özsu & P. Valduriez 65


Quorum Protocols

 The network partitioning problem is handled by the


commit protocol.
 Every site is assigned a vote Vi.
 Total number of votes in the system V
 Abort quorum Va, commit quorum Vc
 Va + Vc > V where 0 ≤ Va , Vc ≤ V
 Before a transaction commits, it must obtain a commit quorum Vc
 Before a transaction aborts, it must obtain an abort quorum Va

© 2020, M.T. Özsu & P. Valduriez 66


Paxos Consensus Protocol

 General problem: how to reach an agreement


(consensus) among TMs about the fate of a transaction
 2PC and 3PC are special cases
 General idea: If a majority reaches a decision, the global
decision is reached (like voting)
 Roles:
 Proposer: recommends a decision
 Acceptor: decides whether to accept the proposed decision
 Learner: discovers the agreed-upon decision by asking or it is
pushed

© 2020, M.T. Özsu & P. Valduriez 67


Paxos & Complications

 Naïve Paxos: one proposer


 Operates like a 2PC
 Complications
 Multiple proposers can exist at the same time; acceptor has to
choose
 Attach a ballot number
 Multiple proposals may result in split votes with no majority
 Run multiple consensus rounds → performance implication
 Choose a leader
 Some accepts fail after they accept a decision; the remaining
acceptors may not constitute majority
 Use ballot numbers

© 2020, M.T. Özsu & P. Valduriez 68


Basic Paxos – No Failures
Proposer (or Leader) Acceptor
Any
previous
Ack No prop?
Record
Record Ack prepare(bal) Yes

Yes bal > any


Record bal
Ack from prepare(bal) received?
majority?
No

Ignore and
Wait
if all are Ack (not with bal’,val’)
then val ← proposer wants &
nbal ← bal
No
else val ← val’ & nbal ← bal nbal = ack.bal? Ignore

Yes

Record
accepted(nbal,val)

© 2020, M.T. Özsu & P. Valduriez 69


Basic Paxos with Failures

 Some acceptors fail but there is quorum


 Not a problem
 Enough acceptors fail to eliminate quorum
 Run a new ballot
 Proposer/leader fails
 Choose a new leader and start a new ballot

© 2020, M.T. Özsu & P. Valduriez 70

You might also like