Tính Toán Phân Tán
Tính Toán Phân Tán
Tính Toán Phân Tán
CUP 2008
1 / 79
Nodes: network nodes, routers, all end hosts (whether participating or not)
Edges: all LAN, WAN links, direct edges between end hosts
E.g., Fig. 5.1(a) topology + all routers and links in WANs
CUP 2008
2 / 79
Topology Abstractions
WAN
WAN
WAN
WAN
participating process(or)
(a)
CUP 2008
3 / 79
Application execution vs. control algorithm execution, each with own events
I
Control algorithm:
F
F
F
F
for monitoring and auxiliary functions, e.g., creating a ST, MIS, CDS, reaching
consensus, global state detection (deadlock, termination etc.), checkpointing
superimposed on application execution, but does not interfere
its send, receive, internal events are transparent to application execution
a.k.a. protocol
CUP 2008
4 / 79
Anonymous algorithm: process ids or processor ids are not used to make any
execution (run-time) decisions
I
CUP 2008
5 / 79
CUP 2008
6 / 79
I
I
I
F
I
CUP 2008
7 / 79
Synchronous:
F
F
F
CUP 2008
8 / 79
I
I
I
Communication channels
I
CUP 2008
9 / 79
I
I
CUP 2008
10 / 79
CUP 2008
11 / 79
I
I
I
CUP 2008
12 / 79
Program Structure
Communicating Sequential Processes (CSP) like:
[ G1 CL1 || G2 CL2 || || Gk CLk ]
The repetitive command * denotes an infinite loop.
Inside it, the alternative command || is over guarded commands.
Specifies execution of exactly one of its constituent guarded commands.
Guarded command syntax: G CL
guard G is boolean expression,
CL is list of commands to be executed if G is true.
Guard may check for message arrival from another process.
Alternative command fails if all the guards fail; if > 1 guard is true, one is
nondeterministically chosen for execution.
Gm CLm : CLm and Gm atomically executed.
CUP 2008
13 / 79
MST, sync
MST, async
Synchronizers: simple, , ,
Async DFS ST
CDS
CUP 2008
14 / 79
= root then
visited 1;
depth 0;
send QUERY to Neighbors;
round = 1 to diameter do
if visited = 0 then
if any QUERY messages arrive then
parent randomly select a node from which QUERY was received;
visited 1;
depth round;
send QUERY to Neighbors \ {senders of QUERYs received in this round};
delete any QUERY messages that arrived in this round.
CUP 2008
15 / 79
B (2)
(1)
(3)
(2)
(1)
C
QUERY
(3)
(3)
F (2)
E (3)
CUP 2008
16 / 79
Complexity:
Local space: O(degree)
P
Global space: O( local space)
Local time: O(degree + diameter )
Message time complexity: d rounds or message hops
Message complexity: 1, 2 messages/edge. Thus, [l, 2l]
Spanning tree: analogous to breadth-first search
CUP 2008
17 / 79
CUP 2008
18 / 79
CUP 2008
19 / 79
CUP 2008
20 / 79
(1)
B (4)
(1)
C
(3)
QUERY
(5)
(3)
(2)
CUP 2008
21 / 79
B
D
Algorithm:
CUP 2008
22 / 79
CUP 2008
23 / 79
CUP 2008
24 / 79
Handle concurrent initiators just as for the non-DFS algorithm, just examined
When QUERY, ACCEPT, or REJECT arrives: actions depend on whether
myroot < = newroot
Termination: only successful root detects termination. Informs others using
ST edges.
Time complexity: O(l)
Message complexity: O(nl)
CUP 2008
25 / 79
CUP 2008
26 / 79
tree edge
crossedge
initiated by leaves
root
convergecast
initiated by root
broadcast
backedge
CUP 2008
27 / 79
CUP 2008
28 / 79
CUP 2008
29 / 79
(local variables)
int length
int parent
set of int Neighbors set of neighbors
set of int {weighti,j , weightj,i | j Neighbors} the known values of the weights of incident links
(message types)
UPDATE
(1) if i = i0 then length 0;
(2) for round = 1 to n 1 do
(3)
send UPDATE(i, length) to all neighbors;
(4)
await UPDATE(j, lengthj ) from each j Neighbors;
(5)
for each j Neighbors do
(6)
if (length > (lengthj + weightj,i ) then
(7)
length lengthj + weightj,i ; parent j.
CUP 2008
30 / 79
CUP 2008
31 / 79
CUP 2008
32 / 79
(local variables)
int length
set of int Neighbors set of neighbors
set of int {weighti,j , weightj,i | j Neighbors} the known values of the weights of incident links
(message types)
UPDATE
(1) if i = i0 then
(1a)
length 0;
(1b)
send UPDATE(i0 , 0) to all neighbours; terminate.
(2) When UPDATE(i0 , lengthj ) arrives from j:
(2a)
if (length > (lengthj + weightj,i )) then
(2b)
length lengthj + weightj,i ; parent j;
(2c)
send UPDATE(i0 , length) to all neighbors;
CUP 2008
33 / 79
LENGTH[s,t]
LENGTH[s,pivot]
VIA(VIA(s,t), t)
LENGTH[pivot,t]
pivot
VIA(s,t)
s
(b)
(a)
Figure 5.6: (a) Triangle inequality for Floyd-Warshall algorithm. (b) VIA
relationships along a branch of the sink tree for a given (s, t) pair
CUP 2008
34 / 79
Invariant
LENGTH[i, j] is the shortest path going through intermediate nodes from the set
{i, . . . , pivot}. VIA[i, j] is the corresponding first hop.
(1) for pivot = 1 to n do
(2) for s = 1 to n do
(3)
for t = 1 to n do
(4)
if LENGTH[s, pivot] + LENGTH[pivot, t] < LENGTH[s, t] then
(5)
LENGTH[s, t] LENGTH[s, pivot] + LENGTH[pivot, t];
(6)
VIA[s, t] VIA[s, pivot].
CUP 2008
35 / 79
CUP 2008
36 / 79
(local variables)
array of int LEN[1..n]
CUP 2008
37 / 79
CUP 2008
38 / 79
CUP 2008
39 / 79
B
NOT_IN_TREE(pivot)
NOT_IN_TREE(pivot)
NOT_IN_TREE(pivot)
NOT_IN_TREE(pivot)
IN_TREE(pivot)
IN_TREE(pivot)
CUP 2008
40 / 79
CUP 2008
41 / 79
(local variables)
array of int SEQNO[1..n] 0
set of int Neighbors set of neighbors
(message types)
UPDATE
(1) To send a message M:
(1a) if i = root then
(1b)
SEQNO[i] SEQNO[i] + 1;
(1c)
send UPDATE(M, i, SEQNO[i]) to each j Neighbors.
(2) When UPDATE(M, j, seqnoj ) arrives from k:
(2a) if SEQNO[j] < seqnoj then
(2b)
Process the message M;
(2c)
SEQNO[j] seqnoj ;
(2d)
send UPDATE(M, j, seqnoj ) to Neighbors/{k}
(2e) else discard the message.
CUP 2008
42 / 79
CUP 2008
43 / 79
I
I
Prims MST:
I
I
CUP 2008
44 / 79
after k iterations,
n
2k
CUP 2008
45 / 79
(a)
(b)
Figure 5.8: Merging of MWOE components. (a) Cycle len = 2 possible. (b) Cycle
len > 2 not possible.
Observation 5.1
For any spanning forest {(Ni , Li ) | i = 1 . . . k} of graph G , consider any
component (Nj , Lj ). Denote by j , the edge having the smallest weight among
those that are incident on only one node in Nj . Then an MST for G that includes
all the edges in each Li in the spanning forest, must also include edge i .
CUP 2008
46 / 79
MST Example
54
88
112
16
43
27
21
87
44
13
14
34
tree edge
cross edge
11
(MWOE)
16
outedge
root of component
CUP 2008
47 / 79
(message types:)
SEARCH MWOE(leader )
EXAMINE(leader )
REPLY MWOES(local ID, remote ID)
ADD MWOE(local ID, remote ID)
NEW LEADER(leader )
CUP 2008
48 / 79
if leader = i then
broadcast SEARCH MWOE(leader ) along marked edges of tree.
Each process i (including leader ) sends an EXAMINE message along unmarked (i.e., non-tree) edges to determine if
the other end of the edge is in the same component (i.e., whether its leader is the same).
From among all incident edges at i, for which the other end belongs to a different component, process i picks its
incident MWOE(localID,remoteID).
The leaf nodes in the MST within the component initiate the convergecast using REPLY MWOEs, informing their parent of
their MWOE(localID,remoteID). All the nodes participate in this convergecast.
if leader = i then
await convergecast replies along marked edges.
Select the minimum MWOE(localID,remoteID) from all the replies.
broadcast ADD MWOE(localID,remoteID) along marked edges of tree.
// To ask process localID to mark the (localID, remoteID) edge,
// i.e., include it in MST of component.
if an MWOE edge gets marked by both the components on which it is incident then
1
2
3
Define new leader as the process with the larger ID on which that MWOE is incident (i.e., process whose ID is
max(localID, remoteID)).
new leader identifies itself as the leader for the next round.
new leader broadcasts NEW LEADER in the newly formed component along the marked edges announcing itself as
the leader for the next round.
CUP 2008
49 / 79
GHS: Complexity
log n rounds (synchronous)
Time complexity: O(n log n)
Message complexity:
I
I
CUP 2008
50 / 79
MST (asynchronous)
Send EXAMINE when all nbhs along unmarked edges in same round
l log n extra messages overall
CUP 2008
51 / 79
Synchronizers
Definition
Class of transformation algorithms that allow a synchronous program (designed for
a synchronous system) to run on asynchronous systems.
Assumption: failure-free system
Designing tailor-made async algo from scratch may be more efficient than
using synchronizer
Process safety
Process i is safe in round r if all messages sent by i have been received.
Implementation key: signal to each process when it is safe to go to next round,
i.e., when all msgs to be received have arrived
CUP 2008
52 / 79
Synchronizers: Notation
(1)
(2)
CUP 2008
53 / 79
Synchronizers: Complexity
Minit
Tinit
Mround
Tround
Simple synchronizer
synchronizer
synchronizer
synchronizer
0
d
2|L|
1
0
0
O(|L|)
O(1)
O(kn2 )
n log (n)/log (k)
O(Lc ) ( O(kn))
O(hc ) ( O(log (n)/log (k)))
CUP 2008
54 / 79
Simple Synchronizer
CUP 2008
55 / 79
Synchronizer
Pi in round r moves to r + 1 if all neighbors are safe for round r .
When neighbor Pj receives ack for each message it sent, it informs Pi (and
its other neighbors) that it is safe.
B
E2
3
3
3
3
D
execution message
D
acknowledgement
(a)
"safe"
(b)
Figure 5.10: Example. (a) Execution msgs (1) and acks (2). (b) I am safe msgs (3).
CUP 2008
56 / 79
Synchronizer: Complexity
Complexity:
I
I
CUP 2008
57 / 79
Synchronizer
Initialization: rooted spanning tree, O(n log n + |L|) messages, O(n) time.
Operation:
Safe nodes initiate convergecast (CvgC)
intermediate nodes propagate CvgC when their subtree is safe.
When root becomes safe and receives CvgC from all children, initiates tree
broadcast to inform all to move to next round.
Complexity: l 0 acks for free, due to transport layer.
Mround = 2(n 1)
Tround = 2 log n average; 2n worst case
CUP 2008
58 / 79
Synchronizer: Clusters
Set of clusters; each cluster has a spanning tree
Intra-cluster: synchronizer over tree edges
Inter-cluster: synchronizer over designated inter-cluster edges. (For 2
neighboring clusters, 1 inter-cluster edge is designated.)
A
tree edge
designated (intercluster) edge
root
Figure 5.11: Cluster organization. Only tree edges and inter-cluster designated edges are shown.
CUP 2008
59 / 79
CUP 2008
60 / 79
Synchronizer: Code
(message types)
Subtree safe
This cluster safe
My cluster safe
Neighboring cluster safe
Next round
1
2
Using the spanning tree, leaves initiate the convergecast of the Subtree safe message towards the root of the cluster.
After the convergecast completes, the root initiates a broadcast of This cluster safe on the spanning tree within the
cluster.
(Embedded synchronizer:)
1
During this broadcast in the tree, as the nodes get engaged, the nodes also send My cluster safe messages
on any incident designated inter-cluster edges.
Each node also awaits My cluster safe messages along any such incident designated edges.
(Convergecast and broadcast phase:) This phase aims to detect when all neighboring clusters are safe, and to inform every
node within this cluster.
1
(Convergecast:)
1
After the broadcast of the earlier phase (1.2) completes, the leaves initiate a convergecast using
Neighboring cluster safe messages once they receive any expected My cluster safe messages (step (1.3))
on all the designated incident edges.
An intermediate node propagates the convergecast once it receives the Neighboring cluster safe message
from all its children, and also any expected My cluster safe message (as per step (1.3)) along designated
edges incident on it.
(Broadcast:) Once the convergecast completes at the root of the cluster, a Next round message is broadcast in the
clusters tree to inform all the tree nodes to move to the next round.
CUP 2008
61 / 79
CUP 2008
62 / 79
Iteratively:
Nodes pick random nos, exchange with nbhs
Lowest number in neighborhood wins (selected in MIS)
If neighbor is selected, I am eliminated ( safety)
Only neighbors of selected nodes are eliminated ( correctness)
Complexity:
In each iteration, 1 selected, 1 eliminated n/2 iterations.
Expected # iterations O(log , n) due to randomized nature.
CUP 2008
63 / 79
// set of neighbours
// random number from a sufficiently large range
// becomes true when Pi is included in the MIS
// becomes true when Pi is eliminated from the candidate set
// a random number is sent
// whether sender was selected in MIS
// whether sender was removed from candidates
(1a) repeat
(1b) if Neighbours = then
(1c)
selectedi true; exit();
(1d) randomi a random number;
(1e) send RANDOM(randomi ) to each neighbour;
(1f) await RANDOM(randomj ) from each neighbour j Neighbours;
(1g) if randomi < randomj (j Neighbours) then
(1h)
send SELECTED(i, true) to each j Neighbours;
(1i)
selectedi true; exit(); // in MIS
(1j) else
(1k)
send SELECTED(i, false) to each j Neighbours;
(1l)
await SELECTED(j, ?) from each j Neighbours;
(1m)
if SELECTED(j, true) arrived from some j Neighbours then
(1n)
for each j Neighbours from which SELECTED(?, false) arrived do
(1o)
send SELECTED(i, true) to j;
(1p)
eliminatedi true; exit(); // not in MIS
(1q)
else
(1r)
send ELIMINATED(i, false) to each j Neighbours;
(1s)
await ELIMINATED(j, ?) from each j Neighbours;
(1t)
for all j Neighbours do
(1u)
if ELIMINATED(j, true) arrived then
(1v)
Neighbours Neighbours \ {j};
(1w) forever.
CUP 2008
64 / 79
B
6
2
D
1
J
0
I
D
9
J
K
1
(b)
(a)
Figure 5.12: (a) Winners and losers in round 1. (b) Winners up to round 1, losers in round 2.
CUP 2008
65 / 79
CUP 2008
66 / 79
4
13
47
11
57
27
14
6
55
33
42
64
77
16
CUP 2008
67 / 79
Interval routing:
I
I
I
I
Prefix routing: Node, channel labels from same domain, view as strings
I
CUP 2008
68 / 79
CUP 2008
69 / 79
Leader Election
CUP 2008
70 / 79
CUP 2008
71 / 79
CUP 2008
72 / 79
P
Evaluate all choices for Nr to identify min( iNk ,ri Nr disti,ri ), where disti,ri is
the cost from node i to ri , the replica nearest to i.
If Read accesses from each user in Nk have a certain frequency (or weight),
the minimization function changes.
Address BW of each edge.
Assume user access is a Read with prob. x, and an Update with prob. 1 x.
Update requires all replicas to be updated.
I
CUP 2008
73 / 79
iV
iV
CUP 2008
74 / 79
CUP 2008
75 / 79
Rfringe
C
A
E
D
Rneighbour
Rneighbour
and
Rfringe
C is R-fringe
A, E are R-fringe and R-neighbour
D is R-neighbour
CUP 2008
76 / 79
r+w
w
(a)
j
i
r
(b)
r+w
(c)
CUP 2008
77 / 79
CUP 2008
78 / 79
CUP 2008
79 / 79