Inf3380 Floyd

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Floyds algorithm

Floyds algorithm p. 1

Overview
Chapter 6 from Michael J. Quinn, Parallel Programming in C with MPI and
OpenMP

Floyds algorithm: solving the all-pairs shortest-path problem

Floyds algorithm p. 2

Finding shortest paths


Starting point: a graph of vertices and weighted edges

Each edge is of a direction and has a length if theres path from vertex i to j , there may not be path from vertex j to i path length from vertex i to j may be different than path length from vertex j to i Objective: nding the shortest path between every pair of vertices (i j ) Application: table of driving distances between city pairs

Floyds algorithm p. 3

Adjacency matrix
There are n vertices The direct path length from vertex i to vertex j is stored as a[i, j ] An n n adjacency matrix a keeps the entire connectivity info 0 1 2 3 4 5 0 0 2 5 7 1 8 1 0 4 2 0 3 0 3 3 4 2 0 4 0 5 5 2 If a[i, j ] is , it means there is no direct path from vertex i to vertex j

Floyds algorithm p. 4

Example of all-pairs shortest path


For the adjacency matrix given on the previous slide, the solution of the all-pairs shortest path is as follows: 0 1 2 3 4 5 0 0 2 5 3 6 9 1 0 6 1 4 7 2 15 0 4 7 10 3 11 5 0 3 6 4 8 2 5 0 3 5 5 6 2 4 0 Table of shortest path lengths

Floyds algorithm p. 5

Floyds algorithm
Input: n number of vertices a adjacency matrix Output: Transformed a that contains the shortest path lengths for k 0 to n 1 for i 0 to n 1 for j 0 to n 1 a[i, j ] min(a[i, j ], a[i, k ] + a[k, j ]) endfor endfor endfor

Floyds algorithm p. 6

Some observations
Floyds algorithm is an exhaustive and incremental approach The entries of the a-matrix are updated n rounds a[i, j ] is compared with all n possibilities, that is, against a[i, k ] + a[k, j ], for 0 k n 1 n3 of comparisons in total

Floyds algorithm p. 7

Source of parallelism
During the k th iteration, the work is (in C syntax)
for (i=0; i<n; i++) for (j=0; j<n, j++) a[i][j] = MIN( a[i][j], a[i][k]+a[k][j] );

Can all the entries in a be updated concurrently? Yes, because the k th column and the k th row remain the same during the k th iteration! Note that a[i][k]=MIN(a[i][k],a[i][k]+a[k][j]) will be the same as a[i][k] Note that a[k][j]=MIN(a[k][j],a[k][k]+a[k][j]) will be the same as a[k][j]

Floyds algorithm p. 8

Design of a parallel algorithm


Using Fosters design methodology: Partitioning each a[i, j ] is a primitive task Communication during the k th iteration, updating a[i, j ] needs values of a[i, k ] and a[k, j ] broadcast a[k, j ] to a[0, j ], a[1, j ], . . . , a[n 1, j ] broadcast a[i, k ] to a[i, 0], a[i, 1], . . . , a[i, n 1]

Floyds algorithm p. 9

Agglomeration and mapping


Let one MPI process be responsible for a piece of the a matrix Memory storage of a is accordingly divided The division can in principle be arbitrary, as long as the number of all a[i, j ] entries is divided evenly However, a row-wise block data division is very convenient 2D arrays in C are row-major easy to send/receive an entire row of a We therefore choose to assign one MPI process with a number of consecutive rows of a

Floyds algorithm p. 10

Communication pattern
Recall that in the k th iteration: a[i, j ] min(a[i, j ], a[i, k ] + a[k, j ]) Since entries of a are divided into rowwise blocks, so a[i, k ] is also in the memory of the MPI process that owns a[i, j ] However, a[k, j ] is probably in another MPI processs memory Communication is therefore needed! Before the k th iteration, the MPI process that owns the k th row of the a matrix should broadcast this row to everyone else

Floyds algorithm p. 11

Recap: creating 2D arrays in C


To create a 2D array with m rows and n columns:

int **B, *Bstorage, i; ... Bstorage=(int*)malloc(m*n*sizeof(int)); B=(int**)malloc(m*sizeof(int*)); for (i=0; i<m; i++) B[i] = &Bstorage[i*n];
The underlying storage is contiguous, making it possible to send and receive an entire 2D array.

Floyds algorithm p. 12

Global index vs. local index


Suppose a matrix (2D array) is divided into row-wise blocks and distributed among p MPI processes Process i only allocates storage for its assigned row block from row (i n)/p of matrix a until row ((i + 1) n)/p 1 We need to know: Which global row does a local row correspond to? Mapping: local index global index On process number proc id global index=BLOCK LOW(proc id, p, n)+local index

Floyds algorithm p. 13

Main work of parallel Floyds algorithm


void compute_shortest_paths (int id, int p, dtype **a, int n) { int i, j, k; int offset; /* Local index of broadcast row */ int root; /* Process controlling row to be bcast */ int* tmp; /* Holds the broadcast row */ tmp = (dtype *) malloc (n * sizeof(dtype)); for (k = 0; k < n; k++) { root = BLOCK_OWNER(k,p,n); if (root == id) { offset = k - BLOCK_LOW(id,p,n); for (j = 0; j < n; j++) tmp[j] = a[offset][j]; } MPI_Bcast (tmp, n, MPI_TYPE, root, MPI_COMM_WORLD); for (i = 0; i < BLOCK_SIZE(id,p,n); i++) for (j = 0; j < n; j++) a[i][j] = MIN(a[i][j],a[i][k]+tmp[j]); } free (tmp); }

Floyds algorithm p. 14

Matrix input
Recall that each MPI process only stores a part of the a matrix When reading a from a le, we can let only process p 1 do the input once the number of rows needed by process i are read in, they are sent from process p 1 to process i using MPI Send process i must issue a matching MPI Recv The above simple strategy is not parallel Parallel I/O can be done using MPI-2 commands

Floyds algorithm p. 15

Matrix output
For example, we let only process 0 do the output Each process needs to send its part of a to process 0 To avoid many processes sending its entire subdata to process 0 at the same time Process 0 communicates with the other processes in turn Each process waits for a hint (a short message) from process 0 before sending its data (a large message)

Floyds algorithm p. 16

Deadlock
Typical deadlock example 1

if (rank==0) { MPI_Recv(&b,1,MPI_INT,1,tag_b,MPI_COMM_WORLD,&status); MPI_Send(&a,1,MPI_INT,1,tag_a,MPI_COMM_WORLD); } else if (rank==1) { MPI_Recv(&a,1,MPI_INT,0,tag_a,MPI_COMM_WORLD,&status); MPI_Send(&b,1,MPI_INT,0,tag_b,MPI_COMM_WORLD); }

Typical deadlock example 2


if (rank==0) { MPI_Send(&a,1,MPI_INT,1,1,MPI_COMM_WORLD); MPI_Recv(&b,1,MPI_INT,1,1,MPI_COMM_WORLD,&status); } else if (rank==1) { MPI_Send(&b,1,MPI_INT,0,0,MPI_COMM_WORLD); MPI_Recv(&a,1,MPI_INT,0,0,MPI_COMM_WORLD,&status); }
Floyds algorithm p. 17

Analysis
Serial algorithm time usage: n3 Parallel algorithm non-communication time usage: n2 n/p communication (broadcast) time usage: nlog2 p( + 4n/ ) assuming each entry of matrix a needs 4 bytes assuming as communication latency assuming as communication bandwidth (# bytes per second) Read Section 6.7 for a more detailed analysis that allows overlap between computation and communication

Floyds algorithm p. 18

Exercises
Write an MPI program that uses p processes to produce a JPEG picture of n n pixels. The picture should have white background and a black circle (of radius n/4) in the middle. (The existing C code collection
http://heim.ifi.uio.no/xingca/inf-verk3830/simple-jpeg.tar.gz

can be used.) Implement the complete Floyds algorithm and try it on a large enough adjacency matrix.

Floyds algorithm p. 19

You might also like