Programming project
Dijkstra's algorithm
Summer school: ' Introduction to High performance Computing
Center for High-Performance Computing & KTH School of Computer
Science and Communication
Salvator Gkalea
Department ICT/ES
KTH
salvator@kth.se
Tutor: Stefano Markidis
Date: October 2014
Dionysios Zelios
Department of Physics
Stockholm University
dionisis.zel@gmail.com
Contents
Abstract ..................................................................................................................................... 3
Introduction ............................................................................................................................... 3
Dijkstra's Parallelization ............................................................................................................ 5
Simulation results and analysis ................................................................................................. 8
References ............................................................................................................................... 14
Abstract
Dijkstra's algorithm is a graph search algorithm that solves the singlesource shortest path problem for a graph with non-negative edge path
costs, producing a shortest path tree. This algorithm is often used in
routing and as a subroutine in other graph algorithms. In this project we
investigate the parallelization of this algorithm and its speedup against
the sequential one. Message Passing Interface (MPI) is used for this
purpose, in order to parallelize the source-shortest path algorithm.
Introduction
Dijkstras algorithm determines the shortest path from a single source
vertex to every other vertex. It can also be used for finding costs of
shortest paths from a single vertex to a single destination vertex by
stopping the algorithm once the shortest path to the destination vertex
has been determined. For instance, if the vertices of the graph represent
cities and edge path costs represent driving distances between pairs of
cities connected by a direct road, Dijkstra's algorithm can be used to find
the shortest route between one city and all other cities.
A pseudo-code representation of the Dijkstra's sequential single-source
shortest-paths Algorithm is given. The pseudo-algorithm proceeds as
follows:
1. function Dijkstra(V, E, w, s)
2. begin
3.
VT := {s};
4.
for all v exists in (V - VT) do
5.
if (s, v) exists
set l[v] := w(s, v);
6.
else
set l[v] := infinitive;
end for
7.
while VT V do
8.
begin
9.
find a vertex u such that l[u] :=
min{l[v]|v exists in (V - VT)};
10.
VT := VT joint {u};
11.
for all v exists in (V - VT) do
12.
l[v] := min{l[v], l[u] + w(u, v)};
13.
end while
14. end Dijkstra
Suppose we have a given weighted graph G = (V,E,w), where V =
vertices, E = edges, w = weight. The pseudo-code above represents the
procedure where the shortest path, from vertex s to every other vertex
v, is computed. For every vertex v that exists in (V - VT), the algorithm
stores the minimum cost to reach vertex v from vertex s. In line 6-7 the
procedure stores every weight of each vertex to the l[] array. In the rest
of the lines, the algorithm finds the closest vertex u to vertex s , based
on the weight, among all those not yet examined and then updates the
minimum distance ( l[] array ).
The main parallelization idea to approach this problem is to
assume that every process can discover its node that is closest to the
source. Then, it is supposed to make a reduction in order to distinguish
the closest node and upon success it will broadcast the selection of this
node to all the other processes. At the end each process will update
locally its distant array ( l[] array ).
The parallel computing can be achieved with some basic MPI
communication functions. These functions are described below and they
are going to be used to parallelize the single-source shortest-paths
Algorithm.
MPI_Init
MPI_Finalize
Initialize the parallel environment
End the parallel environment, release system
resources
MPI_Comm_size Returns the number of parallel computing units
MPI_Comm_rank Returns the current calculation unit identification
nimber
MPI_Reduce
Reduces values on all processes to a single value
MPI_Bcast
Broadcasts a message from the process with rank
"root" to all other processes of the communicator
MPI_Gather
Gathers together values from a group of processes
Dijkstra's Parallelization
The pseudo-code mentioned above can be parallelized based on the line
9 in which the minimum distance among all the nodes is computed or
based on the line 11 where the algorithm calculates the minimum
overall distance and stores it to the local minimum distance array.
The classic approach is to break down the data that need to be
computed into segments. Then each node is responsible to process and
compute a segment of data. At the end all the results from all the
segments must be gathered at a node. The C code above demonstrates
the logic mentioned above.
#include <stdio.h>
#include <mpi.h>
int numV,
//number of vertices
*todo, //vertices to be analysed
numNodes, //number of nodes
chunk, //number of vertices handled by every node
start, end, //start, end point vertex for each node
myNode; //ID of node
unsigned maxInt,
localMin[2], // [0]: min for local node, [1]: vertex
for that min
globalMin[2], // [0]: global min for all the node,
[1]: vertex for that min
*graph, //hops between vertices
*minDistance; // min distance
double T1, T2; // start, end time
// Generation of the Graph
void init(int ac, char **av) {
int i, j;
unsigned u, tmp;
numV = atoi(av[1]);
MPI_Init(&ac, &av);
MPI_Comm_size(MPI_COMM_WORLD, &numNodes);
MPI_Comm_rank(MPI_COMM_WORLD, &myNode);
chunk = numV / numNodes;
start = myNode * chunk;
end = start + chunk - 1;
maxInt = -1 >> 1;
graph = malloc(chunk * numV * sizeof(int));
minDistance = malloc(numV * sizeof(int));
todo = malloc(numV * sizeof(int));
for (i = 0; i < chunk; i++) {
for (j = i; j < numV; j++) {
if (j == i)
graph[i][j] = 0;
else {
graph[i][j] = rand(1) % 21 + 1;
graph[j][i] = graph[i][j];
}
}
}
for (i = 0; i < numV; i++) {
todo[i] = 1;
minDistance[i] = maxInt;
}
minDistance[0] = 0;
}
int main(int ac, char **av) {
int i,j,k;
//Initialization process
init(ac, av);
if (myNode == 0)
T1 = MPI_Wtime();
for (step = 0; step < numV; step++) {
// Find local minimum distance
for (i = start; i <= end; i++) {
if (todo[i] && minDistance[i] < maxInt)
{
localMin[0] =
minDistance[i];
localMin[1] = i;
}
}
MPI_Reduce(localMin,globalMin,1,MPI_2INT,MPI_MINLOC,0,MPI_COMM_
WORLD);
MPI_Bcast(globalMin,1,MPI_2INT,0,MPI_COMM_WORLD);
//check that vertex as passed
todo[globalMin[1]] = 0;
//Update the local min distance array
for (j = 0; j < chunk; j++) {
if (globalMin[0] + graph[globalMin[1] *
chunk + j] < minDistance[j + (myNode * chunk)]){
minDistance[j + (myNode *
chunk)] = globalMin[0] + graph[globalMin[1] * chunk + j];
}
}
}
MPI_Gather(minDistance + start, chunk, MPI_INT, minDistance,
chunk,
MPI_INT, 0, MPI_COMM_WORLD);
if (myNode == 0)
T2 = MPI_Wtime();
if (0 && myNode == 0)
for (k = 1; k < numV; i++) {
printf("node[%d] = %u\n", i, minDistance[i]);
}
if (myNode == 0)
printf("elapsed: %f\n", (float) (T2 - T1));
MPI_Finalize();
}
The node 0 is actually acting as a master which is responsible to
broadcast (MPI_Bcast) the global minimum distance that has been
computed by all the other nodes. The other nodes are acting as
receivers, waiting for the global minimum distance from the node 0.
As it was mentioned before, each node is responsible to process a
segment of the whole data. The results from all the nodes are gathered
(MPI_Gather) to the node 0 and are been checked with the local results
in order to be stored or rejected (MPI_Reduce).
Simulation results and analysis
The experiments have been conducted on Milner with the Intel
compiler.
Time measurements
First, we run our algorithm for different values of meshes. Each time we
vary the number of cores and we get as a result the time needed for the
algorithm to run.
The measurements are shown in the table below and were executed as
aprun n Nodes ./Dijkstra.out mesh, where Nodes=132 and mesh=the
size of the graph (ex. Mesh=4*106 means the number of vertices)
mesh= 4*106, Nodes=1..32
Nodes 1
Time 0.009083
(sec)
2
0.010213
4
0.007118
8
0.008694
16
0.015567
32
0.034436
2
0.150809
4
0.092254
8
0.062077
16
0.105193
32
0.1638634
2
0.597349
4
0.352130
8
0.202681
16
0.189242
32
0.200911
2
2.077290
4
1.213774
8
0.874406
16
0.85756
32
1.056482
mesh= 64*106
Nodes
Time
(sec)
1
0.14132
mesh= 256*106
Nodes
Time
(sec)
1
0.564806
mesh=900*106
Nodes
Time
(sec)
1
1.984156
Plotting now the aforementioned values in the same diagram, we have:
It is obvious that the running time is decreasing as the number of cores
increases until it reaches the smallest value, then the running time will
increase because of the communication latency.
However, for middle size network the phenomenon of a reducing
running time is not that obvious. For a small size network, the running
time is even increasing as the number of cores increases, because the
communication latency outperforms the benefit from using more cores.
Speed up measurements
In our next step, we want to investigate the speed up of our algorithm.
In order to do that, we divided the base time in one node with time
needed for multiple nodes to be executed. The measurements and the
corresponding plot are given below:
mesh= 4*106
Nodes 1
Time 1
(sec)
2
0.889356
4
1.276060
8
1.044743
16
0.583477
32
0.263764
2
0,9370793
4
1,531857
8
2,27652
16
1,3434353
32
0,862425
2
0,945520
4
1,60397
8
2,78667
16
2,98457
32
2,81122
2
0,9551
4
1,63469
8
2,269147
16
2,3137
32
1,8780
mesh= 64*106
Nodes
Time
(sec)
1
1
mesh= 256*106
Nodes
Time
(sec)
1
1
mesh=900*106
Nodes
Time
(sec)
1
1
The speed up is increasing as the number of cores increases until it
reaches the maximum value, then the speed up is decreasing.
This is happening because more cores are used. The speed up is
decreasing because the communication latency outperforms the benefit
from using more cores.
As the network size increases, the number of cores used to get the
maximum speed up increases.
Cost measurements
In addition, we would like to investigate the cost in comparison with the
number of cores that we have used. In order to do that, we multiplied
the number of cores with the execution time in one core. The
measurements and the corresponding diagram are presented below:
mesh= 4*106
Nodes 1
Time 1,87807
(sec)
2
0,020426
4
0,028472
8
0,069552
16
0,249072
32
1,101952
2
0,301618
4
0,369016
8
0,496616
16
1,683088
32
5,2436288
2
1,194698
4
1,40852
8
1,621448
16
3,027872
32
6,429152
2
4,15458
4
4,855096
8
6,995248
16
13,72096
32
36,97687
mesh= 64*106
Nodes
Time
(sec)
1
0,14132
mesh= 256*106
Nodes
Time
(sec)
1
0,564806
mesh=900*106
Nodes
Time
(sec)
1
1,984156
The cost is increasing because the speed up (or the benefit of a reduced
running time) cannot outperform the cost of using more cores.
References
1) Wikipedia: http://en.wikipedia.org/wiki/Dijkstra%27s_algorithm
2) Lecture notes by Erwin Laure:
http://agenda.albanova.se/conferenceDisplay.py?confId=4384
3)Han, Xiao Gang, Qin Lei Sun, and Jiang Wei Fan. "Parallel Dijkstra's
Algorithm Based on Multi-Core and MPI." Applied Mechanics and
Materials 441 (2014): 750-753.
4) http://www.mpich.org/
5) http://www.open-mpi.org/
6) Crauser, Andreas, et al. "A parallelization of Dijkstra's shortest
path algorithm."Mathematical Foundations of Computer Science
1998. Springer Berlin Heidelberg, 1998. 722-731
7) Meyer, Ulrich, and Peter Sanders. "-stepping: A parallel single
source shortest path algorithm." AlgorithmsESA98. Springer
Berlin Heidelberg, 1998. 393-404.
8)http://www.inf.ed.ac.uk/publications/thesis/online/IM040172.pdf