Parallel Methods For PDEs
Parallel Methods For PDEs
Partial differential equations are widely used in various scientific and technical fields. Unfortunately,
analytical solution of these equations may only be possible in some special cases. Therefore approximate
numerical methods are usually used for solving partial differential equations. The amount of computations to
perform here is usually significant. Using high-performance systems is traditional for this sphere of
computational mathematics. Numerical solution of differential equations in partial derivatives is a subject of
intensive research (see, for instance, Fox, et al. (1988)).
Let us consider the numerical solution of the Dirichlet problem for the Poisson equation. It is defined as a
problem of finding function u = u( x, y) that satisfies in the domain D the following equation:
⎧ δ 2u δ 2u
⎪ 2+ = f ( x, y ), ( x, y ) ∈ D,
⎨δ x δ y 2
⎪ u ( x, y ) = g ( x, y ), ( x, y ) ∈ D 0 ,
⎩
and takes the value g ( x, y) at the boundary D 0 of the domain D when continuous functions f and g are
given. Such model can be used for describing steady liquid flow, stationary thermal fields, heat transportation
processes with the internal heat sources, elastic plate deformation etc. This problem is often used as a training
problem for demonstrating various aspects of parallel computations (see, for instance, Fox, et al. (1988), Pfister
(1998)).
For simplicity hereinafter the unit square
D ={( x, y) ∈ D : 0 ≤ x, y ≤ 1}
will be taken as the problem domain.
This Section has been written based essentially on the teaching materials given in Pfister (1995).
12.1. Sequential Method
The finite difference method (the grid method) is most widely used for numerical solving differential
equations (see, for instance, Pfister (1995)). Following this approach, the domain D is represented as a discrete
(uniform, as a rule) set (grid) of points (nodes). Thus, for instance, the rectangular grid in the domain D can be
given as (Figure 12.1)
⎧ Dh ={( xi , y j ) : xi = ih, yi = jh, 0 ≤ i, j ≤ N + 1,
⎨
⎩ h = 1 /( N + 1),
Where the value N defines the grid node number for each coordinate in the region D .
Let us denote by uij the values of the function u ( x, y ) at the points ( xi , y j ) . Then using the five-point
template (see Figure 12.1) we can present the Poisson equation in the finite difference form:
ui −1, j + ui +1, j + ui , j −1 + ui , j +1 − 4uij
= f ij .
h2
This equation can be rewritten in the following form:
uij = 0.25 (u i −1, j + u i +1, j + ui , j −1 + ui , j +1 − h 2 f ij ) .
It makes possible to determine the value uij from the given values of the function u ( x, y ) in the
neighboring nodes of the used grid. This result can be used as a basis for developing various iteration schemes
for solving the Dirichlet problem. At the beginning of calculations in these schemes an initial approximation for
values uij is formed, and then these values are sequentially recalculated in accordance with the given formula.
Thus, for instance, the Gauss-Seidel method uses the following rule for updating values of approximations:
u ijk = 0.25 (u ik−1, j + u ik+−1,1j + u ik, j −1 + u ik, −j +11 − h 2 f ij ) ,
according to which the sequent k approximation of the value uij is calculated from the last k approximation of
the values u i −1, j and u i , j −1 and the next to last (k-1) approximation of the values u i +1, j and ui , j +1 . Calculations
continue till the variations of values uij , which are evaluated at the end of each iteration, become less than a
certain given value (the required accuracy). The described procedure convergence (obtaining the solution of any
desired accuracy) is the subject of intensive investigation (see, for instance, Pfister (1995)). It should be also
noted that the sequence of approximations obtained by this method converges to the Dirichlet problem solution,
while the solution error is the order h 2 .
{ { { { { { { { {
{ • • • • • • • {
(i,j-1)
{ • • • • • • • {
{ • • • •(i,j) • • • {
(i-1,j) (i+1,j)
{ • • • • • • • {
(i,j+1)
{ • • • • • • • {
{ • • • • • • • {
{ • • • • • • • {
{ { { { { { { { {
Figure 12.1. Rectangular grid in the region D (the dark points represent the internal nodes, the
nodes are numbered from left to right in rows and top to bottom in columns)
The above considered algorithm (the Gauss-Seidel method) can be presented in the following way in
pseudocode based on the algorithmical language C++.
// Algorithm 12.1
// Sequential Gauss-Seidel method
do {
dmax = 0; // maximal variation of the values u
for ( i=1; i<N+1; i++ )
for ( j=1; j<N+1; j++ ) {
temp = u[i][j];
u[i][j] = 0.25*(u[i-1][j]+u[i+1][j]+
u[i][j-1]+u[i][j+1]–h*h*f[i][j]);
dm = fabs(temp-u[i][j]);
if ( dmax < dm ) dmax = dm;
}
} while ( dmax > eps );
2
(recall that the values uij with indexes i, j = 0, N + 1 are boundary values, defined in the problem statement and
do not vary in the course of calculations).
As an example, Figure 12.2 shows the function u ( x, y ) obtained for the Dirichlet problems with the
following boundary conditions:
⎧ f ( x, y ) = 0, ( x , y ) ∈ D,
⎪ 100 − 200 x, y = 0,
⎪⎪
⎨ 100 − 200 y, x = 0,
⎪ - 100 + 200 x, y = 1,
⎪
⎪⎩ − 100 + 200 y, x = 1,
The total amount of the Gauss-Seidel method iterations is 210 with the solution accuracy eps = 0.1 and
N = 100 (the values generated by a random-number generator in the range of numbers [-100,100] were used as
an initial approximation of values uij ).
3
program development has recently been worked out. In this approach the instructions of the program developer
concerning parallel computations are added to the program with the help of some extralinguistic means of the
programming language. For instance, they may be added as directives or comments which are processed be
special preprocessor before the program is compiled. In this case the original program code remains the same. If
there’s no preprocessor, the compiler can use it to construct the original sequential executable program. The
preprocessor, if it is used, transforms the parallelism directives into some additional program code (as a rule in
the form of calling the procedures of a parallel library).
The above discussed approach forms the basis of OpenMP technology (see, for instance, Chandra, et al.
(2000)). This technology is currently widely used for developing parallel programs on multiprocessor systems
with shared memory. The parallel directives are used for indicating the parallel regions, in which the executable
code can be divided into several separate instruction threads. These threads can be further executed in parallel on
different processors of the computer system. As a result of this approach the program is represented as a set of
sequential one-thread and parallel multi-thread fragments of the code (see Figure 12.3). This principle of
parallism implementation is called fork-join or pulsatile parallelism. More information on OpenMP technology
can be found in the various resources - see, for instance Chandra, et al. (2000) or in the Internet information
resources. In this section we will discuss the OpenMP capabilities necessary to demonstrate possible ways of
developing parallel programs for solving the Dirichlet problem.
//Algorithm 12.2
// Parallel Gauss-Seidel method (first variant)
omp_lock_t dmax_lock;
omp_init_lock (dmax_lock);
do {
dmax = 0; // maximum variation of values u
#pragma omp parallel for shared(u,n,dmax) \
private(i,temp,d)
for ( i=1; i<N+1; i++ ) {
#pragma omp parallel for shared(u,n,dmax) \
private(j,temp,d)
for ( j=1; j<N+1; j++ ) {
temp = u[i][j];
u[i][j] = 0.25*(u[i-1][j]+u[i+1][j]+
u[i][j-1]+u[i][j+1]–h*h*f[i][j]);
d = fabs(temp-u[i][j])
omp_set_lock(dmax_lock);
if ( dmax < d ) dmax = d;
omp_unset_lock(dmax_lock);
} // end of inner parallel region
} // end of outer parallel region
} while ( dmax > eps );
It should be noted that the program is derived from the original sequential code by means of adding the
directives and the operators of calling the functions of the OpenMP library (these additional lines in the program
are marked by the dark color, the reverse slash at the end of the directives means that the directives will be
continued in the following lines of the program).
4
As it follows from the program code, the parallel regions in the given example set by the directive parallel
for are nested. They include the for loop operators. The compiler supporting OpenMP technology divides the
execution of the loop iterations among several program threads. The number of the threads is usually the same as
the number of processors in the computational system. The parameters (clauses) of the directives shared and
private define the data availability in the program threads. The variables described as shared are common for the
threads. For the variables described as private separate copies are created for each thread. These copies may be
used in the threads independently of each other.
|The availability of shared data provides the possibility of thread interaction. In this aspect the shared
variables may be considered as thread shared resources. As a result, the definite rules of mutual exclusion must
be observed in their use (it means that access of a thread for modifying shared data has to block accessing to
these data for the other threads). The shared resource in our example is the value dmax; the access of threads to
this value is regulated by the variable of special type (semaphore) dmax-lock and the functions omp_set_lock
(access blocking) and omp_unset_lock (access unblocking). Such parallel program implementation guarantees
the uniqueness of thread access for changing the shared data. The fragments of the program code (the blocks
between calling the functions omp_set_lock and omp_unset_lock), for which mutual exclusion is provided, are
usually called critical sections.
The results of computation experiments are given in Table 12.1 (hereinafter the four-processor server of the
Nizhni Novgorod State University cluster with Pentium III, 700 Mhz, 512 RAM processors was used for the
parallel programs developed with the use of OpenMP technology).
Table 12.1. The results of computation experiments for the Gauss-Seidel parallel algorithm (p=4)
Gauss-Seidel sequential
Parallel algorithm 12.2 Parallel algorithm 12.3
Grid size method (algorithm 12.1)
k T k T S k T S
100 210 0,06 210 1,97 0,03 210 0,03 2,03
200 273 0,34 273 11,22 0,03 273 0,14 2,43
300 305 0,88 305 29,09 0,03 305 0,36 2,43
400 318 3,78 318 54,20 0,07 318 0,64 5,90
500 343 6,00 343 85,84 0,07 343 1,06 5,64
600 336 8,81 336 126,38 0,07 336 1,50 5,88
700 344 12,11 344 178,30 0,07 344 2,42 5,00
800 343 16,41 343 234,70 0,07 343 8,08 2,03
900 358 20,61 358 295,03 0,07 358 11,03 1,87
1000 351 25,59 351 366,16 0,07 351 13,69 1,87
2000 367 106,75 367 1585,84 0,07 367 56,63 1,89
3000 370 243,00 370 3598,53 0,07 370 128,66 1,89
(k – number of iterations, T – time in sec., S – speed-up)
Let us discuss the obtained result. The developed parallel algorithm is correct, i.e. it provides the solution
of the problem as well as a sequential algorithm. The used approach guarantees achieving the practically
maximum possible parallelism. Up to N 2 processors can be used for program execution. Nevertheless, the result
can not be called satisfactory as the program will be working more slowly the original sequential
implementation. The main reason for this is the excessively high synchronization of program parallel threads. In
our example each parallel thread after averaging the values uij must check (and may be change) the value dmax.
The permission for using the variable can be obtained be one thread only. The other threads must be blocked.
5
Processors (threads)
1 2 3 4 5 6 7 8
Parallel execution
Waiting
Execution time
Sequential execution
After the shared variable is released the next thread may get control etc. As access synchronization is
necessary a multithread parallel program turns practically into a sequentially executable code, which is
additionally less efficient that the original sequential variant because synchronization leads to additional
computation costs (see Figure 12.4). It should be noted that despite the ideal distribution of computation load
among processors for the given in Figure 6.4 combination of parallel and sequential computations, only two
processors simultaneously can be executed at any given moment of time. This effect of parallelism degradation
caused by intense synchronization of parallel threads is usually called serialization.
As the above discussion shows, the way of achieving efficient parallel computation is decreasing the
necessary synchronization of parallel program fragments. Thus, in our case we may parallelize only one outer
for loop. Besides to decrease possible blocking we may use a multilevel calculation scheme to estimate the
maximum error. Let each parallel thread initially form its private estimation dm only for its own processed data
(one or several grid rows). Then as calculations are completed each thread should compare its estimation dm
with the shared general error estimation dmax.
The new variant of the Gauss-Seidel parallel algorithm looks as follows:
// Algorithm 12.3
// Parallel Gauss-Seidel method (second variant)
omp_lock_t dmax_lock;
omp_init_lock(dmax_lock);
do {
dmax = 0; // maximum variation of values u
#pragma omp parallel for shared(u,n,dmax)\
private(i,temp,d,dm)
for ( i=1; i<N+1; i++ ) {
dm = 0;
for ( j=1; j<N+1; j++ ) {
temp = u[i][j];
u[i][j] = 0.25*(u[i-1][j]+u[i+1][j]+
u[i][j-1]+u[i][j+1]–h*h*f[i][j]);
d = fabs(temp-u[i][j])
if ( dm < d ) dm = d;
}
omp_set_lock(dmax_lock);
if ( dmax < dm ) dmax = dm;
omp_unset_lock(dmax_lock);
}
} // end of parallel region
} while ( dmax > eps );
6
As a result of these modifications in computational scheme the number of accessing to the shared variable
dmax decreases from N 2 to N times. It should lead to a considerable decrease in costs of thread
synchronization and a decrease of computation serialization effect. The results of the experiments with this
variant of parallel algorithm given in Table 12.1 show a remarkable change of the situation. The speed up in a
number of experiments appears to be even greater than the number of processors used (this superlinear speedup
effect is achieved due to the fact that each processor of the computation server has its own fast cash memory). It
should be also noted that the improvement of parallel algorithm parameters is achieved by decreasing the
maximum possible parallelism (no more than N processors may be used for executing the program).
In our case depending on the execution conditions the different (from the precious or the current iterations)
values of the neighboring vertical nodes may be used for calculating the new values uij . Thus, the number of
method iterations before meeting the accuracy requirements and, what is more important, the obtained problem
solution may differ at several repeated program executions. The obtained estimations of the values uij will
correspond to the precise problem solution within the range of the required accuracy but nevertheless they may
be different. The computational scheme based on such calculations for grid algorithms got the name of the
chaotic relaxation method.
In general case the indeterminacy of the results is not desirable as it disturbs the main basis of complicated
software development which consist in the repeatability of the results of program executions with the same
values of initial data. As a result seeking calculation determinacy we can state the important principle of parallel
7
programming, which requires that the time dynamics of parallel threads should not influence executing parallel
programs.
Line 1
Thread 1 Thread 2
Line 2
Figure 12.6. Deadlock situation at accessing rows of the grid (the thread 1 possesses row 1 and
requests row 2, the thread 2 possesses row 2 and requests row 1)
8
12.2.5. Indeterminacy Elimination
The approach discussed in 12.2.4 decreases the effect of thread races but it does not guarantee the solution
uniqueness at recurrent calculations. To achieve a determinacy it is necessary to use additional computational
schemes.
The possible method which is widely used in practice is to separate the memory for storing the computation
results at the previous and the current iterations of the grid method. The scheme of this approach may be
represented as follows:
// Algorithm 12.4
// Parallel Gauss-Jacobi method
omp_lock_t dmax_lock;
omp_init_lock(dmax_lock);
do {
dmax = 0; // maximum change of the values u
#pragma omp parallel for shared(u,n,dmax) \
private(i,temp,d,dm)
for ( i=1; i<N+1; i++ ) {
dm = 0;
for ( j=1; j<N+1; j++ ) {
temp = u[i][j];
un[i][j] = 0.25*(u[i-1][j]+u[i+1][j]+
u[i][j-1]+u[i][j+1]–h*h*f[i][j]);
d = fabs(temp-un[i][j])
if ( dm < d ) dm = d;
}
omp_set_lock(dmax_lock);
if ( dmax < dm ) dmax = dm;
omp_unset_lock(dmax_lock);
}
} // end of parallel region
for ( i=1; i<N+1; i++ ) // data updating
for ( j=1; j<N+1; j++ )
u[i][j] = un[i][j];
} while ( dmax > eps );
As it follows from the given algorithm the results of the previous iterations are stored in the array u , and
new calculation values are stored in the additional array un. As a result, the value uij from the previous method
iterations are always used for calculations regardless of the calculation execution order. This scheme of the
algorithm implementation is usually called the Gauss-Jacobi method. It guarantees the result determinacy
regardless of the parallelizing method. But it should be noted that this method requires the great additional
memory amount and possesses a lesser (in comparison to the Gauss-Seidel algorithm) rate of convergence. The
results of the calculations with the sequential and parallel variants of the method are given in the Table 12.2.
Table 12.2. Results of computational experiments for the Gauss-Jacobi algorithm (p=4)
Sequential Gauss-Jacobi Parallel method developed on the
Mesh size method (algorithm 12.4) analogy with algorithm 12.3
k T k T S
100 5257 1,39 5257 0,73 1,90
200 23067 23,84 23067 11,00 2,17
300 26961 226,23 26961 29,00 7,80
400 34377 562,94 34377 66,25 8,50
500 56941 1330,39 56941 191,95 6,93
600 114342 3815,36 114342 2247,95 1,70
700 64433 2927,88 64433 1699,19 1,72
800 87099 5467,64 87099 2751,73 1,99
900 286188 22759,36 286188 11776,09 1,93
9
1000 152657 14258,38 152657 7397,60 1,93
2000 337809 134140,64 337809 70312,45 1,91
3000 655210 247726,69 655210 129752,13 1,91
(k – the number of iterations, T – time in sec., S – speed up)
Another possible approach to eliminate the mutual dependences of parallel threads is to apply the red/black
row alteration scheme. In this scheme each iteration of the grid method execution is subdivided into two
sequential stages. At the first stage only the rows with even numbers are processed, and then at the second stage
the rows with odd numbers (see Figure 12.7). The given scheme may be generalized for simultaneous
application to the rows and the columns (checkerboard decomposition scheme) of the calculation domain.
The discussed scheme of row alteration does not require any additional memory in comparison to the
Jacobi method and guarantees the solution determinacy at multiple program executions. But it should be noted
that both the approaches considered in this Subsection can have the results that do not coincide with the Dirichlet
problem solution, which is obtained by means of the sequential algorithm. Besides these computational schemes
have a smaller domain and a worse convergence rate than the original variant of the Gauss-Seidel method.
Stage 1 Stage 2
values after stage 1 of the current values after stage 2 of the current
iteration iteration
10
such calculations were called wavefront or hyperplane methods. It should be noted that the wave size (the degree
of possible parallelism) changes dynamically in the course of calculations. The wave reached its peak and then
decays approaching the right lower grid node.
Table 12.3. The results of the experiments for the parallel Gauss-Seidel algorithm with the wavefront
computational scheme (p=4)
Sequential Gauss-Seidel
Parallel algorithm 12.5 Parallel algorithm 12.6
Grid size method (algorithm 12.1)
k T k T S k T S
100 210 0,06 210 0,30 0,21 210 0,16 0,40
200 273 0,34 273 0,86 0,40 273 0,59 0,58
300 305 0,88 305 1,63 0,54 305 1,53 0,57
400 318 3,78 318 2,50 1,51 318 2,36 1,60
500 343 6,00 343 3,53 1,70 343 4,03 1,49
600 336 8,81 336 5,20 1,69 336 5,34 1,65
700 344 12,11 344 8,13 1,49 344 10,00 1,21
800 343 16,41 343 12,08 1,36 343 12,64 1,30
900 358 20,61 358 14,98 1,38 358 15,59 1,32
1000 351 25,59 351 18,27 1,40 351 19,30 1,33
2000 367 106,75 367 69,08 1,55 367 65,72 1,62
3000 370 243,00 370 149,36 1,63 370 140,89 1,72
(k – the number of iterations, T – time in sec., S – speed up)
The possible scheme of the parallel method, which is based on the wavefront computational scheme, can be
given in the following way:
// Algorithm 12.5
// Parallel Gauss-Seidel method (wavefront computational scheme)
omp_lock_t dmax_lock;
omp_init_lock(dmax_lock);
do {
dmax = 0; // maximum change of the values u
// growing wave (nx – the wave size)
for ( nx=1; nx<N+1; nx++ ) {
11
dm[nx] = 0;
#pragma omp parallel for shared(u,nx,dm) \
private(i,j,temp,d)
for ( i=1; i<nx+1; i++ ) {
j = nx + 1 – i;
temp = u[i][j];
u[i][j] = 0.25*(u[i-1][j]+u[i+1][j]+
u[i][j-1]+u[i][j+1]–h*h*f[i][j]);
d = fabs(temp-u[i][j])
if ( dm[i] < d ) dm[i] = d;
} // end of parallel region
}
// decaying wave
for ( nx=N-1; nx>0; nx-- ) {
#pragma omp parallel for shared(u,nx,dm) \
private(i,j,temp,d)
for ( i=N-nx+1; i<N+1; i++ ) {
j = 2*N - nx – I + 1;
temp = u[i][j];
u[i][j] = 0.25*(u[i-1][j]+u[i+1][j]+
u[i][j-1]+u[i][j+1]–h*h*f[i][j]);
d = fabs(temp-u[i][j])
if ( dm[i] < d ) dm[i] = d;
} // end of parallel region
}
#pragma omp parallel for shared(n,dm,dmax) \
private(i)
for ( i=1; i<nx+1; i++ ) {
omp_set_lock(dmax_lock);
if ( dmax < dm[i] ) dmax = dm[i];
omp_unset_lock(dmax_lock);
} // end of parallel region
} while ( dmax > eps );
Algorithm 12.5. The Gauss-Seidel parallel algorithm with the wavefront computational scheme
As it can be noted the estimation of the solution error may be calculated for each separate row (array dm).
This array is shared for all executed threads. However, the synchronization for accessing to the elements is not
required, as the threads always use different array elements (the calculation wave front contains only one node of
each grid row).
After all the wave elements have been processed the maximum error of computation iteration is found
among elements of the array dm. However, this final part of calculations may appear to be the least efficient due
to the high additional synchronization cost. The situation may be improved by increasing the size of sequential
fragments and thus, decreasing the necessary interactions of parallel threads.
The possible variant to implement this approach may be the folowing:
chunk = 200; // sequential part size
#pragma omp parallel for shared(n,dm,dmax) \
private(i,d)
for ( i=1; i<nx+1; i+=chunk ) {
d = 0;
for ( j=i; j<i+chunk; j++ )
if ( d < dm[j] ) d = dm[j];
omp_set_lock(dmax_lock);
if ( dmax < d ) dmax = d;
omp_unset_lock(dmax_lock);
} // end of parallel region
This technique of enlarging sequential computation fragments for decreasing the synchronization overhead
is called chunking. The experimental results for this parallel computation variant are given in Table 12.3.
One more aspect in analyzing the efficiency of the developed parallel algorithm is worth mentioning. The
calculation wave front does not agree properly with the rules of the efficient cache use – fast cache memory used
to store the copies of the most frequently used main memory areas. The use of cache may considerably increase
in tens of times the computation speed. Data allocation in cash may occur either preliminarily (when some
12
algorithm is used for predicting data needs) or at the moment of the data reading from main memory. It should
be noted that caching is executed not by single element but by small sequential memory blocks (cache lines).
The typical cache line sizes are usually equal to 32, 64, 128, 256 bites. As a result, the effect of cache presence
will be observed, if the performed calculations use the same data repeatedly (data processing localization) and
provide accessing to memory elements with sequentially increasing addresses (sequential access).
In the considered algorithm data allocation is realized rowwise but the calculation wave front is located on
grid diagonal. This leads to low efficiency of cache use. A possible way to remedy the situation is to enlarge
computational operations. To follow this intension the procedure of processing some rectangular calculation
subdomains (block) can be considered as operations that can be distributed among the processors – see Figure
12.9.
border values
In the most general way the computational method developed on the basis of this approach may be
described as follows (the calculation domain is represented as the block grid of size NBxNB).
// Algorithm 12.6
// Parallel Gauss-Seidel method (block wavefront computational scheme)
do { // NB is the size of row/column of blocks
// growing wave (wave size is nx+1)
for ( nx=0; nx<NB; nx++ ) {
#pragma omp parallel for shared(nx) private(i,j)
for ( i=0; i<nx+1; i++ ) {
j = nx – i;
// <processing the block with coordinates (i,j)>
} // end of parallel region
}
// decaying wave
for ( nx=NB-2; nx>-1; nx-- ) {
#pragma omp parallel for shared(nx) private(i,j)
for ( i=0; i<nx+1; i++ ) {
j = 2*(NB-1) - nx – i;
// < processing the block with coordinates (i,j)>
} // end of parallel region
}
// <calculation error estimation>
} while ( dmax > eps );
The calculations in the suggested algorithm are performed in accordance with the wavefront data
processing scheme. First calculations are performed only in the upper left block with the coordinates (0,0). Then
the blocks with the coordinates (0,1) and (1,0) become available for processing – the experimental results for this
parallel method are given in Table 12.3.
13
The block-oriented approach to the wavefront data processing method considerably changes the situation. It
means that processing nodes may be arranged rowwise, access to the data may be executed sequentially along
the memory elements, the values transferred to the cache are used repeatedly. Besides, as block processing is
performed on different processors and the blocks are mutually disjoint in data, there will be no additional costs
for providing cache coherence of different processors.
The best cache utilizations will be achieved if there is enough space in the cache for allocating not fewer
than three block rows as just the data of three block rows will be simultaneously used in processing a block row.
Thus, having the definite cache size it is possible to define the recommended maximum possible block size. For
instance, if cache is 8 Kb and the size of data values is 8 bytes, this size is approximately 300 (8Кб/3/8). It is also
possible to define the minimum allowable block size on condition that the cache line and block row sizes are the
same. Thus, if the cache line size is 256 bytes and the size of data values is 8 bytes, the block size must be
divisible by 32.
The last remark should be made concerning the transmission of border block nodes between processors.
Taking into consideration the border transmission, it is appropriate to process the neighboring blocks on the
same processors. Otherwise it is possible to define the block sizes in such a way, that the amount of the border
data transmitted among processors should be minimal. Thus, if the cache line size is 256 bytes, the size of data
values is 8 bytes and the block size is 64х64, the amount of the transmitted data is 132 cache lines; if the block
size is 128х32, the amount is only 72 cache lines. Such optimization is of principal importance in slow
operations of transmitting data among processor caches, particularly in case of nonuniform memory access
systems (NUMA).
14
PutBlock(pBlock->pDown);
omp_unset_lock(pBlock->pDown.Lock);
} // end of computations, as the queue is empty
Algorithm 12.7. Block-oriented wavefront computational scheme with the use of job queue
To describe the grid node blocks the block descriptor with the following parameters is used in the
algorithm:
- Lock – semaphore, which synchronizes access to the block descriptor,
- pNext – pointer to the right neighboring block,
- pDown – pointer to the lower neighboring block,
- Count – counter of the block availability for computations (the number of available block borders).
The operations of selecting a block from the queue and inserting a block into the queue are provided
correspondingly by the function GetBlock and PutBlock.
As it follows from the scheme, a processor gets the block for processing from the queue, performs all the
necessary calculations for the block and marks the availability of its borders for its right and lower neighboring
blocks. If it appears that the neighboring blocks have both the borders available, the processor transmits these
blocks into the job queue.
The use of the job queue makes possible to solve practically all the remaining problems of developing
parallel computations for shared memory multiprocessor systems. The development of this approach can provide
for refinement of the rule of selecting jobs from the queue in accordance with the processor states (it is
appropriate to process the neighboring blocks on the same processors). Then in some cases it is useful to create
several job queue etc. Additional information on these problems may be found, for instance, in Xu, Hwang
(1998), Buyya (1999).
{ • • • • • • • {
0
{ • • • • • • • {
(i,j-1)
{ • • • • • • • { { • • • • • • • {
{ • • • •(i,j) • • • { { • • • • • • • {
(i-1,j) (i+1,j)
{
• • • • •
(i,j+1)
• • {
1 {
• • • • • • • {
{ • • • • • • • { { • • • • • • • {
{ • • • • • • • { { • • • • • • • {
{ • • • • • • • {
2 { • • • • • • • {
{ { { { { { { { { { { { { { { { { {
Figure 12.10. Grid block-striped distribution among processors (border nodes are marked by
circles)
15
In the Dirichlet problem discussed here there are two different data distribution methods: one-dimensional
and block-striped scheme (see Figure 12.10) or two-dimensional or checkerboard computational grid
decomposition (see also Section 7). Our discussion will be based on the first method. The checkerboard scheme
will be briefly discussed further.
In case of block-striped decomposition the computational grid domain is divided into horizontal or vertical
stripes (without loss of generality we will further consider only the horizontal stripes). The number of stripes can
be defined to be equal to the number of processors. The stripe sizes are usually selected the same. The horizontal
border nodes (the first and the last lines) are included into the first and the last stripes correspondingly. The
bands for processing are distributed among the processors.
The essential aspect of parallel computations with block-striped data distribution is that the border rows of
the previous and the next computational grid stripes should be duplicated on the processor, which performs
processing of some grid node stripe. The border rows of the previous and the next computational grid stripes can
be considered as a result of the stripe expansion (these extended stripes are shown in Figure 12.10 by dotted
frames). The duplicated stripe border rows are used only for calculations, the recalculations of the rows should
be performed in the stripes of the initial location. Thus, border row duplicating has to be executed prior to the
beginning of the execution of each iteration.
processor i
processor i+1
1 stage 2 stage
16
Figure 12.11. Border row transferring among the neighboring processors
The procedure of border row exchange among the neighboring processors may be divided into two sequent
stages. During the first stage each processor transmits its lowest border row to the following processor and
receives the similar row from the previous processor (see Figure 12.11). The second transmission stage is
performed in the reverse order: the processors transmit their upper border rows to the previous neighbors and
receive similar rows from the following neighboring processors.
In a general way carrying out such data transmission operations may be represented as follows (for
simplicity let us discuss only the first stage of the exchange procedure):
// transmission of the lowest border row to the following processor
// and receiving the border row from the previous processor
if ( ProcNum != NP-1 )Send(u[M][*],N+2,NextProc);
if ( ProcNum != 0 )Receive(u[0][*],N+2,PrevProc);
(a format closed to MPI (see Section 4 and, for instance, Group, et al. (1994)) is used the communication
functions Send and Receiv. The first and the second parameters of the functions represent the transmitted data
and their amount, and the third parameter defines the addressee (for the function Send) or the source (for the
function Receive).
Two different mechanisms may be employed for data processing. If the first one is used, carrying out the
programs, which initiated transmission, is suspended until the termination of all data transmission operations (i.e.
until the moment when the addressee processor receives all the transmitted data). The communication
operations, which are implemented in this way, are usually called synchronous or blocking. Another approach –
asynchronous or nonblocking communications – the function of this type only initiates the transmission process
and hereon terminate their execution. As a result, the programs can continue the computational operations
without waiting for the termination of the long-continued communication operations. When necessary the
programs may check the availability of the transmitted data. These two alternatives are widely used in parallel
computations and have their advantages and disadvantages. The synchronous communication procedures are, as
a rule, simpler for use and more reliable. The nonblocking operations allow for combining the processes of data
transmission and computations, but they usually cause the increase of programming complexity. Taking into
account the above given discussion, we will use the blocking communication operations in all the following
examples.
The above given sequence of the blocking communication operations (first the function Send, then the
function Receive) leads to a strictly sequential scheme of row transmissions, as all the processors address the
operation Send simultaneously and pass into the waiting mode. The server NP-1 will be the first processor to be
ready for receiving the transmitted data. As a result, the processor NP-2 will perform the operation of sending its
border row and pass over to receiving a row from the processor NP-3 etc. The total number of such steps is equal
to NP-1. The execution of the second stage of the procedure (transmitting the upper border rows) is done
analogously (see Figure 12.11).
The sequential execution of the described data transmission operations is defined by the selected
implementation method. To improve the situation the send-receive alternation scheme can be applied for the
processors with even and odd numbers:
// sending the lowest border row to the following processor and
// receiving the transmitted row from the previous processor
if ( ProcNum % 2 == 1 ) { // odd processor
if ( ProcNum != NP-1 )Send(u[M][*],N+2,NextProc);
if ( ProcNum != 0 )Receive(u[0][*],N+2,PrevProc);
}
else { // even processor
if ( ProcNum != 0 )Receive(u[0][*],N+2,PrevProc);
if ( ProcNum != NP-1 )Send(u[M][*],N+2,NextProc);
}
This technique makes possible to perform all the necessary data transmission operations in two sequent
steps. During the first step all odd processors send data, and the even processors receive the data. During the
second step the processors change their roles: the even processors perform the function Send, the odd processors
perform the function Receive.
The described sequence of send-receive operations for the neighboring processor interaction is widely used
in practice of parallel computations. As a result, the procedures for supporting such operations are available in
many basic parallel program libraries. Thus, the standard MPI (see Section 4, for instance, Group, et al. (1994))
provides for the function Sendrecv. This function allows to write the above part of the program code more
concisely:
17
// sending the lowest border row to the following processor and
// receiving the transmitted row from the previous processor
Sendrecv(u[M][*],N+2,NextProc,u[0][*],N+2,PrevProc);
This integrated function Sendrecv is usually used to provide both correct operating on bordering processors
(when there is no need to perform all send-receive operations) and executing transmission alteration scheme on
the processors for escaping deadlock situations. The function implementation also provides for the possibilities
to execute all the necessary data communications in parallel.
18
Table 12.4. Experimental results for the Gauss-Seidel parallel method (distributed memory, block-striped
distribution, p=4)
Gauss-Seidel sequential Parallel algorithm base on
method (algorithm 12.1) Parallel algorithm 12.8 the wavefront calculation
Grid size scheme (see 12.3.4)
k T k T S k T S
100 210 0,06 210 0,54 0,11 210 1,27 0,05
200 273 0,35 273 0,86 0,41 273 1,37 0,26
300 305 0,92 305 0,92 1,00 305 1,83 0,50
400 318 1,69 318 1,27 1,33 318 2,53 0,67
500 343 2,88 343 1,72 1,68 343 3,26 0,88
600 336 4,04 336 2,16 1,87 336 3,66 1,10
700 344 5,68 344 2,52 2,25 344 4,64 1,22
800 343 7,37 343 3,32 2,22 343 5,65 1,30
900 358 9,94 358 4,12 2,41 358 7,53 1,32
1000 351 11,87 351 4,43 2,68 351 8,10 1,46
2000 367 50,19 367 15,13 3,32 367 27,00 1,86
3000 364 113,17 364 37,96 2,98 364 55,76 2,03
(k – the number of iterations, T – time in sec., S – speed up)
Finally let us analyze such possibilities of parallel computations that would provide for finding the same
Dirichlet problem solutions as the use of initial sequential Gauss-Seidel method does. As it has been mentioned
previously, this result may be obtained by the wavefront calculation scheme. To form a calculation wave let us
represent logically each processor stripe as a set of blocks (the block size may be assumed to be equal to the
stripe width) and implement stripe processing blockwise sequentially (see Figure 12.12). To follow the
sequential algorithm calculations may be started only for the first block of the first processor stripe. After the
block has been processed, two more blocks will be available for calculations – block 2 of the first stripe and
block 1 of the second stripe (to process the block of the second stripe it is necessary to transmit the border block
row of the first stripe). After these blocks are processed, three more blocks will be available for computations. So
we observe the familiar process of wavefront data processing (see the experimental results in Table 12.4).
1 2 3 4 5 6 7 8 0
2 3 4 5 6 7 8 9 1
Processors
3 4 5 6 7 8 9 10 2
4 5 6 7 8 9 10 11 3
Figure 12.12. Wavefront computational scheme in case of block-striped dada distribution among
processors
The attempt to combine border row transmission operations and the operations on data block processing
may be an interesting aspect of developing such parallel computation scheme.
20
// data transmission to the right processor
Send(u[*][M+1],M+2,RightProc); // right column
Send(dmax,1, RightProc); // calculation error
}
// synchronization and error dmax distribution
Barrier();
Broadcast(dmax,NP-1);
} while ( dmax > eps ); // eps –accuracy
(the function Barrier is the collective synchronization operation, i.e. each concurrently executed program part is
suspended at the moment of calling this function, calculations can be resumed only after calling the function by
all the parts of parallel program).
It should be noted that at the initial moment all the processors (except processor 0) must be in the state of
waiting their border nodes (the upper row and the left column). The calculations must be started by the processor
with the left upper block. After terminating the block processing the recalculated data of the right column and the
lower block row should be transmitted to the right and the lower grid processors correspondingly. These actions
provide unblocking the processors of the second processor diagonal (see the left part of Figure 12.13) etc.
The analysis of the wavefront computation efficiency in distributed memory multicomputer systems (see
Table 12.5) demonstrates a considerable decrease of processor utilization because processors are involved in data
processing only at the moment when their blocks belong to calculation wave front. In case of distributed memory
load balancing (data redistribution) is very complicated, as it requires transmitting large amounts of data among
processors. A possible way to improve the situation is to allow processing a multiple calculation wave.
According to this approach the processors after processing the current calculation wave may start processing the
wave of the following iteration. Thus, for instance, the processor 0 (see Figure 12.13) after processing its block
and transmitting the border data to initiate computations on the processors 1 and 4 may start calculations of the
next Gauss-Seidel method iteration. After processing the blocks of the first (the processors 1 and 4) and the
second (the processor 0) waves, the following processor groups appear to be ready for computations: for the first
wave – the processors 2, 5, 8; for the second wave – processors 1 and 4. Correspondingly at this moment the
processor 0 will again be ready to start processing the next data wave. After executing NB such steps, NB
iterations of the Gauss-Seidel method will simultaneously be processed, and all the processors will be involved
in calculations. This computational scheme makes possible to consider the grid as a computational pipeline of
the Gauss-Seidel method iterations. As previously the calculations will be terminated when maximum
computation error meet the accuracy requirement (estimating maximum computation error should be begun only
at complete pipeline loading after starting NB iterations). It should be noted that the Dirichlet problem solution,
which is obtained after meeting the accuracy conditions, will contain the grid node values from different method
iterations. It will not coincide with the solution, which is obtained by means of the original sequential algorithm.
0 1 2 3 0 1 2 3 0 1 2 3
4 5 6 7 4 5 6 7 4 5 6 7
8 9 10 11 8 9 10 11 8 9 10 11
12 13 14 15 12 13 14 15 12 13 14 15
Figure 12.13. Wavefront computational scheme in case of the checkerboard block distribution
22
Figure 12.14. Data communications for the Gauss-Seidel method in case of distributed memory
− accumulating the calculation error, finding the maximum calculation error and its distributing among
processors is a basic communication operation of data transmission from all processors to all processors (the
function MPI_Allreduce);
− collecting the problem solution on one processor (all stripes or grid blocks) is a basic operation of data
transmission from all the processors to one processor (the function MPI_Gather).
12.4. Summary
The Section describes parallel methods for solving partial differential equations. The Dirichlet problem for
the Poisson equation is given as an example for consideration. The method of finite differences is one of the
most widely used approaches to solving differential equations. The amount of computations in this case is
considerable. Possible ways are also considered to develop parallel computations on multiprocessor computer
systems with shared and distributed memory.
In Subsection 12.1 the statement of the Dirichlet problem is given. For solving the problem the finite
difference methods (in particular the Gauss-Seidel algorithm) are described.
In Subsection 12.2 problems of parallel programming for shared memory multiprocessor systems are
discussed. It is noted that in this case one of the most widely used approaches is OpenMP technology. OpenMP
provides to develop parallel software on the basis of existed sequential programs by adding directives in the
program code for indicating program fragments that can be executed in parallel. As a results the first prototypes
of parallel programs can be obtained quite easily but then in the course of improving parallel program efficiency
the problems of parallel computations are appeared immediately. Some of these problems are discussed, namely
redundancy of computation synchronization, possible deadlock appearance, computation indeterminacy in case
of race conditions etc. To overcome the problems various computational schemes are given for solving the
Dirichlet problem are given (the red/black row alteration scheme, the wavefront computation scheme etc.).
Finally the job queue technique is described as a way to provide processor load balancing.
In Subsection 12.3 the subject is parallel programming for distributed memory multicomputer systems. The
discussion is concentrated on the problems of data distribution and information communication among
processors. The block-striped decomposition scheme is considered in detail, a general description is given for the
checkerboard data partition. For communication operations the variants of synchronous and asynchronous
implementations are discussed. Finally the data communication complexity is estimated.
12.5. References
Additional information on the problems of numeric solving partial differential equations and on the method
of finite differences can be found in Fox, et al. (1988).
Memory organizing and cache using is described in detail in Culler, et al. (1998).
The information on the problem of processor load balancing may be found in Xu and Hwang (1998).
12.6. Discussions
1. How is the Dirichlet problem for Poisson equation defined?
2. How is the method of finite differences applied for solving the Dirichlet problem?
3. What approaches determine the parallel computations for the grid methods on shared memory
multiprocessor systems?
4. What is the essence of parallel computation synchronization?
5. What are the characteristics of the race condition of the parallel program threads?
6. What is the essence of the deadlock problem?
7. Which method guarantees the grid method determinacy but requires a great additional memory for
implementation?
8. How does the computation amount change in case of using the wavefront processing methods?
9. How is chunking applied to reduce computation synchronization?
10. What are the ways to increase the efficiency of wavefront data processing methods?
11. In which way does the job queue provide for computational load balancing?
12. What are the problems to be solved in the course of parallel computations on distributed memory
systems?
13. What mechanisms can be involved into the process of data transmission?
23
14. In which way can the multiple wave calculations be applied for increasing the wave computation
efficiency in distributed memory systems?
12.7. Exercises
1. Develop the first and the second variants of the Gauss-Seidel parallel algorithm. Carry out the
computational experiments and compare the execution time.
2. Implement the parallel algorithm based on the wavefront computation scheme. Develop the
implementation of the parallel algorithm, in which the block-oriented approach to the wavefront processing
method is applied. Carry out the computational experiments and compare the execution time.
3. Develop the implementation of the parallel computation job queue for shared memory systems. It is
necessary to provide processing the neighboring blocks on the same processors.
References
Buyya, R. (Ed.) (1999). High Performance Cluster Computing. Volume1: Architectures and Systems.
Volume 2: Programming and Applications. - Prentice Hall PTR, Prentice-Hall Inc.
Chandra, R., Dagum, L., Kohr, D., Maydan, D., McDonald, J., and Melon, R. (2000). Parallel
Programming in OpenMP. Morgan Kaufmann Publishers.
Culler, D., Singh, J.P., Gupta, A. (1998) Parallel Computer Architecture: A Hardware/Software
Approach. - Morgan Kaufmann.
Fox, G.C. et al.(1988). Solving Problems on Concurrent Processors.- Prentice Hall, Englewood Cliffs, NJ.
Group, W., Lusk, E., Skjellum, A. (1994). Using MPI. Portable Parallel Programming with the Message-
Passing Interface. –MIT Press.
Pfister, G. P. (1995). In Search of Clusters. - Prentice Hall PTR, Upper Saddle River, NJ (2nd edn., 1998).
Roosta, S.H. (2000). Parallel Processing and Parallel Algorithms: Theory and Computation. Springer-
Verlag,NY.
Tanenbaum, A. (2001). Modern Operating System. 2nd edn. – Prentice Hall
Xu, Z., Hwang, K. (1998). Scalable Parallel Computing Technology, Architecture, Programming. –
Boston: McGraw-Hill.
24