PARALLEL FACTORIZATIONS IN NUMERICAL ANALYSIS∗
arXiv:0912.3678v1 [math.NA] 18 Dec 2009
PIERLUIGI AMODIO† AND LUIGI BRUGNANO‡
Abstract. In this paper we review the parallel solution of sparse linear systems, usually deriving
by the discretization of ODE-IVPs or ODE-BVPs. The approach is based on the concept of parallel
factorization of a (block) tridiagonal matrix. This allows to obtain efficient parallel extensions of
many known matrix factorizations, and to derive, as a by-product, a unifying approach to the parallel
solution of ODEs.
Key words. Ordinay differential equations (ODEs), initial value problems (IVPs), boundary
value problems (BVPs), parallel factorizations, linear systems, sparse matrices, parallel solution,
“Parareal” algorithm.
AMS subject classifications. 65F05, 15A09, 15A23.
1. Introduction. The numerical solution of ODEs requires the solution of sparse
and structured linear systems. The parallel solution of these problems may be obtained in two ways: for BVPs, since the size of the associated linear system is large,
we need to develop parallel solvers for the obtained linear systems; for IVPs we need
to define appropriate numerical methods that allow to obtain parallelizable linear
systems.
In both cases, the main problem can then be taken back to the solution of special
sparse linear systems, whose solution is here approached through the use of parallel
factorizations, originally introduced for deriving efficient parallel tridiagonal solvers [2,
9], and subsequently generalized to block tridiagonal, Almost Block Diagonal (ABD),
and Bordered Almost Block Diagonal (BABD) systems [10, 11, 15, 16, 17, 19].
With this premise, the structure of the paper is the following: in Section 2 the
main facts about parallel factorizations and their extensions are briefly recalled; then,
in Section 3 their application for solving ODE problems is sketched; finally, in Section 4
we show that this approach also encompasses the so called “Parareal” algorithm,
recently introduced in [23, 24].
2. Parallel factorizations. In this section we consider several parallel algorithms in the class of partition methods for the solution of linear systems,
Ax = f,
(2.1)
where A is a n × n sparse and structured matrix, and x and f are vectors of length n.
We will investigate the parallel solution of (2.1) on p processors, supposing p ≪ n in
order for the number of sequential operations to be much smaller than that of parallel
ones.
The coefficient matrices A here considered are (block) banded, tridiagonal, bidiagonal, or even Almost Block Diagonal (ABD). All these structures may be rearranged
∗ Work
developed within the project “Numerical methods and software for differential equations”.
di Matematica, Università di Bari, Bari, Italy (amodio@dm.uniba.it).
‡ Dipartimento di Matematica, Università di Firenze, Firenze, Italy (luigi.brugnano@unifi.it).
† Dipartimento
1
2
P. AMODIO AND L. BRUGNANO
Fig. 2.1. Partitioning of a banded matrix. Each point represents a (block) entry of the matrix.
in the form
A(1)
(1)
b1
A=
T
(1)
c1
a(1)
(2)
b0
(2)
T
c0
A(2)
(2)
b1
T
(2)
c1
a(2)
..
.
a(p−1)
(p)
b0
(p)T
c0
A(p)
(2.2)
where the diagonal blocks are square and the superscript (i) indicates that this block
(i)
(i)
is handled only by processor i. The size of the blocks a(i) , A(i) , bj , and cj is in
general independent of both i and j, and only depends on the sparsity structure of the
coefficient matrix A. In particular, the size of the blocks a(i) is quite important, since
the sequential section of the algorithm is proportional to it. Therefore, the blocks
a(i) should be as small as possible. As an example, if A is (block) tridiagonal, a(i)
reduces to a single (block) entry. Vice versa, in case of banded (block) matrices, the
(block) size of a(i) equals to max(s, r), where s and r denote the number of lower and
upper off (block) diagonals (see Figure 2.1), respectively. In case of ABD matrices,
a(i) is a block of size equal to the number of rows in each block row of the coefficient
matrix (see Figure 2.2). Since row and column permutations inside each block do
not destroy the sparsity structure of the coefficient matrix, in ABD matrices we may
permute the elements inside a(i) to improve stability properties. Blocks A(i) have the
same sparsity structure as the original matrix, and are locally handled by using any
suitable sequential algorithm.
In order to keep track of any parallel algorithm, we consider the following factorization [2, 9]
A = F T G,
(2.3)
3
PARALLEL FACTORIZATIONS IN NUMERICAL ANALYSIS
Fig. 2.2. Partitioning of an ABD matrix. Each point represents an entry of the matrix.
where
N (1)
v (1)T
F =
Iˆ
oT
T =
o
T
I w (2)
o N (2)
T
v (2)
o
α(1)
o
β (2)
o
T
I w (3)
(3)
o N
T
v (3)
o
I
..
.
I
o
oT
Iˆ
oT
γ (2)
o
α(2)
o
β (3)
T
w(p)
N (p)
,
(2.4)
oT
Iˆ
oT
γ (3)
o
α(3)
..
.
α(p−1)
o
oT
Iˆ
,
(2.5)
4
P. AMODIO AND L. BRUGNANO
S (1)
oT
G=
y (1)
I
z (2)
oT
S (2)
oT
y
(2)
oT
S (3)
oT
I
z
(3)
y (3)
I
..
.
I
z
(p)
oT
S (p)
,
(2.6)
I, Iˆ and o are identity and null matrices of appropriate sizes, and N (i) S (i) is any
suitable factorization of the block A(i) . The remaining entries of F , T , and G can be
derived from (2.3) by direct identification.
Factorization (2.3) may be computed in parallel on the p processors. For simplicity, we analyze the factorization of the sub-matrix identified by the superscript (i)
(with obvious differences for i = 1 and i = p)
(i)T
0 c0
(i)
M (i) = b(i)
(2.7)
A(i) c1 .
0
(i)T
b1
a(i)
Following (2.2)–(2.6) we have
M (i)
I
= o
T
w(i)
N (i)
T
v (i)
where
(i)
α
1
o o
I
β (i)
oT
I
oT
(i)
(i+1)
α2 + α1
I
γ (i)
o z (i)
(i)
α2
oT
S (i)
oT
y (i) ,
I
(2.8)
= α(i) ,
and the other entries are the same as defined in (2.4)–(2.6).
Consequently, by considering any given factorization for A(i) , it is possible to
derive corresponding parallel extensions of such factorizations, which cover most of
the parallel algorithms in the class of partition methods. In particular, the following
ones easily derive for matrices with well conditioned sub-blocks A(i) (this means, for
example, that pivoting is unnecessary or does not destroy the sparsity structure):
• LU factorization, by setting in (2.8) N (i) = L(i) and S (i) = U (i) , where
L(i) U (i) is the LU factorization of the matrix A(i) . In this case, the (block)
(i)
vectors y (i) and v (i) maintain the same sparsity structure as that of c1 and
(i)
b1 , respectively, while the vectors z (i) and w(i) are non-null fill-in (block)
vectors, obtained by solving two triangular systems.
• LU D factorization (which derives from the Gauss-Jordan elimination algorithm), by setting in (2.8) S (i) = D(i) , a diagonal matrix, and
I
o
T
w(i)
N (i)
T
v (i)
I
o = o
I
oT
L(i)
T
v (i)
I
o o
I
T
w(i)
U (i)
oT
o ,
I
5
PARALLEL FACTORIZATIONS IN NUMERICAL ANALYSIS
where L(i) and U (i) are lower and upper triangular matrices, respectively, with
unit diagonal. Therefore, v (i) and w (i) maintain the same sparsity structure
(i)
(i)
as that of b1 and c0 , respectively, while z (i) and y (i) are non-null fill-in
(block) vectors.
• cyclic reduction algorithm [2, 9] (see also [1, 12, 15, 16]), which is one of
the best known parallel algorithms but that, in its original form, requires a
synchronization at each step of reduction. In fact, the idea of this algorithm is
to perform several reductions that, at each step, halve the size of the system.
On the other hand, to obtain a factorization in the form (2.8), it is possible
to consider cyclic reduction as a sequential algorithm to be applied locally,
(i)
(i)
(i)
(i)
(i)
(i)T
M (i) = (P̂1 L̂1 P̂2 L̂2 · · · )D̂(i) (· · · Û2 P̂2
(i)
(i)T
Û1 P̂1
),
(i)
where P̂i are suitable permutation matrices that maintain the first and last
row in the reduced matrix. The computational cost, which is much higher if
the algorithm is applied to A on a sequential computer, becomes comparable
to the previous local factorizations since this algorithm does not compute fillin block vectors. As a consequence, the corresponding parallel factorization
algorithm turns out to be one of the most effective.
• Alternate row and column elimination [25] which is an algorithm suitable for
ABD matrices. In fact, for such matrices alternate row and column permutations always guarantee stability without fill-in. This feature extends to the
parallel algorithm, by taking into account that row permutations between the
(i)
first block row of A(i) and the block containing c0 (see (2.7)), still make the
parallel algorithm stable without introducing fill-in. Such parallel factorization is defined by setting N (i) = P (i) L(i) and S (i) = U (i) Q(i) , where P (i) and
Q(i) are permutation matrices and L(i) and U (i) , after a suitable reordering
of the rows and of the columns, are 2 × 2 block triangular matrices (see [10]
for full details). Finally, the (block) vectors y (i) and z (i) maintain the same
(i)
(i)
sparsity structure as that of b1 and c0 , respectively, whereas w (i) and v (i)
are fill-in (block) vectors.
For what concerns the solution of the systems associated to the previous parallel
factorizations, there is much parallelism inside. The solution of the systems with the
matrices F and G may proceed in parallel on the different processors. Conversely, the
solution of the system with the matrix T requires a sequential part, consisting in the
solution of a reduced system with the (block) tridiagonal reduced matrix
(1)
α
γ (2)
(2)
..
β
.
α(2)
.
(2.9)
Tp =
..
..
.
.
γ (p−1)
β (p−1)
α(p−1)
We observe that the (block) size of Tp only depends on p and is independent of n.
For a matrix A with singular or ill-conditioned sub-blocks A(i) , the local factorizations may be unstable or even undefined. Consequently, it is necessary to slightly
modify the factorization (2.8), in order to obtain stable parallel algorithms. The basic idea is that factorization (2.8) may produce more than two entries in the reduced
6
P. AMODIO AND L. BRUGNANO
system. In other words, the factorization of A(i) is stopped when the considered
sub-block is ill-conditioned (or the local factorization with a singular factor). As a
consequence, the size of the reduced system is increased as sketched below. Let then
(i)
(i)
(i)
M (i) = L̂1 D̂1 Û1 ,
where
(i)
L̂1
I
o
=
(i)T
w1
(i)
N1
(i)
v1
,
o
I
o
T
T
I
o
(i)
D̂1
o
Iˆ
oT
(i)
(i)
Û1
o
oT
(i)
S1
oT
I
z (i)
1
=
γ1
o
(i)T
c2
(i)
A2
(i)T
b3
(i)
α2
(i)
b2
T
(i)
y1
I
o
oT
Iˆ
oT
,
o
I
(i)
oT
I
α1
o
(i)
= β1
(i)
c3
(i)
α3
,
(i)
when the sub-block A1 of A(i) ,
(i)
A1
(i)
=
N1
(i)
v1
T
(i)
α2
!
(i)
(i)
S1
(i)
y1
I
,
(i)
is singular, because the block α2 is singular (i.e., α2 = 0, in the scalar case). Then,
(i)
(i)
α2 is introduced in the reduced system. By iterating this procedure on D̂1 , we
obtain again the factorization (2.3), with the only difference that now the reduced
matrix in Tp may be of (block) size larger than p − 1 (compare with (2.9)). However,
it may be shown that it still depends only on p, whereas it is independent of n [3, 4].
Consequently, the scalar section of the whole algorithm is still negligible, when n ≫ p.
The parallel algorithms that fall in this class are [3, 4]:
(i)
(i)
(i)
• LU factorization with partial pivoting, defined by setting N1 = (P1 )T L1
(i)
(i)
(i)
(i) (i)
and S1 = U1 where P1 is a permutation matrix such that L1 U1 is
(i) (i)
the LU factorization of P1 A1 . The remaining (block) vectors are defined
similarly as in the case of the LU factorization previously described.
(i)
(i)
(i)
(i)
• QR factorization, defined by setting N1 = Q1 and S1 = R1 . In this case
(i)
(i)
(i)
(i)
both w 1 and z 1 are fill-in (block) vectors while v 1 and y 1 maintain the
same sparsity structure as that of the corresponding (block) vectors in M (i) .
Factorization (2.3)–(2.6), and the corresponding parallel algorithms mentioned
above, are easily generalized to matrices with additional non-null elements in the
right-lower and/or left-upper corners. This is the case, for example, of Bordered ABD
(BABD) matrices (see Figure 2.3) and matrices with a circulant-like structure (see
Figure 2.4). Supposing the non-null elements are located in the right-upper corner
7
PARALLEL FACTORIZATIONS IN NUMERICAL ANALYSIS
Fig. 2.3. Partitioning of a BABD matrix. Each point represents an entry of the matrix.
(this is always possible by means of suitable permutation), then the coefficient matrix
is partitioned in the form
a(0)
(1)
b0
A=
(1)T
c0
A(1)
(1)
b1
T
b
(1)
c1
a(1)
(2)
b0
(2)
T
(2)
T
c0
A(2)
b1
(2)
c1
a(2)
..
.
a(p−1)
(p)
b0
(p)T
c0
A(p)
(p)
b1
T
(p)
c1
a(p)
,
(2.10)
where b is the smallest rectangular block containing all the corner elements.
A factorization similar to that in (2.3)–(2.6) (the obvious differences are related
to the first and last (block) rows) produces a corresponding reduced system with the
reduced matrix
(0)
α
γ (1)
β (0)
β (1) α(1) γ (2)
..
(2)
(2)
.
.
Tp =
(2.11)
β
α
.
.
.
.
(p)
.
. γ
β (p)
α(p)
We observe that, for the very important classes of BABD and circulant-like matrices (the latter, after a suitable row permutation, see Figure 2.5), both the matrix
8
P. AMODIO AND L. BRUGNANO
Fig. 2.4. Partitioning of a matrix with a circulant-like structure. Each point represents a
(block) entry of the matrix.
(2.10) and the reduced matrix (2.11) have the form of a lower block bidiagonal matrix
(i)
(i.e., cj = 0 and γ (i) = 0 for all i and j) with an additional right-upper corner block:
and
a(0)
b(1)
0
A=
b
A(1)
(1)T
b1
a(1)
(2)
b0
..
.
..
.
a(p−1)
(p)
b0
A(p)
(p)T
b1
α(0)
β (1)
Tp =
β (0)
α(1)
..
.
..
β
.
(p)
α(p)
a(p)
,
.
We have also to note that, for this kind of matrices, the overall computational
cost of a parallel factorization algorithm has a very small increase. On a sequential
machine, supposing to maintain the same partitioning of the matrix on p > 1 processors, we have a computational cost which is similar to that of any efficient sequential
algorithm (the corner block b implies the construction of a fill-in (block) vector) but
with better stability properties (see [26]). For this reason, a method that is widely
and efficiently applied to matrices in the form (2.10), also on a sequential computer, is
cyclic reduction (see [11, 17, 19] where cyclic reduction is applied to BABD matrices).
PARALLEL FACTORIZATIONS IN NUMERICAL ANALYSIS
9
Fig. 2.5. Partitioning of the matrix in Fig. 2.4 after row permutation. Each point represents a
(block) entry of the matrix.
3. Parallel solution of differential equations. The numerical solution of
ODE-BVPs leads to the solution of large and sparse linear systems of equations that
represent the most expensive part of a BVP code. The sparsity structure of the
obtained problem depends on the methods implemented. In general, one-step methods
lead to ABD or BABD matrices, depending on the boundary conditions (separated
or not, respectively), while multistep methods lead to block banded systems (with
additional corner blocks in case of non-separated boundary conditions) [5, 6, 22]. The
parallel algorithms previously described perfectly cope with this kind of systems. For
this reason we do not investigate further on ODE-BVPs.
Conversely, we shall now consider the application of parallel factorizations for
deriving parallel algorithms for numerically solving ODE-IVPs, which we assume, for
sake of simplicity, to be linear and in the form
y ′ = Ly + g(t),
t ∈ [t0 , T ],
y(t0 ) = y0 ∈ Rm ,
(3.1)
which is, however, sufficient to grasp the main features of the approach [5, 6, 7, 8, 20,
21, 22].
Let us consider a suitable coarse mesh, defined by the following partition of the
integration interval in (3.1):
t0 ≡ τ0 < τ1 < · · · < τp ≡ T.
(3.2)
Suppose, for simplicity, that inside each sub-interval we apply a given method
with a constant stepsize
hi =
τi − τi−1
,
N
i = 1, . . . , p,
(3.3)
10
P. AMODIO AND L. BRUGNANO
to approximate the problem
y ′ = Ly + g(t),
t ∈ [τi−1 , τi ],
y(τi−1 ) = y0i ,
i = 1, . . . , p.
(3.4)
If y(t) denotes the solution of problem (3.1), and we denote by
yni ≈ y(τi−1 + nhi ),
n = 0, . . . , N,
i = 1, . . . , p,
(3.5)
the entries of the discrete approximation, then, in order for the numerical solutions
of (3.1) and (3.4) to be equivalent, we require that (see (3.2) and (3.5))
y01 = y0 ,
y0i ≡ yN,i−1 ,
i = 2, . . . , p.
(3.6)
For convention, we also set
y01 ≡ yN 0 .
(3.7)
Let now suppose that the numerical approximations to the solutions of (3.4) are
obtained by solving discrete problems in the form
Mi y i = v i y0i + g i ,
yi =
T
y1i , . . . , yN i
i = 1, . . . , p,
,
(3.8)
where the matrices Mi ∈ RmN ×mN and v i ∈ RmN ×m , and the vector g i ∈ RmN , do
depend on the chosen method (see, e.g., [5, 6], for the case of block BVMs) and on the
problems (3.4). Clearly, this is a quite general framework, which encompasses most
of the currently available methods for solving ODE-IVPs. By taking into account all
the above facts, one obtains that the global approximation to the solution of (3.1)
is obtained by solving a discrete problem in the form (hereafter, Ir will denote the
identity matrix of dimension r):
Im
−v1
My ≡
M1
−V2
M2
..
.
..
.
−Vp
Mp
yN 0
y1
y2
..
.
yp
=
Vi = [O | v i ] ∈ RmN ×mN ,
y0
g1
g2
..
.
gp
,
(3.9)
i = 2, . . . , p.
Obviously, this problem may be solved in a sequential fashion, by means of the
iteration (see (3.6)-(3.7)):
yN 0 = y0 ,
Mi y i = g i + v i yN,i−1 ,
i = 1, . . . , p.
Nevertheless, by following arguments similar to those in the previous section, we
consider the factorization:
11
PARALLEL FACTORIZATIONS IN NUMERICAL ANALYSIS
Im
−w1
Im
M =
M1
M2
..
.
Mp
ImN
−W2
ImN
..
.
..
.
−Wp
ImN
,
where (see (3.9))
Wi = [O | wi ] ∈ RmN ×mN ,
wi = Mi−1 v i ∈ RmN ×m .
(3.10)
Consequently, at first we solve, in parallel, the systems
Mi z i = g i ,
zi =
z1i , . . . , zN i
T
i = 1, . . . , p,
,
(3.11)
and, then, (see (3.10) and (3.6)) recursively update the local solutions,
y 1 = z 1 + w1 y01 ,
(3.12)
y i = z i + Wi y i−1 ≡ z i + w i y0i ,
i = 2, . . . , p.
The latter recursion, however, has still much parallelism. Indeed, if we consider
the partitionings (see (3.8), (3.11), and (3.10))
yi =
bi
y
yN i
,
zi =
bi
z
zN i
wi =
,
bi
w
wN i
,
wN i ∈ Rm×m ,
(3.13)
then (3.12) is equivalent to solve, at first, the reduced system
i.e.,
Im
−wN 1
y01 = y0 ,
Im
..
.
..
.
−wN,p−1
Im
y01
y02
..
.
y0p
y0,i+1 = zN i + wN i y0i ,
=
y0
zN 1
..
.
zN,p−1
,
(3.14)
i = 1, . . . , p − 1,
(3.15)
y p = z p + wp y0p .
(3.16)
after which performing the p parallel updates
bi = z
bi + w
b i y0i ,
y
We observe that:
i = 1, . . . , p − 1,
12
P. AMODIO AND L. BRUGNANO
• the parallel solution of the p systems in (3.11) is equivalent to compute the
approximate solution of the following p ODE-IVPs,
z ′ = Lz + g(t),
t ∈ [τi−1 , τi ],
z(τi−1 ) = 0,
i = 1, . . . , p,
(3.17)
in place of the corresponding ones in (3.4);
• the solution of the reduced system (3.14)-(3.15) consists in computing the
proper initial values {y0i } for the previous ODE-IVPs;
• the parallel updates (3.16) update the approximate solutions of the ODEIVPs (3.17) to those of the corresponding ODE-IVPs in (3.4).
Remark 1. Clearly, the solution of the first (parallel) system in (3.11) and the
first (parallel) update in (3.12) (see also (3.16)) can be executed together, by solving
the linear system (see (3.6))
M 1 y 1 = g 1 + v 1 y0 ,
(3.18)
thus directly providing the final discrete approximation on the first processor; indeed,
this is possible, since the initial condition y0 is given.
We end this section by emphasizing that one obtains an almost perfect parallel
speed-up, if p processors are used, provided that the cost for the solution of the reduced
system (3.14) and of the parallel updates (3.16) is small, with respect to that of (3.11)
(see [5, 6] for more details). This is, indeed, the case when the parameter N in (3.3)
is large enough and the coarse partition (3.2) can be supposed to be a priori given.
4. Connections with the “Parareal” algorithm. We now briefly describe
the “Parareal” algorithm introduced in [23, 24], showing the existing connections with
the parallel method previously described. This method, originally defined for solving
PDE problems, for example linear or quasi-linear parabolic problems, can be directly
cast into the ODE setting via the semi-discretization of the space variables; that is,
by using the method of lines. In more detail, let us consider the problem
∂
y = L y,
∂t
t ∈ [t0 , T ],
y(t0 ) = y0 ,
(4.1)
where L is an operator from a Hilbert space V into V ′ . Let us consider again the
partition (3.2) of the time interval, and consider the problems
∂
y = L y,
∂t
t ∈ [τi−1 , τi ],
y(τi−1 ) = y0i ,
i = 1, . . . , p.
(4.2)
Clearly, in order for (4.1) and (4.2) to be equivalent, one must require that
y0i = y(τi−1 ),
i = 1, . . . , p.
(4.3)
The initial data (4.3) are then formally related by means of suitable propagators Fi
such that
y0,i+1 = Fi y0i ,
i = 1, . . . , p − 1.
(4.4)
PARALLEL FACTORIZATIONS IN NUMERICAL ANALYSIS
13
The previous relations can be cast in matrix form as (I now denotes the identity
operator)
I
−F1
Fy ≡
I
..
.
..
.
−Fp−1
I
y01
y02
..
.
y0p
=
y0
0
..
.
0
For solving (4.5), the authors essentially define the splitting
I
−G1
G=
F = (F − G) + G,
with coarse propagators
Gi ≈ Fi ,
I
..
.
≡ η.
..
.
−Gp−1
(4.5)
I
,
i = 1, . . . , p,
and consider the iterative procedure
Gy (k+1) = (G − F )y (k) + η,
k = 0, 1, . . . ,
with an obvious meaning of the upper index. This is equivalent to solve the problems
(k+1)
y01
= y0 ,
(k+1)
y0,i+1
= Gi y0i
(k+1)
(k)
+ (Fi − Gi )y0i ,
i = 1, . . . , p − 1,
(4.6)
thus providing good parallel features, if we can assume that the coarse operators Gi
are “cheap” enough. The iteration (4.6) defines the “Parareal” algorithm, which is
iterated until
(k+1)
ky0i
(k)
− y0i k,
i = 2, . . . , p,
are suitably small. In the practice, in case of linear operators, problem (4.1) becomes,
via the method of lines, an ODE in the form (3.1), with L a huge and very sparse
matrix. Consequently, problems (4.2) become in the form (3.4). Similarly, the propagator Fi consists in the application of a suitable discrete method for approximating
the solution of the corresponding ith problem in (3.4), and the coarse propagator Gi
describes the application of a much cheaper method for solving the same problem.
As a consequence, if the discrete problems corresponding to the propagators {Fi } are
in the form (3.8), then the discrete version of the recurrence (4.4) becomes exactly
(3.15), as well as the discrete counterpart of the matrix form (4.5) becomes (3.14).
We can then conclude that the “Parareal” algorithm in [23, 24] exactly coincides
with the iterative solution of the reduced system (3.14), induced by a suitable splitting.
14
P. AMODIO AND L. BRUGNANO
We observe that the previous iterative procedure may be very appropriate, when
the matrix L is large and sparse since, in this case, the computations of the block
vectors {wi } in (3.10), and then of the matrices {wN i } (see (3.13)) would be clearly
impractical. Moreover, it can be considerably improved by observing that
wN i y0i ≈ e(τi −τi−1 )L y0i .
Consequently, by considering a suitable approximation to the matrix exponential, the
corresponding parallel algorithm turns out to become semi-iterative and potentially
very effective, as recently shown in [8].
REFERENCES
[1] P. Amodio. Optimized cyclic reduction for the solution of linear tridiagonal systems on parallel
computers. Comput. Math. Appl. 26 (1993) 45–53.
[2] P. Amodio and L. Brugnano. Parallel factorizations and parallel solvers for tridiagonal linear
systems. Linear Algebra Appl. 172 (1992) 347–364.
[3] P. Amodio and L. Brugnano, The parallel QR factorization algorithm for tridiagonal linear
systems. Parallel Comput. 21 (1995) 1097–1110.
[4] P. Amodio and L. Brugnano. Stable Parallel Solvers for General Tridiagonal Linear Systems. Z.
Angew. Math. Mech. 76, suppl. 1 (1996) 115–118.
[5] P. Amodio and L. Brugnano. Parallel implementation of block boundary value methods for
ODEs, J. Comput. Appl. Math. 78 (1997) 197–211.
[6] P. Amodio and L. Brugnano. Parallel ODE solvers based on block BVMs, Adv. Comput. Math.
7 (1997) 5–26.
[7] P. Amodio and L. Brugnano. ParalleloGAM: a Parallel Code for ODEs, Appl. Numer. Math.
28 (1998) 95–106.
[8] P. Amodio and L. Brugnano. Parallel solution in time of ODEs: some achievements and perspectives. Appl. Numer. Math. 59 (2009) 424–435.
[9] P. Amodio, L. Brugnano and T. Politi. Parallel factorizations for tridiagonal matrices. SIAM J.
Numer. Anal. 30 (1993) 813–823.
[10] P. Amodio, J.R. Cash, G. Roussos, R.W. Wright, G. Fairweather, I. Gladwell, G.L. Kraut and
M. Paprzycki. Almost block diagonal linear systems: sequential and parallel solution techniques, and applications. Numer. Linear Algebra Appl. 7, no. 5 (2000) 275–317.
[11] P. Amodio, I. Gladwell and G. Romanazzi. Numerical solution of general Bordered ABD linear
systems by cyclic reduction. JNAIAM J. Numer. Anal. Ind. Appl. Math. 1, no. 1 (2006)
5–12.
[12] P. Amodio, N. Mastronardi. A parallel version of the cyclic reduction algorithm on a Hypercube.
Parallel Comput. 19 (1993) 1273–1281.
[13] P. Amodio and F. Mazzia. A parallel Gauss-Seidel method for block tridiagonal linear systems.
SIAM J. Sci. Comput. 16 (1995) 1451–1461.
[14] P. Amodio and F. Mazzia. Parallel iterative solvers for banded linear systems. Lecture Notes in
Comput. Sci. 1196, (Numerical Analysis and its Applications. L. Vulkov, J. Wàsniewski,
and P. Yalamov editors, Springer, Berlin), 17–24, 1997.
[15] P. Amodio and M. Paprzycki. Parallel solution of Almost Block Diagonal systems on a Hypercube. Linear Algebra Appl. 241-243 (1996) 85–103.
[16] P. Amodio and M. Paprzycki. On the parallel solution of Almost Block Diagonal systems. Control Cybernet. 25, no. 3 (1996) 645–656.
[17] P. Amodio and M. Paprzycki. A cyclic reduction approach to the numerical solution of boundary
value ODEs. SIAM J. Sci. Comput. 18, no. 1 (1997) 56–68.
[18] P. Amodio, M. Paprzycki and T. Politi. A survey of parallel direct methods for block bidiagonal
linear systems on distributed memory computers. Comput. Math. Appl. 31 (1996) 111–127.
[19] P. Amodio and G. Romanazzi. BABDCR: a Fortran 90 package for the solution of Bordered
ABD systems. ACM Trans. Math. Software 32, no. 4 (2006) 597–608. (Available on the
url: http://www.netlib.org/toms/859 )
[20] L. Brugnano and D. Trigiante. On the potentiality of sequential and parallel codes based on
extended trapezoidal rules (ETRs), Appl. Numer. Math. 25 (1997) 169–184.
[21] L. Brugnano and D. Trigiante. Parallel implementation of block boundary value methods on
nonlinear problems: theoretical results, Appl. Numer. Math. 78 (1997) 197–211.
PARALLEL FACTORIZATIONS IN NUMERICAL ANALYSIS
15
[22] L. Brugnano and D. Trigiante. Solving Differential Problems by Multistep Initial and Boundary
Value Methods, Gordon and Breach, Amsterdam, 1998.
[23] J.L. Lions, Y. Maday and G. Turinici. Résolution d’EDP par un schéma en temps “pararéel”,
C. R. Acad. Sci. Paris, Série I 332 (2001) 661–668.
[24] Y. Maday and G. Turinici. A parareal in time procedure for the control of partial differential
equations, C. R. Acad. Sci. Paris, Série I 335 (2002) 387–392.
[25] J.M. Varah. Alternate row and column elimination for solving certain linear systems. SIAM J.
Numer. Anal. 13 (1976) 71–75.
[26] S.J. Wright. A collection of problems for which Gaussian elimination with partial pivoting is
unstable. SIAM J. Sci. Stat. Comput. 14 (1993) 231–238.