0% found this document useful (0 votes)
9 views77 pages

F10 - Parallelizing Compilers

The document discusses the challenges and methodologies for automatically parallelizing sequential programs, emphasizing the importance of data dependence analysis. It outlines various types of data dependences and their implications for parallelization, particularly in loop structures, and introduces concepts such as perfect loop nests and dependence distances. The lecture focuses on practical approaches to achieve parallelization, including the use of compilers and OpenMP directives.

Uploaded by

ejy jawa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views77 pages

F10 - Parallelizing Compilers

The document discusses the challenges and methodologies for automatically parallelizing sequential programs, emphasizing the importance of data dependence analysis. It outlines various types of data dependences and their implications for parallelization, particularly in loop structures, and introduces concepts such as perfect loop nests and dependence distances. The lecture focuses on practical approaches to achieve parallelization, including the use of compilers and OpenMP directives.

Uploaded by

ejy jawa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Fully Automatic Parallelization

There are huge amounts of source code which is sequential.


Using OpenMP is semi-automatic
For the last 50 years or so, there has been a quest for automatically
parallelizing sequential programs.
An approach to parallelize source code
First try a parallelizing compiler and see what happens
If it fails then look for compiler feedback and see if you can modify the
source
If not useful, try OpenMP
If not useful, parallelize manually

Jonas Skeppstedt Lecture 10 2022 1 / 77


Safety of Parallelization

Does the parallel program produce the same output?


Invalid if data-races are created, obviously.
When a for-loop is parallelized, the iterations are run in an
unpredictable order.
Note: changing the iteration order can cause numerical problems
Note above applies also to sequential programs.

Jonas Skeppstedt Lecture 10 2022 2 / 77


From Simple to Hard Parallelization Problems

Easiest case: loops with matrix computations and with known loop
bounds and array indexes that are linear functions of the loop variables
We will be more precise shortly
Very complicated case: code with dynamically allocated data
structures with many pointers
It would be very hard to automatically parallelize Lab 0
This lecture focuses on matrix computations

Jonas Skeppstedt Lecture 10 2022 3 / 77


Inner vs Outer Loop Parallelization

In the course EDAN75 Optimizing Compilers we learn about inner


loop parallelization which is used e.g. for automatic SIMD
vectorization and software pipelining.
Here the focus instead is on automatic parallelization for multicores,
i.e. outer loop parallelization.
The foundations for inner and outer loop parallelization are similar,
since they both rely on data dependence analysis.

Jonas Skeppstedt Lecture 10 2022 4 / 77


True data dependences

A true dependence:
S1: x = a + b;
S2: y = x + 1;
It is written S1 δ t S2 .
S1 must execute before S2 in any transformed program.

Jonas Skeppstedt Lecture 10 2022 5 / 77


Data Dependences at Different Levels

Data dependences can be at several different levels:


Instructions
Statements
Loop iterations
Functions
Threads
Parallelizing compilers usually find parallelism between different loop
iterations of a loop.
If the compiler can determine that there are no dependences between
loop iterations then it can either:
Produce parallel machine code, or
Produce source code with OpenMP #pragma parallel for
directives.
If there are dependences, it may still be possible to execute the loop in
parallel since perhaps the loop iterations are not totally ordered.

Jonas Skeppstedt Lecture 10 2022 6 / 77


Total vs Partial Order and Loop Iterations

Integers are totally ordered since we can determine which of a and b is


greater if a ̸= b.
Consider a directed acyclic graph. In topological sorting you can
process any node u if all predecessors of u already have been
processed.
Obviously, we should not execute a loop iteration before its input data
has been computed.
In executing a loop in parallel we perform a topological sort of the
loop iterations.
Conceptually, topological sorting is the major work in parallelization.
No topological search is performed during compilation or runtime to
determine which iterations can be executed, though.
Instead, new loops are computed (i.e. created) by the compiler.
If the iterations are a total order no parallelization can be done
Jonas Skeppstedt Lecture 10 2022 7 / 77
Three more data dependences

In an anti dependence, written I1 δ a I2 , I1 reads a memory location


later overwritten by I2 .
In an output dependence, written I1 δ o I2 , I1 writes a memory
location later overwritten by I2 .
In an input dependence, written I1 δ i I2 , both I1 and I2 read the same
memory location.
The first three types of dependences create partial orderings among all
iterations, which parallelizing compilers exploit by ordering iterations
to improve performance.
Input dependences can give a hint to the compiler that some data will
be used so it can try to keep it in the cache (by reordering iterations in
a suitable way).

Jonas Skeppstedt Lecture 10 2022 8 / 77


Loop Level Data Dependences

In the loop
for (i = 3; i < 100; i += 1)
a[i] = a[i-3] + x;
There is a true dependence from iteration i to iteration i + 3.
Iteration i = 3 writes to a3 which is read in iteration i = 6.
A loop level true dependence means one iteration writes to a memory
location which a later reads.

Jonas Skeppstedt Lecture 10 2022 9 / 77


Perfect Loop Nests

A perfect loop nest L is a nest of m nested for loops L1 , L2 , ...Lm


such that the body of Li , i < m, consists of Li+1 and the body of Lm
consists of a sequence of assignment statements.
For 1 < r ≤ m pr and qr are linear functions of I1 , ..., Ir −1 .

for (I1 = p1 ; I1 <= q1 ; I1 + = 1) {


for (I2 = p2 ; I2 <= q2 ; I2 + = 1) {
..
.
for (Im = pm ; Im <= qm ; Im + = 1) {
h(I1 , I2 , ..., Im );
}
}
}

Jonas Skeppstedt Lecture 10 2022 10 / 77


Example Perfect Loop Nest

All assignments, except to the loop index variables are in the


innermost loop.
There may be any number of assignment statements in the innermost
loop.

for (i = 0; i < 100; i += 1) {


for (j = 3 + i; j < 2 * i + 10; j += 1) {
for (k = i - j; k < j - i; k += 1) {
a[i][j][k] += b[k][j][i];
}
}
}

Jonas Skeppstedt Lecture 10 2022 11 / 77


Loop Bounds

The lower bound for I1 is p10 ≤ I1 .


The lower bound for I2 is

I2 ≥ p20 + p21 I1
p20 ≤ I2 − p21 I1
p20 ≤ −p21 I1 + I2

The lower bound for I3 is

I3 ≥ p30 + p31 I1 + p32 I2


p30 ≤ I3 − p31 I1 − p32 I2
p30 ≤ −p31 I1 − p32 I2 + I3

and so forth. We represent this on matrix form as p0 ≤ IP, or... see


next slide.

Jonas Skeppstedt Lecture 10 2022 12 / 77


Loop Bounds on Matrix Form

 
1 −p21 −p31 . . . −pm1
 0
 1 −p32 . . . −pm2  
P= 0
 0 1 . . . −pm3  and p0 = (p10 , p20 , ..., pm0 ).
 .. .. .. . . .. 
 . . . . . 
0 0 0 ... 1
Similarly, the upper bounds are represented as IQ ≤ q0 .
The loop bounds, thus, are represented by the system:

p0 ≤ IP
IQ ≤ q0

Jonas Skeppstedt Lecture 10 2022 13 / 77


Example Non-Perfect Loop Nest

The assignment to cij before the innermost loop makes it a


non-perfect loop nest.
Sometimes non-perfect loop nest can be split up, or distributed into
perfect loop nests.
See next slide.

for (i = 0; i < 100; i += 1) {


for (j = 0; j < 100; j += 1) {
c[i][j] = 0;
for (k = 0; k < 100; k += 1) {
c[i][j] += a[i][k] * b[k][j];
}
}
}

Jonas Skeppstedt Lecture 10 2022 14 / 77


Loop Distribution

Result of loop distribution.

for (i = 0; i < 100; i += 1)


for (j = 0; j < 100; j += 1)
c[i][j] = 0;
for (i = 0; i < 100; i += 1)
for (j = 0; j < 100; j += 1)
for (k = 0; k < 100; k += 1)
c[i][j] += a[i][k] * b[k][j];

Jonas Skeppstedt Lecture 10 2022 15 / 77


Some Terminology

The index vector I = (I1 , I2 , ..., Im ) is the vector of index variables.


The index values of L are the values of (I1 , I2 , ..., Im ).
The index space of L is the subspace of Z m consisting of all the index
values.
An affine array reference is an array reference in which all subscripts
are linear functions of the loop index variables.

Jonas Skeppstedt Lecture 10 2022 16 / 77


Easy non-affine references

Data dependence analysis is normally restricted to affine array


references.
In practice, however, subscripts often contain symbolic constants as
shown below which is test s171 in the C version of the Argonne Test
Suite for Vectorising Compilers.
There is no dependence between the iterations in this test.

for (i=0; i<n; i++)


a[i*n] = a[i*n] + b[i];

Jonas Skeppstedt Lecture 10 2022 17 / 77


Problematic Non-Affine Index Functions

In the loop
scanf("%d", &x);

for (i = 3; i < 100; i += 1) {


S1: a[i] = a[x] + 1;
S2: b[i] = b[c[i-1]] + 2;
S3: d[i] = d[2 * i * i * i - 3 * i * i ] + 3;
}
Some compilers do runtime testing to take care of S1 but it may cause
too much overhead if many variables must be checked.

Jonas Skeppstedt Lecture 10 2022 18 / 77


Representing Array References

Let X be an n-dimensional array. Then an affine reference has the


form:
X [a11 i1 + a21 i2 ...am1 im + a01 ]...[a1n i1 + a2n i2 ...amn im + a0n ]
This is conveniently represented as a matrix and a vector X [IA + a0 ],
where 
a11 a12 ... a1n
 a21 a 22 . . . a2n

A= . .. ..  and
 
. . .
 . . . . 
am1 am2 ... amn
a0 = (a10 , a20 , ..., an0 ).
We will refer to A and a0 as the coefficient matrix and the constant
term, respectively.

Jonas Skeppstedt Lecture 10 2022 19 / 77


An Example

for (i = 0; i < 100; i += 1)


for (j = 2*i + 4; j < i + 40; j += 1)
a[2i-3j-1][2i+j-3] = f(a[-3i+4j+1][-i+2j+7]);

The above loop nest has the following two array reference
representations:
 
2 2
A= and a0 = (−1, −3).
−3 1
 
−3 −1
B= and b0 = (1, 7).
4 2

Jonas Skeppstedt Lecture 10 2022 20 / 77


The Data Dependence Equation

For two references X [IA + a0 ] and X [IB + b0 ] to refer to the same


array element there must be two index values, i and j such that
iA + a0 = jB + b0 which we can write as iA − jB = b0 − a0 .
This system of Diophantine equations has n (the dimension of the
array X ) scalar equations and 2m variables, where m is the nesting
depth of the loop.
It can also be written in the following form:

 
A
(i; j) = b 0 − a0 .
−B

We solve the system of linear Diophantine equations above using a


method presented shortly.

Jonas Skeppstedt Lecture 10 2022 21 / 77


Dependence Distances

Let ≺ℓ be a relation in Zm such that i ≺ j if i1 = j1 , i2 = j2 , ...,


il−1 = jl−1 , and il < jl .
For example: (1, 3, 4) ≺3 (1, 3, 9).
The lexicographic order ≺ in Zm is the union of all the relations ≺ℓ :
i ≺ j iff i ≺ℓ j for some ℓ in 1 ≤ ℓ ≤ m.
The sequential execution of the iterations of a loop nest follows the
lexicographic order.
Assume that (i; j) is a solution and that i ≺ j. Then d = j − i is the
dependence distance of the dependence.

Jonas Skeppstedt Lecture 10 2022 22 / 77


Uniform Dependence Distance

If a dependence distance d is a constant vector then the dependence is


said to be uniform.
Examples:
d = (1, 2) is uniform — required for parallelization.
d = (1, t2 ) is nonuniform — cannot be optimized.
All unique d are put in a matrix as rows — but row order does not
matter since it is really just a set of all d

Jonas Skeppstedt Lecture 10 2022 23 / 77


Loop Independent and Loop Carried Dependences

A loop independent dependence is a dependence such that


d = j − i = (0, ..., 0).
A loop independent dependence does not prevent concurrent execution
of different iterations of a loop. Rather, it constrains the scheduling of
instructions in the loop body.
A loop carried dependence is a dependence which is not loop
independent, or, in other words, the dependence is between two
different iterations of a loop nest.
A dependence has level ℓ if in d = j − i, d1 = 0, d2 = 0, ..., dl−1 = 0,
and dl > 0.
Only a loop carried dependence has a level, and it is only the loop at
that level which needs to be executed sequentially.

Jonas Skeppstedt Lecture 10 2022 24 / 77


The GCD Test

The GCD test was invented at Texas Instruments and first described
1973.
Consider the loop
for (i = lb; i <= ub; ++i)
x[ a1 * i + c1] = x[a2 * i + c2] + y;
To prove independence, we must show that the Diophantine equation

a1 i1 − a2 i2 = c2 − c1

has no solutions.
We compute the gcd of a1 and a2 and check whether it divides
c2 − c1 , and if it does not, there is no solution and we have proved
independence, otherwise we must use another test.

Jonas Skeppstedt Lecture 10 2022 25 / 77


Weaknesses of The GCD Test

There are two weaknesses of the GCD test:


1 It does not exploit knowledge about the loop bounds.
2 Most often the gcd is one.
The first weakness means the GCD Test might be unable to prove
independence despite the solution actually lies outside the index space
of the loop.
The second weakness means independence usually cannot be proved.

Jonas Skeppstedt Lecture 10 2022 26 / 77


GCD Test for Nested Loops and Multdimensional Arrays

The GCD Test can be extended to cover nested loops and


multidimensional arrays.
The solution is then a vector and it usually contains unknowns.
The Fourier-Motzkin Test described shortly takes the solution vector
from this GCD Test and checks whether the solution lies within the
loop bounds.
Next we will look at unimodular matrices and Fourier elimination used
by the Fourier-Motzkin Test.

Jonas Skeppstedt Lecture 10 2022 27 / 77


Unimodular Matrices

An integer square matrix A is unimodular if its determinant


det(A) = ±1.
If A and B are unimodular, then A−1 exists and is itself unimodular,
and A × B is unimodular.
I is the m × m identity matrix.

Jonas Skeppstedt Lecture 10 2022 28 / 77


Elementary Row Operations

The operations
reversal: multiply a row by −1,
interchange: interchange two rows, and
skewing: add an integer multiple of one row to another row,
are called the elementary row operations.
With each elementary row operation, there is a corresponding
elementary matrix.

Jonas Skeppstedt Lecture 10 2022 29 / 77


Performing Elementary Row Operations

To perform an elementary row operation on a matrix A, we can


premultiply it with the corresponding elementary matrix.
Assume we wish to interchange rows 1 and 3 in a 3 × 3 matrix A. The
resulting matrix is formed by
 
0 0 1
 0 1 0  × A.
1 0 0

The elementary matrices are all unimodular.

Jonas Skeppstedt Lecture 10 2022 30 / 77


3 × 3 Reversal Matrices

 
−1 0 0
 0 1 0 ,
0 0 1
 
1 0 0
 0 −1 0  ,
0 0 1
and
 
1 0 0
 0 1 0 .
0 0 −1

Jonas Skeppstedt Lecture 10 2022 31 / 77


3 × 3 Interchange Matrices

 
0 1 0
 1 0 0 ,
0 0 1
 
1 0 0
 0 0 1 ,
0 1 0
and
 
0 0 1
 0 1 0 .
1 0 0

Jonas Skeppstedt Lecture 10 2022 32 / 77


3 × 3 Upper Skewing Matrices

 
1 z 0
 0 1 0 ,
0 0 1
 
1 0 z
 0 1 0 ,
0 0 1
and
 
1 0 0
 0 1 z .
0 0 1

Jonas Skeppstedt Lecture 10 2022 33 / 77


3 × 3 Lower Skewing Matrices

 
1 0 0
 z 1 0 ,
0 0 1
 
1 0 0
 0 1 0 ,
z 0 1
and
 
1 0 0
 0 1 0 .
0 z 1

Jonas Skeppstedt Lecture 10 2022 34 / 77


Echelon Matrices

Let li denote the column number of the first nonzero element of


matrix row i.
A given m × n matrix A, is an echelon matrix if the following are
satisfied for some integer ρ in 0 ≤ ρ ≤ m:
rows 1 through ρ are nonzero rows,
rows ρ + 1 through m are zero rows,
for 1 ≤ i ≤ ρ, each element in column li below row i is zero, and
l1 < l2 < ... < lρ .
The following are examples of echelon matrices:
 
   1 2 3
1 2 3 1 2 3 
 0 4 5  0 0 4  0 4 5 

 0 0 6 
0 0 6 0 0 0
0 0 0

Jonas Skeppstedt Lecture 10 2022 35 / 77


Echelon Reduction
Given an m × n matrix A, Echelon reduction finds two matrices U and
S such that U × A = S, where U is unimodular and S is echelon.
U remains unimodular since we only apply elementary row operations.

function echelon_reduce (A)


U ← Im
S←A
i0 ← 0
for (j ← 1; j ≤ n; j ← j + 1) {
if (there is a nonzero sij with i0 < i ≤ m) {
i0 ← i0 + 1
i =m
while (i ≥ i0 + 1) {
while (sij ̸= 0) {
σ ← sign (s(i−1)j × sij )
z ← ⌊|s(i−1)j |/|sij |⌋
subtract σz(row i) from (row i − 1) in (U; S)
interchange rows i and i − 1 in (U; S)
}
i ←i −1
}
}
}
return U and S
end

Jonas Skeppstedt Lecture 10 2022 36 / 77


Example Echelon Reduction

We will now show how one can echelon reduce the following matrix:
 
2 2
 −3 1 
A=  3
.
1 
−4 −2

We start with with U = I4 and S = A which we write as:


 
1 0 0 0 2 2
 0 1 0 0 −3 1 
(U; S) = 
 0
.
0 1 0 3 1 
0 0 0 1 −4 −2

Then we will eliminate the nonzero elements in S starting with


s41 , s31 , s21 , s42 and so on.

Jonas Skeppstedt Lecture 10 2022 37 / 77


Example Echelon Reduction

j = 1, i0 = 1, i = 4. We always wish to eliminate sij , which currently


means s41 .
σ ← −1 and z ← 0. Nothing is subtracted from row 3.
Then rows 3 and 4 are interchanged in (U; S), resulting in:
 
1 0 0 0 2 2
 0 1 0 0 −3 1 
(U; S) =  .
0 0 0 1 −4 −2 
0 0 1 0 3 1

Jonas Skeppstedt Lecture 10 2022 38 / 77


Example Echelon Reduction

We continue the inner while loop and find that σ ← −1 and z ← 1.


Then −1× row 4 is subtracted from row 3, resulting in:
 
1 0 0 0 2 2
 0 1 0 0 −3 1 
(U; S) = 
 0 0 1 1 −1 −1 
.

0 0 1 0 3 1
Then rows 3 and 4 are interchanged, resulting in:
 
1 0 0 0 2 2
 0 1 0 0 −3 1 
(U; S) =  0 0 1
.
0 3 1 
0 0 1 1 −1 −1

Jonas Skeppstedt Lecture 10 2022 39 / 77


Example Echelon Reduction

s41 is still zero, and the inner while loop is continued and σ ← −1 and
z ← 3. Then −3× row 4 is subtracted from row 3:
 
1 0 0 0 2 2
 0 1 0 0 −3 1 
(U; S) =  .
 0 0 4 3 0 −2 
0 0 1 1 −1 −1
Then rows 3 and 4 are interchanged, resulting in:
 
1 0 0 0 2 2
 0 1 0 0 −3 1 
(U; S) =  .
 0 0 1 1 −1 −1 
0 0 4 3 0 −2

Now the first ij has become zero and i is decremented.

Jonas Skeppstedt Lecture 10 2022 40 / 77


Example Echelon Reduction

j = 1, i0 = 1, i = 3. We now wish to eliminate s31 . σ ← +1 and


z ← 3. Then 3× row 3 is subtracted from row 2:
 
1 0 0 0 2 2
 0 1 −3 −3 0 4 
(U; S) = 
 .
0 0 1 1 −1 −1 
0 0 4 3 0 −2
Then rows 2 and 3 are interchanged, resulting in:
 
1 0 0 0 2 2
 0 0 1 1 −1 −1 
(U; S) = 
 0 1 −3 −3
.
0 4 
0 0 4 3 0 −2

Jonas Skeppstedt Lecture 10 2022 41 / 77


Example Echelon Reduction

j = 1, i0 = 1, i = 2. We now wish to eliminate s21 . σ ← −1 and


z ← 2. Then −2× row 2 is subtracted from row 1:
 
1 0 2 2 0 0
 0 0 1 1 −1 −1 
(U; S) = 
 .
0 1 −3 −3 0 4 
0 0 4 3 0 −2
Interchanging rows 2 and 1 results in:
 
0 0 1 1 −1 −1
 1 0 2 2 0 0 
(U; S) =  .
 0 1 −3 −3 0 4 
0 0 4 3 0 −2

Jonas Skeppstedt Lecture 10 2022 42 / 77


Example Echelon Reduction

j = 2, i0 = 2, i = 4. We now wish to eliminate s42 . σ ← −1 and


z ← 2. −2× row 4 is subtracted from row 3:
 
0 0 1 1 −1 −1
 1 0 2 2 0 0 
(U; S) = 
 .
0 1 5 3 0 0 
0 0 4 3 0 −2
Interchanging rows 4 and 3 results in:
 
0 0 1 1 −1 −1
 1 0 2 2 0 0 
(U; S) =  .
 0 0 4 3 0 −2 
0 1 5 3 0 0

Jonas Skeppstedt Lecture 10 2022 43 / 77


Example Echelon Reduction
j = 2, i0 = 2, i = 3. We now wish to eliminate s32 . σ ← 0 and z ← 0.
Nothing is subtracted from row 2 but rows 3 and 2 are interchanged:
 
0 0 1 1 −1 −1
 0 0 4 3 0 −2 
(U; S) = 
 .
1 0 2 2 0 0 
0 1 5 3 0 0
At this point S is an echelon matrix and the algorithm stops (the outer
while loop since i = i0 ). As will turn out to be convenient later, we
prefer positive values of s11 and therefore multiply with −1 finally
resulting in:
 
0 0 −1 −1 1 1
 0 0 4 3 0 −2 
(U; S) =  .
1 0 2 2 0 0 
0 1 5 3 0 0
Jonas Skeppstedt Lecture 10 2022 44 / 77
Solving a dependence equation

Two references for the same variable: a matrix with n dimensions


m/2 for-loops m loop index variables (i,j,k etc for each reference)
That is: the loop index variables i1 , i2 , ..., im/2

xA = c

x is an 1 × m integer matrix
A is an m × n integer matrix
c is an 1 × n integer matrix
We find U and S such that UA = S.
Then try to solve tS = c
If there is solution, then: c = tS = tUA.
So x = tU
Jonas Skeppstedt Lecture 10 2022 45 / 77
An example

Consider xA = c with
 
2 2
  −3 1  
x1 x2 x3 x4  3
= 2 4
1 
−4 −2

Firstly we use echelon reduction to find the matrices U and S.


Then we solve tS = c
 
1 1
  0 −2  
t1 t2 t3 t4  0
= 2 4
0 
0 0

We find that t = (2, −1, t3 , t4 ), where t3 and t4 are arbitrary integers.

Jonas Skeppstedt Lecture 10 2022 46 / 77


Linear Diophantine Equations

We then find x:
 
0 0 −1 −1
 0 0 4 3 
x = tU = 2 −1 t3 t4 
 1
=
0 2 2 
0 1 5 3

(t3 , t4 , 2t3 + 5t4 − 7, 2t3 + 3t4 − 5)

Jonas Skeppstedt Lecture 10 2022 47 / 77


Fourier Elimination

Suppose we find an integer solution x to xA = c.


The next question is if the solution is within the loop bounds.
Unfortunately, the problem of solving a linear integer inequality is
NP-complete.
Instead the compiler looks for a rational solution and only if no
rational solution within the loop bounds exists, it ignores that pair of
array references.

Jonas Skeppstedt Lecture 10 2022 48 / 77


Fourier Elimination

In 1827 Fourier published a method for solving linear inequalities in


the real case.
This is sometimes called Fourier-Motzkin elimination
Utpal Banerjee, a leading compiler researcher at Intel has written a
very good book series about parallelization calls it Fourier’s method of
elimination.

Jonas Skeppstedt Lecture 10 2022 49 / 77


Fourier Elimination

An interesting question is how frequently Fourier elimination finds a


real solution when there is no integer solution. Some special cases can
be exploited.
For instance, if a variable xi must satisfy 2.2 ≤ xi ≤ 2.8 then there is
no integer solution.
Otherwise, if we find eg that 2.2 ≤ xi ≤ 4.8 then we may try the two
cases of setting xi = 3 and xi = 4, and see if there still is a real
solution.
It is easiest to understand Fourier elimination if we first look at an
example.

Jonas Skeppstedt Lecture 10 2022 50 / 77


Fourier Elimination

Assume we wish to solve the following system of linear inequalities.

2x1 − 11x2 ≤ 3
−3x1 + 2x2 ≤ −5
x1 + 3x2 ≤ 4
−2x1 ≤ −3

We will first eliminate x2 from the system, and then check whether the
remaining inequalities can be satisfied. To eliminate x2 , we start out
with sorting the rows with respect to the coefficients of x2 :

−3x1 + 2x2 ≤ −5
x1 + 3x2 ≤ 4
2x1 − 11x2 ≤ 3
−2x1 ≤ −3

Jonas Skeppstedt Lecture 10 2022 51 / 77


Fourier Elimination

First we want to have rows with positive coefficients of x2 , then


negative, and lastly zero coefficients.
Next we divide each row by its coefficient (if it is nonzero) of x2 :

−3 −5
2 x1 + x2 ≤ 2
1 4
3 x1 + x2 ≤ 3
2 3
11 x1 − x2 ≥ 11

Of course, the ≤ becomes ≥ when dividing with a negative coefficient.


We can now rearrange the system to isolate x2 :

3 5
x2 ≤ 2 x1 − 2
x2 ≤ − 13 x1 + 4
3
2 3
11 x1 − 11 ≤ x2

Jonas Skeppstedt Lecture 10 2022 52 / 77


Fourier Elimination

At this point, we make a record of the minimum and maximum values


that x2 can have, expressed as functions of x1 . We have:

b2 (x1 ) ≤ x2 ≤ B2 (x1 )

where
2
b2 (x1 ) = 11 x1
B2 (x1 ) = min( 23 x1 − 52 , − 31 x1 + 43 )

Jonas Skeppstedt Lecture 10 2022 53 / 77


Fourier Elimination

To eliminate x2 from the system, we simply combine the inequalities


which had positive coefficients of x2 with those which had negative
coefficients (ie, one with positive coefficient is combined with one with
negative coefficient):

2 3 3 5
11 x1 − 11 ≤ 2 x1 − 2
2 3
11 x1 − 11 ≤ − 31 x1 + 4
3

These are simplified and the inequality with the zero coefficient of x2
is brought back:

− 29
22 x1 ≤ − 49
22
− 17
33 x1 ≤
53
33
−2x1 ≤ −3

Jonas Skeppstedt Lecture 10 2022 54 / 77


Fourier Elimination

We can now repeat parts of the procedure above:

53
x1 ≤ 17
49
x1 ≥ 29
3
x1 ≥ 2

We find that
b1 () = max(49/29, 3/2) = 49/29
B1 () = 53/17
49 53
The solution to the system is 29 ≤ x1 ≤ 17 and b2 (x1 ) ≤ B2 (x1 ) for
each value of x1 .

Jonas Skeppstedt Lecture 10 2022 55 / 77


Fourier Elimination

procedure fourier_motzkin_elimination (x, A, c)


r ← m, s ← n, T ← A, q←c
while (1) {
n1 ← number of inqualities with positive trj
n2 ← n1 + number of inqualities with negative trj
Sort the inequalities so that the n1 with trj > 0 come first,
then the n2 − n1 with trj < 0 come next,
and the ones with trj = 0 come last.
for (i = 1; i ≤ r − 1; i ← i + 1)
for (j = 1; i ≤ n2 ; j ← j + 1)
tij ← tij /trj
for (j = 1; i ≤ n2 ; j ← j + 1)
qj ← qj /trj
if (n2 > n1 )
P −1
br (x1 , x2 , ..., xr −1 ) = maxn1 +1≤j≤n2 (− ri=1 tij xi + qi )
else
br ← −∞
if (n1 > 0)
P −1
jr (x1 , x2 , ..., xr −1 ) = minn1 +1≤j≤n2 (− ri=1 tij xi + qi )
else
Br ← ∞
if (r = 1)
return make_solution()

Jonas Skeppstedt Lecture 10 2022 56 / 77


Fourier Elimination
/* We will now eliminate xr . */
s ′ ← s − n2 + n1 (n2 − n1 )
if (s ′ = 0) {
/* We have not discovered any inconsistency and */
/* we have no more inequalities to check. */
/* The system has a solution. */
The solution set consists of all real vectors (x1 , x2 , ..., xm ),
where xr −1 , xr −2 , ..., x1 are chosen arbitrarily, and
xm , xm−1 , ..., xr must satisfy
bi (x1 , x2 , ..., xi−1 ) ≤ xi ≤ Bi (x1 , x2 , ..., xi−1 ) for r ≤ i ≤ m.
return solution set.
}
/* There are now s ′ inequalities in r − 1 variables. */
The new system of inequalities is made of two parts:
Pr −1
(t − til )xi ≤ qk − qj for 1 ≤ k ≤ n1 , n1 + 1 ≤ j ≤ n2
Pir −1 ik
i
tij xi ≤ qj for n2 + 1 ≤ j ≤ s
and becomes by setting r = r ← 1 and s ← s ′ :
P r
i tij xi ≤ qj for 1 ≤ j ≤ s
} end

function make_solution ()
/* We have come to the last variable x1 . */
if (b1 > B1 or (there is a qj < 0 for n2 + 1 ≤ j ≤ s))
return there is no solution
The solution set consists of all real vectors (x1 , x2 , ..., xm ),
such that bi (x1 , x2 , ..., xm ) ≤ xi ≤ Bi (x1 , x2 , ..., xm ) for 1 ≤ i ≤ m.
return solution set.
end

Jonas Skeppstedt Lecture 10 2022 57 / 77


Summary
In the case of a loop nest of height m and an n-dimensional array, we
use the matrix representation of the references iA + a0 = jB + b0 , or
equivalently:
 
A
(i; j) = b0 − a0 ,
−B
where the A and B have m rows and n columns.
We find a 2m × 2m unimodular matrix U and a 2m × n echelon
matrix S such that

 
A
U = S.
−B
If there is a 2m vector t which satisfies tS = b0 − a0 then the GCD
test cannot exclude dependence, and if so...
..., the computed t will be input to the Fourier-Motzkin Test.
Jonas Skeppstedt Lecture 10 2022 58 / 77
The Fourier-Motzkin Test
If the GCD Test found a solution vector t to tS = c, these solutions
will be tested to see if they are within the loop bounds.
Recall we wrote
 
A
x = (i; j) = b0 − a0 .
−B

We find x from:

x = (i; j) = tU

With U1 being the left half of U and U2 the right half we have:

i = tU1
j = tU2

These should be used in the loop bounds constraints.


Jonas Skeppstedt Lecture 10 2022 59 / 77
The Fourier Motzkin Test

Recall the original loop bounds are:



p0 ≤ IP
IQ ≤ q0
The solution vector t must satisfy:


p0 ≤ tU1 P 

tU1 Q ≤ q0

p0 ≤ tU2 P 

tU2 Q ≤ q0

If there is no integer solution to this system, there is no dependence.


Recall, however, the system is solved with real or rational numbers so
the Fourier-Motzkin Test may fail to exclude independence.

Jonas Skeppstedt Lecture 10 2022 60 / 77


After Data Dependence Analysis

When we have performed data dependence analysis of all pairs of


references to the same arrays, we have a dependence matrix,
denoted D.
Some rows will be due to some array and other rows due to some
other arrays.
It’s the dependence matrix that determines which transformations we
can do.
As mentioned, in the optimizing compilers course inner loop
transformations are studied for SIMD vectorization and software
pipelining.
We will look at outer loop parallelization.

Jonas Skeppstedt Lecture 10 2022 61 / 77


Unimodular Transformations

A unimodular transformation is a loop transformation completely


expressed as a unimodular matrix U.
A loop nest L is changed to a new loop nest LU with loop index
variables:
K = IU
I = KU−1
The same iterations are executed but in a different order.
A new iteration order might make parallel execution possible.
Before generating code for the new loop, the loop bounds for K must
be computed from the original bounds:

p0 ≤ IP
IQ ≤ q0

Jonas Skeppstedt Lecture 10 2022 62 / 77


Computing the New Index Variables

With


p0 ≤ IP
IQ ≤ q0
I = KU−1

We use Fourier elimination also to find the loop bounds from

KU−1 P

p0 ≤
KU−1 Q ≤ q0

The bounds are found starting with k1 , k2 etc.


This is the reason why we want to have an invertible transformation
matrix.
Jonas Skeppstedt Lecture 10 2022 63 / 77
New Array References

All array references are rewritten to use the new index variables.
Conceptually we could calculate, at the beginning of each loop iteration,
I = KU−1
and then use this vector I in the original references, on the form:
x[IA + a0 ]
We don’t do that of course and instead replace each reference with
x[KU−1 A + a0 ]
Here KU−1 A + a0 can be calculated at compile-time.

Jonas Skeppstedt Lecture 10 2022 64 / 77


The Distance Matrix

The set of all vectors of dependence distances is represented by the


distance matrix D.
We are free to swap the rows of D since it really is a set of
dependences.
Unimodular transformations require that all dependences are uniform,
i.e. with known constants.
Consider a uniform dependence vector d = j − i.
With index variables K = I U we have dU = jU − iU = dU.
Therefore, given a dependence matrix D and a unimodular
transformation U, the dependences in the new loop LU become:
DU = DU

Jonas Skeppstedt Lecture 10 2022 65 / 77


Valid Distance Matrices

The sign, lexicographically, of a vector is the sign of the first


nonzero element.
A distance vector can never be lexicographically negative since it
would mean that some iteration would depend on a future iteration.
Therefore no row in the new distance matrix DU = DU may be
lexicographically negative.
If we would discover a lexicographically negative row in DU , that loop
transformation is invalid, such as the second row of the following DU :
 
1 2
DU =
−1 1

Jonas Skeppstedt Lecture 10 2022 66 / 77


Outer Loop Parallelization

By outer loops is meant all loops starting with the outermost loop.
While we always can find a unimodular matrix through which we can
parallelize the inner loops, this is not the case for outer loops.
To parallelize the inner loops, we need to assure that all loop carried
dependences are carried at the outermost loop.
In other words, the leftmost column of the distance matrix DU simply
should consist only of positive numbers!
For outer loop parallelization, DU instead should have leading zero
columns.

Jonas Skeppstedt Lecture 10 2022 67 / 77


Rank of a Matrix

A column of a matrix is linearly independent if it cannot be expressed


as a linear combination of the other columns.
The rank of a matrix is the number of linearly independent columns.
For instance, an identity matrix Im with m columns has rank(Im ) = m.
Any unimodular m × m-matrix U has rank(U) = m.
A matrix with zero columns must have a rank less than the number of
columns.
So, since DU = DU, if DU should have a rank less than m, it must be
D which contributes with that.

Jonas Skeppstedt Lecture 10 2022 68 / 77


Outer Loop Parallelization Example

Assume we have the distance  matrix D defined


 as:
6 4 2
D =  0 1 −1 
1 0 1
With this distance matrix, only the innermost loop can be executed in
parallel.
We want a DU with positive rows and zero columns to the left.
For example:
   
0 ? ? 6 4 2
DU =  0 ? ?  =  0 1 −1  U
0 ? ? 1 0 1
If rank(D) = 3 then such aU cannot exist.

Jonas Skeppstedt Lecture 10 2022 69 / 77


Steps towards Finding U

We start with transposing D:


 
6 0 1
Dt =  4 1 0 
2 −1 1
Using the Echelon reduction algorithm, we compute:
a unimodular matrix V
an echelon matrix S
Such that VDt = S, e.g.
   
0 0 1 2 −1 1
 0 1 −2  Dt =  0 3 −2 
1 −1 −1 0 0 0

Jonas Skeppstedt Lecture 10 2022 70 / 77


More Steps towards Finding U
We have VD t = S:
    
0 0 1 6 0 1 2 −1 1
 0 1 −2   4 1 0 = 0 3 −2 
1 −1 −1 2 −1 1 0 0 0
Assume we wish to find n = 1 parallel outer loops.
Then we find an m × (n + 1) matrix A such that DA has n zero
columns and then a column with elements greater than zero.
This A will be used to find U.
How can we find A?
Multiplying the last row of V with the columns of Dt produces the
zero row in S.
So let A have that
 last row as first
 column:   
6 4 2 1 ? 0 ?
DA =  0 1 −1   −1 ?  =  0 ? 
1 0 1 −1 ? 0 ?
Jonas Skeppstedt Lecture 10 2022 71 / 77
Finding the Rest of A

Finding the last column of A is easy. Denote it u.


    
6 4 2 1 u1 0 ≥1
DA =  0 1 −1   −1 u2  =  0 ≥1 
1 0 1 −1 u3 0 ≥1
Multiplying each row of D with u should produce a positive number:
6u1 + 4u2 + 2u3 ≥ 1
u2 − u3 ≥ 1
u1 + u3 ≥ 1
We find u to be e.g. u = (1, 1, 0).
 
1 1
A =  −1 1 
−1 0

Jonas Skeppstedt Lecture 10 2022 72 / 77


Computing U

Given a matrix A, using a variant of the algorithm for echelon


reduction, we can find a unimodular matrix U such that
A = UT
i.e.     
1 1 −1 1 1 −1 0
A =  −1 1  = UT =  1 1 0   0 1 
−1 0 1 0 0 0 0

Jonas Skeppstedt Lecture 10 2022 73 / 77


Computing LU
With this loop transformation matrix U, we get the following new
dependence matrix DU :
DU = DU
i.e.     
0 10 6 6 4 2 −1 1 1
DU =  0 1 0  = DU =  0 1 −1   1 1 0 
0 1 1 1 0 1 1 0 0
The compiler does not actually need to compute DU but it is a nice
internal check to verify no row is lexicographically negative.
The new loop LU is constructed as explained before:
A loop nest L is changed to a new loop nest LU with loop index
variables:
K = IU
New array references and new loop bounds must be computed.
We have already seen both of these two, but repeat them for
convenience on the next two slides.
Jonas Skeppstedt Lecture 10 2022 74 / 77
Recall: Computing the New Index Variables

With


p0 ≤ IP
IQ ≤ q0
I = KU−1

We use Fourier elimination to find the loop bounds from

KU−1 P

p0 ≤
KU−1 Q ≤ q0

The bounds are found starting with k1 , k2 etc.

Jonas Skeppstedt Lecture 10 2022 75 / 77


Recall: New Array References

All array references are rewritten to use the new index variables.
Conceptually we could calculate, at the beginning of each loop iteration,
I = KU−1
and then use this vector I in the original references, on the form:
x[IA + a0 ]
We don’t do that of course and instead replace each reference with
x[KU−1 A + a0 ]
Here KU−1 A + a0 can be calculated at compile-time.

Jonas Skeppstedt Lecture 10 2022 76 / 77


Summary

Using linear algebra it is sometimes possible to automatically


parallelize for-loops
Optimizing compilers rewrite loops with while or gotos to for-loops
when possible
All these transformations can be expressed in a matrix which is then
used to generate a new loop (this belongs to the category of elegant
computer science).

Jonas Skeppstedt Lecture 10 2022 77 / 77

You might also like