Nelson L. Passos Robert P. Light Virgil Andronache Edwin H.-M. Sha Midwestern State University

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

DESIGN OF 2-D FILTERS USING A PARALLEL PROCESSOR ARCHITECTURE

Nelson L. Passos Robert P. Light Virgil Andronache Edwin H.-M. Sha


Midwestern State University University of Notre Dame
Wichita Falls, TX 76308 Notre Dame, IN, 46556

ABSTRACT utilization of the parallel processors [2, 17, 24]. During


the circuit design phase, nested loop structures can be
Two-dimensional filters are usually part of the coded using hardware description languages, such as
implementation of digital image processing applications. VHDL constructs, in order to reduce the design time.
These filters process recursive sets of instructions and However, in VHDL, the loop control indices will
require high computational speed. Optimized represent the number of times a section of the circuit will
implementations of these filters depend on the use of be replicated in the final synthesis under the assumption
Application Specific Integrated Circuits (ASICs). A that there are no physical or cost constraints in the circuit
system with multiple parallel processing units is a feasible implementation [23]. In this paper, a multi-dimensional
design option able to achieve the required computational retiming technique is used to transform the loop in such a
performance. In this paper, a loop transformation way to produce the parallel solution for the problem for a
algorithm, which allows the efficient utilization of a given number of processing units. Such a solution can
parallel multiprocessor system, is presented. Uniform then be implemented on a standard multiprocessor
nested loops representing the image filters and the architecture.
available processors are modeled as multi-dimensional
data flow graphs. A new loop structure is generated so Retiming was originally proposed by Leiserson –
that an arbitrary number of processors available in the Saxe, focusing on improving the cycle time of one-
system can run in parallel. An example and a description dimensional problems [13]. Most work done in this area,
of the algorithm are presented in the paper. is subject to limitations imposed by the number of delays
(memory elements) existing in a cycle of a data flow
Keywords: retiming, nested loops, filter, multiprocessor, graph representing the problem [3, 6, 10, 11, 12, 16, 22,
image processing. 25]. Other methods focus on multi-processor scheduling
and are also applicable to one-dimensional problems [7, 8,
1. INTRODUCTION 14, 16, 18]. This study focuses on the parallelism inherent
to multi-dimensional applications, ignored by the one-
Image enhancement and edge detection are well- dimensional methods.
known digital image signal processing applications that
may require two-dimensional (2-D) filter-like Retiming and other loop transformations have since
computational solutions. These applications usually been applied in areas such as scheduling and parallel
depend on computation intensive code sections, processing, with the main goal of exploiting fine-grain
consisting of the repetition of sequences of operations. parallelism in the loop body [4, 15]. Due to the different
They are also characterized by the multi-dimensionality focus in obtaining parallelism, those techniques are not
of the data involved. An effective technique in improving aimed to improve the execution of parallel iterations in
the computing performance of such applications has been multiprocessor systems. Research by Passos and Sha
the design and use of Application Specific Integrated extended the retiming concept to multi-dimensional (MD)
Circuits (ASICs). This paper presents a new technique applications [19]. The multi-dimensional retiming concept
applicable to the design of a 2-D filter system using is used in this paper to model the partitioning of the loop
multiple parallel processors. A multi-dimensional among the available processors. Multi-dimensional
retiming algorithm embedded in this new technique retiming brings some advantages to the process, since it is
provides the fully parallel utilization of the available readily applicable to the multi-dimensional fields
processors, thus reducing the overall execution time of the considered, eliminating the need for a loop transformation
filter function. that converts the original problem to one dimension.
Another significant advantage of MD retiming is that
Parallel architectures are an important tool in ASIC there are no restrictions on its applicability, not being
design. However, these architectures require a careful constrained by the characteristics of the one-dimensional
partitioning of the problem in order to improve the methods.
As an example, consider a filter with transfer optimized will be representative of the iterations being
function: executed in parallel.
1
H ( z1, z 2) = 1 1
As an example, Figure 2(a) shows the graph for the
1 − ∑ ∑ g (n1, n2) * z1
− n1
* z2 −n 2 filter described earlier, implemented in a two-processor
n1=0 n 2 =0 system, however still subject to a sequential execution.
After applying the MD retiming operation, the graph will
for n1, n2 not simultaneously zero. A multi-dimensional appear as in figure 2(b), where the delays between
data flow graph (MDFG) can be used to represent this processors guarantee full parallelism.
problem as seen in figure 1(a). Figure 1(b) shows a digital
circuit design with multiple functional units (multipliers
and adders) and memory elements comprising a single (0,1) (0,1) (0,1) (0,1)
processor system designed to solve the filter problem (0,1) (0,2)
represented in figure 1(a).
(0,0)
(0,1)

(1,1) (1,0)

(1,0) (1,-1)

Figure 2. (a) Multiprocessor graph (b) Retimed graph

This paper presents the new loop transformation


method, based on the multi-dimensional retiming
Figure 1. (a) MDFG (b) circuit implementation technique, and its applicability to the design of filters
implemented in multiprocessor systems. It starts in the
The 2-D delay (1,1) in the MDFG is usually next section with an introduction to the basic principles
implemented by a FIFO structure equivalent to a serial involved, including an overview of multi-dimensional
implementation of z1-1 and z2-1 elements. The delay (0,1) retiming. Section 3 shows the theoretical aspects of the
is equivalent to z2-1 and the delay (1,0) to z1-1. The processor allocation technique, followed by a description
sequential execution time for this design is equivalent to of the utilization of the method in a more complex
the serial execution of three additions and one example. A final section summarizes the concepts
multiplication. The nested loop representation of the filter introduced in the paper.
requires a row-wise computation, where the z2-1 element
represents only one delay or storage element. Assuming 2. BASIC PRINCIPLES
that two identical processors, with the internal design
shown in figure 1(b), are available for parallel execution A multi-dimensional data flow graph (MDFG) G =
of this loop, an optimization problem arises. The z2-1 (V, E, d, t) is a node-weighted and edge-weighted directed
delay becomes a direct dependency between two graph, where V is the set of computation nodes, E is the
consecutive iterations being executed in the two set of dependence edges equivalent to the circuit data
processors. The same delay also produces a one-delay paths, d is a function representing the MD delay between
dependency between two consecutive utilizations of the two nodes, and t is a function representing the
two-processor system. This implies a non-parallel computation time of each node.
execution of the two processors, which the 1-D retiming
technique cannot change due to the constancy on the An iteration is the execution of each node in V
number of delays in the cycle involving the two exactly once. Iterations are described by a vector w,
processors and containing the z2-1 delay. Using multi- equivalent to a multi-dimensional index, starting at
dimensional retiming on the MDFG representing the (0,…,0). Iteration dependencies are represented by vector
parallel implementation of the circuit, that constraint can weighted edges. An edge e from u to v with delay vector
be eliminated. In this paper, the focus will be on obtaining d(e) means that the computation of the node v at iteration
parallelism between separate iterations of the loop that w depends on the execution of node u at iteration w –
can be run on different processors, rather than the fine d(e). Thus, an edge with delay (0,…,0) represents a data
grain parallelism that optimizes the execution on each dependence within the same iteration. A legal MDFG
individual processing element. As a result, the graph to be
must have no zero-delay cycle, i.e. the summation of the An MDFG is transformed to a processor
delay vectors along any cycle cannot be (0,…,0). dependency graph (PDG) according to the number of
available processors established by the filter designer. The
The iteration space of a loop is defined to be the PDG shows the dependency edges among the various
region in the MD discrete Cartesian space whose points processors, assuming they are running consecutive
correspond one-to-one to the iteration indices. This paper iterations in the original loop code. Figure 3(a) shows the
considers loops that present the characteristic of constant inter-iteration dependencies originated in processor P0
dependence vectors, i.e. their data dependencies are at a and required in the example seen in figure 1. Figure 3(b)
constant distance in the iteration space. These loops are shows the PDG representation of the same graph in the
called uniform loops. two-processor arrangement.

Reexamining the example presented in the previous (0,1) (0,1)


section, the 2-D filter for an image of N by N pixels can (0,1)
be represented by the nested loop below: P1 P1
@2
for (i=0; i<N; i++) 2 P0 P1
for (j=0; j<N; j++) (1,1)
P0 P0
{
y(i,j) = c1*y(i-1,j-1) + c2*y(i,j-1) + c3*y(i-1,j) + x(i,j) (1,0)
}
Figure 3: (a) Inter-iteration dependencies (b) PDG
representation
In the MDFG shown in figure 1, the nodes A, B and
C represent the three required multiplications, while the
In the allocation of processors, and implicitly in the
remaining nodes show the necessary additions.
application of the retiming technique, two properties of
MDFGs that model loops written with regular constructs
A multi-dimensional retiming function r:V→Zn need to be noticed [9, 20].
applied to a graph G=(V, E, d, t), redistributes the nodes
in the original iteration along the iteration space of the Property 3.1 Given an MDFG G = (V, E, d, t), a retiming
MDFG G. The result of this process is a new graph,
function r = (rx,ry), a node u∈V and an edge e∈E such
Gr=(V, E, dr, t), in which each iteration still contains one
that u →e u, applying the retiming function r to u, r(u),
instance of every node in G. This transformation is
does not change d(e).
equivalent to a redistribution of the delay elements in the
graph. The multi-dimensional retiming function r(u) for a
Property 3.2 given a nested loop of the form:
node u ∈ G represents the offset between the original
iteration containing u and the one after retiming. This
for (i = 0; i < N; i++)
offset change implies a corresponding change in the delay
for (j=0; j < N; j++)
vectors (edge weights) associated with the MDFG, so that
a[i,j] = f(b[i’,j’])
the original dependencies are preserved. Thus, for an edge
b[i,j] = g(a[i”,j”])
e between nodes u and v, dr(e) =d(e) + r(u) – r(v).
with dependence vectors (x,y), where x indicates the
3. PROCESSOR ALLOCATION iteration distance according to the outermost loop and y
represents the innermost loop, then x≥ 0.
Given an iteration space representing the execution
of a two-dimensional nested loop, the allocation of Property 3.3 Given a nested loop as the one shown in
multiple processors is done along a line parallel to the i- property 3.2, represented by a PDG P = (Π, ε, δ), and a
axis (outermost index of the loop). A processor
pair of processors p and q ∈ Π assigned to iteration
dependency graph (PDG) P, defined below, shows the
instances (i, j) and (i+1, j) respectively, then for any edge
dependencies among the different processors used in the
e = p q , δ(e) > (0,0).
execution of the nested loop.
As a consequence of property 3.2, a dependence
Definition For a given MDFG G= (V, E, d, t) and k
vector such as (-2,3) will not be in the set of dependence
processors, a PDG P = (Π, ε, δ) is a node weighted and
vectors associated with the loop. Using these properties, a
edge weighted direct graph, where Π is the set of transformation algorithm can be applied to the PDG for
processors, with Π = k, ε is the set of dependence optimal performance. In this algorithm, the memory
edges between processors and δ is a function representing access time is assumed to be uniform and the loop
the MD delay between two processors. structures under consideration have constant
dependencies. With these assumptions, the MDFG The synchronization function call is controlled by the
representing the original problem is translated into a PDG values of the indices of the loop and the number of the
according to the lemma below: processing unit. When the processor has executed the
instructions comprising its mandatory prologue and has
Lemma 3.4 A processor dependency graph PDG P =(Π, already initialized the data required for the parallel
ε, δ) is such that, given an MDFG G=(V, E, d, t) and k execution then the function is activated. The general
processors, then ∀ e∈E connecting nodes u, v∈V with format of the final code of the loop body for each
d(e) = (x,y) there exists e’∈ε connecting processor n, processor is shown next:
which contains node u, to processor m, which contains
node v, with m, n∈Π, 0≤ m, n < k, and: Code for processor #n (k =number of processors)
for (i = n; i < N; i+k)
m = (n+x) mod k for (j=0; j<N; j++)
δ(e’) = (x’,y’) = (int((n+x)/k),y) if ((i==n) && (j==(ry(pn) - ry(pn+1))
SYNC(pn+1)
In order to guarantee the parallelism of the y(i,j) = c1*y(i-1,j-1)+c2*y(i,j-1)+c3*y(i-1,j)+x(i,j)
multiprocessor system, a multi-dimensional retiming
function r(u) = (0,ry) is applied to the PDG to change the The entire process is then summarized in the
dependence edges eliminating sequential processing algorithm MRA, which is a modified version of the
among different processors. It can be proven that (0,ry) is OPTIMUS algorithm presented in [21]:
always a valid multi-dimensional retiming vector in a
multiprocessor environment and, therefore, a convenient Multiprocessor Retiming Algorithm (MRA):
retiming function such as (0,1) can be applied and will Input: MDFG G = (V, E, d, t), number of processors k;
result in the elimination of direct dependencies from the Output: retimed G
PDG as stated in the theorem below: PDG P = (Π,ε,δ) ← transform(G,k);
rv ← (0,1);
Theorem 3.5: Given a PDG P=(Π,ε,δ) with at least one MC(∀u ∈ Π) ← 0
edge e∈ε such that δ(e) = (0,0), there exists a retiming MCmax ← 0
function r of the form (0,ry(u)) for all u∈Π, that applied to QueueΠ ← φ
P creates Pr=(Π,ε,δr) such that for any e∈Π, δr(e) > (0,0). /* remove original edges with positive delays */
∀e ∈ ε
This theorem can be proven by analyzing three ε ← ε - {e , s.t. δ(e) >(0,…,0)}
possible cases and considering properties 3.1, 3.2 and 3.3. /* queue non-dependent nodes */
The three cases are: QueueΠ ← QueueΠ ∪ {u ∈ Π, s.t. indegree(u) = 0}
while QueueΠ ≠ φ
• δ(e) = (0,0), which will require the application of the get(u, QueueΠ)
retiming function r, resulting in a dependence δr(e) = /* check if u needs to be retimed */
δ(e) + r > (0,0). if ∃ v ∈ Π s.t. δ(u → v) = (0,0)
/* adjust the MC(u) value */
• δ(e) = (0,y), according to properties 3.1, 3.2 and 3.3, MC(u) ← MC(u)+1
which after retiming will result in δr(e) >(0,0), MCmax ← max {MC(u),MCmax}
endif
• δ(e) = (x,y), with x > 0, which would produce δr(e) = /* propagate the values to successor nodes of u */
(x, y-ry). In this case, considering that x > 0, then repeat ∀ v ∈ Π such that u → v
δr(e)>(0,0). indegree(v) ← indegree(v)- 1
MC(v) ← max{MC(v), MC(u)}
Just as in the case of MDFGs, full parallelism is /* check for new non-dependent nodes */
achieved when all edges between any two processors if indegree(v) = 0
become non-zero delay edges [19]. The loop bodies are QueueΠ ← QueueΠ ∪ {v}
then modified according to the retiming function applied endif
to the PDG. Figure 2(b) shows the inter-processor endrepeat
dependencies after the retiming technique was applied to endwhile
the PDG in figure 3. A synchronization function is now /* compute the multi-dimensional retiming */
needed to trigger the execution of each processor after an ∀u∈Π
initialization stage, usually called prologue, which is r(u) = (0,(MCmax - MC(u))*rv)
inherent to retiming. End
The algorithm MRA will produce the necessary k = 3 (number of processors)
retiming values for each node (processor) represented in /* processor 0 */
the graph, allowing the transformation of the loop into its for (i = 0; i < N; i+k)
parallel format. for (j=0; j<N;j++)
if ((i==0) && (j==1)
4. EXAMPLE SYNC(p1)
endif
In this section, the processor allocation algorithm is filter operations
applied to the IIR section of a 2D-filter design initially endfor
presented in [5]. Its transfer function is given by: endfor
2 2
−i − j
∑ ∑ f(i,j)*z 1 *z 2 /* processor 1 */
H(z 1 ,z 2 ) = n1= 0 n 2 = 0 for (i = 1; i < N; i+k)
2 2
−i − j
∑ ∑ g(i,j)*z 1 *z 2 for (j=0; j<N;j++)
n1= 0 n 2 = 0
if ((i==1) && (j==1)
SYNC(p2)
This filter requires a loop with iterations containing endif
eight multiplications and seven additions. The filter operations
corresponding PDG for a three-processor system, endfor
implementing this filter, is shown in figure 4. endfor

(0,1) (0,1) (0,1) The SYNC command sends a signal to the named
processor, informing that its required data has been
computed and allowing it to initiate its execution
P0 P1 P2 sequence. Processor P0 triggers processor P1 after a first
iteration has been executed, while processor P1 will
(1,0) trigger P2, after 2 iterations of P0 (or one of P1). The non-
existence of (0,0) delays in the resulting graph shows that
the iterations can be run in parallel on the three-processor
system.
Figure 4. Processor Dependency Graph for the IIR
problem 5. CONCLUSION
(0,1) (0,1) (0,1)
This paper has introduced an algorithm based on
multi-dimensional retiming that allows an arbitrary
(0,1) (0,1) number of processors to work in parallel on applications
P0 P1 P2 that make use of nested loop constructs. In particular, this
paper demonstrated the application of this new method to
the case of a two-dimensional image filter. The loops are
(1,-2) modeled in the form of multi-dimensional dependence
graphs, which are transformed to multi-processor
equivalent versions. Such loops are retimed in order to
Figure 5. Retimed PDG for the IIR problem. guarantee fully parallel execution of the nested loop.
After retiming, the modified code for each processor is
After applying the multi-dimensional retiming to the easily generated. A synchronization signal sent between
processor dependency graph, the PDG changes to the processors guarantees the availability of initial data and
structure presented in figure 5. In this PDG, P0 has been correct execution of the iterations distributed in different
retimed twice using the retiming function (0,1), resulting processors. The theory and basis for the algorithm were
in an overall retiming value r(P0) = (0,2). P1 has been presented and an example was shown illustrating the use
retimed once, with the overall retiming r(P1) = (0,1). P2 of the algorithm.
did not undergo retiming and therefore its retiming value
is r(P2) = (0,0). These retiming values introduce multi-
dimensional delays in all edges representing dependencies
6. ACKNOWLEDGEMENTS
between processors. These new delays represent stored
This work was partially supported by the National
data that allow the parallel execution of the tasks assigned
Science Foundation under Grants No. MIP 95-01006 and
to the different processing elements. The code for
MIP 97-04276.
processors 0 and 1 under the stated conditions becomes:
REFERENCES [13] C. E. Leiserson and J. B. Saxe, “Retiming
Synchronous Circuitry,” Algorithmica, 6, pp. 5-35,
[1] A. Aiken and A. Nicolau, “Optimal Loop 1991.
Parallelization,” Proceedings of the SIGPLAN [14] H. Li, S. Tandri, M. Stumm and K. C. Sevcik,
Conference on Programming Language Design and “Locality and Loop Scheduling on NUMA
Implementation, pp. 308-317, June, 1988. Multiprocessors,” Proceedings of the 1993
[2] W. P. Burleson, “The Partitioning Problem on VLSI International Conference on Parallel Processing,
Arrays: I/O and Local Memory Complexity,” Vol. II, pp. 140-147, 1993.
Proceedings of the International Conference on [15] L. S. Liu, C. W. Ho, and J. P. Sheu, “On the
Acoustics, Speech and Signal Processing, pp. 1217- Parallelism of Nested For-Loops Using Index Shift
1220, 1991. Method,” Proceedings of International Conference
[3] L.-F. Chao, A. LaPaugh, and E. H.-M. Sha, on Parallel Processing, pp. 119-123, 1990.
“Rotation Scheduling: A Loop Pipelining [16] L. E. Lucke and K. K. Parhi, “Generalized ILP
Algorithm,” Proceedings of the 30th ACM/IEEE Scheduling and Allocation for High-Level DSP
Design Automation Conference, Dallas, TX, pp. Synthesis,” Proceedings of the Custom Integrated
566-572, June, 1993. Circuits Conference, pp. 5.4.1-5.4.4, 1993.
[4] A. Darte and Y. Robert, “Constructive Methods for [17] D. I. Moldovan and J. A. B. Fortes, “Partitioning
Scheduling Uniform Loop Nests,” IEEE and Mapping Algorithms into Fixed Size Systolic
Transactions on Parallel and Distributed Systems, Arrays,” IEEE Transactions on Computers, Vol. C-
Vol. 5, no. 8, pp. 814-822, 1994. 35, pp. 1-12, January, 1986.
[5] R. Gnanasekaran, “2-D Filter Implementation for [18] K. K. Parhi and D. G. Messerschmitt, “Fully-Static
real-time Signal Processing,” IEEE Transactions on Rate-Optimal Scheduling of Iterative Data-Flow
Circuits and Systems, v. 35, n. 5, pp. 587-590, 1988. Programs Via Optimum Unfolding,” Proceedings of
[6] G. Goosens, J. Wandewalle, and H. de Man, “Loop the International Conference on Parallel
Optimization in Register Transfer Scheduling for Processing, Vol. I, pp. 209-216, 1989.
DSP Systems,” Proceedings of the ACM/IEEE [19] N. L. Passos and E. H. -M. Sha, “Achieving Full
Design Automation Conference, pp. 826-831, 1989. Parallelism Using Multidimensional Retiming,”
[7] R. Gupta, “Loop Displacement: An Approach for IEEE Transactions on Parallel and Distributed
Transforming and Scheduling Loops for Parallel Systems, Vol. 7, No. 11, pp. 1150-1163, November,
Execution,” Proceedings of the International 1996.
Conference on Supercomputing, pp. 388-397, [20] N. L. Passos, A. Leonardi and E. H. -M. Sha,
November, 1990. “Nested Loops Optimization for Multiprocessor
[8] P. Held, P. Dewilde, E. Deprettere, and P. Wielage, Architecture Design,” in the Proceedings of the
“HIFI: From Parallel Algorithm to Fixed-Size VLSI 1998 Midwest Symposium on Circuits and Systems,
Processor Array,” in Application-Driven Notre Dame, IN, pp. 415-418, August, 1998.
Architecture Synthesis, ed. F. Catthoor and L. [21] N. L. Passos and E. H. -M. Sha, “Scheduling of
Svensson, Norwell, Massachusetts: Kluwer Uniform Multi-Dimensional Systems under
Academic Publishers, pp. 71-94, 1993. Resource Constraints, ” in the IEEE Transactions on
[9] L. Lamport, “The Parallel Execution of DO Loops,” VLSI Systems, Volume 6, Number 4, pp. 719-730,
Communications of the ACM SIGPLAN, 17(2), pp. December, 1998.
82-93, February, 1974. [22] R. Potasman, J. Lis, A. Nicolau, and D. Gajski,
[10] M. Lam, “Software Pipelining: An Effective “Percolation Based Scheduling,” Proceedings of the
Scheduling for VLIW Machines,” ACM SIGPLAN ACM/IEEE Design Automation Conference, pp.
Conference on Programming Language Design and 444-449, 1990.
Implementation, pp. 318-328, 1988. [23] D. L. Perry, VHDL, McGraw-Hill Inc., New York,
[11] T.-F. Lee, A. C.-H. Wu, D. D. Gajski, and Y.-L. New York, 1994.
Lin, “Performance Optimization of Pipelined [24] W. Shang and J. A. B. Fortes, “Independent
Circuits,” Proceedings of the International Partitioning of Algorithms with Uniform
Conference on Computer Aided Design, pp. 410- Dependencies,” IEEE Transactions on Computers,
413, November, 1990. Vol. 41, February, pp. 190-206, 1992.
[12] T.-F. Lee, A. C.-H. Wu, D. D. Gajski, and Y.-L. [25] C.-Y. Wang and K. K. Parhi, “High Level DSP
Lin, “An Effective Methodology for Functional Synthesis Using the MARS Design System,”
Pipelining,” Proceedings of the International Proceedings of the International Symposium on
Conference on Computer Aided Design, pp. 230- Circuits and Systems, pp. 164-167, 1992.
233, December, 1992.

You might also like