Nelson L. Passos Robert P. Light Virgil Andronache Edwin H.-M. Sha Midwestern State University
Nelson L. Passos Robert P. Light Virgil Andronache Edwin H.-M. Sha Midwestern State University
Nelson L. Passos Robert P. Light Virgil Andronache Edwin H.-M. Sha Midwestern State University
(1,1) (1,0)
(1,0) (1,-1)
(0,1) (0,1) (0,1) The SYNC command sends a signal to the named
processor, informing that its required data has been
computed and allowing it to initiate its execution
P0 P1 P2 sequence. Processor P0 triggers processor P1 after a first
iteration has been executed, while processor P1 will
(1,0) trigger P2, after 2 iterations of P0 (or one of P1). The non-
existence of (0,0) delays in the resulting graph shows that
the iterations can be run in parallel on the three-processor
system.
Figure 4. Processor Dependency Graph for the IIR
problem 5. CONCLUSION
(0,1) (0,1) (0,1)
This paper has introduced an algorithm based on
multi-dimensional retiming that allows an arbitrary
(0,1) (0,1) number of processors to work in parallel on applications
P0 P1 P2 that make use of nested loop constructs. In particular, this
paper demonstrated the application of this new method to
the case of a two-dimensional image filter. The loops are
(1,-2) modeled in the form of multi-dimensional dependence
graphs, which are transformed to multi-processor
equivalent versions. Such loops are retimed in order to
Figure 5. Retimed PDG for the IIR problem. guarantee fully parallel execution of the nested loop.
After retiming, the modified code for each processor is
After applying the multi-dimensional retiming to the easily generated. A synchronization signal sent between
processor dependency graph, the PDG changes to the processors guarantees the availability of initial data and
structure presented in figure 5. In this PDG, P0 has been correct execution of the iterations distributed in different
retimed twice using the retiming function (0,1), resulting processors. The theory and basis for the algorithm were
in an overall retiming value r(P0) = (0,2). P1 has been presented and an example was shown illustrating the use
retimed once, with the overall retiming r(P1) = (0,1). P2 of the algorithm.
did not undergo retiming and therefore its retiming value
is r(P2) = (0,0). These retiming values introduce multi-
dimensional delays in all edges representing dependencies
6. ACKNOWLEDGEMENTS
between processors. These new delays represent stored
This work was partially supported by the National
data that allow the parallel execution of the tasks assigned
Science Foundation under Grants No. MIP 95-01006 and
to the different processing elements. The code for
MIP 97-04276.
processors 0 and 1 under the stated conditions becomes:
REFERENCES [13] C. E. Leiserson and J. B. Saxe, “Retiming
Synchronous Circuitry,” Algorithmica, 6, pp. 5-35,
[1] A. Aiken and A. Nicolau, “Optimal Loop 1991.
Parallelization,” Proceedings of the SIGPLAN [14] H. Li, S. Tandri, M. Stumm and K. C. Sevcik,
Conference on Programming Language Design and “Locality and Loop Scheduling on NUMA
Implementation, pp. 308-317, June, 1988. Multiprocessors,” Proceedings of the 1993
[2] W. P. Burleson, “The Partitioning Problem on VLSI International Conference on Parallel Processing,
Arrays: I/O and Local Memory Complexity,” Vol. II, pp. 140-147, 1993.
Proceedings of the International Conference on [15] L. S. Liu, C. W. Ho, and J. P. Sheu, “On the
Acoustics, Speech and Signal Processing, pp. 1217- Parallelism of Nested For-Loops Using Index Shift
1220, 1991. Method,” Proceedings of International Conference
[3] L.-F. Chao, A. LaPaugh, and E. H.-M. Sha, on Parallel Processing, pp. 119-123, 1990.
“Rotation Scheduling: A Loop Pipelining [16] L. E. Lucke and K. K. Parhi, “Generalized ILP
Algorithm,” Proceedings of the 30th ACM/IEEE Scheduling and Allocation for High-Level DSP
Design Automation Conference, Dallas, TX, pp. Synthesis,” Proceedings of the Custom Integrated
566-572, June, 1993. Circuits Conference, pp. 5.4.1-5.4.4, 1993.
[4] A. Darte and Y. Robert, “Constructive Methods for [17] D. I. Moldovan and J. A. B. Fortes, “Partitioning
Scheduling Uniform Loop Nests,” IEEE and Mapping Algorithms into Fixed Size Systolic
Transactions on Parallel and Distributed Systems, Arrays,” IEEE Transactions on Computers, Vol. C-
Vol. 5, no. 8, pp. 814-822, 1994. 35, pp. 1-12, January, 1986.
[5] R. Gnanasekaran, “2-D Filter Implementation for [18] K. K. Parhi and D. G. Messerschmitt, “Fully-Static
real-time Signal Processing,” IEEE Transactions on Rate-Optimal Scheduling of Iterative Data-Flow
Circuits and Systems, v. 35, n. 5, pp. 587-590, 1988. Programs Via Optimum Unfolding,” Proceedings of
[6] G. Goosens, J. Wandewalle, and H. de Man, “Loop the International Conference on Parallel
Optimization in Register Transfer Scheduling for Processing, Vol. I, pp. 209-216, 1989.
DSP Systems,” Proceedings of the ACM/IEEE [19] N. L. Passos and E. H. -M. Sha, “Achieving Full
Design Automation Conference, pp. 826-831, 1989. Parallelism Using Multidimensional Retiming,”
[7] R. Gupta, “Loop Displacement: An Approach for IEEE Transactions on Parallel and Distributed
Transforming and Scheduling Loops for Parallel Systems, Vol. 7, No. 11, pp. 1150-1163, November,
Execution,” Proceedings of the International 1996.
Conference on Supercomputing, pp. 388-397, [20] N. L. Passos, A. Leonardi and E. H. -M. Sha,
November, 1990. “Nested Loops Optimization for Multiprocessor
[8] P. Held, P. Dewilde, E. Deprettere, and P. Wielage, Architecture Design,” in the Proceedings of the
“HIFI: From Parallel Algorithm to Fixed-Size VLSI 1998 Midwest Symposium on Circuits and Systems,
Processor Array,” in Application-Driven Notre Dame, IN, pp. 415-418, August, 1998.
Architecture Synthesis, ed. F. Catthoor and L. [21] N. L. Passos and E. H. -M. Sha, “Scheduling of
Svensson, Norwell, Massachusetts: Kluwer Uniform Multi-Dimensional Systems under
Academic Publishers, pp. 71-94, 1993. Resource Constraints, ” in the IEEE Transactions on
[9] L. Lamport, “The Parallel Execution of DO Loops,” VLSI Systems, Volume 6, Number 4, pp. 719-730,
Communications of the ACM SIGPLAN, 17(2), pp. December, 1998.
82-93, February, 1974. [22] R. Potasman, J. Lis, A. Nicolau, and D. Gajski,
[10] M. Lam, “Software Pipelining: An Effective “Percolation Based Scheduling,” Proceedings of the
Scheduling for VLIW Machines,” ACM SIGPLAN ACM/IEEE Design Automation Conference, pp.
Conference on Programming Language Design and 444-449, 1990.
Implementation, pp. 318-328, 1988. [23] D. L. Perry, VHDL, McGraw-Hill Inc., New York,
[11] T.-F. Lee, A. C.-H. Wu, D. D. Gajski, and Y.-L. New York, 1994.
Lin, “Performance Optimization of Pipelined [24] W. Shang and J. A. B. Fortes, “Independent
Circuits,” Proceedings of the International Partitioning of Algorithms with Uniform
Conference on Computer Aided Design, pp. 410- Dependencies,” IEEE Transactions on Computers,
413, November, 1990. Vol. 41, February, pp. 190-206, 1992.
[12] T.-F. Lee, A. C.-H. Wu, D. D. Gajski, and Y.-L. [25] C.-Y. Wang and K. K. Parhi, “High Level DSP
Lin, “An Effective Methodology for Functional Synthesis Using the MARS Design System,”
Pipelining,” Proceedings of the International Proceedings of the International Symposium on
Conference on Computer Aided Design, pp. 230- Circuits and Systems, pp. 164-167, 1992.
233, December, 1992.