STATIC SCHEDULING OF SPLIT-NODE DATA-FLOW GRAPHS
Timothy W. O’Neil
Computer Science Dept.
University of Akron
Akron, OH 44325-4002
email: toneil@cs.uakron.edu
Edwin H.-M. Sha
Computer Science Dept.
Univ. of Texas at Dallas
Richardson, TX 75083-0688
email: edsha@utdallas.edu
ABSTRACT
Many computation-intensive or recursive applications
commonly found in digital signal processing and image
processing applications can be represented by data-flow
graphs (DFGs). In our previous work, we proposed a new
technique, extended retiming, which can be combined with
minimal unfolding to transform a DFG into one which is
rate-optimal. The result, however, is a DFG with split
nodes, a concise representation for pipelined schedules.
This model and the extraction of the pipelined schedule it
represents have heretofore not been explored. In this paper, we construct scheduling algorithms for such graphs
and demonstrate our methods on specific examples.
which provides a feasible rate-optimal schedule with loop
pipelining. The split-node graph is simply the most compact means for expressing this best schedule. This combination of reduced size and extensive implanted system information potentially makes the split-node DFG an attractive archetype for research into DSP on parallel embedded
systems, where concision is key. However, before we get
to that point, much basic work remains to be completed.
The properties of split-node graphs and the means by
which they can be manipulated in order to draw out the
pipelined schedule represented therein have not been explored in the literature. By using a split-node DFG to characterize a situation, we are conveying not only that a schedule is to be pipelined, but we are giving specific clues as to
how it is to be pipelined. Thus, while many scheduling algorithms for traditional data-flow graphs exist throughout
the literature [9–13], we must make specific modifications
to existing methods so that they apply to this new model
and produce an optimal schedule which obeys the additional rules regarding pipelining that the split-node graph
dictates.
In this paper, we develop the first method designed
specifically to efficiently schedule the system represented
by a split-node graph. To that end, we formally define a
split-node data-flow graph and redefine the terminology of
scheduling to fit this new paradigm. We develop scheduling algorithms for split-node graphs, and explain how to
achieve a rate-optimal schedule by applying these algorithms. Finally, we demonstrate our methods on specific
examples.
KEY WORDS
Parallel and Distributed Compilers, Task Scheduling.
1 Introduction
Because the most time-critical parts of real-time or
computation-intensive applications are loops, we must explore the parallelism embedded in the repetitive pattern of
a loop. A loop can be represented as a data-flow graph
(DFG) [1]. The nodes of a DFG depict tasks, while
edges between nodes symbolize data dependencies among
tasks. Each edge may contain a number of delays (i.e.
loop-carried dependencies). This model is widely used in
many fields, including circuitry [2], digital signal processing (DSP) [3] and program descriptions [4].
In our previous work [5–8], we proposed an efficient
algorithm, extended retiming, which transforms a DFG into
an equivalent graph with maximum parallelism. Indeed, we
have demonstrated that extended retiming, when combined
with minimum unfolding, achieves rate optimality, the first
method we are aware of that this can be said about. The effectiveness of extended retiming was further demonstated
via experimentation in [6]. In all cases explored, we were
able to achieve better results by using extended retiming,
getting an optimal clock period while requiring less unfolding.
While the usefulness of this new transformation is
clear, the result of extended retiming is a graph containing split nodes. This is not to say that we are physically
altering the DFG by placing registers inside of functional
units. Rather, we are describing an abstraction for a graph
466-093
2 Background
Before proceeding to our primary results, we first introduce
our basic models. We then review previously established
results pertinent to our task.
We define a split-node data-flow graph (SDG) with
splitting degree Æ to be a finite, directed, weighted graph
where is the vertex set; is
the edge set, representing precedence relations among the
nodes; Z is a function with the delay count for
edge ; and ZÆ is a function with the Æ -tuple
representing the computation times of ’s pieces. Broadly
speaking, Æ is the maximum number of pieces any node of
is split into. (Trivially if Æ the SDG is simply a data-
125
flow graph as defined previously.) If a node is not split
is an integer rather than a Æ-tuple. For example, in the
SDG of Figure 1, while
. We will use the notation to refer to the sum of the
elements of ’s Æ -tuple if is split and to otherwise.
.
In this example, and
A
3
4
2
B
C
2
2
if
for all nodes and iterations
. Such a schedule can be represented by its first iterations, since a new occurrence of this partial schedule can
be started at the beginning of every interval of clock ticks
to form the complete legal schedule.
As we’ve stated, an iteration is simply an execution
of all nodes in a data-flow graph (DFG) once. The average
computation time of an iteration is called the iteration period of the DFG. If a DFG contains a loop, then this iteration period is bounded from below by the iteration bound
[14] of , which is denoted and is the maximum
time-to-delay ratio of all cycles in . For example, there
are two loops in Figure 1: the outer
loop with total computation time and delay count ; and
the
loop with time and delay count .
The larger of these ratios comes from the outer loop, and
so
in this case. When the iteration period of
the schedule equals the iteration bound of the DFG, we say
that the schedule is rate-optimal. Clearly, if we have a legal
schedule for with cycle period and unfolding factor ,
its iteration period is , and since is a lower bound
for the iteration period, we must have .
1
Figure 1. A sample SDG.
In our model, delays may be contained either along an
edge or within a node. As we have stated, the execution of
all nodes in once is an iteration. Delays contained along
an edge represent precedence relations across iterations; for
example, the one-delay edge between and in Figure 1
in the current iteration
indicates that the execution of
must terminate before can begin in the next iteration. On
the other hand, delays within a node convey information
regarding the pipelined execution of a node. For example,
the three delays inside of tell us that up to copies of
the node may be executing simultaneously in a pipelined
schedule of tasks. Furthermore, the position of the delays
inside of a node indicate the form of the schedule of tasks
for a graph. In the case of Figure 1 we can build a schedule
in such a way that the first iteration contains the beginning
of ’s first copy; the next iteration includes only the part of
this copy taking time units to execute; the next iteration
lasts only time units to match the next part of the copy;
and the next iteration includes the remaining piece of ’s
first copy. After that, we schedule the copies of and
as best we can around the borders of the iterations, making
sure that the copy of in this iteration precedes the copy of
in the next iteration, and that the current copy of starts
upon termination of the current copy of . We see that it
is straight-forward to derive a schedule from a simple SDG
purely through observation. Our purpose is to formalize
this method so that it may be applied automatically to more
complex examples.
Given an edge in a SDG , we will use the
traditional notation to refer to the number of delays
on the edge not including delays within end nodes. We will
further define as plus the number of delays
within the source node . Referring to Figure 1, we observe
that while for .
An integral time schedule or integral schedule is a
function NZ where the starting time of node
in the iteration is given by . It is a legal schedule if
for all edges
and iterations , while a legal schedule is a repeating schedule for cycle period and unfolding factor
3 SDG Scheduling Algorithms
We now discuss a method for scheduling a split-node
graph, based on the as-early-as-possible (AEAP) scheduling algorithm [9]. We begin by constructing a related sequencing graph based on our SDG. This sequencing graph
is designed to model all intra-iteration dependencies. To
this end, we remove all edges from the SDG with non-zero
delay count. We also replace any split node by its head
and tail and re-route all zero-delay edges involving the split
node. Edges leaving a split node must leave the tail of the
node in the sequencing graph, and those entering a split
node must now go to the head. Finally, a dummy source
node and edges from it to all other nodes are added. The
procedure for constructing this graph appears as Algorithm
1, with the sequencing graph for Figure 1 given as Figure
2(a).
We can now produce a forward schedule using the sequencing graph. We assume that the dummy node executes at time step zero and takes no time to execute. Since
the sequencing graph is acyclic, we may apply a modified
version of the algorithm from [15] for finding the lengths of the shortest paths from to every other
vertex. We begin by sorting all of the vertices, with preceding in the sorted list if there is an edge from to
in the sequencing graph. Taking the vertices in sorted order, we now find the longest paths to each vertex from .
The length of the longest path is the starting time for the
node in the first iteration; repeating this iteration gives us
the complete schedule. However, since we must execute a
split node in order from start to finish, we schedule only the
head of any split node. This complete procedure appears as
Algorithm 2.
126
0
v
0
TIME: 0
1
2 3
4 5
6
7 8
9
0
1 2
3 4
TIME: 0
1
2 3
4 5
6
A
7 8
9
0
A
1
2
2
2
A head
A tail
B
C
1 2
3 4
A
A
A
B
C
B
C
A
B
B
C
(a)
C
B
B
C
C
(b)
(c)
Figure 2. (a) The sequencing graph for Figure 1; (b) The forward schedule for this graph; (c) The backward schedule.
Algorithm 2 Forward scheduling
Input: A split-node DFG
with clock period
Output: An forward repeating schedule
/* Apply Algorithm 1 to */
do
Algorithm 1 Building the sequencing graph for a splitnode DFG
Input: A split-node DFG
Output: An acyclic sequencing graph
¼
¼
¼
¼
¼
¼
¼
/* Find longest paths to all nodes in seq. graph. */
Topologically sort the vertices of ¼
for all ¼ taken in sorted order do
for all ¼ adjacent to do
¼
then
if
¼
end if
end for
end for
for all do
if is a split node then
/* Schedule only the heads of split nodes. */
¼
¼
end if
end for
with
do
for all
/* Edges from a split node now leaving tail. */
if is a split node then
else
else
/* Schedule the complete node if not split. */
end if
end for
do
for all and integers
/* Repeat to derive complete schedule. */
else
¼
As an example, return to the sequencing graph in Figure 2(a). The longest paths to both and are zero,
while those to and are . Adopting a near-optimal
clock period of , we schedule copies of at steps and
copies of and at for , as shown in Figure
2(b).
By a similar process, we may construct a SDG’s backward schedule. First of all, when constructing the sequencing graph, we reverse the direction of all edges inherited
from the original graph. (In the case of Figure 2(a), this
means that the edges and become
the edges and , respectively.) With
such a construction, the longest path lengths may be sub-
¼
end for
/* Add dummy source node. */
˼
¼
for all with ¼ do
¼
¼
¼
¼
˼
/* Add edges from source to all other nodes. */
¼
end for
end for
Ë
end if
end if
/* Edges into a split node routed into head. */
if is a split node then
Ë
¼
¼
end for
for all do
is an -tuple with
then
if
/* Replace all split nodes by a head and tail. */
¼
¼
¼
first element of
¼
element of
else
/* Any non-split node is retained as-is. */
Ë
for all
¼
127
tracted from the designated clock period to derive the finishing times for the nodes. We must then subtract a node’s
execution time to find the starting time. For example, returning to our modified version of Figure 2(a), we would
compute path lengths of for and , for and
for . Assuming again a clock period of , we would
schedule to start execution at step , to
begin at , and to commence at .
This ALAP schedule is pictured in Figure 2(c).
Reconsider the graph in Figure 1. One topological
sort of the nodes in its sequencing graph (Figure 2(a))
yields the order , , , and . The longest
paths into each of these nodes was noted above. We now
compute the Æ -values by adding the lengths of the paths into
these nodes to the computation times of the nodes themselves (since any signal must pass through the end node as
well). Thus the Æ -values for , , and are ,
, and , respectively. is the only split node and has
a computation time of for its largest piece. The longest
into its head,
zero-delay path into is the edge from
which has length as noted. Thus , the -values
equal the Æ -values for the other nodes, and the maximum is
, the clock period we have been using all along.
4 The Clock Period of an SDG
One detail passed over in Algorithm 2 is the specification
of a clock period for our SDG. While we could use almost
anything sufficiently large for an input parameter and a legal schedule would still result, we are interested in minimizing this so as to produce the best possible schedule. To
this end, the minimum clock period is formally defined as
the length of the longest zero-delay path (i.e. connected
sequence of nodes and edges) in a graph. Informally, the
clock period represents the “maximum amount of propagation delay through which any signal must ripple between
clock ticks” [2].
In an SDG, “paths” are either internal pieces of split
nodes, or zero-delay edge sequences from the tail of a split
node through unsplit nodes into the head of another split
node. We thus calculate the clock period of an SDG in
two phases. First, we determine the computation time of
the biggest piece of any split node. (Of course, in the case
of unsplit nodes, this is the total computation time.) Next,
we use the sequencing graph to find sums of computation
times along paths from split-node tails to split-node heads.
The maximum over this combined data set is the minimum
clock period. A formalized method for this, based on the
method from [2] for traditional DFGs, appears as Algorithm 3.
5 Analysis
The key thing to realize in constructing the sequencing
graph in Algorithm 1 is that any node in the original SDG
is replaced by either one or two new nodes in the sequencing graph, depending on whether or not the original node is
split. Thus the first and last loops execute in time
steps, while the loop in between them at worst requires
. It is therefore clear that Algorithm 1 is of complexity .
Similarly the time complexities of Algorithms 2 and
3 are due to the topological sorts. The loop
immediately following the sort in Algorithm 2 requires at
worst time steps since the SDG is a directed graph,
while the remaining loops take time. The first loop
in Algorithm 3 has time complexity at worst ,
while the second takes time to run.
We can therefore conclude that the time complexity
of our overall scheduling process for a split-node graph is
at worst . Not only have we constructed an
initial method for scheduling a SDG, we have constructed
one that is no more complex than similar existing methods
for traditional DFGs [9, 10].
Algorithm 3 Determining the clock period of a split-node
DFG
Input: A split-node DFG
Output: Its clock period
/* Compute zero-delay paths. */
¼
Topologically sort the vertices of
6 Achieving Rate Optimality via Unfolding
While our scheduling algorithm appears effective, this first
example demonstrates its weakness. Because our algorithms require an integral clock period, the best we could
accomplish was to create a schedule for Figure 1 with a
near-optimal period of . We previously determined that
the iteration bound for Figure 1 is the smaller . Indeed,
the fact that our AEAP and ALAP schedules differ noticably tells us that there is room for improvement. The question is then what to do with a fractional iteration period.
Unfolding [16] transforms a data-flow graph by
scheduling multiple iterations simultaneously. For our particular example, if we schedule two iterations together during clock cycles, we would achieve an average iteration
period equal to our lower bound. In other words, if we
unfold Figure 3(a) twice and then schedule with a clock
¼
Ƽ
for all taken in sorted order with ¼ do
Æ Æ
¼
¼
¼
end for
for all do
is an -tuple with
then
if
/* Find time of split node’s largest piece. */
element of
Æ
else
Æ
end if
end for
128
0
period of , we would achieve rate-optimality. In [17], we
presented an unfolding algorithm for the split-node model.
Applying this algorithm to unfold our initial example twice
yields the graph in Figure 3(b). For clarity, the nodes comprising iteration one are shaded, those in iteration zero are
not. With unfolding complete, we now attempt to schedule the unfolded graph with a clock period of time units.
Algorithm 1 produces the sequencing graph in Figure 4(a).
From the information contained in this graph, we can derive the time schedule for the zeroth iteration seen in Figure
4(b) below via Algorithm 2. Note that this iteration is the
zeroth iteration for the unfolded graph and contains iterations zero and one of the original graph.
1
4
3
2
A
A1
2
B 2
B1
2
B0
2
C 2
C0
2
C1
2
(a)
6
8
0
2
4
6
8
0
2
4
6
8
0
TIME: 0
1
2
A0
3
4
B0
D0 A1
E0
C0
A0
5
5
tail
B1
C0
6
A7
5
6
7
D1 A2
B1
8
9
B2
D2
C1
E1
head
4
(b)
Figure 3. (a) Our original example; (b) The graph unfolded
by a factor of .
A0
2
Algorithm 1 to this split-node graph produces the scheduling graph in Figure 6(b). The longest length of any path in
this graph is , so we adopt this as our clock period. Furthermore, the longest path from to has zero length, the
longest paths to and each have length , and
the longest paths to and each have length . We
can thus produce the time schedule for the zeroth iteration
seen in Figure 6(c) below via Algorithm 2. Propagating
these values across time and optimizing for best processor
assignment as in [18] yields the final time- and processoroptimal schedule in Figure 7.
7
5
4
Figure 5. Rate- and processor-optimal schedule for Figure
1.
1
5
A0
2
C0 B0 C1
A1
A4
B7
A3
A6
B1 C2 B2 C
3
B C B C
A0
A
3 4 4 5
5
B5 C6 B6 C7
A2
1
v
0
A1
head
2 A
1 tail
2
2
2
2
B0
C1
2
6
2
5
0
4
C2
E2
Figure 7. Final schedule for Figure 6(a).
10
10
2
2
2
2
8 Conclusion
(a)
In this paper, we have formally defined a split-node dataflow graph and redefined the terminology of scheduling to
fit this new paradigm. We have developed scheduling algorithms for split-node graphs, and demonstrated how to
achieve a rate-optimal schedule by applying these algorithms. Finally, we have demonstrated our methods on specific examples.
(b)
Figure 4. (a) Sequencing graph for Figure 3(b); (b) The
time schedule for iteration zero.
Finally, we can use the information from this table to
construct a first schedule. Now, as in [18], we can optimize this schedule to arrive at the final rate- and processoroptimal schedule for Figure 1 pictured in Figure 5.
Acknowledgement
This work was partially supported by NSF grants MIP9501006 and MIP-9704276; and by the A.J. Schmitt Foundation while the authors were with the University of Notre
Dame. It was also supported by the University of Akron,
NSF grants ETA-0103709 and CCR-0309461, Texas ARP
grant 009741-0028-2001 and the TI University program.
7 Example
We now review our methods by applying them to an additional example, the graph pictured in Figure 6(a). Applying
129
B
1
2
1
D
0
3
2
1
C
A
2
2
E
1
2
2
1
1
4
4
1
4
(c)
(a)
(b)
Figure 6. (a) Another sample SDG; (b) Its sequencing graph; (c) The time schedule for iteration zero.
References
[12] F. Gasperoni and U. Schwiegelshohn. Transforming cyclic
scheduling problems into acyclic ones. In Scheduling Theory and Its Applications, pages 241–258. John Wiley &
Sons, 1995.
[1] L.-F. Chao and E. H.-M. Sha. Scheduling data-flow graphs
via retiming and unfolding. IEEE Trans. Parallel & Distributed Syst., 8:1259–1267, 1997.
[13] C. Hanen and A. Munier. Cyclic scheduling on parallel processors: An overview. In Scheduling Theory and Its Applications, pages 194–226. John Wiley & Sons, 1995.
[2] C.E. Leiserson and J.B. Saxe. Retiming synchronous circuitry. Algorithmica, 6:5–35, 1991.
[14] M. Renfors and Y. Neuvo. The maximum sampling rate
of digital filters under hardware speed. Trans. Circuits &
Sampling, CAS-28:196–202, 1981.
[3] S.Y. Kung, J. Whitehouse, and T. Kailath. VLSI and Modern
Signal Processing. Prentice Hall, 1985.
[4] L.-F. Chao and E. H.-M. Sha. Retiming and unfolding dataflow graphs. In Proc. Int. Conf. Parallel Process., pages II
33–40, 1992.
[15] T.H. Cormen, C.E. Leiserson, and R.L. Rivest. Introduction
to Algorithms. McGraw-Hill, Inc., 1991.
[5] T.W. O’Neil, S. Tongsima, and E. H.-M. Sha. Extended retiming: Optimal retiming via a graph-theoretical approach.
In Proc. ICASSP-99, volume 4, pages 2001–2004, 1999.
[16] K.K. Parhi and D.G. Messerschmitt. Static rate-optimal
scheduling of iterative data-flow programs via optimum unfolding. IEEE Trans. Comput., 40:178–195, 1991.
[6] T.W. O’Neil and E. H.-M. Sha. Rate-optimal graph transformation via extended retiming and unfolding. In Proc.
IASTED 11th Int. Conf. Parallel & Distributed Computing
& Syst., volume 10, pages 764–769, 1999.
[17] T.W. O’Neil and E. H.-M. Sha. Unfolding a split-node dataflow graph. In Proc. IASTED 14th Int. Conf. Parallel &
Distributed Computing & Syst., pages 717–722, 2002.
[18] T.W. O’Neil and E. H.-M. Sha. Minimizing resources in a
repeating schedule for a split-node data-flow graph. In Proc.
IEEE/ACM 12th Great Lakes Symposium on VLSI, pages
136–141, 2002.
[7] T.W. O’Neil, S. Tongsima, and E. H.-M. Sha. Optimal
scheduling of data-flow graphs using extended retiming. In
Proc. ISCA 12th Int. Conf. Parallel & Distributed Computing Syst., pages 292–297, 1999.
[8] T.W. O’Neil and E. H.-M. Sha. Optimal graph transformation using extended retiming with minimal unfolding. In
Proc. IASTED 12th Int. Conf. Parallel & Distributed Computing & Syst., volume I, pages 128–133, 2000.
[9] G. DeMicheli. Synthesis and Optimization of Digital Circuits. McGraw-Hill, Inc., 1994.
[10] L.-F. Chao and E. H.-M. Sha. Static scheduling for synthesis of DSP algorithms on various models. J. VLSI Signal
Process., 10:207–223, 1995.
[11] F. Gasperoni and U. Schwiegelshohn. Generating close to
optimum loop schedules on parallel processors. Parallel
Process. Letters, 4:391–403, 1994.
130