C Compiler Design For A Network Processor

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/3224767

C Compiler Design for a Network Processor

Article  in  IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems · December 2001
DOI: 10.1109/43.959859 · Source: IEEE Xplore

CITATIONS READS
29 1,237

2 authors, including:

Jens Wagner
Hochschule für Technik, Wirtschaft und Kultur Leipzig
36 PUBLICATIONS   112 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

TrustNode/RE research router for IoT View project

Dezibot View project

All content following this page was uploaded by Jens Wagner on 17 September 2015.

The user has requested enhancement of the downloaded file.


1302 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 20, NO. 11, NOVEMBER 2001

C Compiler Design for a Network Processor


J. Wagner and R. Leupers

Abstract—One important problem in code generation for em- have already been developed. These include code generation
bedded processors is the design of efficient compilers for target for irregular data paths [3]–[7], address code optimization
machines with application-specific architectures. This paper out- for DSPs [8]–[11], and exploitation of multimedia instruction
lines the design of a C compiler for an industrial application-spe-
cific instruction-set processor (ASIP) for telecom applications. The sets [12]–[14]. It has been shown experimentally, that such
target ASIP is a network processor with special instructions for highly machine-specific techniques are a promising approach
bit-level access to data registers, which is required for packet-ori- to generate high-quality machine code, whose quality often
ented communication protocol processing. From a practical view- comes close to handwritten assembly code. Naturally, this has
point, we describe the main challenges in exploiting these applica- to be paid with increased compilation times in many cases.
tion-specific features in a C compiler and we show how a compiler
backend has been designed that accommodates these features by While partially impressive results have been achieved in
means of compiler intrinsics and a dedicated register allocator. The code optimization for ASIPs in the DSP area, less emphasis has
compiler is fully operational and first experimental results indicate been so far on a new and important class of ASIPs for bit-serial
that C-level programming of the ASIP leads to good code quality protocol processing, which are called network processors
without the need for time-consuming assembly programming.
(NPs). The design of NPs has been motivated by the growing
Index Terms—Bit-level addressing, code generation, compilers, need for new high bandwidth communication equipment in net-
network processors. works (e.g., Internet routers and Ethernet adapters) as well as in
telecommunication (e.g., ISDN and xDSL). The corresponding
I. INTRODUCTION communication protocols mostly employ bit-stream-oriented
data formats. The bit streams consist of packets of different

T HE USE of application-specific instruction-set processors


(ASIPs) in embedded system design has become quite
common. ASIPs are located between standard “off-the-shelf”
length, i.e., there are variable length header packets and
(typically longer) payload packets. Typical packet processing
requirements include decoding, compression, encryption, or
programmable processors and custom application-specific inte- routing.
grated circuits (ASICs). Hence, ASIPs represent the frequently A major system design problem in this area is that the re-
needed compromise between high efficiency of ASICs and quired high bandwidth leaves only a very short time frame (as
low development effort associated with standard processors or low as a few nanoseconds) for processing each bit packet ar-
cores. While being tailored toward certain application areas, riving at a network node [15]. Even contemporary high-end pro-
ASIPs still offer programmability and, hence, high flexibility grammable processors can hardly keep pace with the required
for debugging or upgrading. Industrial examples for ASIPs real-time performance, not to mention the issue of computa-
are Tensilica’s configurable Xtensa reduced instruction-set tional efficiency, e.g., with respect to power consumption.
computer (RISC) processor [1] and the configurable Gepard
There are several approaches in ASIC design that deal with
digital-signal-processor (DSP) core from Austria Micro Sys-
efficient bit-level processing, such as [16]–[19]. However, all
tems [2].
these solutions require highly application-specific hardware. On
Like in the case of standard processors, compiler support
the other hand, the design of hardwired ASICs is frequently not
for ASIPs is very desirable. Compilers are urgently required
desirable due to the high design effort and low flexibility. As
to avoid time-consuming and error-prone assembly program-
a special class of ASIPs, NPs represent a promising solution
ming of embedded software, so that fast time-to-market and
to this problem, since their instruction sets are tailored toward
dependability requirements for embedded systems can be
efficient communication protocol processing. The advantage of
met. However, due to the specialized architectures of ASIPs,
this is illustrated in the following.
classical compiler technology is often insufficient, but fully ex-
Since the memories of transmitters and receivers normally
ploiting the processor capabilities demands for more dedicated
show a fixed word length (e.g., 8 or 16 bits), relatively expen-
code generation and optimization techniques.
sive processing may be required on both sides when using stan-
A number of such code generation techniques, intended to
dard processors (Fig. 1): at the beginning of a communication,
meet the high code quality demands of embedded systems,
the packets to be transmitted are typically aligned at the word
boundaries of the transmitter. For storing these words into the
Manuscript received April 9, 2001; revised June 29, 2001. This work was send buffer, they have to be packed into the bit stream format
supported by Informatik Centrum Dortmund, by Infineon Technologies, and by
Agilent Technologies. This paper was recommended by Guest Editor P. Mar- required by the network protocol. After transmission over the
wedel. communication channel, the packets have to be extracted again
The authors are with the University of Dortmund, Department of Computer at the receiver side, so as to align them at the receiver word
Science XII, Embedded Systems Group, 44221 Dortmund, Germany (e-mail:
wagner@icd.de; leupers@icd.de). length, which may even be different from the transmitter word
Publisher Item Identifier S 0278-0070(01)09994-8. length.
0278–0070/01$10.00 © 2001 IEEE
WAGNER AND LEUPERS: C COMPILER DESIGN FOR A NETWORK PROCESSOR 1303

Fig. 2. Infineon NP architecture.

Fig. 3. Processing of variable-length bit packets.


Fig. 1. Communication via bit-stream-oriented protocols.

Obviously, this data conversion overhead reduces the bene-


fits of the bit-stream-oriented protocol. In contrast, NPs may be
designed to be capable of directly processing bit packets of vari-
able length, i.e., in the form they are stored in the receive buffer.
This feature largely reduces the data transport overhead.
NPs are relatively new on the semiconductor market. There
are only a few standard chips (e.g., from Intel and IBM) and sev-
eral in-house designs (see the overview in [15], which also de-
scribes NP development efforts at STMicroelectronics). In this
paper, we focus on a specific machine, the Infineon Technolo-
gies NP [20].
Fig. 4. Data layout in memory and registers. Bit packets are not aligned at
Efficient compiler design for NPs is at least as challenging as memory or register word lengths.
for DSPs, since the dedicated bit-packet-oriented instructions
are not easily generated from a high-level language like C. In
contrast to the approach taken in [15], which is based on the II. TARGET ARCHITECTURE
retargetable FlexWare tool suite [21], we decided to develop a Fig. 2 shows the overall architecture of our target machine,
nearly full custom compiler backend. This is essentially moti- the Infineon NP [20]. The NP core shows a 16-bit RISC-like
vated by the need to incorporate C language extensions and a basic architecture with 12 general-purpose registers and special
dedicated register allocator, which will become clear later. An- extensions for bit-level data access. This principle is illustrated
other approach related to our work is the Valen-C compiler [22], in Fig. 3.
a retargetable compiler that allows the specification of arbitrary The NP instruction set permits performing arithmetic logic
bit widths of C variables. However, there is no direct support for unit (ALU) computations on bit packets that are not aligned
NP applications. at the processor word length. A packet may be stored in any
The purpose of this paper is to show how an efficient C com- bit index subrange of a register and a packet may even span
piler for an advanced NP architecture has been implemented and up to two different registers. In this way, protocol processing
to describe the required machine-specific code generation tech- can be adapted to the required variable packet lengths instead
niques. More specifically, we show how bit packet processing of the fixed machine word length. However, this packet-level
is made available to the programmer at the C level and how the addressing is only possible within registers, not within memory.
register allocator needs to be designed to handle variable-length Therefore, partial bit streams have to be loaded from memory
bit packets in registers, which is not directly possible by clas- into registers before processing on the ALU can take place
sical techniques. (Fig. 4). The size and position of the different bit fields are
The remainder of this paper is structured as follows. In Sec- statically known from the C source code for each specific
tion II, the Infineon NP architecture and its instruction set are de- application.
scribed in more detail. Section III outlines the problems associ- In order to enable packet-level addressing of unaligned data,
ated with modeling bit-level processing in the C language. Sec- the NP instruction set permits the specification of offsets and
tion IV describes the compiler design, with focus on the backend operand lengths within registers. The general instruction format
components. Experimental results are presented in Section V. is as follows:
Finally, we give conclusions and mention directions for future
work in Section VI.
1304 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 20, NO. 11, NOVEMBER 2001

Fig. 5. Packet-level addressing within registers.

“CMD” denotes the assembly command, “reg1.off” and


“reg2.off” denote argument registers with a corresponding
offset, and “width” is the bit width of the operation to be
performed. The use is shown in Fig. 5: any bit packet is
addressed by means of the corresponding register number, its
offset within the register, and the packet bit width. If offset
plus width are larger than the register word length (16 bits),
then the packet spans over two registers (without increasing
the access latency, though). This feature is very challenging Fig. 6. Array of bit packets.
especially from a compiler designer’s viewpoint. The width of
argument and result packets must be identical and one of the Even though this expression may still be simplified some-
two argument registers is also the result location of any ALU what by standard compiler optimization techniques, e.g., con-
operation. Therefore, two offsets and one width parameter per stant folding, it would translate into a relatively large instruc-
instruction are sufficient. tion sequence on a standard processor. In contrast, the NP can
implement the entire assignment within a single instruction. For
III. BIT-PACKET PROCESSING IN C this purpose, we introduce a packet access (PA) CKF of the fol-
lowing form:
Although possible, the description of bit packet-level ad-
dressing in the C language is inconvenient, since it can only
be expressed by means of a rather complex shift and masking
scheme. The code readability (and, thus, maintainability) is
poor and furthermore the masking constants might make the Parameter “op” denotes the operator, “var1” and “var2” are
code machine-dependent in case they depend on the processor the operands, “off1” and “off2” are the operand packet offsets,
word length. and “width” denotes the packet bit width. The CKF directly re-
flects the packet-level NP instructions illustrated in Fig. 5. The
A. Use of Compiler-Known Functions
“op” parameter selects the operation (e.g., ADD, SUB, SHIFT,
As outlined in Section II, the NP instruction set allows to ) to be performed on the arguments “var1” and “var2”. In ad-
avoid costly shift and mask operations by means of special dition, the required intraregister offsets and the packet bit width
instructions for packet-level addressing. In the C compiler, are passed to function PA. Using function PA, the above ex-
bit-packet manipulation is made visible to the programmer by ample can be expressed very simply in C as follows:
means of compiler-known functions (CKFs) or compiler intrin-
sics. The compiler maps calls to CKFs not into regular function
calls, but into fixed instructions or instruction sequences.
Thus, CKFs can be considered as C-level macros without any
calling overhead. The CKF approach has also been used in
several C compilers for DSPs, e.g., for the Texas Instruments
Incorporated C62xx.
Parameter (selecting an ADD instruction) is spec-
Using CKFs, the programmer still has to have detailed knowl-
ified as a constant in a C header file. The scalar variables and
edge about the underlying target processor, but readability of the
are mapped to registers in the assembly code by the register
code is improved significantly. In addition, by providing a suit-
allocator.
able set of simulation functions for the CKFs, C code written for
the NP is no longer machine-dependent, but can also be com- B. Bit-Packet Access in Loops
piled to other host machines for debugging purposes.
We illustrate the use of CKFs with a simple example. Con- As exemplified above, CKFs provide an elegant way to re-
sider a case where we would like to add the constant 2 to a duce the description effort of bit-packet processing in C. Most
7-bit-wide packet stored in bits 3–9 of some register denoted urgently, CKFs are required in case of bit-packet arrays, which
by the C variable . In standard C, this can only be expressed are indirectly accessed within loops, so as to avoid unacceptable
by means of a complex assignment as follows: quality of compiler-generated code.
Consider the example in Fig. 6, where we have to process an
array of eight packets, each of 10-bit length. The bit-packet array
WAGNER AND LEUPERS: C COMPILER DESIGN FOR A NETWORK PROCESSOR 1305

If the number of physical registers is lower than the number


of simultaneously required register variables, spill code will be
inserted. The register allocator uses the names of the pointer
registers in such a case to identify the register arrays that have
to be loaded into the register file for a given indirect bit-packet
operation. The INIT CKF translates to assembly code as the load
of a constant into a register. In this constant the name of the
register, the offset, and the width of the bit packet where the
pointer points to is encoded. In the example from Fig. 8, the
pointer PR2 points to the first element of the bit-packet array.
The name of the register where the bit packet is located is not
Fig. 7. Bit-packet array access in a loop.
known before register allocation. Therefore, the backend works
on symbolic addresses before register allocation. The symbolic
addresses are automatically replaced after the register allocation
phase. is the CKF for a single indirect addition,
like in the C expression “ .” The backend creates a single
machine operation for this CKF. In order to keep the number
of CKFs low, we specify the arithmetic operation as the first
parameter of the CKF instead of having a dedicated CKF for
each operation.
The CKF INC denotes the increment of a bit-packet pointer.
Like the increment of a pointer in ANSI C, the pointer will
point to the next array element after the call to INC. Because
the NP supports bit-packet pointer arithmetic in hardware, this
Fig. 8. Bit-packet array access with CKFs. requires again only a single machine instruction, independent of
whether or not advancing the pointer requires crossing a register
in turn is stored in an array of five 16-bit registers (R0–R4). As a boundary.
consequence, the bit packets are not aligned at the register word Obviously, the source code in the example is specified in a
length and some packets even cross register boundaries. very low-level programming style. However, the programmer
Suppose, we want to compute the sum over the bit packets still gets a significant benefit from use of the compiler. First
within a loop. In standard C, this would require code as shown in of all, it permits the use of high-level language constructs, e.g.,
Fig. 7. Due to the unaligned bit packets, the register file pointer for control code and loops. In addition, address generation and
elem must not be incremented in every loop iteration, but only if register allocation are performed by the compiler, which keeps
a bit packet crosses a register boundary. Therefore, control code the code reusable and saves development time.
is required within the loop, which is obviously highly undesir-
able with respect to code quality.
In order to avoid such overhead, the NP instruction-set ar- IV. COMPILER DESIGN
chitecture provides means for indirect access to unaligned bit-
packet arrays via a bit-packet pointer register. In the compiler, Like most other compilers, the NP C compiler is subdivided
we again exploit this feature by CKFs. The modified C code into a frontend and a backend part. The frontend is responsible
with CKFs for the above sum computation example is given in for source code analysis, generation of an intermediate represen-
Fig. 8. The array is declared with the C-type attribute reg- tation (IR), and machine-independent optimizations, while the
ister. This attribute instructs our compiler backend to assign the backend maps the machine-independent IR into machine-spe-
whole array to the register file. In contrast, a regular C com- cific assembly code.
piler would store the array (like other complex data structures) As a frontend, we use the LANCE compiler system developed
in memory. This concept of register arrays is required, since the at the University of Dortmund [28]. LANCE is a machine-inde-
NP machine operations using packet-level addressing only work pendent optimizing ANSI C frontend. There is no support for
on registers, not on memory. We assume that register arrays al- automatic retargeting, but LANCE comprises a backend inter-
ways start a register boundary. face that allows for direct coupling between the generated three
The variables PR1 and PR2 are pointers. By being operands address code IR and the data-flow tree format used by contem-
of the CKF INIT, they are introduced to the backend as pointers porary code generator tools.
to bit packets. The compiler checks that pointers to bit packets The C compiler backend is subdivided into code selection and
are assigned exclusively within CKFs, since otherwise incorrect register allocation modules. The code selector maps data flow
code might result. The backend exploits the knowledge about trees as generated by the LANCE C frontend and its backend
which pointer belongs to which element/array during life time interface into NP assembly instructions. As in many compilers
analysis in the register allocation. In the example, PR2 is used for RISCs, during this phase an infinite number of virtual regis-
to traverse the different bit packets of array , while PR1 con- ters are assumed, which are later folded to the available amount
stantly points to the sum variable. of physical registers by the register allocator.
1306 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 20, NO. 11, NOVEMBER 2001

A. Code Selection
The code selector uses the widespread technique of tree pat-
tern matching with dynamic programming [23] for mapping
data flow trees (DFTs) into assembly code. The basic idea in
this approach is to represent the target machine instruction set
in the form of a cost-attributed tree grammar, and parsing each
given DFT with respect to that grammar. As a result, an optimum
derivation for the given cost metric, and hence an optimal code
selection, are obtained. The runtime complexity is only linear in
the DFT size.
For the implementation, we used the OLIVE tool (an exten-
sion of IBURG [24] contained in the SPAM compiler [25]),
that generates code selector C source code for a given instruc- Fig. 9. Construction of the SIG. Virtual register sets fR1; R2g and
fR3; R4; R5g represent two register arrays, while R6 refers to a scalar
tion set represented by a tree grammar. Specifying the instruc- variable.
tion set with OLIVE is convenient, since the tool permits to at-
tach action functions to the instruction patterns, which facilitates
book-keeping and assembly code emission. some register pointer are assumed to be live while is in use.
The LANCE frontend splits each C function into a set of basic Liveness of is determined by inspecting the pointer initializa-
blocks, each of which contains an ordered list of DFTs. The tions in the calls to CKF INIT (see Fig. 8).
DFTs, which are directly generated in the format required for Due to the allocation constraints imposed by register arrays,
OLIVE, are passed to the generated code selector and are trans- the mapping of virtual registers to physical registers is based
lated into assembly code sequences one after another. During on a special multilevel graph coloring algorithm. Physical reg-
this phase, also the calls to CKFs are detected and are directly isters are assigned to those virtual registers first that belong to
transformed into the corresponding NP assembly instructions. register arrays. This is necessary since register arrays present
This step is rather straightforward, since CKFs are simply iden- higher pressure for the register allocator than scalar registers.
tified by their name. However, the code selector, in cooperation First, any node set in the original interference graph that be-
with the register allocator, is still responsible for a correct reg- longs to a certain register array is merged into a supernode.
ister mapping, since CKFs are called with symbolic C variables Then, the interference graph is transformed into a superinter-
instead of register names. The result of the code selection phase ference graph (SIG), while deleting all edges internal to each
is symbolic assembly code with references to virtual registers. supernode and all scalar virtual register nodes and their incident
This code is passed to the register allocator described in the fol- edges (Fig. 9).
lowing. Next, a weight is assigned to each supernode , which is equal
to the number of internal virtual registers of plus the maximum
number of internal virtual registers of ’s neighbors in the SIG.
B. Register Allocation
The supernodes are mapped to physical registers according to
Although the NP shows a RISC-like basic architecture, the descending weights. This heuristic is motivated by the fact that
classical graph coloring approach to global register allocation supernodes of a lower weight are generally easier to allocate be-
[26] cannot be directly used. The reason is the need to handle cause they cause less lifetime conflicts. Furthermore, in case of
register arrays. As explained in Section III (see also Figs. 6 a conflict, it is cheaper to spill/reload a smaller array instead of a
and 8), register arrays arise from indirect addressing in C pro- larger one. For any supernode with internal virtual registers,
grams, where unaligned bit packets are traversed within loops. a contiguous range in the register file is assigned. Since there
As a consequence, virtual registers containing (fragments of) may be multiple such windows available at a certain point of
bit-packet arrays have to be assigned to contiguous windows in time, the selection of this range is based on the best fit strategy
the physical register file. in order to ensure a tight packing of register arrays in the reg-
In order to achieve this, the register allocator maintains two ister file, i.e., in order to avoid too many spills.
sets of virtual registers: one for scalar values and one for reg- In our approach, any element of a register array can be ac-
ister arrays. All virtual registers are indexed by a unique number, cessed in two different ways: first by direct addressing (e.g.,
where each register array gets a dedicated, unique, and con- ) or indirectly by the use of a bit-packet pointer. In case of
tiguous index range. As usual, register allocation starts with a insufficient physical registers using indirect access, spill code is
lifetime analysis of virtual registers. Potential conflicts in the generated for all virtual registers within a register array; other-
form of overlapping life ranges are represented in an interfer- wise, only the particular virtual register is spilled. After register
ence graph, where each node represents a virtual register, and allocation, all symbolic addresses for bit packets have to be re-
each edge denotes a lifetime overlap. The lifetime analysis is calculated because now they specify a physical register within
based on a standard def-use analysis of virtual registers [27]. the register file instead of the name of a virtual register.
During lifetime analysis, special attention has to be paid to bit The permissible size of bit-packet arrays is constrained by
packets indirectly addressed via register pointers, whose values the size of the physical register file. Therefore, no register al-
might not be known at compile time. In order to ensure program location can be done for register arrays that are larger than the
correctness, all register array elements potentially pointed to by register file and the compiler needs to reject such code with cor-
WAGNER AND LEUPERS: C COMPILER DESIGN FOR A NETWORK PROCESSOR 1307

TABLE I arrays, respectively. Column 4 shows the performance gain in


EXPERIMENTAL PERFORMANCE RESULTS percent.
The use of packet-level addressing resulted in an average per-
formance gain of 28% over the original C reference implemen-
tations without CKFs and register arrays. Naturally, this mainly
has to be attributed to the NP hardware itself. However, from
a system-level perspective, it has been very important to prove
that this performance gain can also be achieved by means of
compiled C programs instead of handwritten assembly code.
As a result of our evaluation, we believe that the introduc-
responding error messages. Note that for such code, it would be tion of CKFs and register arrays represents a reasonable com-
at least very difficult to find a valid register allocation even man- promise between programming effort and code quality. CKFs
ually. For certain source programs, one solution to this problem give the programmer direct access to dedicated instructions,
is to split large register arrays into smaller pieces already in the which are important for optimizing the “hot spots” in a C ap-
C source code and to perform computations on a set of small ar- plication program, while the compiler still performs the other-
rays instead of a single large one. However, this only works for wise time-consuming task of register allocation. For noncritical
programs in which the subarrays required at each program point program parts, where high performance bit-level operations are
are already known at compile time. Eventually, a solution can al- hardly an issue, the productivity gain offered by the compiler
ways be found by mapping bit-packet arrays to memory just like versus assembly programming clearly compensates the poten-
regular data arrays in which case the advantages of packet-level tial loss in code quality.
addressing are naturally lost in exchange for a safe fallback po-
sition.
After register allocation for the supernodes has been VI. CONCLUSION AND FUTURE WORK
performed, all remaining virtual registers in the original inter-
ference graph are mapped to physical registers by traditional We have outlined compiler challenges encountered for NPs, a
graph coloring [26], while inserting spill code whenever new class of ASIPs, that allow for efficient protocol processing
required. by means of packet-level addressing. We have described the im-
plementation of a C compiler for a real-life industrial NP. In
order to make packet-level addressing accessible at the C lan-
V. RESULTS guage level, the main concepts in this compiler are the use of
CKFs and a special register allocation technique. Experimental
The C compiler for the NP described in the previous sections results indicate that these techniques work in practice, so that the
is fully operational. The performance of the generated code has processor features are well exploited. Although the detailed im-
been measured by means of a cycle-true NP instruction-set sim- plementation is machine-specific, we believe that the main tech-
ulator for a set of test programs. niques can be easily ported to similar NPs for which a growing
As may be expected, the quality of compiler-generated code compiler demand may be expected in the future. An example
as compared to handwritten assembly code largely depends on is the Intel i960, whose instruction set also partially supports
the clever use of the CKFs and the underlying register array con- bit-packet addressing.
cept. When using CKFs without specific knowledge of the ap- Improved versions of the NP C compiler are already planned.
plication, the performance overhead of compiled code may be Ongoing work deals with gradually replacing the pragmatic ap-
several hundred percent, which is clearly not acceptable for the proach of CKFs with more sophisticated code selection tech-
intended application domain. This is due to a massive increase niques, capable of directly mapping complex bit masking op-
in register pressure when too many register arrays are simultane- erations into single machine instructions. This will be enabled
ously live, which naturally implies a huge amount of spill code. by the use of special tree grammars that model the instruction
On the other hand, a careful use of CKFs, as derived from de- set for the code selector. In addition, we plan to include a tech-
tailed application knowledge, generally leads to a small perfor- nique similar to register pipelining [29] in order to reduce the
mance overhead in the order of only 10%. We observed that this register-memory traffic for multiregister bit packets and several
overhead can be even reduced further by means of instruction peephole optimizations are being developed in order to further
scheduling techniques to reduce register lifetimes (and thereby close the quality gap between compiled code and handwritten
spill code), as well as by peephole optimizations, which so far assembly code.
have not been implemented.
It is also interesting to consider the improvement offered
by CKFs as compared to regular C code. Table I shows the REFERENCES
performance for six small test routines. These mainly include
[1] Xtensa RISC processor, Tensilica Inc. [Online]. Available:
packet-oriented arithmetic operations on bit streams, as well http://www.tensilica.com
as masking, counting, and checksum computations. The C [2] Gepard DSP core, Austria Mikro Systeme International. (2000). [On-
programs have been compiled into NP machine code. Columns line]. Available: http://asic.amsint.com/databooks/digital/gepard.html
[3] B. Wess, “Automatic instruction code generation based on trellis dia-
2 and 3 give the simulated performance (clock cycles) of the grams,” in Proc. IEEE Int. Symp. Circuits and Systems, May 1992, pp.
compiled code without and with the use of CKFs and register 645–648.
1308 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 20, NO. 11, NOVEMBER 2001

[4] G. Araujo and S. Malik, “Optimal code generation for embedded [24] C. W. Fraser, D. R. Hanson, and T. A. Proebsting, “Engineering a simple,
memory nonhomogeneous register architectures,” in Proc. 8th Int. efficient code generator generator,” ACM Lett. Program. Lang. Syst., vol.
Symp. System Synthesis, Sept. 1995, pp. 36–41. 1, no. 3, pp. 213–226, 1992.
[5] S. Liao, S. Devadas, K. Keutzer, S. Tjiang, and A. Wang, “Code opti- [25] A. Sudarsanam, “Code optimization libraries for retargetable compila-
mization techniques for embedded DSP microprocessors,” in Proc. 32nd tion for embedded digital signal processors,” Ph.D. dissertation, Dept.
Design Automation Conf., June 1995, pp. 599–604. Elect. Eng., Princeton Univ., Princeton, NJ, 1998.
[6] A. Timmer, M. Strik, J. van Meerbergen, and J. Jess, “Conflict modeling [26] P. Briggs, “Register allocation via graph coloring,” Ph.D. dissertation,
and instruction scheduling in code generation for in-house DSP cores,” Dept. Comput. Sci., Rice Univ., Houston, TX, 1992.
in Proc. 32nd Design Automation Conf., Dec. 1995, pp. 593–598. [27] A. V. Aho, R. Sethi, and J. D. Ullman, Compilers—Principles, Tech-
[7] S. Bashford and R. Leupers, “Constraint driven code selection for fixed- niques, and Tools. Reading: Addison-Wesley, MA 1986.
point DSPs,” in Proc. 36th Design Automation Conf., June 1999, pp. [28] R. Leupers, Code Optimization Techniques for Embedded Proces-
817–822. sors. Norwell, MA: Kluwer, 2000.
[8] D. H. Bartley, “Optimizing stack frame accesses for processors with re- [29] D. Callahan, S. Carr, and K. Kennedy, “Improving register allocation
stricted addressing modes,” Software—Practice and Experience, vol. 22, for subscripted variables,” in Proc. ACM SIGPLAN Conf. Programming
no. 2, pp. 101–110, Feb. 1992. Language Design and Implementation, June 1990, pp. 53–65.
[9] S. Liao, S. Devadas, K. Keutzer, S. Tjiang, and A. Wang, “Storage
assignment to decrease code size,” in Proc. ACM SIGPLAN Conf.
Programming Language Design and Implementation, June 1995, pp.
186–195.
[10] R. Leupers and F. David, “A uniform optimization technique for offset
J. Wagner received the B.S. degree in electrical
assignment problems,” in Proc. 11th Int. System Synthesis Symp., Dec.
engineering from the Engineer School of Eisleben,
1998, pp. 3–8.
Eisleben, Germany, in 1993, the B.S. degree in
[11] E. Eckstein and A. Krall, “Minimizing cost of local variables access
computer science from the Bolton Institute, Bolton,
for DSP processors,” in Proc. ACM SIGPLAN Workshop Languages,
U.K., in 1996, and the Diploma degree in computing
Compilers, and Tools for Embedded Systems, May 1999, pp. 20–27.
from Leipzig University of Applied Sciences,
[12] R. J. Fisher and H. G. Dietz, “Compiling for SIMD within a register,”
Leipzig, Germany, in 1997. He is currently working
in Proc. 11th Annu. Workshop Languages and Compilers for Parallel
toward the Ph.D. degree in computer science at the
Computing, Aug. 1998, pp. 290–304.
University of Dortmund, Dortmund, Germany.
[13] R. Leupers, “Code selection for media processors with SIMD instruc-
He has been working in electronics and the IT in-
tions,” in Proc. Design Automation and Test Eur., 2000, pp. 4–8.
dustry for six years and joined the Embedded Sys-
[14] S. Larsen and S. Amarasinghe, “Exploiting superword level parallelism
tems Group at the University of Dortmund in 1999 as a Ph.D. student. He is
with multimedia instruction sets,” in Proc. ACM SIGPLAN Conf.
currently with the technology transfer company Informatik Centrum Dortmund
Programming Language Design and Implementation, June 2000, pp.
(ICD), Dortmund, Germany, where he designs customized compilers for appli-
145–156.
cation-specific processors. His current research interests include assembly code
[15] P. Paulin, “Network processors: A perspective on market requirements,
optimization techniques, retargetable compilers, and hardware–software code-
processors architectures, and embedded S/W tools,” in Proc. Design Au-
sign.
tomation and Test Eur., 2001, pp. 420–427.
[16] D. Brooks and M. Martonosi, “Dynamically exploiting narrow width
operands to improve processor power and performance,” in Proc. 5th
Int. Conf. High-Performance Computer Architecture, Jan. 1999.
[17] M. Stephenson, J. Babb, and S. Amarasinghe, “Bitwidth analysis with
application to silicon compilation,” in Proc. ACM SIGPLAN Conf. Pro- R. Leupers received the Dipl.-Inform. and Ph.D.
gram Language Design and Implementation, June 2000, pp. 108–120. degrees in computer science with distinction from
[18] S. C. Goldstein, H. Schmidt, M. Moe, M. Budiu, S. Cadambi, R. R. the University of Dortmund, Dortmund, Germany,
Taylor, and R. Laufer, “PipeRench: A coprocessor for streaming mul- in 1992 and 1997, respectively.
timedia acceleration,” in Proc. 26th Annu. Int. Symp. Computer Archi- He is currently a Senior Researcher and Lecturer
tecture, May 1999, pp. 28–39. with the Computer Science Department of the
[19] S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, and R. R. University of Dortmund. Since 1993, he has been a
Taylor, “PipeRench: A reconfigurable architecture and compiler,” IEEE Member of the Embedded Systems Group at Dort-
Comput., vol. 33, pp. 70–77, Apr. 2000. mund, where he is responsible for research projects
[20] X. Nie, L. Gazsi, F. Engel, and G. Fettweis, “A new network processor in the area of retargetable and optimizing compilers
architecture for high-speed communications,” in Proc. IEEE Workshop for embedded processors. In addition to his research
Signal Processing Systems, 1999, pp. 548–557. and teaching activities, he also heads the embedded software tools development
[21] C. Liem, Retargetable Compilers for Embedded Core Proces- team at the technology transfer company Informatik Centrum Dortmund
sors. Norwell, MA: Kluwer, 1997. (ICD), Dortmund, Germany. He has been a Co-Organizer of the Software
[22] A. Inoue, H. Tomiyama, H. Okuma, H. Kanbara, and H. Yasuura, “Lan- and Compilers for Embedded Systems workshop series and he serves on the
guage and compiler for optimizing datapath widths of embedded sys- program committees of several design automation, digital signal processor, and
tems,” IEICE Trans. Fundam., vol. E81-A, no. 12, pp. 2595–2604, Dec. compiler conferences. He authored Retargetable Code Generation for Digital
1998. Signal Processors (Norwell, MA: Kluwer, 1997) and Code Optimization
[23] A. V. Aho, M. Ganapathi, and S. W. K. Tjiang, “Code generation using Techniques for Embedded Processors (Norwell, MA: Kluwer, 2000).
tree matching and dynamic programming,” ACM Trans. Program. Lang. Dr. Leupers received a Best Paper Award at the 2000 Design Automation and
Syst., vol. 11, no. 4, pp. 491–516, 1989. Test in Europe Conference.

View publication stats

You might also like