0% found this document useful (0 votes)
42 views

NISC Architecture

NISC CPUs

Uploaded by

Benjamin Ottoman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

NISC Architecture

NISC CPUs

Uploaded by

Benjamin Ottoman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Generic Netlist Representation

for System and PE Level Design Exploration


Bita Gorjiara, Mehrdad Reshadi, Pramod Chandraiah, Daniel Gajski
Center for Embedded Computer Systems, University of California, Irvine
{bgorjiar, reshadi, pramodc, gajski}@cecs.uci.edu

in terms of performance, power consumption, area, and


ABSTRACT manufacturability. Furthermore, we believe that design flows that give
Designer productivity and design predictability are vital factors for more control to the designers over the final implementation will
successful embedded system design. Shrinking time-to-market and generate more predictable results.
increasing complexity of these systems require more productive design Architecture Description Languages (ADLs) have been proven to be
approaches starting from high-level languages such as C. On the other productive for design of Application Specific Instruction-set Processors
hand, tight constraints of embedded systems require careful design (ASIP). The ADL captures the behavior or structure of the processor
exploration at system level (coarse grained exploration) and at the and is used by the tools that compile the application and simulate the
processing-element (PE) level (fine grained exploration). results. A few approaches have also offered automated or semi-
In this paper we presented GNR, a formal modeling approach, automated RTL synthesis of the processor, which can improve the
developed to improve productivity of designing systems and processing designer’s productivity. It is desired to extend the ADL-based
elements, the same way that traditional ADLs improved productivity approaches to capture the entire systems as well. However, ADL-based
for designing processors. The GNR is an order of magnitude shorter design flows always assume that the architecture has a predefined
than state-of-the-art ADLs with RTL generation capabilities and yet instruction-set. This assumption creates three problems: (a) they cannot
can capture any structural details that affect the implementation be used for dedicated hardware executing a fixed application (IP),
quality. Using relatively short GNR description, we explored several where instructions impose unnecessary overhead; or for the entire
designs for implementing an MP3 decoder and achieved 3.25 speedup system, where no instruction-set can be defined; (b) such ADLs are
compared to MicroBlaze processor. We have also developed a web- lengthy and complex because they contain either behavioral description
based interface for our tools, so that users can upload and evaluate of all instances of instructions, or structural description of the
new architectures described in GNR. Our toolset and GNR is an instruction decoder.; (c) generated RTL from instruction behaviors has
intermediate step towards synthesis of TLM to RTL. unpredictable quality.
Categories and Subject Descriptors To address the above issues, in this paper we present a Generic Netlist
B.5.2 [Design Aids] Automatic synthesis; C.0 [General] Systems Representation (GNR) that can be used for generating programmable
specification methodology, Modeling of computer architecture. and dedicated custom pipelined IPs from high level C description of the
application. It can capture a single IP or a system composed of several
General Terms IPs. In contrast to ASIP approaches, our target processing elements
Design, Performance, Languages.
(PEs), called No-Instruction-Set-Computers (NISC), do not have a
Keywords predefined instruction-set. In our approach, the accurate netlist of the
Architecture Description Language, application-specific processor, datapath components is described GNR. Using this GNR, a cycle-
system design, modeling, synthesis, NISC, GNR. accurate compiler compiles C code of the application directly on the
input datapath and generates the control words for each clock cycle.
1. INTRODUCTION The outputs of this compiler and the input GNR is used to generate the
Designer productivity and design predictability are vital factors for simulatable and synthesizable RTL code of the PE. Generally, most of
successful embedded system design. Shrinking time-to-market and the designer’s experience, skill and innovation go into the design of
increasing complexity of these systems require more productive design datapath. Our approach improves design predictability by giving the
approaches. Hence, embedded systems are increasingly designed using designer complete control over the datapath. On the other hand, design
software (high-level languages such as C) rather than directly of the controller is tedious, time consuming and error-prone process.
implementing them in RTL. Tight constraints of embedded systems By automating this process and by allowing reuse of previously
require careful design exploration at system level (coarse grained designed datapaths and components, designer productivity is also
exploration) and at the processing-element (PE) level (fine grained significantly improved in our approach.
exploration). Such explorations can result in considerable improvement The GNR can also capture a system containing several communicating
custom IPs. It can be used as the output of TLM-based synthesis tools.
After modeling and verifying a system in transaction level, it can be
Permission to make digital or hard copies of all or part of this work for converted to GNR for synthesis. Each low level TLM communication
personal or classroom use is granted without fee provided that copies
command (e.g. send/receive) is mapped to an intrinsic C function
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy representing a communication component at the hardware level. In this
otherwise, or republish, to post on servers or to redistribute to lists, paper, we present a formalism for modeling a system and its
requires prior specific permission and/or a fee. components including programmable and dedicated custom pipelined
CODES+ISSS'06, October 22–25, 2006, Seoul, Korea. IPs. The GNR is formal and hence it allows checking rules and
Copyright 2006 ACM 1-59593-370-0/06/0010...$5.00. reducing semantic errors in the design. It provides support for third-

282
party cores, and the same GNR description is used for compilation, the control ports of the components in the NiscArchitecture (see
simulation and RTL generation. Since the designer does not describe Section 2.2).
the controller in our approach, the GNR descriptions are much shorter The set of connections Lx is defined between a bit-slice of a port p1
than other ADLs. We have developed a web-based interface for our and a similarly sized bit-slice of port p2 as follows:
toolset, so that users can upload and evaluate new architectures
described in GNR. Our compiler supports various architectural features ⎛ ⎞
Lx = {( p1, p 2, s1, e1, s 2, e2) | p1, p 2 ∈ ⎜⎜ Px ∪ ( U Py ) ⎟⎟ and
such as controller/datapath pipelining, multi-cycle/pipelined units, and ⎝ y ∈C x ⎠
heterogeneous forwarding paths. The compilation algorithm and the 0 ≤ s1 ≤ e1 < β p1 , and 0 ≤ s 2 ≤ e2 < β p 2 , and e1 − s1 = e2 − s 2}
datapath optimizations have been discussed in [9] and [10],
respectively. where, s1 and s2 are the start index of p1 and p2 and e1 and e2 are the
The rest of the paper is organized as follows. Section 2 and 3 explain end index of p1 and p2.
the GNR modeling approach and its syntax. Section 4 discusses the Ax is a list of aspects required by different tools for processing
details of GNR using several examples. Section 5 presents the flow of component x. Aspects are defined based on components types.
our tools, followed by experimental results in Section 6. Section 7 Currently, in our toolset, each component has three aspects:
presents related works and Section 8 concludes the paper. compilation aspect CAx, simulation aspect MAx, and synthesis aspect
NAx. Compilation aspect usually captures the relation between the
2. GNR MODELING APPROACH component’s behavior and the C-language operations, or application
GNR models a system as a hierarchical description of components functions. Simulation and synthesis aspects usually contain the
(objects) and their connections (composition). GNR contains a set of description of the component in an HDL, or the information required
predefined components and port types. These types are used for for generating a hardwired core (e.g. memory, divider, etc.). For some
enforcing the composition rules. A typical system consists of several component types, if an aspect is not specified by the designer, the
RTL components and processing elements (PEs). The behavior of each toolset will generate it automatically. For example, the
PE is captured in C language. In GNR, the PEs are represented by simulation/synthesis aspects of hierarchical components can either be
components of type behavioralIP. A behavioralIP may contain a generated automatically from their internal components, or be explicitly
custom datapath that is captured by a component of type specified by the designer. This feature allows modeling of third party
NiscArchitecture. The NiscArchitecture contains basic RTL cores and pre-laid-out components that have special technology or
components that are used by our compiler. Figure 1 shows a simple manufacturability considerations. Aspects are also used in defining
example of a system with two PEs (BIP1, BIP2), a bus, and an arbiter. proxy components in a NiscArchitecture. A proxy component is a
BIP2 is implemented by a programmable NISC and has a control component that resides outside of the IP block but the IP controls it. For
memory (Cmem) and data memory (Dmem). In the rest of this section, example, a memory proxy represents a memory or cache hierarchy that
we present the details of the GNR objects and compositions rules. resides outside of the IP. The HDL implementation of a proxy may be
as simple as input to output wirings. However, its compiler aspect
captures the information for controlling the external component. The
NiscArchitecture and behavioralIP component types have additional
properties as follows:
NiscArchitecture: The NiscArchitecture represents our target
architecture that does not have instruction-set and its control words are
generated by the cycle-accurate compiler. The compiler aspect of a
NiscArchitecture ξ is modeled by CAξ=(freqξ, CNSTξ, Γξ, sPtξ, fPtξ).
Figure 1- A sample system in GNR. The freqξ specifies the clock frequency of the NiscArchitecture and is
used by the compiler to generate the proper control words considering
2.1 GNR formalism the component delays. A control word contains the control values of
In GNR, a component x is represented by (τx, Px, Cx, Lx, Ax), where τx is components as well as a set of constant fields CNSTξ. The constant
the component’s type, Px is the set of ports, Cx is the set of components fields are used for jump and other operations with a constant operand.
inside x, Lx is the set of its internal point-to-point connections, and Ax is Each constant field f in CNSTξ has a bit-width or size denoted by βf.
the list of aspects that describe behavior of x for different tools in the The Γξ is a function that defines the ordering of the constant and control
toolset. Component type τx is defined as follows: fields in the control word. This ordering is used by the compiler to
τx ∈T, T={register, register-file, bus, mux, tri-state buffer, functional- generate the correct control words. The sPtξ and fPtξare storage
unit, memory-proxy, controller, NiscArchitecture, behavioralIP, components used for stack pointer and frame pointer. The storage
module, system} components can be separate registers or registers in a register file.
Where, NiscArchitecture, behavioralIP, module, system, and controller BehavioralIP: behavioralIP is a component that its behavior is
are hierarchical components and contain an internal netlist, while others specified in C language, and is handled by our cycle-accurate compiler,
are basic RTL components with no internal netlist. a traditional compiler, or a high-level synthesis (HLS) tool. The
compiler aspect of the behavioralIP specifies the set of application files
Each port p in Px has a bit-width βp, and a type θp defined as follows:
(e.g. header files and C files) that execute on that IP. In our approach,
θp ∈ {clkPort, ctrlPort, inPort, outPort, cwPort} the netlist of behavioralIP contains a NiscArchitecture and, if
Type clkPort shows the port is a clock, and type ctrlPort shows the port necessary, a memory subsystem (Figure 1). The cycle-accurate
is used to control the component. For example, a register has one port compiler compiles the application C code directly on the datapath of
of each type clkPort, inPort, outPort, and ctrlPort (i.e. load enable). NiscArchitecture. The behavioralIP can cover instruction-set based
Type cwPort means the port is a control-word port and is used to drive general-purpose or custom processors as well, where the synthesis
aspect is usually a third-party core.

283
2.2 GNR Rules component. For example, the delay or bit-width of the component can
be specified as parameters.
Our formal and typed description allows us to define rules to validate
the correctness of the given netlist. Enforcing such rules significantly
improves the productivity of the designer by identifying most of the
problems without simulation. Depending on the component type, the
rules can restrict number and types of the ports, instantiated
components, and their connectivity. There are two groups of rules:
general rules, and NISC-specific rules.
General rules:
• Clock ports can only connect to clock ports:
Figure 2- Block diagram of GNR schema for NiscArchitecture.
∀(p1,p2, …)∈Lx, τp1=clkPort if and only if τp2=clkPort
• Connections in Lx are defined between source ports (i.e. outPort) 4. EXAMPLE GNR MODELS
and the destination ports (i.e. inPort). For boundary connections In this section, we discuss modeling IPs in more details using several
(i.e. the connections that involve ports in Px), the input ports of Px examples. We first explain how a simple component, namely an ALU,
must be source and its output ports must be the destination. is defined in GNR. Then, we explain how components are integrated to
• Maximum of one connection is allowed to any bit of any form a simple IP that can execute C code. Finally, we show how this IP
destination port. The only exception is for input ports of bus-type is extended for system.
components, where multiple connections are valid. In digital
design, connecting several output ports to a single input port is not
4.1 Modeling a custom ALU
valid, unless through tri-state buffers. ALU is a component of type functional-unit. Figure 3 shows the GNR
description of a custom ALU that executes three operations: Add, Sub,
∀(p1,p2,s1,e1,s2,e2), (p3,p4,s3,e3,s4,e4)∈Lx, if p2=p4, then (p2∈Px
Not. The component has two parameters: BIT_WIDTH and DELAY.
and τx=bus) or (s2>e4) or (s4>e2) The parameters are initialized during the instantiation of a component
NISC-specific rules: in a datapath. This ALU has two input ports, one output port and a
• Each NiscArchitecture ξ has one and only one component of type control port. Since this ALU executes three operations, the size of the
controller: ctrl port is at least two. The simulatable and synthesizable code of the
ALU are described in the <Simulation-aspect> and <Synthesis-aspect>
∃! x∈Cξ, where τx=controller (not shown in the figure). For some components, it is also possible to
• Only component x with τx=controller can have one and only one generate the HDL description automatically from the component entity
port of type cwPort: information and compiler aspect.
∃! p∈Px and θp=cwPort if and only if τx = controller
• Each NiscArchitecture ξ has at least one component of type
register-file:
∃ x∈Cξ, where τx=register-file
• In NiscArchitectureξ, the bit-width of the cw port of controller
component must be equal to sum of the bit-widths of all control
ports, plus the sum of the bit-widths of all control fields in CNSTξ.
∀cw ∈ Pc , if θ cw = cwPort, then β cw = ∑β
p∈CPξ
p + ∑β
f ∈CNSTξ
f

where, CPξ = { p | p ∈ U Px and θ p = ctrlPort}


x∈Cξ

• Control connections in NiscArchitecture ξ are defined between the


cw port and the control ports of components in Cξ.
∀(p1,p2,s1,e1,s2,e2)∈Lx, if p2∈CPξ, then θp1=cwPort and s2=0 and
e2=(e1−s1)=βp2−1

3. GNR SYNTAX
We use XML language [12] to describe IP models in GNR. We define Figure 3- Partial description of a custom ALU in GNR.
GNR syntax in XML Schema [13] to enforce syntax and semantics In <Compiler-aspect> the operations that the ALU executes are
checking on the given input model. The Schema can also be used for described in details. Each operation has a name and a delay attribute:
code completion, which further increases the productivity of the the name is selected from the list of valid C operations, and the delay
designers. Figure 2 shows the partial block diagram of the Schema for is specified in terms of number of cycles or nanoseconds, according
modeling a custom IP (NiscArchitecture). The IP has several children to the selected target technology. Each operation has a set of input
tags including: <Ports>, <Components>, <Connections>, <CwFields>, ports and at most one output port. An operation may also require a
<Compiler-aspect>, <Simulation-aspect>, and <Synthesis-aspect>, specific value on one or more control ports. The values are specified
representing Pξ, Cξ, Lξ, Γξ, CAξ, MAξ, and NAξ, respectively. All using <Ctrl> tag. Using this modeling approach, new functional units
components in GNR have a <Params> tag that parameterizes that can be described and added to the library.

284
Some functional units are more complex than others. For example, In this IP, suppose that a constant field of 10 bits is used for operations
some of them are pipelined, or may require instantiation of hardwired with a constant operand. Figure 5 shows the GNR description of the IP.
cores provided by a third party. In case of a pipelined unit, a netlist of The IP has one clock port, a reset port, and several IO ports for
the main functional unit and the pipeline registers are defined as a communicating with data memory unit. The <Netlist> tag shows the
module in GNR. Most of today’s synthesis tools apply retiming to the components and connections of the IP. For each instantiated
netlist, and generate proper pipelined functional unit. In case of component the proper parameters such as BIT_WIDTH and
hardwired cores, the information of the third party tool that must be REG_COUNT are initialized. Thirty four connections are defined for
called for core generation is specified in <Synthesis-aspect>. this IP. Each connection determines the source component src, source
port sPort, destination component dest, and destination port dPort.
4.2 Modeling a simple IP Among these connections, 19 are shown in Figure 5, and the rest are
Figure 4(a) shows the block diagram of a simple NiscArchitecture that clock and control connections.
can execute simple C codes. The architecture consists of a controller, a In <Compiler-aspect> the ordering of the control fields are specified by
register file (RF), a data memory proxy, an ALU, a comparator, and a listing the fields in tag <CwFields>. This information is used by the
few multiplexers. The bus-width of the IP is 32 bits. The register file compiler for generating the control words. In this architecture, the total
has 32 registers, and two read ports and one write port. bit-width of the control ports is 35 bits, and the constant width is 10
bits. Therefore, the bit-width of the control words is 45 bits.

4.2.1 Automatic generation of control and clock


connections
In order to further simplify the datapath description, if the control
connections are not explicitly specified, we generate them
automatically by analyzing the components added to the architecture.
This improves the productivity significantly because adding the control
(a) (b) connections is very error-prone. Our modeling approach allows
Figure 4- Block diagram of a simple IP automatic generation of control connections and control fields, because
(a) without, (b) with communication Interface. we distinguish the control ports from other types of ports. Similarly, the
clock connections can be added automatically. In this architecture,
automatically adding the control connections and control fields reduces
the description size by 25%, while reducing the design and validation
time by more than two times.

4.2.2 Expanding the IP for communication


In order to use the simple IP of Figure 4(a) in a system we need to add
communication capability to it. For example, to connect the component
to a double-handshake bus protocol in message-passing mode, we need
to add an interrupt unit (IU) and a proper communication-interface unit
(CI) to the datapath of the IP. The CI has two send and receive queues
controlled by a control port. The block diagram of the new IP is shown
in Figure 4(b). In the C code of the application, the CI component is
programmed through a set of intrinsic-functions that are described in
GNR description of CI. The cycle-accurate compiler detects these
functions in the code and translates them to proper control signals for
the CI. The details of the bus protocol and CI drivers are available in
[16]. This IP is instantiated inside a behavioralIP as shown in Figure 1.

5. GENERATING RTL FROM GNR


Figure 6 shows the block diagram of our toolset. The inputs of the
toolset are GNR description of the system and the application C codes.
The outputs include synthesizable and simulatable RTL codes.
The Pre-Processor first verifies the syntax of the given GNR file using
the GNR Schema. Next, it completes the netlist by (a) resolving the
parameters of the components, (b) adding the missing clock and control
connections, and (c) adding the control fields, as explained in Section
4.2. The semantic correctness of the completed netlist is verified
afterwards, and proper warning and error messages are reported by
Pre-Processor. The netlist checker reports unconnected ports, invalid
connectivity, and non-existing referenced component and port names.
GNR modeling enables additional checking that is not possible using
HDL-based structural descriptions or even SystemC. For example, in
GNR, if a data port is mistakenly connected to a clock port, or if
Figure 5- GNR description of the IP in Figure 4(a). multiple output-ports are connected to one input port of a non-bus

285
component, then it is possible to detect and report the problem. Note System4 that includes MicroBlaze, OPB bus, bridge, DHS bus, and
that such connections are valid in HDLs but they result in an incorrect three custom IPs (One DCT and two IMDCTs).
design behavior. Using such simple checking in GNR, most
architecture problems are quickly determined.
GNR Model C code
code
C
C code

Pre-Processor

Core Translator Cycle-accurate Compiler

Third-Party Core Generator HDL Generator

Synthesizable Code Simulatable Code


Figure 7- Block diagram of system 4.

Figure 6-The flow of our toolset.


Table 1- Performance, memory and area of the five systems.
The Cycle-accurate compiler compiles the C code of each PE on the
# Cycles
given datapath using the algorithm presented in [9]. If a specific (millions)
Delay (s) Speedup # FPGA Slices
operation required by C code is not supported by a given datapath, then System 1 2.7 0.0540 1.00 1270
compiler displays proper error messages. After compilation, the System 2 2.54 0.0508 1.06 6008
compiler generates the contents of data and control memories. The System 3 2.47 0.0494 1.09 8376
System 4 1.24 0.0248 2.17 10750
HDL Generator uses the GNR and the outputs of the compiler to System 5 0.83 0.0166 3.25 2600
produce the final simulatable and synthesizable codes. The simulatable
code is mostly behavioral and simulates much faster than the
synthesizable code. The Core Translator generates the input files for We captured all these five systems including the two custom IPs in
third-party core generator by extracting proper information and GNR and used our tools to compile the partitioned C code and
parameters from the GNR model. The produced cores are combined generated Verilog RTL code for simulation and synthesis. Table 1
with the generated HDL code to form the synthesizable code. An shows the performance and area of the five systems. The second
online version of the toolset is available at [11]. column shows the total number of cycles for decoding a frame. The
third column shows the overall delay of systems running at 50MHz
6. EXPERIMENTAL RESULTS clock frequency. The fourth column shows the speedup of the systems
compared to System1. The fifth column shows the area of the designs in
As experimental results, we designed different system architectures
terms of number of FPGA slices. To play 38 frames per second (as
using GNR, and ran a fixed-point MP3 decoder (11000 lines of C code
required by MP3 standard), processing one frame should not take more
downloaded from [14]) on them. We explored system-level
than 0.026 seconds. System1 processes each frame in 0.054s, and
customizations and PE-level customizations in order to maximize the
therefore cannot meet the deadline. Among four other systems, only
performance gain. For all experiments, we generated Verilog RTL
System4 and System5 can meet the deadline. System4 and System5 run
code, and simulated and synthesized them on a Xilinx Virtex II FPGA
2.17 and 3.25 times faster than System1. However, System5 consumes
using Xilinx ISE 8.1 toolset. We measured the execution delay of the
4.1 times less area compared to System4. Therefore, System5 is a better
MP3 decoder for processing one frame.
design choice for MP3 application.
We profiled the MP3 decoder to identify its computationally intensive
Table 2- Specification vs. Genereated code size.
parts. The profiling results showed that during processing of each frame
GNR lines of code Verilog lines of code
most of the execution time is spent inside DCT and IMDCT filters.
simulatable
Therefore, we can accelerate the execution of these filters using system IP Total modified
code
synthesizable code
dedicated DCT and IMDCT cores. In this section, we present five system 1 70 NA 70 - NA NA
system architectures: System1 includes a MicroBlaze and an OPB bus system 2 181 363 544 474 2600 22000
(Xilinx cores) for off-chip memory communication; System2 extends system 3 220 363 583 40 4100 41000
system 4 285 363 648 65 6400 50200
System1 by adding one DCT IP; System3 extends System1 by two system 5 25 432 457 150 2400 32000
parallel DCT IPs; System4 adds one DCT IP and two IMDCT IPs to
System1; and System5 includes only one custom IP that runs the entire
MP3 decoder. For the filters and the entire MP3 we designed two Table 2 shows the size of the GNR files compared to the size of the
custom datapaths and used our cycle-accurate compiler to compile the generated RTL files. The second column of the table shows the GNR
corresponding code on them. The customizations include adding lines of code for description of the systems. This includes instantiating
multiple constant fields, proper pipelining and data forwarding. In the RTL and behavioralIP components in the systems and connecting
System5, we also added an integer divider core provided by Xilinx them together. In case of System5, only one IP is instantiated and hence
LogiCore. the size of this file is very small. The third column of the table shows
the GNR lines of code for describing the IPs. Note that the same IP
Our current component library has several communication-interface (with 363 lines), with different parameters, is instantiated once, twice,
components for double-handshake bus protocols (DHS). However, and three times in System2, System3, and System4, respectively. In
OPB uses a master-slave protocol that is not yet implemented in our System5, the IP is more complex and hence has more lines of GNR
library. Therefore, in order to communicate between MicroBlaze and code (432). The fourth column of the table show the total number of
our custom IPs, we used a bridge (similar to [15]) that converts the two GNR lines of code for each system, i.e. sum of GNR lines for
protocols to each other. Figure 7 shows the block diagram of the describing the system and its IPs. Note that, since in our experiments

286
we changed one system to create the next, we did not need to rewrite 8. CONCLUSION AND FUTURE WORK
the whole description again. The number of modified lines of code in
each step is shown in the fifth column of the table. For example, when In this paper we presented GNR, a formal modeling approach,
generating System3 from System2, we reused the IP description and developed to improve productivity of designing systems and processing
only need to modify the system description to instantiate and connect it elements, the same way that traditional ADLs improved productivity
(40 lines). The last two columns of the table show the size of the for designing processors. GNR captures a system as a hierarchical
Verilog and other core related files that are generated automatically. netlist of components annotated by compilation, simulation and
Note that, while the GNR descriptions are only a few hundred lines of synthesis aspects. Our tools and GNR improve the productivity of
code, the generated files are several thousand lines. This shows the system design by means of using parametrizable component
productivity gain of using the GNR. descriptions, static rule checking, and automatic compilation and RTL
generation for the custom PEs.
Overall, we could perform different system level (coarse-grained) and
IP level (fine-grained) architecture explorations using relatively small Furthermore, GNR enhances the designer control over structural details
GNR descriptions. The productivity gain was due to several factors of the design and hence improves design predictability. Using relatively
including: parametrizable component descriptions, static rule checking, short GNR description, we explored several designs for implementing
and automatic compilation and RTL generation for the custom IPs. an MP3 decoder and achieved 3.25 speedup compared to MicroBlaze
Since GNR enabled us to make detailed architectural adjustments, we processor. The future work will address TLM to GNR translation.
were able to achieve significant performance improvement while
meeting the area constraints.
9. REFERENCES
[1] P. Mishra and N. Dutt, “Architecture Description Languages for
7. RELATED WORKS Programmable Embedded Systems”, IEE Proc. on Computers and
Over the past years, several ADLs and their supporting software tools Digital Techniques (CDT), Special issue on Embedded
have been introduced. A complete survey of these ADLs can be found Microelectronic Systems: Status and Trends, vol. 152, no 3, 2005.
in [1], [2]. Among these ADLs only the followings have directly or [2] W. Qin and S. Malik, “Architecture Description Languages for
indirectly addressed synthesis of the architecture. Retargetable Compilation”, in The Compiler Design Handbook:
LISA [3], a sate-of-the-art commercial product, and EXPRESSION [4] Optimizations & Machine Code Generation. Y. N. Srikant and Priti
are behavioral ADLs that capture a processor in terms of its instruction- Shankar, CRC Press, 2002.
set behavior and a high level block diagram of its pipeline. They were [3] A. Hoffmann, T. Kogel, A. Nohl, G. Braun, O. Schliebusch,
originally designed for compilation and simulation and have been A.Wieferink, and H. Meyr. A Novel Methodology for the Design of
recently extended to generate the RTL of the processor by synthesizing Application Specific Instruction Set Processors (ASIP) Using a
the instruction behaviors. Since instruction behaviors are described in a Machine Description Language. IEEE Transactions on Computer-
very high abstraction level in order to be used by the compiler, Aided Design, 20(11):1338–1354, Nov. 2001.
achieving a high quality synthesis in these approaches is less likely. [4] P. Mishra, A. Kejariwal, and N. Dutt, “Synthesis-driven Exploration
Furthermore, the designer has no control over the details of final of Pipelined Embedded Processors”, International Conference on
implementation and is limited to describing the functionality of VLSI Design, 2004.
instructions. Since these ADLs are behavioral, they must capture all [5] H. Akaboshi, “A Study on Design Support for Computer Architecture
possible configurations of instructions. This can lead to very lengthy Design”, Doctoral Thesis, Depart. of Information Systems, Kyushu
descriptions. For example, in LISA the description of two RISC Univ., Japan, Jan. 1996
processors with four and seven pipeline stages has been reported to be [6] R. Leupers and P. Marwedel, “Retargetable Code Generation based
more than 2000 and more than 9000 lines of code, respectively [8]. on Structural Processor Descriptions,” Design Automation for
UDL/I [5] is a hardware description language (HDL) that captures the Embedded Systems, vol. 3, no. 1, 1998.
architecture at the Register-Transfer (RT)-level. A target specific [7] R. Leupers, P. Marwedel, “Retargetable Generation of Code Selectors
compiler can be generated based on the instruction set extracted from from HDL Processor Models”, European Design and Test, 1997.
the UDL/I description. UDL/I cannot support architecture with any [8] A. Chattopadhyay, D. Kammler, E. Witte, O. Schliebusch, H.
instruction level parallelism. Ishebabi, B. Geukes, R. Leupers, G. Ascheid, “Automatic Low Power
MIMOLA [6] is another HDL that captures the architecture netlist at Optimizations during ADL-driven ASIP Design”, VLSI-DAT, 2006.
RT-Level and is used for hardware synthesis, simulation, test [9] M. Reshadi, D. Gajski, “A Cycle-Accurate Compilation Algorithm
generation, and code generation. The RECORD compiler [7] extracts for Custom Pipelined Datapaths”, CODES+ISSS, 2005.
behavioral model of instructions from MIMOLA HDL. It processes the [10] B. Gorjiara, D. Gajski, “Custom Processor Design Using NISC: A
structure of the datapath from destination storages towards source Case-Study on DCT algorithm”, ESTIMEDIA, 2005.
storages to extract valid register transfers (RTs). After analyzing the [11] http://www.cecs.uci.edu/~nisc
controller and the instruction decoder, it rejects illegal RTs that do not [12] XML: http://www.w3.org/XML/
correspond to an instruction, and uses the remaining RTs in the [13] XML Schema: http://www.w3.org/XML/Schema
compiler. MIMOLA does not support pipelined architectures and [14] http://www.underbit.com/products/mad/
assumes single cycle operations. Furthermore, designer must describe [15] H. Cho, S. Abdi, D. Gajski, “Design and Implementation of
the instruction decoder from which the compiler will extract the set of Transducer for ARM-TMS Communication”, In Proc. ASPDAC,
valid operations. Although RT-level descriptions are more amicable to Design Contest, 2006.
hardware designers, describing the instruction decoder at RT-level is
[16] B. Gorjiara, M. Reshadi, D. Gajski, “NISC Communication
very tedious. Also instruction set extraction from RT-level is very
Interface”, Center for Embedded Computer Systems (CECS)
difficult and is typically possible only for limited target scope.
Technical Report TR 06-05, 2006.

287

You might also like