1450401864
1450401864
1450401864
SYSTEM DESIGN
AND VERIFICATION
Pao-Ann Hsiung
Marco D. Santambrogio
Chun-Hsian Huang
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher can
not assume responsibility for the validity of all materials or the consequences of their use. The
authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copy
right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 9787508400. CCC is a notforprofit organization that pro
vides licenses and registration for a variety of users. For organizations that have been granted a
photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Hsiung, PaoAnn.
Reconfigurable system design and verification / PaoAnn Hsiung, Marco D.
Santambrogio, and ChunHsian Huang.
p. cm.
Includes bibliographical references and index.
ISBN 9781420062663 (hardcover : alk. paper)
1. Embedded computer systems. 2. System design. 3. Computer
systemsVerification. I. Santambrogio, Marco D. (Marco Domenico) II. Huang,
ChunHsian. III. Title.
TK7895.E42H79 2009
004.21dc22 2008044105
Pao-Ann Hsiung
Marco D. Santambrogio
To my beloved family
Chun-Hsian Huang
Preface
References 225
Index 245
Readership
This book can be used as both a textbook and a reference book. The textbook
can be used for courses such as reconfigurable computing, reconfigurable sys-
tem design, FPGA systems, and hardware-software co-design. The examples,
case studies, and exercises provided with each chapter illustrate the theory be-
hind the techniques and thus help students and readers to put the knowledge
Organization
The book focuses on the individual techniques used in the design and verifi-
cation of reconfigurable systems and not on individual applications or frame-
works. The aim of the book is to make the techniques available to different
applications as required by the designer. The book is divided into seven
chapters, starting with an introduction to reconfigurable computing and con-
cluding with a detailed design and implementation flow. The seven chapters
provide an introduction to reconfigurable systems, FPGA technology with dy-
namic reconfiguration, reconfigurable hardware design techniques, operating
system design for reconfigurable systems, dynamic reconfigurable system de-
sign flows, reconfigurable system verification techniques, and reconfigurable
system design implementation techniques.
Contents
In Chapter 1, the introduction to reconfigurable computing, we set the back-
ground for reconfigurable system design by introducing reconfiguration tech-
niques and architectures, tools and platforms, and design and verification
methodologies. Application examples are given to illustrate the techniques
and tools.
In Chapter 2, FPGA technology, we introduce the various types of reconfig-
uration architectures with emphasis on the FPGA, including its configuration
bitstream, chip families, configuration conventions, file formats, and different
reconfiguration techniques.
In Chapter 3, the reconfigurable hardware design techniques, we first de-
scribe the model for reconfigurable hardware and then describe in detail the
partitioning and scheduling techniques for reconfigurable hardware.
In Chapter 4, the operating system for reconfigurable systems, we describe
the motivation and the requirements for such an OS and then describe a
layered architecture for the OS, including the hardware layer, configuration
Teaching Supplements
The web page for the book (http://www.cs.ccu.edu.tw/ pahsiung/RSDVbook)
contains slides for teaching, model course syllabi, and all source code used in
Suggestions
We welcome all kinds of suggestions and comments on the book, including
any errors or omissions that you identify. Please e-mail to
rsvd-book@embedded.cs.ccu.edu.tw.
Pao-Ann Hsiung
Department of Computer Science and Information Engineering
National Chung Cheng University, Taiwan, ROC.
Marco D. Santambrogio
Dipartimento di Elettronica e Informazione
Politecnico di Milano, Milano, Italy
Chun-Hsian Huang
Department of Computer Science and Information Engineering
National Chung Cheng University, Taiwan, ROC.
1
2009 by Taylor & Francis Group, LLC
2 Reconfigurable System Design and Verification
Reconfigure
General- General-
purpose RH4 C5 RH6 purpose RH4 C5 RH6
Processor Processor
configurable hardware task, such as the encryption task C1 , has finished its
computation the processor reconfigures the hardware logic to execute another
task, such as the signal processing task C5 . During this reconfiguration pro-
cess, the image processing task C5 continues to execute, without interruption.
The reconfigurable computing architecture can also be concisely described
as Hardware-On-Demand [164], general purpose custom hardware [74], or a
hybrid approach between ASIC and GPP [170].
From the above illustration, we can also define reconfigurable computing
as a discipline in which the functions of a system or an application can be
altered by configuring a fixed set of logic resources through memory settings.
Examples of application functions include various kinds of transforms, filters,
codecs, and protocols. Examples of the fixed set of logic resources include
logic blocks, I/O blocks, routing blocks, memory blocks, and application-
specific blocks. Memory settings can be achieved by the programming of
configuration bits that control the functions of the logic resources.
ware and software [193], which requires a system design methodology different
from the classic design. Formerly, the unified view of hardware and software
required by the codesign methodology could only be provided by cosimula-
tion platforms that struggled to cope with the large difference in simulation
speeds of hardware and software. With the advent of reconfigurable comput-
ing, hardware-software prototyping platforms solved this difference as both
hardware and software were running in real-time.
The classic design methodology starts by partitioning a system into hard-
ware and software. Since a software designer may not know the final hardware
architecture design and the hardware designer may also be unacquainted with
the software design flow, hardware and software and are often implemented
independently and then integrated towards the end of the design flow. If prob-
lems crop up during integration, changing either the hardware or the software
could be quite difficult. This increases the maintenance difficulty and also
delays the time-to-market.
To address the problems mentioned above, a system design methodology
called HW-SW codesign was proposed, which emphasizes the consistency and
the integration between hardware and software. It is based on a system-level
view and thus eases both software verification and hardware error detection.
The HW-SW codesign methodology reduces the cost of design and also short-
ens the time-to-market.
Nevertheless, the high cost in hardware design is a major issue of the HW-
SW codesign flow because hardware must go through a time-consuming flow
including design, debug, manufacturing, and test. The inconvenient hard-
ware manufacturing forces designers to search for alternate ways. One way
is to use modeling languages such as SystemC [23] to simulate hardware and
software. Another method is to use concurrent process models to simulate
hardware and software tasks. However, simulation speed is a major bot-
tleneck [164]. To overcome the drawback, prototyping using reconfigurable
architectures has become the most appropriate choice for HW-SW codesign.
In contrast to ASIC design, reconfigurable hardware can be much more eas-
ily used to design hardware prototypes that can be integrated with software
to obtain performance and functional analysis results much more efficiently
and accurately. We can thus say that reconfigurable computing has acceler-
ated and enhanced the HW-SW codesign flow. Figure 1.2 compares classic
design, hardware-software codesign, and reconfigurable design. There is an
obvious schedule compaction in the system design flow when transiting from
classic design to co-design and then from co-design to reconfigurable design.
The first transition mainly results in hardware and software being designed
concurrently and thus reducing the design time. The second transition re-
sults in a compaction of the system design, which previously required lengthy
hardware-software cosimulation time.
In the rest of this chapter, we will first introduce the technology and the de-
vices used for reconfiguration, including the famous Field-Programmable Gate
Arrays (FPGA) [73] in Section 1.4. We will then introduce some represen-
Analysis
HW Classic
D esi gn
SW
System
A n a l y si s
HW C o-
SW De sig n
S ch e dule
System compaction
A n a l y si s
HW Reconfi-
gur a b le
SW D esi gn
Schedule
System compaction
Time
Table
1.2: Commercial Reconfiguration Tools
Functionality Tool Name FPGA/EDA Company
Design Analysis PlanAhead Xilinx
FPGA Suite Tools ISE Foundation Xilinx
Quartus Altera
FPGA Advantage Mentor Graphics
FPGA Synthesizer Synplify Pro Synplicity
FPGA Compiler Synopsys
Leonardo Spectrum Mentor Graphics
Precision Synthesis Mentor Graphics
Simulator ModelSim Mentor Graphics
NC SIM Cadence
Scirocco Simulator Cadence
Spexsim Verisity
VCS Synopsys
Verilog-XL Cadence
and widely applied application domains. In this section, we will discuss how
reconfigurable computing has or is being applied, to embedded systems, net-
work security applications, multimedia applications, scientific computing, and
SATisfiability solvers. More detailed applications can be found in [73].
Summary
This chapter was basically a whirlwind tour of the book, introducing several
new concepts that will be covered in more detail in later chapters. References
to all the other chapters can be found in this chapter. We discussed why we
need reconfigurable computing and what exactly is reconfigurable computing.
We also suggested the perspective of viewing the progress in reconfigurable
computing through its transition from hardware-software codesign. Later, we
introduced the technology, the tools, the platforms, and the methodologies
for designing and verifying reconfigurable systems. We concluded this chapter
using some application examples.
2. Why did the FPGA devices, invented in the mid-1980s, take more than
two decades to be used in reconfigurable systems?
5. What are the main differences between Xilinx and Altera FPGAs?
6. Is there any difference between the tools and platforms used for the de-
sign of reconfigurable systems compared to those for conventional sys-
tems?
This chapter will then illustrate a particular technique that has been viable
for most recent FPGA devices, Partial Dynamic Reconfiguration. To fully
understand what this technique is, the concepts of reconfigurable computing,
static and dynamic reconfiguration, and the taxonomy of dynamic reconfig-
uration itself must be analyzed. In this way partial dynamic reconfiguration
can be correctly placed in the set of system development techniques that it is
possible to implement on a modern FPGA chip.
15
2009 by Taylor & Francis Group, LLC
16 Reconfigurable System Design and Verification
signals of the internal logic to an output pin of the FPGA package. There
is one and only one IOB for every I/O pin of the chip package. The IOBs
have their own configuration memory, storing the voltage standards to which
the pin must comply and configuring the direction of the communication on
it, making it possible to establish mono-directional links in either way or also
bidirectional ones.
Finally, the interconnection resources within an FPGA allow the arbi-
trary connection of CLBs and IOBs. The main modes of interconnections are
direct and segmented.
Figure 2.2: FPGA using the direct interconnection model (a) and the seg-
mented interconnection one (b).
The three building blocks presented so far interconnect together in the de-
vice to create a communication infrastructure composed of the communication
lines and IOBs around a bi-dimensional array of CLBs, which covers most of
the available die area and represents the true FPGA building block. The
memory cells attached to every configurable resource control its key features,
in such a way that the IO voltage standards of a IOB are controlled by par-
ticular values in its corresponding memory cell, the interconnections among
the communication infrastructure are controlled by setting appropriate bits
in the configuration memory, and the equations in the LUTS are controlled
in the same way. All of this configuration memory is made of SRAM memory
elements, and it is therefore volatile: when the device is power-cycled all of its
configuration is lost and it must be started afresh with a new configuration.
Usually an external machine downloads the configuration on the FPGA via
one of its configuration interfaces and sends a start command to signal that
the configuration has taken place. Some boards also offer a ROM where con-
figuration is stored, so that it can be subsequently downloaded on the FPGA
on power up. The file that stores the information that is copied over the con-
figuration SRAM memory of the FPGA is called bitstream and can be either
full or partial according to the extent of configuration memory addressed in
it.
Most FPGAs are not composed only of the three components described in the
previous paragraphs, but they have additional resources directly embedded
on the die. Such resources are RAM cells (called block ram in Xilinx FP-
GAs) processors, DSPs, multipliers, and so on. In this way system designers
can take advantage of hardware already present on the chip and integrate it
in their designs without the need of having to implement all of the desired
functionalities on the configurable resources but instead exploiting the speed
and the functions of pre-made embedded hardware cores. As an example, the
resource array of the Xilinx Virtex II FPGA XC2VP20 [216] is heterogeneous
because it is composed of a CLB array interleaved with hard-cores such as
an embedded PowerPC processor, 88 blocks of RAM each capable of 18Kb
of storage, 8 multi-gigabit transceivers, and so on. A design can therefore be
composed of two kinds of hardware: hard-cores and soft-cores. The hard-cores
on an FPGA thus enrich its functionalities and improve the overall speed of
architectures that make use of them.
configuration word
configuration word
configuration word
header type 1
header type 1
header type 1
header type 1
header type 1
header type 2
header type 1
header type 1
header type 1
0xFFFFFFFF
0xAA995566
comment
... ...
payload
payload
payload
payload
payload
payload
payload
data
After the configuration logic has been aligned with the special synchroniza-
tion word, the bitstream is processed and interpreted on 32-bit boundaries,
word after word. The register writes at the beginning and at the end of the
actual bitstream, i.e., the bitstream part after the sync word is realized via a
packet structure of a header and a certain number of payload words.
Type 1 packets are the most common packet type in the configuration
bitstream. Type 2 packets are only used when the number of bits in the
word count field of the type 1 header is insufficient to count a big number of
configuration words. The type 2 packet has been created only for the purpose
of reducing the overhead caused by the insertion of too many type 1 packets
in long writes to the same configuration register. The header for the type 2
packet is defined in in [100]. The majority of the header is used to store the
count of the subsequent words and no space is devoted to the register address,
which is always defined in the previous word, which must be a type 1 packet
header with word count set to zero. Therefore a type 2 packet must always
follow a type 1 header, and it can be seen only as an extended word count
field for the type 1 packet.
The significant data to understand the bitstream format are therefore in
the header of type 1 packets. In these headers, the opcode field is usually set
to a null operation or to a write operation in configuration bitstream. When
in write mode, the last 5 bits of the Address field select one of the possible
configuration registers, while the word count field indicates how many words
the configuration logic must expect as a payload of the header.
The Spartan 3 and Virtex II Pro FAR format are defined in [191] and [218], while Virtex
4 FAR format is defined in [100].
Xilinx documentation has various names for the Block Address: it is also called Column
column column
...
MINOR address
TOP/BOTTOM bit
+ ROW address
MAJOR address +
BLOCK address
32 bit
configuration
word
...
column column
reason in the bitstream there is some padding data in order to make the final
writes to the configuration memory. As stated previously, the FAR address is
automatically incremented whenever a whole frame has been written to the
FDRI register.
The majority of a bitstream is thus made of the words that are written to
the FDRI register, as shown in Figure 2.3.
Figure 2.5: Virtex II Pro FPGA configuration memory mapping, from Xilinx
Virtex II Pro User Guide [218].
the figure gives information about what particular column of resources in the
array is being configured with the frames belonging to a column, and shows
the progression of the configured resources as the FAR is automatically in-
cremented by the configuration logic: the first column configures the central
clock routing resources, then the next column the left IOB column, then their
interconnections, then the columns of CLBs, and so on. A change from 0 to
1 in the Block Address field means that the frames of that particular column
are configuring the contents of the block ram columns on the device, and a
value of 2 in the Block Address means that the frames are configuring the
interconnections of the ram cells with the rest of the array. The variables in
this addressing scheme are given by the number of CLB columns and by the
number of BRAM columns, marked with the letters k and j, respectively, in
Figure 2.5.
Family Model
Spartan 3 [217] XCS200
XCVP7
Virtex II Pro [216]
XCVP20
Virtex 4 [219] XC4VFX12
Table 2.1: FPGA families, models, and reference manuals used in this book.
2.3.1 Spartan 3
The Spartan 3 family is the simplest of the three, and it offers fairly low cost
devices. The available resources besides CLBs, block ram, and IOBs are dedi-
cated 18-bit X 18-bit hardware multipliers and digital clock managers (DCM)
hard-cores. In particular the chosen model, XC3S200, has a CLB array com-
posed of 24 rows by 20 columns, interleaved by two columns in which reside
216 Kbits of block ram and 12 hardware multipliers. Four digital clock man-
ager hard-cores are placed on the perimeter of the device in place of IOBs
in the position corresponding to BRAM columns. An internal representation
of the device is illustrated in Figure 2.6, where the resources that break the
regular structure of the slice array have been highlighted in dark gray. Con-
figuration of the resources is performed on a columnar basis, i.e., a subset of
slices in a vertical column cannot be addressed for reconfiguration without
configuring the remainder of the column as well.
Figure 2.7: XCVP7 (left) and XCVP20 (right) internals as shown in Xilinx
Floorplanner.
2.3.3 Virtex 4
The Virtex 4 family is the most recent of the families taken into consider-
ation. The most important feature regarding dynamic reconfiguration is its
capability to be configured without the columnar constraint. Configuration
frames are now provided with a row address that points to a specific row of
resources on the device. The single row of CLBs is not addressable individ-
ually, as the rows addressed by the frames are groups of CLB rows. This
Integrated
Software Environment, the set of software tools developed by Xilinx for going
from an HDL description of the hardware to the bitstream(s) to be configured on the FPGA
device.
given that that position has the same features of the original position where
the macro has been developed, namely resources and inter-resource relative
placements. This feature of the RPM grid will be useful during a partial
bitstream relocation, where it is important to know both the resources that
the bitstream configures and their relative placement on the FPGA die.
X0Y4
X3Y8
X0Y0
X4Y8
...
X3Y4
X7Y8
X1Y0
X8Y8
... X0Y2
X3Y4
X0Y0
X3Y2
X7Y4
X1Y0
...
X4Y3 X8Y3
X0Y1 X3Y1
X3Y2 X7Y2
Figure 2.10: An example of the numbering style adopted in the RPM grid.
Figure 2.10 shows a small example that illustrates the pairing of the con-
ventional numbering system described in Section 2.4.1 and the RPM num-
bering system described here. Different box sizes represent different kinds
of resources, as in real devices where resources can be visually identified as
belonging to a particular class according to their shape. In each box, the first
line of text represents the coordinates in the conventional numbering system,
while the second line has the coordinates of a sample RPM system. The
point of reference of a resource in the RPM grid is its lower left corner, and
the main constraint is that if the horizontal and/or vertical position of this
point in the global visual representation is smaller than the position of the
lowest left corner of another arbitrary resource, the former resource will have
an RPM X coordinate and/or an RPM Y coordinate smaller than the latter.
Increments in RPM X or Y coordinates are not always equal to one, as in the
conventional numbering system, but can be other greater integer numbers,
provided that the positional order of resources is respected as before.
data are loaded on a column basis, with the smallest load unit being a con-
figuration bitstream frame, which varies in size based on the target device.
Active partial reconfiguration of Virtex devices, or simply partial reconfigu-
ration, is accomplished in either slave SelectMAP mode or Boundary Scan,
JTAG mode. Instead of resetting the device and performing a complete re-
configuration, new data are loaded to reconfigure a specific area of the device,
while the rest of the device is still in operation [96].
The scenario shown in Figure 2.11 turns into the scenario proposed in Figure
2.12.
The first subdivision (who and where) is between external and internal re-
configuration. External reconfiguration implies that an active array may be
partially reconfigured by an external device such as a personal computer,
while ensuring the correct operation of those active circuits that are not be-
ing changed.
Internal or self-reconfiguration, on the other hand, is performed completely
within the FPGA boundaries. Clearly the integrity of the control circuits must
be guaranteed during reconfiguration, so by definition self-reconfiguration is
a specialized form of dynamic reconfiguration [183]. An important feature in
FPGA architectures is the ability to reconfigure not only all the device but
also a portion of it while the remainder of the design is still operational. Once
initially configured, self-reconfiguration requires an internal reconfiguration
interface that can be driven by the logic configured on the array. Starting with
Xilinx Virtex II parts, this interface is called the internal configuration access
port, ICAP [60, 97]. These devices can be configured by loading application-
specific data into the configuration memory, which is segmented into frames,
the smallest unit of reconfiguration. The number of frames and the bits per
frame are different for the different devices of the Virtex II family. The number
of frames is proportional to the CLB width of the device. The number of bits
per frame is proportional to the CLB height of the device.
The last property is the dimension. One can distinguish between two different
possibilities: mono-dimensional (1D) and bi-dimensional (2D) reconfiguration.
In 1D reconfiguration the resources affected by the process always span
the whole height of the device. In other words, it is impossible to configure
one set of resources without having to reconfigure the whole columns they
span as well. This reconfiguration scheme is imposed by the architecture of
some devices such as those in Xilinx Spartan 3 and Virtex II families because
the smallest addressable configurable unit in these devices is the frame, a
vertical line of resources. On the other hand, 2D reconfiguration allows
the partitioning of the configurable resources in a 2D space, allowing the
definition of different reconfigurable regions one on top of the other.
The 1D limitation has been overcome in more recent FPGA families such
as Virtex 4 and Virtex 5, which can reconfigure specific portions of the de-
vice without affecting the entire columns of corresponding resources. In this
way more sophisticated architectures can be designed, exploiting the added
dimension to the freedom in the choice of reconfigurable areas. Figure 2.14
illustrates these two different schemes. The 1D scheme is presented in Figure
2.14(a), showing how the whole FPGA height is occupied by a reconfigurable
region. In such a scenario it is not possible to define two reconfigurable re-
gions that share two different parts of the same column, so it is compulsory
to assign a column to a single reconfigurable region (see Figure 2.14(a)): this
observation explains why, even using a 2D placement, it is not possible to
perform a true 2D reconfiguration.
Figure 2.14: (a) 1D and 2D placement constraints vs. (b) 1D and 2D recon-
figuration.
these buses are implemented using a standard 32-bit wide data bus and 32-
bit wide address bus and they are defined using the same working frequency.
The communication between the fixed and the reconfigurable area is obtained
exporting from the fixed side the necessary OPB signals used to define the
correct bind of each reconfigurable module to it. The reconfigurable side is
composed of one or more modules that can be dynamically replaced. The
proposed architecture uses a mono-dimensional approach to reconfigure itself,
thus reconfigurable modules are as high as the entire FPGA, according to the
reconfiguration-by-column technique. Reconfigurable modules are attached
to the OPB bus through signals exported by the fixed component and acts
like a slave on the OPB. This architecture has two main weakness in the
communication and the portability over different FPGA families. The com-
munication infrastructure between the fixed side and the reconfigurable one
implies that, while a reconfiguration process takes place, inter-module commu-
nications are disrupted. The proposed infrastructure also limits the number
of reconfigurable modules. The second problem is caused by the instantiation
of a Power-PC processor inside the fixed side. This limits the portability of
the architecture to only FPGAs containing this processor inside their die.
Summary
Figure 3.1: A reconfigurable device (a), a program specification (b), the time
partitioning schedule (c), and the optimal schedule (d).
will at best identify the scheduling in Figure 3.1(b), since the chip, for the sake
of simplicity, is considered an homogeneous device organized in three different
reconfigurable regions, and must be reconfigured completely each time. The
size of each task is represented horizontally, while time is shown increasing
downwards. Gray areas represent reconfiguration time. If instead one allows
45
2009 by Taylor & Francis Group, LLC
46 Reconfigurable System Design and Verification
3.1 Model
In order to formally describe the properties of the actual FPGAs it is necessary
to define a model that will allow us to clarify the problem at hand, and thus
to better study its properties. We will strive to build a model that is general
enough to be applicable to several different chips, but also not too abstract
nor simple, thus retaining the important features of the problem.
The specification is given in terms of a Data Flow Graph (DFG) hO, P i,
where the node set O represents the operations and the arc set P represents
the precedence relations: arc oi oj indicates that operation oi O must
terminate before operation oj O starts. For convenience, the DFG also
includes the dummy start node and the dummy end node from the input graph
hO, P i in graph hS, Pi correspond to singleton subsets, respectively, Se and
Ss . In the setting of the DRESD project [152], the DFG is usually obtained
semi-automatically from a high-level specification (SystemC, C, VHDL) [64].
The reconfigurable device is modeled as a set U = {u1 , u2 , . . . , u|U | } of
reconfigurable units (RU): each u U is made up of u CLBs. The RUs are
linearly ordered, by which we mean that uk is adjacent to uk1 on the FPGA,
for every 1 < k < |U |. The operations can be physically implemented on
the target device using a set E of execution units (EUs), which correspond to
different configurations of the resources (RUs) available on the device.
In such a scenario, the reconfigurable partition and scheduling problem
amounts to
The dummy start node and the dummy end node from the input graph hO, P i in graph
amples in [150, 144, 113, 36, 35, 195]. Among these we can identify approaches
that only carry out temporal partitioning of the devices resources, i.e., the
whole device is reconfigured at once when the application needs to be, and
approaches that also perform spatial partitioning of the device into reconfig-
urable units of smaller size. The latter approaches are those that are closest
to our point of view, and are elsewhere referred to as dynamically reconfig-
urable architectures. Let us first focus on some of the works that propose
total reconfiguration.
The first approach simply orders the vertices by increasing the value of ,
and adds vertices to a partition until no more will fit on the device, at which
point it creates another partition and continues similarly until all the speci-
fication has been considered. The second approach, on the other hand, aims
at minimizing the overhead due to data exchange between different partitions
by trying, when possible, to assign successors of a given node to the same
partition.
An estimate of the total execution time was provided in this work as Ttotal =
(n Treconf ) + Texec , simply stating that the total time required for executing
the partitioned specification is given by the total reconfiguration time needed
added to the total execution time needed. The authors themselves, however,
time
cn
...
c3
c2
c1
noted how the reconfiguration time was, for their hardware devices, orders
of magnitude larger than the execution time, and suggested improvements to
tackle this disadvantage, such as applying the method to applications in which
each configuration can be used multiple times before being swapped out for
the next one, such as applying the first phase of a JPEG encoder to all the
input images before going on to configuring the subsequent stages.
A similar approach is exposed in [94], which also considers some amount of
spatial partitioning while still considering total reconfiguration of the physical
devices due to the use of multiple FPGAs in a single system.
A series of papers [144, 113, 145, 112, 114, 70, 78] present approaches to
the temporal partitioning of a system specification by exploring different ideas
and improvements. In [112], an Integer Linear Programming (ILP) model is
proposed to describe the temporal partitioning problem. It takes into account
the dependencies between different parts of the specification and the resource
availability on the FPGA, and also models the constraints due to memory
needs for exchanging data between the different partitions identified. As its
input, this approach uses a task graph, i.e., a graph in which each node is
not a single operation, but a set of operations instead. This constrains the
algorithm to miss a large number of the possible solutions, since the minimum
working granularity is set at the task level. Also, many of the parameters
needed by the model to function properly are to be supplied by the designer,
apparently with no general guideline for choosing reasonable values, e.g., for
the maximum number of partitions or the maximum number of functional
units to be instantiated for each type.
The work proposed here is extended in [144] with the application of a genetic
algorithm that adds spatial partitioning in order to split the specification
onto multiple interconnected reconfigurable devices. The lack of indications
towards the choice of the ILP model parameters is tackled in [113], which
revises some aspects of the model itself: each task is associated with a set of
possible implementations, each characterized with an estimate of its latency
and area usage. The ILP solver chooses among them which to employ. Choice
of the ILP parameters is tackled here in an enumerative fashion: the ILP solver
is run multiple times for each partition number bound in order to find which
latency constraints yield the best solution, and it is repeated until an upper
bound on the partitions number constraint is reached. A clear disadvantage
of this way of operating, however, is that the ILP model is solved multiple
times, thus increasing the computational burden.
In [145], a further modification of the approach is introduced: the operations
inside each task are now allowed to be assigned to different partitions, thus
refining the granularity at which the scheduling is performed. The authors
stress here that, given the unbalance between the configuration and execution
times, in order to obtain a good overall latency it is necessary to minimize
the need for reconfiguration, and thus the number of temporal partitions
identified.
The approach that they propose to this end is the sharing of the func-
tional units instantiated in each temporal partition between different oper-
ations allocated to the same partition. The authors present as an example
the application of their approach to the description of a matrix multiplier,
showing that making use of sharing can yield a gain in execution latency.
The approach proposed therein employs a resource-constrained force-directed
list scheduling, the resource set of which is modified to signify the change
in the chosen temporal partitioning in a local search-like fashion. A major
issue that the authors are faced with is, once again, the remarkable unbalance
between the time spent in computation and reconfiguration, the latter being
orders of magnitude more relevant than the former, several seconds spent in
reconfiguration vs. just microseconds needed for execution.
This observation leads the authors, in a further work [114], to explore the
possibility of lessening the impact of reconfiguration time on total latency.
They observe here that many data-intensive applications can be modeled as
the subsequent application of some repeating tasks, in a loop-like fashion. If
each one of the tasks were to be implemented as a temporal partition, then a
reconfiguration would be necessary each time the application needed to begin
execution of a new task. The possibility explored here is that, depending on
A series of works related to the MorphoSys platform [128, 130] tries to ap-
proach dynamic reconfiguration with respect to the particular features of this
architecture. The core of the architecture is made of an 8x8 array of reconfig-
urable cells, each of which resembles a microprocessors datapath and receives
control information from a dedicated control register, which in turn loads its
contents from the context memory. The task of controlling the behavior of the
reconfigurable hardware is delegated to a dedicated RISC control processor.
The proposed approach uses a library of kernels as its starting point, i.e.,
a collection of subprograms written in C each of which has its counterpart
implemented as reconfigurable hardware. It is also assumed that a generic
specification can be expressed as a loop on a certain sequence of kernels.
Note that this assumption has been adopted in other works such as [114],
presented earlier. The reason this approach is considered as only performing
temporal partitioning is that each kernel is assumed to require the use of the
whole processing array at once. Having said this, it appears clear that the key
to optimizing the implementation is an appropriate way to manage the loops
that the specification is composed of. Not unexpectedly, the first approach
proposed is the modification of the loops schedulings in order to execute all
necessary runs of each kernel before reconfiguring the hardware to execute the
next one, with the objective of minimizing the number of required reconfigu-
rations. Different ways of rescheduling the kernels to this end are proposed in
[130], taking into account the limitations imposed by the hardware architec-
ture. Once the loops have been scheduled, the authors focus in turn on the
generation of the control code, which states when context changes take place,
and how data are transferred between subsequent partitions. A primary goal
is the maximization of the overlapping between computation on part of the
reconfigurable array and data transfer or context loading, along with optimal
management of the memory in order to simplify the control code.
at once.
The authors of [144, 113, 145, 112, 114, 70, 78], cited above, also explore
the possibility of extending their work to take advantage of partially recon-
figurable architectures. Their main concern [70] is to hide the reconfiguration
times: their approach is to divide the reconfigurable device into two parti-
tions of fixed sizes. At a given time, one of the two parts is computing while
the other one is being reconfigured and getting ready for its next task. It is
assumed that the configuration of one of the two parts starts synchronously
with the execution of the other one, which implies that in order to properly
hide the reconfiguration times, the task assigned to a part has to be long
enough in execution time to last at least as long as the configuration of the
other part. In order to ensure this, the now well-known idea of embedding a
loop inside each partition is proposed.
This is made harder, clearly, by the fact that each partition can only occupy
half of the resources on the device, and thus can only fit smaller loop bodies.
The authors focus changes in [78], where they try to minimize the com-
putational effort needed in the design phase as opposed to optimizing the
execution further, by exploiting a library of pre-placed and routed macros as
functional units that the design phase thus only has to place on the device.
functions among random instances that are not part of the datapath, in order
to place them into the bit-slices to which they show the closest relation, thus
saving communication costs.
The work presented in [121], also in the VLSI design field, has the goal of
identifying functionally equivalent slices, i.e., subgraphs of the netlist gener-
ated by the systems HDL description, in order to ease the subsequent logic
optimization step. The proposed approach starts with the identification of
seed sets, i.e., initial groups of known equivalent gates, to be used as a start-
ing point.
Seed sets are built considering heuristics closely related to the application
field, such as considering successors of high-fanout nodes, or entities with
similar names as generated by the HDL tool. At the core of the described
algorithm is an expansion of the already known functionally equivalent slices
driven by the comparison of the so-called regularity signatures between the
corresponding entities in each of the slices. These signatures encode informa-
tion about the structure that would be obtained by expanding the current
slices. Basically, they are subgraphs that include the neighbors being con-
sidered for inclusion into the slices along with the edges connecting the new
neighbors to entities already included in the slices. By comparing the signa-
tures, the algorithm can ensure functional equivalence of the slices obtained
after the expansion.
Another way to approach the regularity extraction problem is presented
in [153], which also deals with integrated circuit design. Here, the focus of
the work is on recognizing in the system graph instances of a given set of
templates. The authors propose an approach that involves using a linear
representation of the graphs, in the form of strings called K-formulas [22].
By representing the templates to be recognized as K-formulas, the authors
formulate the matching problem as a string matching one, and propose ways to
restrict the K-formula representation, which in itself would allow a given graph
to have multiple legal string encodings, to obtain a unique representation for a
given directed graph. The authors, however, assume that the template library
is given and thus focus on the recognition of their instances in the circuit. The
problem of generating templates is, on the other hand, given little space.
At this point, all that remains to be done for the current pair is to see
whether the template they have in common is a known one or if, instead, it is
the first time such a template is encountered. In either case, at the end of the
computation for the current pair the template will have been inserted in set S,
and the current pair of nodes will have been added to the rootN odestemplate
set.
Let us now consider how the LargestT emplate function (Algorithm 2) car-
ries out its task. In order to understand its workings, we first have to clarify
how a template is treated in this approach to the problem. Thanks to the fact
that we are identifying tree-shaped templates, we can efficiently store only the
type of the root node, and keep track of which other templates are children
of a given one.
A template is as such composed of two elements :
Inthe pseudocode explaining the algorithm, this is represented with a C-like dot notation
for the sake of readability.
This might be somewhat confusing, but we must keep in mind here that the term tree-
shaped is used in a somewhat stretched way: we are in fact identifying templates that are
shaped like reversed trees, i.e., their root is a node that has no children inside the template.
{1,2}
Figure 3.3: Example of how tree-shaped templates are generated. (a) shows
the way the generated templates are stored, while (b) shows where the iden-
tified templates have their instances in the graph.
3.2.5.1.1 Local Isomorphism Check Let us now consider the local iso-
morphism check step, which was mentioned above. At first glance, it may
seem that making sure the operation types for two neighbors match is enough
to ensure the isomorphism of the clusters with the neighbors added to them.
This is not quite true: we must keep in mind, in fact, that we are consider-
ing node-induced subgraphs, i.e., subgraphs that include the vertices in their
vertex set and all the edges from the original graph that have both source and
destination in the subgraph. Figure 3.5 illustrates this problem with a simple
example. In the figure, the six nodes surrounded by the grey area are already
part of the two isomorphic clusters, which consist of three nodes each. The
nodes labeled c1 and c2 , as in the algorithms pseudocode, are part of the pair
that has been dequeued from P and whose neighbors are being considered
for addition to the clusters. After the matching problem has been solved, the
nodes labeled m1 and m2 have been matched. Their operation type is the
same, but adding them to the clusters would not produce isomorphic clusters
of size 4!
Notice, in fact, how adding m1 and m2 to their respective subgraphs would
also cause the addition of the edges from m1 to the light blue node and from
m2 to the yellow node. The resulting subgraphs would then not be isomorphic.
ENTRY EXIT
i1 = i1 * 2;
P={ ( a = a1 / a2;
, b = b1 / b2;
),( a1 = i1 + 1;
, b1 = i2 + 2;
)}
i2 = i2 * 4;
c = io + 16; Internal = *o1;
a1 = i1 + 1; a2 = i1 - 4; b1 = i2 + 2; b2 = i2 - 10;
c = c / Internal;
a2 = i1 - 4; b2 = i2 - 10;
Internal_1 = a + b; Internal_1 = a + b;
d = d / Internal_0;
Internal_1 = a + b; Internal_2 = c - d;
Step 1
Internal_3 = Internal_1 / Internal_2;
( a2 = i1 - 4;
, b2 = i2 - 10;
) is added
ENTRY EXIT
P={ ( a1 = i1 + 1;
, b1 = i2 + 2;
),( a2 = i1 - 4;
, b2 = i2 - 10;
)}
i1 = i1 * 2; i2 = i2 * 4;
c = io + 16; Internal = *o1;
a1 = i1 + 1; a2 = i1 - 4; b1 = i2 + 2; b2 = i2 - 10;
c = c / Internal;
i1 = i1 * 2; ENTRY
d = io + 24; Internal_0 = *o1;
a = a1 / a2; b = b1 / b2;
ENTRY i2 = i2 * 4;
d = d / Internal_0;
Internal_1 = a + b; Internal_2 = c - d;
Step 2
Internal_3 = Internal_1 / Internal_2;
( i1 = i1 * 2;
, i2 = i2 * 4;
) is added
ENTRY EXIT
P={ ( a2 = i1 - 4;
, b2 = i2 - 10;
),( i1 = i1 * 2;
, i2 = i2 * 4;
)}
i1 = i1 * 2; i2 = i2 * 4;
c = io + 16; Internal = *o1;
a1 = i1 + 1; a2 = i1 - 4; b1 = i2 + 2; b2 = i2 - 10;
c = c / Internal;
No more additions
d = d / Internal_0; are possible,
Internal_1 = a + b; Internal_2 = c - d;
expansion stops.
Step 3
Internal_3 = Internal_1 / Internal_2;
ml mr
c1 c2
In order to avoid this kind of issue, the local isomorphism check has been
implemented as shown in Algorithm 5. The idea behind the check that is per-
formed is indeed pretty simple: we build two sets, E1 and E2 , which represent
the edges that would be added in the left and right clusters, respectively (left
and right are intended as the left and right sets of the bipartite matching),
if we added the node currently being checked. Each edge is represented by a
triple, hinout, pair, i, in which:
inout can take two values: in if the edge being described has the newly
added node as its destination, or out if the opposite happens
pair is the pair of nodes in the isomorphism bijection to which the node
already in the cluster which is the source (or sink) of the current edge
belongs
is the input number provided by the edge
The algorithm for local isomorphism check simply populates the two sets, with
edges to and from the nodes being considered for addition to the clusters.
When the sets have been populated, a simple comparison is performed: if
the contents of the two sets are the same, then the local isomorphic check is
passed. Otherwise, addition of the two new nodes is rejected.
Algorithm 5 IsomorphismCheck
el
er
for each edge (ml o) edges (G), with o V1 do
el el hout, (o, V (o)) , (ml o)i
end for
for each edge
(mr o) edges (G), with o V2 do
er er out, V 1 (o) , o , (mr o)
end for
for each edge (o ml ) edges (G), with o V1 do
el el hin, (o, V (o)) , (o ml )i
end for
for each edge
(o mr ) edges (G), witho V2 do
er er in, V 1 (o) , o , (o mr )
end for
if el = er then
return true
else
return f alse
end if
Latency estimates, i.e., an estimate of the time a given core will take to
execute its functionality once configured on the target device. This is of
course essential to the subsequent scheduling phases, but can nonetheless
be useful during the core choice phases, e.g., to choose between two cores
with identical configuration time but different latency: which one will
yield the most advantage if chosen?
Other ways to characterize the identified cores were defined, among which
the so-called Hierarchical Template Graph, HTG in short, is worth noting.
The result of several runs of an algorithm solving Isomorphic Subgraphs
is a set V of pairs of isomorphic subgraphs. This set, though, is highly
unstructured: it might contain, for instance, two pairs whose elements are
isomorphic (say {S a , S b }, {S a , S c } V), which would be better represented
as a single set {S a , S b , S c } V. Also, notice that even if two sets S a , S b
in some pairs of V are not isomorphic, it might well be that there exist
0
S a 0 S a and S b S b that indeed are isomorphic. For instance, say we
have V = {{S , S }, {S c , S d }} and that there exists S a 0 S a isomorphic to
a b
An accurate measure of device resource usage can only be obtained further down the
design flow, due to the nondeterminism of some of the lower level phases.
S c . For scheduling purposes we are not sure whether we prefer to use less iso-
morphic subgraphs of greater size ({S a , S b }) or more isomorphic subgraphs of
0 0
smaller size ({S a 0 , S b , S c , S d }, where S b S b isomorphic to S a 0 must exist
because S a is isomorphic to S b ).
For this reason we build a Hierarchical Template Graph, HTG, whose nodes
contain a collection of isomorphic subgraphs. Whenever two subgraphs are
isomorphic they share the same structure, so we introduce the term template
to identify the common structure of those subgraphs. Moreover, every arc
starting from a set S and going to a set S 0 implies that the template for S
(and hence any graph in S) is a subgraph of the template for S 0 (and hence
of any graph in S 0 ).
If we now want to know how many copies of a subgraph S there are in the
original graph S we can look at the representative node for S in V and traverse
all the children of that node, since they are bound to contain graphs S that
have subgraphs isomorphic to the representative and hence to S. Therefore
we can say that in the HTG two templates have a common ancestor if they
contain it. This situation is especially true of the core generation approach
we exposed above, which dealt with the progressive expansion of an initial
seed graph and kept track of each of the intermediate results, but it definitely
can occur in other approaches as well. The purpose of the HTG structure is
to explicitly represent such relations, in order to allow us to take advantage
of the structural similarity of the cores in their choice. Figure 3.6 shows an
example of the structure offered by the HTG, in which it is shown how the
seed templates of size 2 are parts of larger templates of size 3, which in turn
are subgraphs of the size 4 template. The templates used are taken from the
example used earlier for the core generation phase.
The HTG can improve the reconfiguration process in two different ways. On
one hand it allows the scheduling algorithm to compute a fairly precise trade-
off between the size of the templates to use on the FPGA and the number of
times these replicas can be used.
On the other hand, the scheduling algorithm can benefit from knowing that
a certain template can be configured on the FPGA as a modification of another
template rather than by blindly reconfiguring the CLBs. As an example,
consider two nodes, n1 and n2 , n1 , n2 htg, that have a common ancestor,
ai htg. This means that n1 and n2 contain two isomorphic subgraphs. It is
then possible to specify the reconfiguration process as the difference between
n1 and n2 , thus producing a smaller reconfiguration bitstream to download
on the FPGA.
As a proof-of-concept, consider Table 3.1, which shows some reconfiguration
time results computed on a Xilinx xc2vp7 FPGA. It is easy to see that the
reconfiguration time grows linearly with the size, in number of frames, of the
bitstream. Therefore, the ability to maximize the resource reuse is going to
drastically decrease the time needed to partially reconfigure the device.
rc1 to rc5, shown in Table 3.1, represent the bitstreams used to implement
a given specification on a reconfigurable device. Suppose we have an isomor-
phic relation, computed using the HTG, between some of them, i.e., RC4 and
RC5 have as common ancestor RC3. Suppose also that a data dependency
exists between RC3 and RC4 and between RC4 and RC5, so that RC5 has to
wait for RC4 and RC4 for RC3. In a scenario with just one feasible area to
configure all the RCs there is no way to hide the reconfiguration time needed
to switch between the corresponding three configurations. Therefore the con-
figuration time is computed as: 60.045ms+64.140ms+76.049ms = 200.229ms.
Consider now the case where the information provided by the htg has been
taken into account to compute the reconfiguration bitstream. In this scenario
the reconfiguration time needed to configure RC3 is still 60.045ms, but the
time required to switch between RC3 and RC4 is no longer 64.140ms but
something close to 4ms, and the one to switch RC4 to RC5 is closer to 20ms,
therefore the overall configuration is now 60.045ms + 4ms + 20ms 84ms.
All this information is already available thanks to the HTG. At present it
is not as yet used in the scheduling process, but its exploitation is underway.
Building this tree may seem complicated, but in practice it is sufficient to
identify every node of the HTG with a representative list of operations from
the original graph. We start with an empty tree and add in turn the pairs
in decreasing size order. If in the graph there is a node with the same list of
operations (we can order the list of operation in some way to speed up the
comparison), we add the pair there. If the list of operations is a subset of the
list in an existing node, we create a new child node.
Consider the core that has the largest size first, also known as LFF
(Largest Fit First) heuristic [44]. The point of such choice is to allow
greater optimization possibilities to later stages of the design flow, i.e.,
intra-module placement and routing, but has the potential adverse effect
of choosing templates so large that they could use up a big part of the
device, thus leaving little room for configuring more modules at the same
time.
Consider cores with the most instances first (clearly, this heuristic is
especially fit to work with partitioning policies that expose recurrent
structures), also known as the MFF (Most frequent Fit First) heuristic
[44]. Conversely, this approach allows for more moderate area usage
in that it tends, in principle, to maximise reuse of the identified units
possibly at the expense of a worse schedule length in the implemented
system.
In addition to these rather basic policies, however, we can define some that
are more specifically tailored to our problem, i.e., the final goal of implement-
ing our reconfigurable system on a physical device such as a FPGA with its
constraints related to the minimum size of a reconfigurable unit, the total
area availability and the reduced communication resources.
As an example, we might evaluate actual area usage on the device by using
the following formula:
Acore
Aused = Sizerec
Sizerec
i.e., the actual area used on the device by a given core is the area effectively
occupied by the cores logic rounded up to the next reconfigurable unit.
X X X
+ + + +
maxx1 + x2 + x3
x1 , x2 , x3 {0, 1}
subject to x1 + x2 1
x2 + x3 1
ENTRY 1
i1 = i1 * 2; i2 = i2 * 4; i1 = i1 * 2; i2 = i2 * 4;
c = io + 16; Internal = *o1; c = io + 16; Internal = *o1;
a1 = i1 + 1; a2 = i1 - 4; b1 = i2 + 2; b2 = i2 - 10; a1 = i1 + 1; a2 = i1 - 4; b1 = i2 + 2; b2 = i2 - 10;
c = c / Internal; c = c / Internal;
a
Internal_1 = a + b; Internal_2 = c - d;
d = d / Internal_0;
b 2 3 d = d / Internal_0;
ENTRY
i1 = i1 * 2; i2 = i2 * 4;
c = io + 16; Internal = *o1;
a1 = i1 + 1; a2 = i1 - 4; b1 = i2 + 2; b2 = i2 - 10;
c = c / Internal;
d = d / Internal_0;
c Internal_1 = a + b; Internal_2 = c - d;
Figure 3.8: Covering Set Identification: (a) status after core choice, (b) order
in which the covering set identification considers the uncovered nodes, (c) final
status, partitions have been created for the previously uncovered parts.
the operations contained in the original specification. Basically, two ways exist
to overcome this obstacle, which involve either enlarging some of the cores to
include the uncovered operations or generating new cores which contain these
operations. Given our policy of detecting repeated structures, we chose to
act according to the latter strategy, since modifying the detected cores would
mean losing the isomorphism between their instances, ultimately wasting the
information produced by our partitioning phase.
Let us refer to Figure 3.8 to explain how we carry out this task, with
reference to the example of Figure 3.4, where different instructions have been
represented using different filler colors:
Figure 3.8(a) represents the status of our example specification after the
core choice phase, having applied our isomorphism detecting partitioner
and an LFF heuristic for the core choicek , with a simple size measure-
ment based on the number of nodes contained in each core. Note how
parts of the specification are uncovered by the chosen cores, i.e., the
entry node at the top and the three nodes at the bottom;
Figure 3.8(b) shows how the covering set identification algorithm tra-
verses the graph to generate the new subsets: the graph is traversed
in breadth-first order, with the aim of exploring the remaining nodes
by level (nodes which have already been covered by the previous steps
are simply skipped). The algorithm then adds the nodes to the current
subset as they are visited, with the following exceptions:
k LargestFit First, which chooses cores simply in decreasing size order. Other heuristics are
MFF (Most frequent Fit First), which chooses the core that has most instances; and the two
versions of CWM (Communication Width Metric), which try to minimize communication
with the other cores or to maximize the communication inside the core.
put a task.
Under these assumptions the algorithm finds a provable optimal solution
[71], and the techniques used are based on the off-line paging solved by Least
Imminently Used (LIU) algorithm. This approach does not take into consid-
eration resource reuse and different task sizes. Furthermore, only the total
reconfiguration time is minimized, without considering the execution time and
the reconfiguration time impact described, for example, in [63] for a similar
architecture.
3.3.2.1 Constants
aij := [tasks i and j S are mapped onto the same EU] (by convention,
aii = 1);
li := latency of task i S;
Notice that index i replaces index e because the mapping of tasks on EUs is
univocal.
The scheduling time horizon T = {0, . . . , |T | 1} is large enough to re-
configure and execute all tasks. A good estimate of |T | is obtained via a
heuristic.
3.3.2.2 Variables
pihk := [task i is present on the FPGA at time h and column k is its
leftmost one];
3.3.2.3 Constraints
The constraints marked with an asterisk are written with the if-then trans-
formation (see [210]).
No overlap constraints A column can be used by a single task at a time:
X k
X
pihl 1 h T, k U (3.2)
iS l=max(kri +1,1)
pihk + piml 1 i S, h, m T : k, l U : k 6= l
pihk = 0 i S, h T, k |U | ri + 2 (3.4)
Arrival time constraints The arrival time must not exceed the first time
step for which p is 1:
!
X X
Sion h pihk + |T | 1 pihk i S, h T (3.5)
kU kU
Leaving time constraints The leaving time must not precede the last instant
for which p is 1: X
Sioff h pihk i S, h T (3.6)
kU
Task length constraints A task must be present on the FPGA at least for
its execution time plus (if no module reuse occurs) its reconfiguration time
(reconfiguration prefetching is allowed, that is, the execution is not bound to
follow immediately the reconfiguration):
XX
pihk li + (1 mi )di i S (3.9)
hT kU
X h
X
tim 1 h T (3.12)
iS m=max(1,hdi +1)
pi0k = 0 i S, k U (3.13)
Module reuse constraints A task can exploit module reuse only if in the
time step preceding its arrival time an equivalent task occupies the same
position:
X
pihk + mi aij pj(h1)k + 1 i S, h T \ {0}, k U (3.14)
jS
te Sioff iS (3.15)
3.3.2.4 Objective
min te (3.16)
3.3.3 Heuristic
3.3.3.1 Salomone
Salomone solves the scheduling problem trying to take advantages from the
solution of the task allocation problem, that is, it assigns a location on the
target device to each node of the input graph reconfiguring the functionality
assigned to the same location. We will show the different phases involved
using the Task Dependence Graph shown in Figure 3.9 as an example.
The basic idea is to try to identify a feasible location to each node, remem-
bering that the switch between two nodes (two functionalities, two tasks) has
to be implemented by introducing a time penalty due to the reconfiguration,
which will provide useful information that, if used as input for the scheduler,
will lead to the definition of a good scheduling solution.
The first phase aims at defining the conflict graph using as input the tdg.
Which can be thought of as provided as input to the scheduling phase by the previous
partitioning one.
A conflict is modeled using an edge. It means that two nodes cannot be mapped on the
This is done to identify the nodes able be executed in parallel. Starting from
the same scheduling solution, more than one conflict graph can be obtained
from the algorithm without increasing the computed solution. Experimental
results proved that the best conflict graph is the one that has the smallest
number of overlapping nodes, therefore this graph will be provided as output
of this phase. At this point it is important to consider the following situation,
which introduces an interesting distinction with respect to the classical prob-
lem, proposed in [41]. Consider two nodes S and S 0 having a data dependency
S S 0 : it is desirable, since the reconfiguration time can introduce latency
overhead, to have S 0 mapped on the chip as soon as S ends its computation.
Therefore Salomone does not solve the allocation problem, using a coloring
technique as done in the compiler design area [41, 40] on the tdg: rather, it
considers a graph that is obtained merging the Conflict Graph and the Task
Dependence Graph, the graph obtained from the task graph in Figure 3.9 is
shown in Figure 3.10.
Salomone then colors this new graph, partitioning it into several node sets,
each of them corresponding to a specific color. This solves the allocation
problem of each node onto the FPGA, partitioning the FPGA into a number
of areas equal to the number of colors. This assignment will imply that each
node belonging to a color set will be mapped onto the same FPGA area,
Algorithm 7 Adj
AllColors ;
for all (v V ) do
Colors[v] 0;
end for
for all (v V ) do
if Colors[v] = 0 then
ColorThis(v);
ColorNb(v);
else
if Colors[v] = 1 then
ColorThis(v);
else
ColorNb(v)
end if
end if
end for
The choice of the actual color being assigned to a node is done via an
heuristic based on the Friend(c, Cols) function which, among the colors Cols,
chooses the one that is most frequently adjacent to c in the graph at a par-
ticular time.
The Friend function checks how many edges exist with certain endpoint
colors. Its complexity is therefore O(|E|). The CheckConflict function,
instead, has O(|V |) worst-case complexity (for a dense graph). As for the
ColorThis function, steps 1 and 2 require O(|V |) while the else branch takes
O(|V |) plus the complexity of Friend. Hence we obtain O(|V | + |E|). It is
then easy to see that ColorNb takes O(|V |(|V | + |E|)). From this it can be
deduced that an upper bound for the worst complexity of the overall algorithm
is O(|V |2 (|V | + |E|)). This bound is not tight at all, though, since it assumes
that condition (a) in the main body is always true. Obviously this is not
the case, since some coloring is performed by ColorNb executed for previous
nodes. In the case of sparse graphs (and those arising from our problems
always are), however, assuming that the cardinality of every neighborhood is
bound by a constant, one gets O(1) for CheckConflict, O(|E|) for ColorThis
and hence for ColorNb, so that the total complexity is O(|E||V |).
uler used to change the order in which the nodes are mapped on the FPGA
according to the environment changes that can influence the performance of
the system at runtime.
3.3.3.2 Napoleon
Salomone does not provide task relocation. Once a task has been assigned to
a specific SCONO it cannot be relocated at runtime to a different position
since the communication infrastructure is built on top of the SCONOs floor-
planning assignment. In such a scenario the larger area constraint, where the
area constraint for a SCONO is defined considering the most demanding task,
that have to be implemented in that area, is assigned to each SCONO, with
a considerable waste of resources. To avoid these problems, a new heuristic,
named Napoleon, has been defined. This new solution tries also to maxi-
mize the task reuse without introducing reconfiguration overhead whenever
possible.
Napoleon is a reconfiguration-aware scheduler for dynamically partially re-
configurable architectures. Its distinguishing features are the exploitation of
configuration prefetching, module reuse, and anti-fragmentation techniques.
Algorithm 12 shows the pseudo-code of the proposed algorithm. First of
all, Napoleon performs an infinite-resource scheduling in order to sort the task
set S by increasing alap values. Then, it builds subset RN with all tasks
having no predecessor. In the following, RN will be updated so as to include
all tasks whose predecessors have all been already scheduled (available tasks).
As long as the dummy end task is unscheduled, the algorithm performs the
following operations. First it scans the available tasks in increasing alap order
to determine whether any of them can reuse the modules currently placed
on the FPGA. Each time this occurs, task S is allocated to the compatible
module p, which is the farthest from the center of the FPGA. This farthest
placement criteria is an anti-fragmentation technique that aims at favoring
Summary
This chapter proposes an insight into the advantages of partial dynamic re-
configuration. In order to formally describe the properties of the actual FP-
GAs, a model, general enough to be applicable to several different chips, but
also not too abstract or simple, thus retaining the important features of the
problem, is presented. The specification is given in terms of a DFG hO, P i,
where the node set O represents the operations and the arc set P represents
the precedence relations: arc oi oj indicates that operation oi O must
terminate before operation oj O starts. Different partitioning approaches
are described in Section 3.2, comparing them to identify the most promising
one in achieving the best results using the partial reconfigurable capabilities.
This chapter presents how to best exploit such capabilities, starting from the
given specification, by defining specific solutions to the partitioning and the
scheduling problem tailored for self-dynamically reconfigurable systems. The
problem at hand will be described from the point of view of the actual physi-
cal architecture and then a novel model will be used to describe it in a precise
mathematical way to allow its analysis. Finally, scheduling techniques are
illustrated in Section 3.3.
5. Explain the reasons why the local isomorphism check has to be done.
6. Hierarchical template graph (HTG) and partial reconfiguration: list at
least two benefits produced by the HTG during the runtime partial
reconfiguration.
7. Try to extend the proposed ILP to support the 2D reconfiguration.
87
2009 by Taylor & Francis Group, LLC
88 Reconfigurable System Design and Verification
form interface for communication between hardware and software tasks. Ad-
vanced services of an OS4RS could include dynamically partitioning a task
as hardware or software, relocating a hardware task as a software one and
vice versa [134], de-fragmentation of the reconfigurable logic resource space,
module management such that hardware and software implementations of a
task are reused across different applications, support for Network-on-Chips
(NoC) [131, 140].
From the previous discussions, we can clearly see that an OS4RS needs to
provide services at different levels of the system hierarchy. Thus, a layered
architecture is generally proposed for an OS4RS. Another reason is that the
metamorphosis of tasks between its hardware and software implementations
requires a layered approach such that application level, task level, and func-
tion level attributes and characteristics are clearly distinguished through the
different layers.
Application Layer
Module Layer
Scheduling Layer
Placement Layer
Configuration Layer
Hardware Layer
Processor Memory
HW Reconfigurable Function
Accelerator Logic ROM
The reference architecture can be easily extended with more advanced fea-
tures as described in the following.
Network-on-Chip (NoC): NoC is a more efficient communication infra-
structure that can allow multiple concurrent data transfers using differ-
ent switching and routing schemes. Reconfigurable NoC is also currently
a hot topic of research because it not only makes the NoC design more
adaptable to dynamic application changes, but also brings unforeseen
advantages to the traditional NoC [168] such as run-time migration of
processing elements (PE), de-allocation of unused PE, dynamic map-
ping of PE to the available NoC routers, replacing a simple (complex)
router with another complex (simple) router based on the needs of the
applications, shutting down part of the NoC and restarting it when
and where required, reconfiguration-based congestion avoidance strate-
gies, predictable data transfers, and reduction in power consumption, to
name a few. The OS4RS will play an important role in such a reconfig-
urable NoC due to the added features.
Reconfigurable Module Sequencer (RMS): RMS [169] is more like a hard-
ware task manager, but its main job is to alleviate the burden of the
microprocessor and the OS4RS in handling the data communications
among different hardware tasks. System calls are implemented in the
OS4RS to drive the RMS. A conventional program must be first trans-
formed into a chained program, which has a set of chains that can be
executed by the RMS. A chain is a sequence of reconfigurable hardware
functions. The data transfers between consecutive hardware functions
in a chain are handled by the RMS. Thus, the communication perfor-
mance of a reconfigurable system with RMS is higher than that without
RMS.
Task Configuration Controller (TCC): The basic function of a task con-
figuration controller is to read configuration bitstreams and send them
to a configuration port such as ICAP. It could be implemented in hard-
ware or software depending on the availability of hardware computing
resources and real-time constraints. More advanced functionalities of
a TCC could include task context saving and restoring, task switching,
and task relocation. Details of these advanced functions can be found in
Section 4.3.5. To support TCC, an OS4RS must include related device
drivers such as ICAP driver and control programs such as the mediator
program [180].
Data Sequencer: A major performance bottleneck for reconfigurable
hardware tasks is data streaming. The hardware tasks are usually
slaves and thus data need to be streamed into and out of the tasks.
To solve this problem, either the hardware tasks should be implemented
as master devices or the system architecture could include a data se-
quencer. Master hardware tasks are similar to the Hthreads concept
[7], where a hardware thread can access shared IPC structures, synchro-
nization primitives, and perform I/O just like software threads. A data
sequencer can be used to accelerate data I/O to and from the hardware
tasks through a devoted communication infrastructure such as a bus or
NoC.
Slotted Area Model : In the slotted area model, as shown in Figure 4.3(a),
task
task
slot slot slot slot task task task task
task
task
Best/Worst Fit with Exact Fit Heuristic: Among all free spaces that can
accommodate a new task, the best (worst) fit with exact fit heuristic
[201] chooses the smallest (biggest) rectangle that has exactly the same
width or exactly the same height as the task. If no such rectangle is
found, the task will be placed according to the conventional best (worst)
fit strategy.
From the above fragmentation minimization heuristics, one can see that
the best/worst fit with exact fit heuristic is the simplest but might not re-
duce fragmentation as much as the other heuristics. Further, the adjacency
heuristic is simpler than the fragmentation heuristic; however, it might not
provide as many feasible solutions as the fragmentation heuristic. Neverthe-
less, the fragmentation heuristic requires more computations. Irrespective of
the heuristic, the placement algorithm complexity is O(n2 ), where n is the
number of running tasks.
n1
X
T otalRCost(i, n) = min { [((x (i, n))2 + (y (i, n))2 ) win ]} (4.3)
xn ,yn
i=1
Pn1
{ i=1 [((x (i, n))2 + (y (i, n))2 ) win ]}
=0 (4.4)
xn
Pn1
{ i=1 [((x (i, n))2 + (y (i, n))2 ) win ]}
=0 (4.5)
yn
Pn1
i=1 win (xi + wi /2 wn /2)
xn = Pn1 (4.6)
i=1 win
Pn1
i=1 win (yi + hi /2 hn /2)
yn = Pn1 (4.7)
i=1 win
Nearest Possible
Position
IPR(E)
IPR(C)
from the upper right corner of a reconfigurable logic area. The minimization
of fragmentation (MF) algorithm starts from the lower right corner and the
first-fit (FF) algorithm starts from the lower left corner. If a fourth placement
algorithm is to be incorporated into MOHP, the upper left corner can then
be used. Given a large enough reconfigurable logic device such as Virtex 5
or Altera Stratix III, effective separation of the different types of tasks allows
the simultaneous satisfaction of the different goals.
The flow chart for the MOHP algorithm is illustrated in Figure 4.8. MOHP
employs three queues to schedule the different types of tasks that are ready
to be placed and executed. The routing queue consists of the communicating
tasks, the fragment queue consists of the independent non-real-time tasks,
and the urgent queue consists of the independent real-time tasks. The MR,
MF, and FF algorithms are used for placing the tasks from the routing queue,
the fragment queue, and the urgent queue, respectively. If a feasible location
is found by the algorithms, then the hardware task is placed. Otherwise, the
task waits for the next chance, which could be the next scheduling point such
as when a hardware task terminates.
Urgent Queue
Decide Placement
Location
Update Space
Wait for Next Chance
Management Records
Hardware Library
Resource Constraints Task System Design Constraints
(PCs, TSCs, CCEs)
Scheduler
High-Level Hardware
Task Schedule
System Description
which combines ALAP and dynamic ASAP to calculate task priority. Dy-
namic ASAP computes task priority at run time, and all dynamic ASAP val-
ues are recalculated once a task is scheduled. Thus task priorities are changed
dynamically.
[5,8].
Column Column
1 2 3 4 5 6 7 8 910 11 12131415 1 2 3 4 5 6 7 8 910 11 12131415
1 1
2 2
3 A B 3 A B
4 4
5 5
6 C 6 C G
7 7
8 8
9 9
10 D 10 D
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 E 18 E
19 19
20 20
21 21
22 F G 22 F
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
Time Time
(cycle) Horizon (cycle) Stuffing
the area requirement, and ei is the execution time of task ti . If SU R(ti ) < 1
then the task ti is classified as a low SUR task. Otherwise, it is a high
SUR task. The classified stuffing method places the low SUR tasks and the
high SUR tasks starting from the leftmost and the rightmost column in the
reconfigurable area, respectively. The difference in the placement produced
by classified stuffing, as compared to conventional stuffing, is illustrated in
Figure 4.11, where SU R(D) < 1 means task D is a low SUR task and is
placed from the rightmost column. The hole is thus not as large as that in
the conventional stuffing. The classified stuffing technique shows significant
benefits in terms of a shorter system schedule and a more compact placement
with smaller fragmentation of allocatable resources.
Column Column
1 2 3 4 5 6 7 8 9 101112131415 1 2 3 4 5 6 7 8 9 101112131415
1 1
2 2
3 A B 3 A B
4 4
5 5
6 C 6 C
7 7
8 8
9 9
10 D 10 E D
11 11
12 12
13 13
14 14 F
15 15
16 16
17 17
18 E 18
19 19
20 20
21 21
22 F 22
23 23
24 24
25 25
26 26
27 27
28 28
29 29
30 30
T im e T im e
(cycle) Stuffing (cycle) Classified Stuffing
SW S
HW to SW SW to HW
relocation relocation
HW T
Relocatable
System Tasks
Configurations
Relocatable
HW/SW
Software Scheduler Hardware
Tasks Tasks
EDF Context
Scheduler (re)store Placer
Common
Microprocessor HW-SW Task FPGA
Interface
partitioning such that the software tasks are scheduled by an EDF scheduler
on the microprocessor and the hardware tasks are placed on the FPGA for
execution. At two different scheduling points, including when a new task
arrives and when a hardware task terminates, the RHSS is invoked again.
Task relocation further requires the support of two more mechanisms, includ-
ing context saving and restoring for both hardware and software tasks and
a common hardware software task interface such that a relocated task can
resume communication with other tasks.
Compared to AHSA, RHSS considers hardware configuration time explicitly
and takes contiguous area placement restrictions into account, which results
in more realistic models. Application experiments show that in spite of the
reduced time complexity, the proposed RHSS method can achieve an even
higher resource utilization than AHSA.
tions of the task. The goal is to find an execution order for the functions
on the processor and one for the FPGA such that all task deadlines, func-
tion precedence relations, and FPGA area constraint are satisfied while the
schedules consume the least energy.
The coscheduling method [126] consists of two phases, namely a design
time phase and a runtime phase. As illustrated in Algorithm 15, at design
time, hardware functions that are common to two or more tasks are identified
along with their corresponding leading software functions, where a leading
software function is one that precedes the common hardware function in a
function graph, with no other software function coming between them along
any path. Each path from a leading software function to a common hardware
function is called a key path. A delay time-table is constructed for each pair
of key paths corresponding to the same hardware function. The table keeps
information about the time difference that is allowed between two leading
software functions for the common hardware functions to be scheduled to use
the same configuration. The slack time of a leading software function is shared
amongst all the corresponding common hardware functions, if there are more
than one of them.
duction, but DVS also helps reduce the power usage of software tasks. Hence,
the overall effect is the combined efforts of hardware configuration reuse and
dynamic voltage scaling.
Example 4.1
To illustrate the above described energy-efficient hardware-software co-sched-
uling phases, we use an example system consisting of 3 tasks, whose task
graphs are shown in Figure 4.14. The function associated with a node TXi is
given inside the node as Fj . Functions F1 , F7 , and F10 are software, while the
rest are all hardware. The software functions are usually the control programs
or drivers, while the hardware functions could be any IP such as (I)DCT, ma-
trix multiplier, FFT, motion estimator, that is, circuit designs that could be
used across different applications. The arrival times for tasks A, B, C are,
respectively, 0, 2600, 5000, their priorities are 2, 3, 1, where a higher value rep-
resents higher priority, and their deadlines are 6500, 6600, 5500. The function
attributes are as shown in Table 4.1, where SEi , HEi , HCi , and HSi are the
software execution time, the hardware execution time, the hardware configu-
ration time, and the hardware space requirements, respectively, for function
Fi .
TA3 F3 TA5 F5
TB3 F8
TA6 FF66
TB4 FF99
and the execution power in FPGA is 1000 mW. The prolongment of F7 results
not only in lesser energy consumption by the software function F7 , but also
saves one reconfiguration of two columns of FPGA due to the configuration
reuse now possible for hardware function F6 common to tasks A and B, along
key paths P1 and P2 , respectively. Fewer reconfigurations save both time and
energy.
At time 5000, task C arrives, but has to wait till time 6200 for the processor
to be available to execute leading software function F10 . Figure 4.15 shows the
details of why and how the execution of F10 can be prolonged by 600 time units
by applying our co-scheduling method. Further, one more reconfiguration of
1 column FPGA is saved for F9 common to tasks B and C, along paths P3
and P5 , respectively. The final scheduling results are shown in Figure 4.15.
Let us compare the energy-efficient hardware-software coscheduling method
with two conventional methods as follows. Method M1 does not apply any
acceleration technique, while method M2 applies configuration prefetch and
reuse, but without DVS. The comparisons are shown in Table 4.4, where
we can see that the energy-efficient coscheduling method outperforms the
other two methods. Compared to the bare method M1 , the energy-efficient
C ol um n 3
TA6(F6)TB2(F6) TC2(F6) For key paths P2 and P4:
Original finish time of TC1 is at
time 8600.
8600 6200 = 2400 > 1200,
TA2(F2) where 6200 = C2(TB1)
Therefore, this does not affect the
C lumn4
Col TA4(F4) TA6(F6)TB2(F6) TC2(F6)
execution time of TC1.
0 2000 4000 6000 8000 10000 (time)
coscheduling method not only consumes lesser amounts of total energy and to-
tal configuration energy by 24.23% and 38.5%, respectively, and allows 33.3%
more task deadlines to be satisfied, but the total execution time is also reduced
by 16%. Compared to method M2 , the energy-efficient coscheduling method
similarly satisfies all task deadlines, but it requires an additional execution
time of about 7.2%, while saving 24.1% total energy and 33.3% configuration
energy.
schedule,
place,
configure allocate
Instantiated Configured
start
preempt wait
release
Terminated Cached
Arrival Time (ai ): This is the time when a task is triggered for execu-
tion.
Period (pi ): This is the task period if the task is a periodic one.
Deadline (di ): This is the task deadline if the task is a real-time task.
In the JPEG encoder, the DCT, Quantization, and Huffman encoding can
be represented by the attributes listed in Table 4.5.
Summary
We introduced the motivation and the requirements for the design of an op-
erating system for reconfigurable systems. A layered architecture for OS4RS
was also described in details. The design of several algorithms in an OS4RS
was also described. Finally, some real-world examples of OS4RS were cited
and compared. Though it is yet not possible to have a full-fledged OS that
meets all the requirements of an OS4RS that we have described, we believe we
will soon see such an OS as long as there is a consensus on the architecture of
reconfigurable systems. Interested readers may further explore the extensive
literature on this topic, some of which is given in the references.
127
2009 by Taylor & Francis Group, LLC
128 Reconfigurable System Design and Verification
The first one is the possibility to allow signals (routes) in the base design
to cross through a partially reconfigurable region without the use of a
bus macro.
The last feature is that this new approach supports Virtex-4 [101] and
[100] devices.
Developed exclusively for the 8.2, 9.1, and the 9.2 versions of ISE.
Using Virtex, VirtexII and VirtexIIPro Xilinx devices [218].
In such a scenario it can be noticed how this new flow provides a deep im-
provement in the design of reconfigurable architectures but without changing
the context in which the previous one also was working. Considering some as-
pects of the reconfiguration process, i.e., the bitstream reallocation scenario; it
can be easily described via the difference bitstreams [85, 65, 119, 47, 118, 108]
and via blank modules approaches and only with the introduction of a set
complex constraints that will lead to the loss of all the presented advantages
in the EAPR case.
To derive the final architecture from the input specification, the dynamically
reconfigurable hardware has to be identified; each dynamically reconfigurable
hardware block can be considered as a hardware block that can be scheduled
for a certain time interval. During the partitioning phase it has to be decided
for each part of the system, if it has to be implemented in software, in hard-
ware, or in a reconfigurable hardware block. To aid in this decision, some
general guidelines have been developed. In the mapping phase the functional-
ities defined by the executable specification are modified to obtain thorough
simulation results. It is possible to state that the ADRIATIC flow is a solu-
tion that can easily be applied to the system level of a design. In this phase,
in fact, it is possible to draw benefits from the general rules that guide the
partitioning and from the mapping phase. However, there is no detailed de-
scription of the following phases, which take place at RTL level, thus there
are some implementation problems that cannot find a solution within the
ADRIATIC flow.
The RECONF2 [30] aim is to allow implementation of adaptive system
architectures by developing a complete design environment to benefit from
It is possible to use as input for this flow a conventional VHDL static de-
scription of the application or multiple descriptions of a given VHDL entity,
to enable dynamic switching between two architectures sharing the same in-
terfaces and area on the FPGA. The steps that characterize this approach are
the partitioning of the design code, the verification of the dynamic behavior,
and the generation of the configuration controller. The main limitation of
the RECONF2 solution is that there is not the possibility to integrate the
system with both a hardware and a software part since both the partitioned
application and the reconfiguration controller are implemented in hardware
in the final system.
The works proposed in [202, 25, 39] are all examples of modular approaches
to the reconfiguration based on the Xilinx module-based [95] technique. The
flow proposed in [25] is too human based, it is basically a proof of concepts of
the Xilinx methodology demanding a deep interaction from the designer, who
has also to be a very highly skilled designer in matter of partial reconfigurable
design techniques, in all the phases of the flow. In [202] the reconfigurable
computing platform is intended to be PC based. In such a context the host PC
will take care of the download of all the necessary files to partially reconfigure
the underlying reconfigurable architecture. Such architecture has been defined
using a Xilinx FPGA, which can be accessed using common PC BUS, i.e., PCI,
USB, FireWire. The Proteus framework [202] does not provide substantial
improvements in the definition of a novel design flow for partial reconfigurable
architectures. This framework can be useful to develop applications that
will find benefits in the execution over the proposed reconfigurable platform,
since Proteus will take care of the communication between the host PC and
design synthesis and placement constraints assignment find its rationale in the
identification and in the corresponding assignment of the physical placement
constraints needed to define the overall structure of the final architectural
solution.
On the other hand, in the reconfiguration management software part, there
is the need to develop, in addition to a control application that is able to man-
age the reconfiguration tasks, a set of drivers to handle both the reconfigurable
and the static components of the system. All these software applications are
compiled for the processor of the target system. The compiled software is
then integrated, in the Software Integration phase, with the bootloader, with
the Linux OS and with the Linux Reconfiguration Support, which extends
a standard OS, making it able both to perform reconfigurations of the re-
programmable device and to manage the reconfigurable hardware as well as
the static hardware, in order to allow component runtime plugin. The fol-
lowing step is the Bitstreams Generation, which is necessary to obtain the
bitstreams that will be used to configure and to partially reconfigure the re-
programmable device. Finally, the last step of the design flow is the Deploy-
ment Design phase, which aims at creating the final solution that consists of
the initial configuration bitstream, the partial bitstreams, the software part
(bootloader, OS, Reconfiguration Support, drivers, and controller) and the
deployment information that will be used to physically configure the target
device.
The self-dynamic reconfigurable architecture that will be used has target
architecture that is defined by a static and a reconfigurable side. The physical
implementation of such an architecture, organized in three different layers, is
shown in Figure 5.2.
Starting from the bottom, Figure 5.2 shows the reference reconfigurable
architecture implemented over a Virtex II Pro VP7 (the layer containing the
BRAM memories and Power-PC 405 has not been depicted). The first layer
is the clock level that is, according to Xilinx Documentation [216, 95, 218],
routed at a different level with respect to the other signals. The second layer
contains the communication infrastructure layer between the static and the
reconfigurable side. The last one contains CLBs and Switch Matrices and
consequently all user logic, and the CoreConnect components are also imple-
mented at this level. The architecture foresees a static and a reconfigurable
portion; the static one is typically designed using standard tools, such as EDK
[99], or manually described in VHDL. It is composed of a processor (that can
be hardcore, i.e., a PowerPC, or softcore, i.e., a Xilinx Microblaze [215]) and a
set of cores used to implement an appropriate bridge to the reconfigurable por-
tion. It is important to notice how this kind of self reconfigurable architecture
can be implemented using both the PowerPC and the Microblaze because of
the difference in the Xilinx FPGAs, i.e., a PowerPC architecture can be used
with VirtexIIPro FPGAs, but it is not available in the majority of the Virtex4
and Virtex5 devices. The IBM CoreConnect [48] has been chosen as the main
communication infrastructure, and both its OPB and PLB buses will be used
to interface the hardware relocation filter proposed in this work. Whenever a
PowerPC-based architecture will be used the PLB version of the core will be
plugged into the system, while in a Microblaze solution the OPB version of
the core will be used. The processor will be used to execute the software side
of the application and to correctly manage the necessary calls to the hardware
components (with the corresponding reconfiguration, if necessary) designed to
speed up the overall computation. The reconfigurable part of the architecture
is a portion of the FPGA area on which different reconfigurable functional
unit can be mapped, with only the requirement to implement an appropriate
bus interface. All the necessary configuration bitstreams can be either stored
in an internal memory or eventually loaded by external devices. In any case,
these files have to be be sent on the bus to be used by the reconfiguration
controller. In this way an overhead is created on the bus that can become
slower than the component that manages the reconfiguration process. This
scenario can bring to a slowdown of the entire process. Thus it is necessary
introduce some sort of mechanism that can manage to accelerate it.
solution, i.e., the number of reconfigurable slots, the need of the runtime
relocation support.
The
core that is responsible for the physical implementation of a reconfiguration, i.e., an
ICAP controller.
The input phase: this tool needs the VHDL description of the core and the
indication of the chosen communication architecture.
The Reader: it reads, interprets, and saves all the information needed by
the following step. The main operations performed by the reader are
the following:
The Writer: it writes the IP-Core VHDL description, performing the fol-
lowing actions:
Accept as inputs the generics and I/O signals lists, the cores entity
name and the cores path.
Therefore, three output VHDL files are produced: the cleaned core, the
stub, and the top IP-Core description. The resulting IP-Core structure is
represented in Figure 5.4
During the execution, if one of the two phases fails the tool is halted and an
error message, useful to understand where and why the problem occurred, is
generated. If the execution ends correctly the created IP-Core is ready to be
plugged into an EDK system. It is important to specify that the tool is not a
VHDL parser because it does not validate the analyzed code, except for the
entity declaration syntax. An important feature of the tool is that the address
decoding logic is automatically generated and included in the stub. Therefore,
with this solution, while creating a hardware module, the designer does not
have to deal with the core interconnections within the system, focusing his
efforts instead only on the functionality itself. In other words, the core must
not include any reference to the interfacing architecture.
The set of tests presented in Table 5.1 concern several types of components,
starting from some small IP-Cores such as an IrDA interface to more complex
examples, e.g., a complex ALU, a video editing core that changes the image
coloring plane from RGB to YCbCr. Table 5.1 presents some relevant results,
considering both the input core (the first row for each core) that represents
the core logic, and the obtained component (the second row) that is the final
IP-Core produced by the IPGen tool, [135]. For each one of them, the size
in terms of 4-Input LUTs and the number of occupied slices are illustrated,
both as absolute values and as the percentage with respect to the total size of
the FPGA. Columns labeled with ari and di present the number of available
resources (ari ) given the minimum area constraints to implement the core as
reconfigurable elements, and the density (di ) computed as the ratio between
the number of slices used to implement the core and the number of available
resources, ari . The last column shows the time needed by IPGen to create
the IP-Core.
On one hand the relative overhead due to the interface of the core logic with
the bus infrastructure is acceptable, both for the 4-Input LUTs and for the
occupied slices, especially when the core size is relevant. Therefore, from this
point of view, the best case is to have a big design with few and small ports,
while the worst one is to have a small core with many I/O signals. This result
may suggest the use of bigger cores, for example, putting together in a unique
module different functionalities. In system design, especially in a dynamic
reconfigurable architecture, this solution has to be adopted carefully, for two
main reasons: it implies a decrease in the degree of flexibility, which is one
of the main features of modular-based design, and the time required by the
reconfiguration becomes longer. Hence it is desirable to reach a reasonable
trade-off between the size of the modules and the occupation of the overhead
required by the communication logic.
In order to create the IP-Core, starting from input Core, IPGen needs
an execution time that is almost constant and on average of 0.065 seconds.
The generated IP-Cores will then be provided as input to the Reconfigurable
Modules Creator or to the EDK System Creator tools, according to their
nature, static or reconfigurable, as defined in the system characterization file.
macro hardware file (NMC). Once this operation is also done, the hard macro
is ready to be used. An example of a custom bus macro, generated using
ComIC, is shown in Figure 5.6. Finally, the architecture generation stage
consists of the creation of the VHDL description of the top architecture where
the static component, the communication infrastructure, and the necessary
reconfigurable components are instantiated. This is achieved by analyzing the
VHDL descriptions generated by the previous stages. Experimental results
have shown that the most time-consuming stages are the static system cre-
ation and the architecture generation ones, therefore Table 5.2 reports the
results of a set of experiments where the static side of the architecture has
been designed using different kinds of processors, the PowerPC 405 and the
Xilinx Microblaze.
The first column reports the main characteristics of the static part of the
architecture under test. Lets take into consideration VP7MB1, meaning that
the target FPGA chosen to implement the final solution is the Xilinx Virtex II
Pro 7 (VP7) and that the processor instantiated on the static side was a Xilinx
Microblaze. We use 1 or 2, just to characterize different static architecture,
where different means that they have been defined using different sets of IP-
Cores.
The accuracy of the #occupied slices value affects dramatically the quality of the place-
ment.
k We are now working on an new version of this framework, which is based on a non-greedy
approach.
number of reconfigurations [72]. This is possible due to the fact that a module
can be requested to be executed at different times and not only once. In such
a scenario there might be a placement solution able to keep configured this
configuration code on the reconfigurable device, without affecting the quality
of the schedule, without having to reconfigure the same module twice or more
just because it is no longer available on the reconfigurable device. Once the
new placement constraints are defined this information is stored in the system
characterization file and in the UCF file and provided, respectively, to the
system architecture generation stage in the System Description phase to im-
plement the correct communication infrastructure and to the context creation
stage in the System Generation phase.
A game theory-based technique is now under development. The idea is to
consider each core as a player in the game, where the objective of each player
is the best placement assignment [141]. The set of all the players identifies
the society. In this context the best solution for the society may not be equal
to the best assignment of each core. This means that a placement assignment
solution for a system corresponds to an equilibrium.
Previously it was underscored that one of the most critical steps to obtain a
good result consists of the choice of the initial floorplanning for each module.
A single objective function has been defined as
max 2
10longest
2
= A2 + B +C (5.2)
maxdesign { } maxdesign { }
In order to make the sum of parts belonging to the interval [0, 1] the
time components have been scaled using the longest critical path. Hence,
if the considered module is responsible for the longest critical path then
will have a normalized time component equal to 1, and any variation to the
critical path due to the floorplanning will be appreciated more than any other
improvement obtained, i.e., decreasing fragmentation.
the one used in the Xilinx ISE environment) that tend to spread logic along the
assigned area without caring, unless explicitly specified, about global timing.
Figure 5.10: Overview of the system used to assign the height and the width
of the area constraint of a generic module.
Width Height Height/Width Longest Delay Average Conn. Delay Fragmentation Index
Worst 10 Nets
13 13 1.00 3.90 2.75 0.85
17 17 1.00 3.87 2.58 0.49
21 21 1.00 3.45 2.38 0.32
23 23 1.00 3.99 2.52 0.27
9 21 0.43 2.94 2.39 0.76
9 17 0.53 3.20 2.39 0.94
9 25 0.36 4.02 2.67 0.64
17 9 1.89 3.69 2.79 0.94
21 9 2.33 3.88 2.54 0.76
25 9 2.78 4.04 2.42 0.64
5 31 0.16 3.74 2.93 0.93
The System Generation phase can be used to support both the Xilinx Early
Access Partial Reconfiguration-based [98] and the Module-based [95] reconfig-
urable architecture design flow. This phase has been divided into two different
stages: the Context Creation and the Bitstream Generation.
The Context Creation phase has been organized into three different
stages: Static Side Implementation, Reconfigurable Modules Implementation,
and Context Merging. The first one accepts as input the HDL files generated
during the system description phase and the placement information defined
via the design synthesis and placement constraints assignment phase. The
aim of this stage is the physical implementation of the static side of the final
architecture. The placement information of this component will be provided
as input to the Context Merging phase of the final architecture. A second
output, working with the Xilinx EAPR flow, is represented by the information
of the static side components that are placed into the reconfigurable region,
i.e., routing, CLBs usage. This information is stored in the arcs exclude file,
and it is represented by the dotted line, in Figure 5.11, used to bind the
Static Side Implementation and the Reconfigurable Modules Implementation
stages. The Reconfigurable Modules Implementation stage needs as input the
placement information for each module and the corresponding VHDL files
defined during the previous two phases and the arcs exclude file. This stage
defines the physical implementation of each reconfigurable component that
has to be passed as input to the hardware merging phase. It is composed of
three different steps: the NGDBuild, the mapping, and the final place and
route stage. The Reconfigurable Modules Implementation stage needs to be
executed for each reconfigurable component. Finally, the Context Merging
stage produces as a result the merging of the outputs produced by the two
previous stages. Table 5.4 reports the average execution time for each phase
composing the Context Creation stage. We report the results for all the
devices used in our tests: Xilinx Spartan3, Virtex II Pro, and Virtex 4.
Summary
This chapter, building on the information provided in Chapters 3 and 4, de-
scribes different solutions proposed in the literature to design and implement
dynamic reconfigurable systems. The main objective of this chapter is to de-
scribe a general and complete design methodology that can be followed as a
guideline for designing reconfigurable embedded systems. The idea behind
the methodology presented in this chapter is based on the assumption that
it is desirable to implement a flow that can output a set of configuration bit-
streams used to configure and, if necessary, partially reconfigure a standard
FPGA to realize the desired system. The proposed workflow aims at design-
ing a complete framework able to support different devices (i.e., [102, 101]),
multi-FPGA architecture (i.e., the RAPTOR2000 system [109]), different re-
configuration techniques ([95, 98]), and type of reconfiguration (i.e., internal
or external, mono and bi-dimensional) that allow a simple implementation of
an FPGA system specification, exploiting the capabilities of partial dynamic
reconfiguration. Regarding the software side of the desired solution, it can
be developed either as a standalone application or with the support of an
operating system, such as Linux, a solution that was presented in Chapter 4.
3. List at least three different generic design flows for partially reconfig-
urable systems.
5. What is a bus macro and what is it used for? Is there only one kind of
bus macro? If not, explain why.
List a couple of critical aspects that you have to take into account dur-
ing the design phase.
155
2009 by Taylor & Francis Group, LLC
156 Reconfigurable System Design and Verification
tool was applied to a face recognition system with image preprocessing such
as Bayer filter and face location based on Hough transform. Eight incorrect
reconfigurations of the hardware were detected by SymbC.
The Xilinx COREGen tool can be used to design reconfigurable cores for
the Xilinx FPGA platforms. Singh and Lillieroth [172] proposed the dynamic
generation of the question formula for a new dynamically reconfigured core
generated by COREGen and then using the theorem prover called Prover to
ensure the dynamically calculated circuit is correct. When a system wishes
to perform architecture reconfiguration, it first updates its internal model
of the circuit and submits it to Prover for re-verification against the new
specification. Using lazy evaluation it can avoid re-proving parts of the circuit
that change, and the proof effort is concentrated on the reconfigured parts.
Fey et al. [67] proposed another approach to formal verification of reconfig-
urable systems. The configurations of a functional block in a reconfigurable
system were abstracted fully into a set of relations. Then, Binary Decision
Diagrams (BDDs) were used to represent the relations. Finally, circuits were
generated from the BDDs.
connector Service =
role Client = request!x result?y Client 2
role Server = invoke?x return!y Server
glue = Client.request?x Service.invoke!x
Service.return?y Client.result!y glue
then sequences the remaining three events and their data. This example il-
lustrates how the connector description language is capable of expressing the
conventional notion of providing and using a set of services.
Further, Wermelinger et al. [203] proposed an imperative language that can
be used to specify categorical diagrams and dynamic reconfiguration through
algebraic graph rewriting. This language supports component creation, com-
ponent refinement, connector creation, node removal, node query, and vari-
able assignment. The language uses category theory as a semantic foundation
for both configurations and reconfigurations. It has design characteristics for
computation, declarative characteristics for constraints, and operational char-
acteristics for reconfiguration. Following are some examples of how compo-
nents can be created, refined, and connected. The words in bold are keywords.
The keyword create is used to create a new component called N ode1 of type
Design with conditions such as li := Expi . The component N ode2 is cre-
ated as a refinement of the component N ode1 with conditions li := Expi . A
connector called N ode is created by applying refinements to N odei .
[N ode1 :=]
create Design
with l1 := Exp1 k l2 := Exp2 . . .
[N ode2 :=]
create Design2 as [Ref inement(]N ode1 [)]
with l1 := Exp1 k . . .
[N ode :=]
apply Connector([Ref inement1 ]N ode1 , . . .)
with l1 := Exp1 k l2 := Exp2 . . .
Application Model: How must one define an application model such that
a generic application can be executed on the architecture model?
Section 6.4.6 gives the 4 task features and 5 partition features that
are evaluated in Perfecto.
posed for integrating the design algorithms and for the design space explo-
ration of dynamically reconfigurable systems through performance evaluation.
As shown in Figure 6.1, the design flow for dynamically reconfigurable sys-
tems is divided into two phases, namely a front-end design and a back-end
design. The front-end design phase takes an architecture model and an ap-
plication model, which might be derived from user-given system specification,
and then generates three kinds of tasks, namely hardware task, reconfigurable
hardware task, and software task. The back-end design phase synthesizes the
three kinds of tasks, performs more detailed hardware-software co-simulation
and then implements the full system using back-end tools such as hardware
synthesis tool, software compiler, gate-level simulator, and power estimation
tool. Perfecto nicely fits into the front-end design phase as the main tool for
design space exploration and performance evaluation.
As shown in Figure 6.2, Perfecto takes two inputs, namely an architecture
model and an application model, which are defined later in Sections 6.4.1.4
and 6.4.2, respectively. Hardware-software partitions are then generated by
Perfecto. For each partition, the scheduler in Perfecto schedules the recon-
figurable hardware and the software tasks. It is assumed here that the fixed
hardware accelerators are part of the dynamically reconfigurable system ar-
chitecture and thus not scheduled by Perfecto. Then, Perfecto simulates the
execution of the software tasks on a processor, places the hardware tasks in
the reconfigurable logic, and simulates their execution on the DRCL. Finally,
after all task executions in all partitions are simulated, Perfecto generates
several reports for the system designer, including partition evaluations, bus
access conflicts, and real-time placement information, which are described in
Section 6.4.6.
In the rest of this section, we will first describe the two inputs of Perfecto,
namely the reconfigurable architecture model and parameters in Section 6.4.1
and the application model and task parameters in Section 6.4.2. Later, in
Sections 6.4.3, 6.4.4, and 6.4.5, respectively, we will also describe the par-
titioning, scheduling, and placement algorithms that were implemented into
Perfecto for illustration purposes. Designers can always change these algo-
rithms or use the real-time information provided by Perfecto to construct a
suitable algorithm. It must be noted here that the algorithms themselves are
not the focus of design; rather, it is the evaluation framework design and how
it can benefit designers. The algorithms will be the focus of design when we
are designing an operating system for reconfigurable systems.
System Specification
Architecture Application
Model Model
Perfecto Mapping
Refinement
Scheduling &
Placement
Front-End
Back-End
Co-Synthesis
Co-Simulation
Back-end tools
Architecture Application
Model Model
Generate
Partitions
Schedule
Tasks
Softwaretasks Hardwaretasks
Executein Placeand
Processor executeinDRCL
More Yes
Partitions?
No
Processor RAM
Bus Arbiter
Function
DRCL ROM
guage [142] was used to develop this architecture model because design space
exploration is being performed at the system level and because the design
contains both hardware and software functions. In this architecture, a simple
bus model is used as the communication infrastructure for hardware tasks.
Here, simple means that there is no pipeline and no split transaction. The
bus model eliminates the need to do global routing after tasks are placed into
the DRCL. Function ROM is a memory storage to save configurations (e.g.,
bitstreams) that will be loaded into and executed in DRCL. The other system
models are described in the rest of this subsection.
bus bus
2. If the last request had its lock flag set and the corresponding master is
requesting again, this request is selected from the waiting queue and
returned.
3. The request with the highest priority (smallest number) is returned from
the waiting queue.
to multiple behaviors and then selecting one of the behaviors for dynamic
execution. This method is the simplest, but quite inflexible because for every
new function, we need to modify the DRCL model. Similar to DRCF [147]
and DRC [151], Perfecto adopts this method because it is simple and can be
simulated quickly. For design space exploration, the speed of simulation is
of utmost importance. Other methods include the use of C function pointers
[162] and C++ templates, all of which might cause an overhead in SystemC
simulation performance and are thus not very suitable for design space ex-
ploration. Further, multiple sc threads are used in the DRCL for modeling
partial reconfiguration. To avoid the use of function pointers, Perfecto uses a
function table and a task table, with interfaces for automatic insertion of new
functions and new tasks.
Wbus is the bus width. The basic unit is a word of 4 bytes. A designer
may specify the bus width in units of word. This parameter affects the
memory access counts of an application running in the target system.
Tmem is the memory access time. This parameter is the time for one
memory access. This parameter also affects the application response
time.
Nslice is the DRCL size. This is the total number of slices in a DRCL.
Apart is the partitioning algorithm. The default partitioning algorithm is
function-based partitioning. Users can implement their own partitioner
for task mapping or choose an optional method, including the common-
hardware-first or the random partitioning as described in Section 6.4.3.
Asched is the scheduling algorithm. The default scheduling algorithm is
simply FIFO. Users can implement their own scheduler to optimize an
application. Another scheduler also currently implemented in Perfecto
is an energy-efficient hardware-software co-scheduler [126] for reconfig-
urable systems.
Aplace is the placement algorithm. The default placement algorithm is a
rule-based one, which is described in Section 6.4.5. Users can implement
their own placer to place the hardware tasks into DRCL. 2
Since Perfecto is used for design space exploration, an interface to TGFF elim-
inated the need for users to specify an exact application for evaluation. The
TGFF interface allows users to thoroughly evaluate a reconfigurable system
along with its partitioning, scheduling, and placement algorithms because the
task graphs are randomly generated.
TGFF generates the task set information from a template by varying the
seed for the random number generator per template. The template param-
eters to be defined include the number of task graphs, the average number
of functions in a task graph, and the function attributes. The textual rep-
resentation of task graphs generated by TGFF is then automatically parsed
by Perfecto into intermediate task data structures that can then be used for
partitioning, scheduling, and placement, thus automating design space explo-
ration and performance evaluation.
6.4.3 Partitioning
A partitioning algorithm maps each task in an application model to either
software or hardware based on some estimation criteria such that the task is
executed either on the microprocessor or in the DRCL, respectively. Formally,
it is defined as follows.
rithm because the purpose was not to propose a new partitioning algorithm;
the purpose was merely to check if the framework can be used to efficiently
evaluate dynamically reconfigurable system designs.
Though several partitioning algorithms were implemented in Perfecto, users
have to compare the results of the different partitioning algorithms manually
after Perfecto has generated the partitions by applying the algorithms one
by one. The partitioning results of different algorithms can be compared by
considering either the total number of partitions generated by an algorithm or
the quality of the partitions generated. After Perfecto applies the partitioning,
scheduling, and placement algorithms, the quality of the partitions can be
gauged. Details of the characteristics of partitions can be found in Section
6.4.6. An ideal partitioning algorithm is one that can generate the optimal
partition within a minimal number of partition results. Optimality in quality
and minimality in quantity are conflicting goals and thus a real partitioning
algorithm can only generate a near-optimal partition in a manageable number
of partitions. Users can thus select a partitioning algorithm based on the
desired trade-off between quality and quantity.
Users can also invent a new partitioning algorithm and implement it into
Perfecto to check if the evaluated performance improves. Perfecto was mod-
ularly designed such that independent data structures were used for the set
of all tasks, the set of software tasks, and the set of hardware tasks. Thus, a
user merely has to re-write the Generate Partition() function to implement
a new partitioning algorithm. Well-defined interfaces between this function
and the consequent scheduling algorithm have thus helped users to implement
new algorithms into Perfecto.
6.4.4 Scheduling
A scheduling algorithm associates an ordering to the set of software tasks that
are ready so that they will be executed in that order on the microprocessor.
It also associates an ordering to the set of hardware tasks that are ready so
that they will be placed and executed in that order in the DRCL. This is
defined formally in Definition 6.4.
6.4.5 Placement
A placement algorithm tries to find a feasible location in a fixed-size DRCL for
a given hardware task of some fixed size. The placement algorithm depends on
the underlying configuration model, namely paged 1-dimensional, segmented
1-dimensional, or 2-dimensional, where the basic units of configuration are a
fixed-size slot, a column, or a tile, respectively. In Perfecto, an abstract model
is used where the basic unit of configuration is simply called a slice. Thus, in
Perfecto the underlying configuration model could be any of the three.
Idle status
Done status
Execute status
S0 S1 S2 S3 S4 S5 S0 S1 S2 S3 S4 S5
(a) (b)
C1 C2 C3 C1 C2 C4
part, if any, is denoted by loc00 , i.e., nslice (loc) = nslice (loc0 ) + nslice (loc00 ),
where nslice (loc0 ) = nslice (Tj ). Then, the lists are updated as follows. L0used =
Lused {loc0 }. L0f ree = Merge Adj(Lf ree \{loc} {loc00 }), where the function
Merge Adj() tries to merge adjacent free spaces into contiguous blocks of free
space. 2
By changing the selection criteria for the feasible location to place a task,
different placement algorithms can be invented. Currently in Perfecto, the
following placement policy is used. The DRCL selects a block for configuration
according to the following rules:
1. If there exists a block that is already configured with the same circuit,
but not executing currently, i.e., it is in the done status, then reuse the
block by selecting it for the current task.
2. If there exists a configured block with the same slice count, but with a
different function and at the done status, then configure the new task
in this block.
3. If there exists an unconfigured block, i.e., in the idle status, with enough
slices, then configure in that block.
4. If there is no free block with enough slices, the blocks at the done status
will be released into the idle status, then check rule 1 to rule 3 again.
P
RR(t, P ))
t (T ET (t, P )
ADU (P ) = (6.1)
P ET PM S
CT (t, P )
ACT (P ) = P t (6.2)
T ET (t, P )
Pt
BW T (t, P )
AW T (P ) = Pt (6.3)
t T ET (t, P )
The bus access conflicts show the real-time information of the number of
tasks competing for bus access and also the tasks that are actually making
requests. From this information, a designer can detect if there is a bottleneck
in system performance. The real-time placement information for each task in
each partition can be used for further tuning and optimizations.
After simulation, by analyzing the above results generated by Perfecto, a
designer can then decide to select one or more partitions that best fits his/her
needs. The criterion could be the least total execution time, or the least
average DRCL utilization, or the one with the least average bus waiting time.
All of these results would be more apparent and intuitive through application
examples, as described in Section 6.4.7.
179
2009 by Taylor & Francis Group, LLC
180 Reconfigurable System Design and Verification
providing designers with detailed real-time information on how and when the
tasks in each partition compete for bus access. From Table 6.3, we can observe
that task T6 in partition P1 has the largest bus waiting time of 89 ns among
all the tasks in all partitions. Further, this also directly reflects on the aver-
age bus waiting time of partition P1 (2.8 ns), which is largest among all the
8 partitions. To debug this performance bottleneck, we can analyze the bus
conflict accesses reported by Perfecto. Figure 6.8 shows the aggregate and the
individual task bus accesses for Partition P1 along the time axis. Analyzing
the aggregate diagram, we can immediately identify there is a performance
bottleneck in partition P1, where the maximum number of concurrent bus
accesses is 3. Suppose the system designer would like to solve this bottleneck.
Through the individual task accesses provided by Perfecto, we can identify
the tasks that are competing in this bottleneck of 3 concurrent bus accesses,
namely tasks T1, T3, and T6. We can also exactly pinpoint the time (200 ns)
during which the three tasks are competing for the bus. These information
details help designers resolve the bottlenecks and optimize their designs.
All
Task1
Task2
Task3
Task4
Task5
Task6
S0 S1 S2 S3 S4
6 8
10
T2 T4 T6
Idle status
760 Execution status
780 T4 870
Done status
T2
2528
T6
T3
3118
Summary
We discussed the state of the art in the verification technology for reconfig-
urable systems. System-level verification techniques, including a formal ver-
ification and language-based approach, were introduced. Hardware-software
coverification techniques, including cosimulation and prototyping, were also
discussed. Finally, we introduced a SystemC-based design space exploration
framework called Perfecto that can be used for evaluating performance of
reconfigurable systems and also for designing OS4RS.
With rapid technology progress, FPGAs are getting more and more powerful
and flexible in contrast to inflexible ASICs. FPGAs, such as Xilinx Virtex
II/II Pro, Virtex 4, and Virtex 5, can now be partially reconfigured at run-
time for achieving higher system performance. Partial reconfiguration means
that one part of the FPGA can be reconfigured while other parts remain oper-
ational without being affected by reconfiguration. An embedded system along
with FPGAs, which is usually called a Dynamically Partially Reconfigurable
System (DPRS), can enable more and more intensive applications to be accel-
erated in hardware at run-time, and thus the overall system execution time
can be reduced [95, 213].
There exist many related research works in which the capability for partial
reconfiguration is used to enhance the flexibility and the performance of their
proposed systems. Implementing a DPRS is a very complex task compared to
a traditional embedded system. The DPRS includes not only the traditional
software and hardware applications in a traditional embedded system, but
also the dynamically partially reconfigurable hardware designs running on the
FPGA. However, access to the detailed information on the implementation of
DPRS is distributed, without any complete user or reference manual. Hence,
in this chapter, we will guide the designer to implement a DPRS so that he or
she can avoid spending a lot of time on searching and studying all the related
documents and materials.
Section 7.1 introduces the capability for partial reconfiguration on the Xil-
inx Virtex family FPGAs, and then the physical constraints are described.
After introducing the concepts, the latest partial reconfiguration design flow,
namely Early Access Partial Reconfiguration (EA PR) design flow provided by
Xilinx [213], is introduced in Section 7.2. Section 7.3 gives a real case study on
the design of a simple LED control design for readers to more concretely under-
stand the implementation steps for a DPRS design. Through the case study,
readers can learn how to enhance an embedded system with the capability
for partial reconfiguration using the Xilinx Virtex family FPGAs. In contrast
to the pure hardware design with the capability for partial reconfiguration
described in Section 7.3, the software-controlled approach to partial reconfig-
uration through a specific configuration controller provided with the Xilinx
Virtex family FPGAs, namely Internal Configuration Access Port (ICAP), is
185
2009 by Taylor & Francis Group, LLC
186 Reconfigurable System Design and Verification
in 7.2, PRR a and PRR b are correctly placed in the FPGA. PRR c and PRR d
overlap with each other in the same CLB column, which results in incorrect
placement in the Virtex II/II Pro FPGAs.
For Xilinx Virtex 4 FPGAs, the basic configuration unit is a column of 161
CLBs that spans the height of a page and not the full chip area, where the
FPGA is divided into several regions called pages. Take Virtex 4 XC4VFX12
FPGA, for example, as shown in Figure 7.3. The Virtex 4 FPGA can be
divided into eight pages, and a PRR can be placed across one or more pages,
such as PRR b. A PRR cannot include any Input/Ouput Block (IOB) column,
which is in the middle of the Virtex 4 FPGA, because a PRM cannot directly
use the I/O blocks of the FPGA to interact with the external I/O devices.
Thus, the placement of PRR d is incorrect. Moreover, similar to the Virtex
II/II Pro FPGAs a basic configuration unit can be allocated to only one PRR,
so the placement of PRR e and PRR f is incorrect. The Xilinx Virtex 5 FPGAs
are also divided into a few pages for the placement of PRRs except that the
basic configuration unit is 201 CLBs.
After understanding the placement constraints for different FPGA families,
the allocation for every PRR on the FPGA must include enough FPGA re-
sources to (re)configure its PRMs. Here, the FPGA resource usage can be
evaluated by using the reports generated from a synthesis tool and a design
analysis tool, such as the Xilinx PlanAhead tool. In the next section, a partial
reconfiguration design flow and a specific hardware component, namely bus
macros provided by Xilinx, will be described in detail for designing a DPRS.
16 CLB
Rows
PRR_a PRR_d
16 CLB
Rows
PRR_e
PRR_b
16 CLB
Rows
PRR_f : Acceptable
16 CLB
Rows
PRR_c : Unacceptable
IOB
Column
The hdl directory is used to store all HDL files, including the top-level
design, the static designs, and the PRMs.
The synth directory is used to store all the synthesis project files. The
top folder includes the global logic hardware components and the top-
level HDL file, where the static designs and PRMs are instantiated as
black boxes. All the synthesis projects for the static designs are located
in the static directory while the PRMs are located in the corresponding
PRR folders. For the DPRS architecture in Figure 7.4, the prr a and
prr b directories individually contain two different PRMs.
The non pr directory includes all synthesized designs for the DPRS
architecture. Though the DPRS architecture illustrated in Figure 7.4
can be configured into four different combinations of PRMs, it is strongly
PRM_ a2 PRM_ b2
PRM_ a1 PRM_ b1
Bus
As shown in Figure 7.6, the EA PR design flow can be divided into several
main steps. The Design Partitioning and Synthesis step is used to modify
the HDL description for fitting the DPRS architecture and to synthesize the
designs into the hdl and synth directories. In the Design Budgeting step,
the synthesized top-level design needs to meet the physical FPGA design con-
straints, including the location of bus macros and global logics. The Non-PR
Design step is used to integrate the synthesized top-level design, the synthe-
sized static designs, all combinations of the synthesized PRMs, and the design
constraints provided by the Design Budgeting step for generating the full de-
sign bitstreams. By configuring each full bitstream, the functional correctness
of DPRS can be validated before going through the EA PR design flow. When
the validation of the DPRS architecture in the Non-PR Design step is finished,
DPRS Implemented
Static Design
Specification
Top Level
Design Implemented
PRMs
Design
Design Top Level Static Logic PR Block
Partitioning and Non-PR Design Merge
Budgeting Implementation Implementation Implementation
Synthesis
Static_ full
Bitstream
PRMs
Checker of the ISE tool can be used to check if the ISE version is updated to
the PR version. In the following subsections, the main steps will be illustrated
in detail.
XST of the Xilinx ISE tool. Moreover, all PRMs in a PRR must be defined
the same port definitions and the entity names because the communication
with the static designs or other PRR is based on the ports of the PRR but
not on that of the PRMs. This shows that every PRM in a PRR has the same
interface for interacting with the static designs or other PRRs.
All signals of a PRR cannot directly interact with the external hardware
designs. Here, a specific hardware component, namely the bus macro provided
by Xilinx, must be used. The bus macros provided by Xilinx are the hard
macro designs, and thus they need to be defined as a submodule for inserting
in the top-level design. Before the top-level design, which is modified to be a
DPRS architecture, is synthesized, the bus macro sources have to be copied
to the same directory. There exist several different types of bus macros to fit
the requirements for the design architecture, which include the direction, syn-
chronicity, physical width and the FPGA families. The bus macro naming con-
version can be described as busmacro device direction synchronity width.nmc
for CLB bus macros and busmacro device synchronity.nmc for single-slice bus
macros, as shown in the following table.
The left side of Figure 7.7 shows a simple sample for illustrating the CLB
bus macro usage, where four different directions of the CLB bus macros are
used to pass the signals into the PRR. For the width of CLB bus macros as
shown in the right side of Figure 7.7, the narrow bus macros are two CLBs
wide while the wide bus macros are four CLBs wide. However, the width of
a CLB bus macro is only 8-bit. In order to extend the bus width of CLB
bus macros in a CLB row, three wide CLB bus macros can be nested in
the same CLB row for achieving 24-bit bus bandwidth, as shown in Figure
7.8. Compared to the CLB bus macros, the single-slice bus macros are not
PRR S ta tic
T2B_BM
(a) narrow
L2 R _ B M PRR R2L_ BM
B2T_BM
(b) wide
PRR Static
MODE is used to define the PRR property for preventing the errors
during the static design and the PRM implementation. The MODE of
every PRR must be defined as follows.
LOC is used to set the I/O pins, the global clock primitives, and the bus
macros. However, there are some constraints for the placement of CLB
bus macros to follow. The CLB bus macros can only be placed on the
boundary of a PRR, where the x-coordinate of the bus macro equals the
The top-level design, the static designs, one combination of the PRMs, and
the modified UCF file must be copied to the non pr directory. Then the
following commands are used to generate the bitstream.
For the DPRS architecture in Figure 7.4 and the directory structure in
Figure 7.5, four different combinations of PRMs in the DPRS need to be indi-
vidually generated by following the commands. After validating the functional
correctness of every combination of PRMs, the PR implementation design flow
can be used to generate the static full bitstream for static area and the partial
bitstreams for the PRMs.
7.2.7 Merge
The final step of the PR implementation flow is to merge the top-level de-
sign, the static designs, and the PRMs. The implemented static design and
the implemented PRMs are copied to the corresponding PRM directory and
then the static full bitstream and the partial bitstreams for each PRM can be
generated using the following commands.
In the same way, the remaining partial bitstreams for the PRMs can be gen-
erated using similar commands. Additionally, the specific partial bitstreams,
namely blank bitstreams, are also generated to clear the PRM in the PRR.
The following are the final bitstreams after going through the EA PR design
FPGA XC4VFX12
BM LED
Led
BM
BM
Led_ NSEWC
BM Button
static full.bit
prr a blank.bit (used to clear PRR a)
prr b blank.bit (used to clear PRR b)
prm a1 routed partial.bit (used to load PRM a1 into PRR a)
prm a2 routed partial.bit (used to load PRM a2 into PRR a)
prm b1 routed partial.bit (used to load PRM b1 into PRR b)
prm b2 routed partial.bit (used to load PRM b2 into PRR b)
base.vhd
base.vhd led.vhd
l e d_ N S E W C . v h d bus macros
B U FG
(a) (b)
North, South, East, West, and Center LEDs, and they are triggered by five
corresponding buttons. The LED controller design has a hierarchical DPRS
architecture, consisting of a top-level design (top.vhd) and three submodule
designs, namely, base.vhd, led.vhd, and led NSEWC.vhd, as illustrated in
Figure 7.10(a).
As illustrated in Figure 7.9, the Base design block receives the inputs from
the five buttons, which are then passed on to both the Led and Led NSEWC
design blocks. The Led block maps the input data {N, S, E, W } to {0, 1, 2, 3}
and the Led NSEWC block maps {N, S, E, W, C} to {N, S, E, W, C}. The exact
I/O mappings are determined by the designer at design time for each block.
The mapped data are then sent from these two blocks to the Base design,
which then controls all nine LEDs. In this case study, the Led and Led NSEWC
design blocks are implemented as PRMs. There are two versions for each
of the two blocks, differing only in the mapping functions. By reconfiguring
different PRMs into the two PRRs, even though the input data from the
button are the same, yet different LEDs will be switched on, based on the
configured mapping functions (PRMs). Next, we will illustrate, step by step,
how to modify the LED control design into a DPRS architecture.
In the EA PR design flow as described in Section 7.2, the first step is to
modify the top-level HDL description as follows. A global buffer (BUFG) and
bus macros have to be inserted in the top-level design, as shown in Figure
7.10(b). Some rules that can help designers to clearly understand the HDL-
description modification are as follows. Figure 7.11 illustrates how these four
rules are applied to the LED controller example.
Rule 1: The PRMs cannot directly connect to the FPGA I/O pins in
the top-level design. The signals of PRMs need to be passed to/from
the static designs or connected through the bus macros for connecting
to the FPGA I/O pins.
Rule 2: Every PRM must include a signal for triggering (clock).
After modifying the top-level HDL description, the four HDL files shown
in Figure 7.10(a) must be separately synthesized by following the approach
described in Section 7.2.1. The top-level design includes the bus macros and
the I/O buffer. Moreover, the top-level design for the FPGA pin assignments
needs to be defined in the UCF file, as shown in Figure 7.12. However, before
the PR implementation flow is used, the synthesized top-level design, the bus
macros, and the UCF file need to be copied to a directory, without including
the synthesized Base, LED, and LED NSEWC designs. In the following, the use
of the PlanAhead tool is illustrated for implementing the DPRS architecture.
Enabling PlanAhead-PR
This step is required because the PlanAhead tool does not automatically
enable the partial reconfiguration feature. Enter the following command into
the PlanAhead Tcl console for enabling the partial reconfiguration features.
hdi::param set -name project.enablePR -bvalue yes
Floorplan
After the synthesized top-level design is inserted into the PlanAhead project,
the Netlist view should be as shown in Figure 7.13.
1. Select and right-click the Base design in the Netlist view, and select
New Pblock and use the default name pblock base sub for the fol-
lowing constraint in the UCF file.
INST "BASE sub" AREA GROUP = "pblock BASE sub";
2. Select and right-click the LED and LED NSEWC designs in the Netlist view,
then select Draw Pblock to draw a slice range in the Device view for
the following constraints in the UCF file.
AREA GROUP "pblock NSEWC sub" RANGE=SLICE X30Y106:SLICE X41Y127;
AREA GROUP "pblock NSEWC sub" RANGE=RAMB16 X2Y14:RAMB16 X2Y15;
AREA GROUP "pblock NSEWC sub" RANGE=DSP48 X0Y28:DSP48 X0Y31;
3. Mark the Pblocks pblock LED sub and pblock NSEWC sub as reconfig-
urable. In the Physical Hierarchy view, separately select the Pblocks
pblock LED sub and pblock NSEWC sub and click the Attributes tab
in the Pblock Properties view. Then, click the Define New Attribute
button and select MODE, set its value to RECONFIG for creating
the following constraints in the UCF file.
AREA GROUP "pblock NSEWC sub" MODE=RECONFIG;
AREA GROUP "pblock LED sub" MODE=RECONFIG;
4. Select the location for each of the bus macros and the global buffer.
Change the selection mode to Create Site Constraint Mode as shown
in Figure 7.14.
Drag the four bus macros and the global buffer to the following locations
in the UCF file.
INST "ToLED" LOC = SLICE X2Y114;
INST "FromLED" LOC = SLICE X2Y120;
INST "ToNSEWC" LOC = SLICE X28Y110;
INST "FromNSEWC" LOC = SLICE X28Y120;
INST "bufg clk" LOC = BUFGCTRL X0Y3;
NET "clk" LOC = AE14;
Netlist Export
After a design passes the DRC checker, it can be used for the PR implemen-
tation. Select the menu File Export Floorplan. Set the export mode to
Partial Reconfig. The synthesized top-level design, the new UCF file, and
the bus macro files will be automatically copied to the top-level directory.
The purpose of this step is to use the exported design to generate a static full
bitstream for the static design and a partial bitstream for each of the PRMs.
The tool will automatically run ngdbuild, map, par, trace, and the PR as-
semble scripts to create the bitstreams. After the floorplan is exported, the
PlanAhead project created should include the directories as shown in Fig-
ure 7.15. The synthesized static design, one of the LED PRMs, and one of
the LED NSEWC PRMs must be copied to the static, pblock LED sub CV, and
pblock NSEWC sub CV directories, respectively.
After the required sources are copied to the corresponding directories, start
the flow wizard from the menu Tool Run Partial Reconfig. This will
either create a script to be run from a command line or run in an interactive
mode. Select the Run Place & Route. Then run the implementation steps
in order.
1. Budgeting
In the budgeting phase, the UCF file is used to ensure the correctness
of the design. Select the Budgeting item to start the partial reconfig-
uration flow. Accept the default setting for the step and click Finish.
The Place & Route Output window will appear. It should report 0
error to pass this phase.
2. Static Logic Implementation
In this step, the static portion of the design is implemented. Select the
Static Logic Implementation item and then accept the default setting
for the step and click Finish. The Place & Route Output window
will appear. It should report 0 error to pass this phase.
3. PR Block Implementation
In this step, each of the PRMs in the design is implemented. Select
the PR Block Implementation item and then click the Add button
to add the pblock LED sub and pblock NSEWC sub Pblocks. Accept
the default setting for the step and click Finish. The Place & Route
Output window will appear. It should report 0 error to pass this phase.
4. Assemble
In this step, the static designs and PRMs are merged to generate the
corresponding bitstreams for configuring the FPGA. Select the As-
semble item and then accept the default setting for the step and click
Finish. Finally, the bitstreams are generated in the merge directory.
In the same way, copy the other two PRM netlists to the pblock LED sub CV
and pblock NSEWC sub CV directories. Then, go through the PR Block
Implementation and the Assemble steps again to generate the partial bit-
streams. The final generated bitstreams should include:
static full.bit
pblock led sub blank.bit (used to clear pblock LED sub )
pblock nsewc sub blank.bit (used to clear pblock NSEWC sub )
pblock led sub cv routed partial A.bit
pblock led sub cv routed partial b.bit
pblock nsewc sub cv routed partial A.bit
pblock nsewc sub cv routed partial B.bit
Next, start the iMPACT tool to configure the bitstreams to the FPGA
in order to check the correctness of the DPRS. First, the static full.bit is
configured into the FPGA. Each partial bitstream of pblock LED sub and
pblock NSEWC sub pblocks are then configured into the FPGA. Click the
FPGA push buttons to check which LEDs will be switched on. Next, re-
configure other partial bitstreams of pblock LED sub and pblock NSEWC sub
pblocks into the FPGA to ensure the correctness of partial reconfiguration.
After the partial reconfiguration is done, the LED results should be different.
DCM
P RR CF Card Terminal
PL B PowerPC
OPB
HWICAP
DDR
JTAG Bridge SDRAM
ICAP
1. Create a new EDK project, and then select the Virtex-II Pro ML310
Evaluation Platform.
2. Set both data and instruction on-chip memory to 8 KB.
Create SW
Accessible Interface
Configure Static_full and
Re-Synthesize Partial Bitstreams
Syst em D esi gn
Add PRM to Top-
Level Design
No Fu n c t i o n a l V a l i d a t i o n
Yes
Synthesize System No
Desi gn PR Design Budgeting Yes
Yes
Functional Validation Functional Validation
Done
3. Select RS232 Uart, DDR SDRAM 32M64, SPI EEPROM, LED 8Bit,
LCD OPTIONAL and SysACE CompactFlash peripherals, where
RS232 Uart, DDR SDRAM 32M64, SPI EEPROM and SysACE CompactFlash
peripherals must contain the interrupt support.
4. Set the memory size of PLB BRAM IF CNTLR to 128 KB.
5. In the IP catalog, expand the item FPGA Reconfiguration and then
select opb hwicap to attach to the OPB.
6. Use the Generate Addresses button to automatically assign the mem-
ory range for each hardware device.
5. In the IP catalog, expand the item Project Repository and then select
the new peripheral, which was created using the IPIF and which was
attached to OPB. Finally, use the Generate Addresses button to
automatically assign the memory range for each hardware device again.
6. Select the menu Hardware Generate Netlists to generate the
netlists. All HDL files of the peripherals will be located in the hdl
directory.
In this step, the PRM needs to be inserted into the top-level design and
connected to the software-accessible communication interface.
1. The RSA and the CRC designs are first integrated with the partially
reconfigurable template [93] as shown in Figure 7.18. Separately syn-
thesize the RSA and CRC designs using the ISE tool, and then copy
one of the CRC and RSA designs to the implementation directory.
2. The HDL of the top-level design needs to be re-modified for connect-
ing the communication interface and one of the RSA and CRC designs.
Here are parts of the modified top-level HDL description.
4. After the communication interface is created using the IPIF, the drivers
directory includes the standalone drivers for users to control the com-
munication interface. The software application must include the header
file of the communication interface to interact with the communication
interface. Then, right-click software project and set Build Project for
generating the software executable file (elf). Here are parts of the soft-
ware application for the communication interface.
...
#include "comm.h"
...
int rsa(){
...
COMM mWriteSlaveReg2(XPAR COMM 0 BASEADDR, data);
COMM mWriteSlaveReg3(XPAR COMM 0 BASEADDR, exp);
COMM mWriteSlaveReg4(XPAR COMM 0 BASEADDR, mod);
xil printf("Start RSA Encrypting!\n\r");
COMM mWriteSlaveReg14(XPAR COMM 0 BASEADDR, 1);
COMM mWriteSlaveReg14(XPAR COMM 0 BASEADDR, 0);
while ( COMM mReadSlaveReg15(XPAR COMM 0 BASEADDR) != 1) ;
xil printf("Encryption Done!\n\r");
cypher text = COMM mReadSlaveReg10(XPAR COMM 0 BASEADDR);
xil printf("Cypher Text: 0x%08x \n\r", cypher text);
}
5. When the top-level design is re-synthesized, click the menu item Device
Configuration Download Bitstream. The software executable file
and the full bitstream will be combined into a bitsream download.bit.
Then the iMPACT tool will be invoked to configure the bitstream.
Modify to PR Architecture
When the full bitstream is validated, the top-level HDL description needs
to be re-modified again. As illustrated in Section 7.2.1, the bus macros are
added to act as a bridge between the communication interface and the PRR.
Moreover, the clock signal cannot directly pass to the PRR, thus a global
buffer is used to transfer the clock signal from the DCM to PRR. In this
design, the bus macros used include busmacro xc2vp l2r async narrow and
The top-level design needs to be re-synthesized again because the HDL de-
scription of the top-level design is modified. Moreover, the bus macro files
(nmc format) are copied to the synthesis directory for synthesizing the top-
level design. Change to the synthesis directory and use the following com-
mands to re-synthesize the top-level design:
>edk project\synthesis\xst -ifn system xst.scr
PR Design Budgeting
This step uses the top-level netlist for design budgeting. A PRR needs to
be allocated, and bus macros and the global buffer need to be assigned to
specific locations in the FPGA.
1. Create a pr directory to store all the required files. Place copies of the
top-level netlist (system.ngc) and the RAM Memory Map (system.bmm
and system bd.bmm) files in the implementation directory, the base
design constraint file (system.ucf) in the data directory, along with the
bus macro files.
2. Open a new PlanAhead project and import the top-level netlist (sys-
tem.ngc) and the base design constraint file (system.ucf). Create a
pblock with MODE = RECONFIG for the RSA and CRC PRMs,
and use a pblock to group the remaining hardware components. Lastly,
the global buffer and the bus macros are assigned one by one to specific
locations in the FPGA.
3. After passing the design rule check, export the the PR design floorplan
for PR implementation.
4. Generate the full bitstream of the non-PR design and use the iMPACT
tool to download the bitstream into the FPGA. However, the bitstream
includes only the hardware design so the software design needs to be
downloaded separately. Here, the XMD debug tool is used to down-
load the software executable file by selecting menu Debug Launch
XMD. Next, use the default setting and change to the software project
directory. Finally, use the following commands to download the software
executable file to the microprocessor.
XMD% rst (Reset the processor)
XMD% dow executable.elf (Download the executable file)
XMD% run (Run the executable file)
After the floorplan is exported, the PlanAhead project should include three
main directories, including static, pblock PRR CV, and merge. Then, all
netlists, except for the PRM and top-level netlists, are copied to the static
directory, while one of the PRM netlists (PR template0.ngc) is copied to the
pblock PRR CV directory. Additionally, the RAM Memory Map (sys-
tem.bmm and system bd.bmm) files are copied to the merge directory for
the bitstream generation.
When all required files are copied to the corresponding directories, enable
the PlanAhead-PR to run the EA PR design flow for generating the static full
and partial bitstreams. For generating the partial bitstream for another PRM,
copy that PRM netlist to the pblock PRR CV directory and then run the
PR Block Implementation and Merge steps again. The final bitstreams should
include:
static full.bit
pblock prr blank.bit
pblock prr cv routed partial RSA.bit
pblock prr cv routed partial CRC.bit
Use the iMPACT tool to download the static full bitstream and the partial
bitstream of RSA. Similarly, use XMD to download the software executable
for validation. After reconfiguring the partial bitstream of CRC, run the
software executable file again to check if the CRC design can work correctly.
When the partial reconfiguration through the JTAG can correctly work,
in this step we use the ICAP to partially reconfigure the FPGA. By using
the driver of ICAP, the software application can dynamically reconfigure the
FPGA without needing any external support.
Different from the partial reconfiguration described in Section 7.3, this case
study for partial reconfiguration includes both hardware and software designs.
Furthermore, the software application can use the ICAP driver to reconfigure
the partial bitstreams into the FPGA at run-time. Take the design for exam-
ple. While the system is running, there is a requirement for the hardware RSA
design, the currently unused CRC PRM in the PRR can be partially reconfig-
ured into be the RSA PRM without switching off the system. Compared to
traditional embedded systems, the dynamically adaptive capability can sig-
nificantly enhance the system performance and flexibility. Within the limited
logic resources, the system can include multiple hardware functions, whose
total amount of required logic resources is much more than that available in
the FPGA device.
Functional F un c t i on a l F un c t i on a l OS4RS
Validation Validation Validation
The OS4RS includes not only the basic OS management but also the PRMs.
The PRMs in the OS4RS is akin to the software task in a conventional OS,
and thus they can be called hardware tasks. However, the CRC and RSA
designs are both integrated with the same partially reconfigurable template
[93] such that a driver is designed for a PRR and not for the PRMs. The driver
only provides an interface for the software application to interact with the
PRR, where the functionality of software accessible registers will be different
according to the configured PRM. As shown in Figure 7.18, the software
accessible register 2 is mapped to the first input data ports of both the RSA
and CRC designs. The driver is only responsible for writing the data to the
software accessible register 2. However, the data definition for the first input
data port should be different according to the definition of the configured
PRM. The software application must be aware of this interactive method
such that it can control the PRM through the same driver.
The OS4RS design flow is illustrated in Figure 7.19. The establishment
of the DPRS architecture is similar to that in Section 7.4. The EDK project
needs to export the information on the system environment, such as the mem-
ory size and memory mapping for each device. For each PRR, when using
IPIF to design a communication interface, a physical driver for the standalone
design is generated by the Xilinx EDK tool. On top of this physical driver,
we need a virtual driver for each PRM. All the physical PRR drivers and
the virtual PRM drivers are configured into the OS4RS kernel. Then, the
iMPACT tool is used to download the non-PR full bitstream, as described
in Section 7.4. Finally, we use the XMD tool to download the OS image file
to the board for validation. After the functional correctness is validated, the
ICAP driver proposed by John Williams [209] is integrated into the OS4RS
kernel. The partial bitstreams and the software executable files are stored
in the root directory of the file system, and then the ICAP driver is used
for partial reconfiguration. Here, uClinux is used to build the OS4RS. The
detailed steps of the OS4RS design flow are illustrated in the following.
Compared to the standalone design, this step rebuilds the software libraries
for OS integration.
Build OS Environment
The OS4RS is based on the Clinux OS, whose kernel version is 2.4.26.
The cross-compiler for the PowerPC processor needs to be installed in the
Linux development environment. In the following, the integration between
the hardware architecture created by the EDK tool and the Clinux OS will
be discussed in detail.
1. Use the arch and driver directories that were generated in the previous
step, to replace the directories in the kernel source with the same names.
2. In the main directory of the kernel source, use the command make
menuconfig for kernel setting. For the item General Setup, select
Default bootloader kernel arguments and set Initial kernel command
string to console=ttyS0,9600 root=/dev/xsysace/disc0/part3
rw for locating the file system in the third partition of the flash card
as shown in Figure 7.20.
3. Because the target microprocessor is PowerPC, the main Makefile must
be modified as follows.
ARCH := ppc
CROSS COMPILE = powerpc-405-linux-gnu-
4. When the version of the device driver is updated, the Makefile for the
device driver must include the object file of the new device. For example,
the xsysace sinit.c is added to the sysace device, and thus the Makefile
will be modified as follows.
xilinx sysace-objs += xsysace.o xsysace sinit.o xsysace g.o
xsysace intr.o xsysace l.o
5. Use the make dep && make zImage command to generate the kernel
image after all settings and modifications. The kernel image will be
located in the directory arch/ppc/boot/images/ and the file is named
zImage.elf.
6. The file system will be put in a flash card. First, use the partition tool,
such as fdisk, to divide the flash card into three partitions. Partition 1
is formatted into the FAT 16 format. Here, the mkdosfs tool provided
by Xilinx is used in the Windows OS as follows.
mkdosfs -s 64 -F 16 -R 1 NAME:
(NAME is the disk name of the flash card in the Windows OS.)
7. After the flash card is divided into three partitions, the file system
ramdisk.image provided by Xilinx is placed in Partition 3 using
the following commands in the Linux OS.
>gzip -d ramdisk.image.gz
>mkdir ramdisk fs
>mount -o loop ramdisk.image ramdisk fs
>cd ramdisk fs
Finally, copy all files in the ramdisk to Partition 3 of the flash card.
8. Use the iMPACT tool to download the non-PR full bitstream, and then
use XMD to download the kernel image zImage.elf.
Similar to the common device driver in Linux, the PRR driver consists of
the basic operations, such as open, release, write, read, and ioctl operations.
Different from the software application described in Section 7.4, the control for
the software accessible registers must be inserted in the basic driver operation.
In the following, the basic method for inserting a new PRR driver into the
Linux kernel is described.
5. Write a test program and compile it into an executable file. Then, use
the make dep && make zImage command to generate the kernel image
for validating the functional correctness.
This step integrates the John Williams ICAP driver [209] with the OS4RS
for partial reconfiguration. Similar to the step for PRR driver design, first cre-
ate a xilinx hwicap sub-directory in the directory linux kernel/driver/char
to locate all ICAP sources.
This step is used to combine the static full bitstream and the OS image file
into a system ACE file.
2. Select the menu item Project Launch EDK Shell. Then a com-
mand window will be displayed. Use the following command for gener-
ating the system ACE file.
$ xmd -tcl genace.tcl -opt xupGenace.opt
3. Put the system.ace file into Partition 1 of the flash card. Finally,
power-on the ML310 board again. When the CLinux boots success-
fully, run the software application to check if the partial reconfiguration
can work.
If the software application can interact with the RSA and CRC designs
and the partial reconfiguration can work through the ICAP, a simple OS4RS
design is created.
Summary
This chapter introduces the implementation for the DPRS architecture. First,
the pure hardware partial reconfigurable design for controlling the LED is il-
lustrated. Many basic design methods and flows for partially reconfigurable
system architecture are illustrated. The placement constraints for partial re-
configuration on the Xilinx Virtex families are also described. In Section
7.4, a software/hardware embedded design, which is created by the EDK,
is modified into a dynamically partially reconfigurable system. Through the
implementation steps, the ICAP can be used to control the partial reconfig-
uration without using the JTAG. Lastly, the implementation of an actual
OS4RS design is described. Through the illustration of the OS4RS imple-
mentation, the PRM can be defined as a hardware task in an embedded OS.
In this way, computation intensive functions can be dynamically executed as
hardware tasks for enhancing the system performance. After reading this
chapter, the implementation for a DPRS can be understood more concretely.
4. How can you modify the top-level design into a partially reconfigurable
architecture?
5. What are bus macros? How many types of bus macros are there?
6. What is the difference between the driver in a traditional embedded OS
and that in an OS4RS?
225
2009 by Taylor & Francis Group, LLC
226 Reconfigurable System Design and Verification
[69] W. Fornaciari and V. Piuri. Virtual FPGAs: some steps behind the
physical barriers. In IPPS/SPDP Workshops, pages 712, 1998.
[70] S. Ganesan and R. Vemuri. An integrated temporal partitioning and
partial reconfiguration technique for design latency improvement. In
Proceedings of the Conference on Design, Automation and Test in Eu-
rope (DATE00), pages 2730, 2000.
[71] S. Ghiasi and M. Sarrafzadeh. Optimal reconfiguration sequence man-
agement. In Proceedings of Asia South Pacific Design Automation Con-
ference (ASP DAC), pages 359365, 2003.
[72] M. Giorgetta, M. D. Santambrogio, P. Spoletini, and D. Sciuto. A
graph-coloring approach to the allocation and tasks scheduling for re-
configurable architectures. In 14th IFIP International Conference on
Very Large Scale Integration - IFIP VLSI-SOC, pages 2429, October
2006.
[73] M. Gokhale and P. S. Graham. Reconfigurable Computing Accelerating
Computation with Field-Programmable Gate Arrays. Springer, 2005.
[74] S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, and R. R.
Taylor. Piperench: a reconfigurable architecture and compiler. Com-
puter, 33(4):7077, Apr 2000.
[75] J. Gu, P. W. Purdom, J. Franco, and B. W. Wah. Algorithms for the
satisfiability (SAT) problem: a survey. DIMACS Series in Discrete
Math and Theoretical Computer Science, 35:19151, 1997.
[76] J. Hagemeyer, B. Keltelhoit, M. Koester, and M. Pomnann. A design
methodology for communication infrastructures on partially reconfig-
urable FPGAs. In Proceedings of the 17th IEEE International Confer-
ence on Field Programmable Logic and Applications (FPL 2007), pages
331338, August 2007.
[77] J. Hagemeyer, B. Kettelhoit, M. Koester, and M. Porrmann. Design of
homogeneous communication infrastructures for partially reconfigurable
FPGAs. In Proceedings of the International Conference on Engineer-
ing of Reconfigurable Systems and Algorithms, pages 238247. CSREA
Press, June 2007.
[78] M. Handa, R. Radhakrishnan, M. Mukherjee, and R. Vemuri. A
fast macro based compilation methodology for partially reconfigurable
FPGA designs. In 16th International Conference on VLSI Design, pages
9196, January 2003.
[79] M. Handa and R. Vemuri. An efficient algorithm for finding empty rect-
angles for online FPGA placement. In Proceedings of the 41st Annual
Conference on Design Automation (DAC2004), pages 960V965. ACM
Press, June 2004.
[80] M. Handa and R. Vemuri. Hardware assisted two dimensional ultra fast
online placement. International Journal of Embedded Systems (IJES),
1:291V299, June 2006.
[81] R. Hartenstein. Why we need reconfigurable computing education. In
Proceedings of the 1st International Workshop on Reconfigurable Com-
puting Education, pages 111, March 2006.
[82] J.R. Hauser and J. Wawrzynek. Garp: a MIPS processor with a reconfig-
urable coprocessor. In Proceedings of the 5th Annual IEEE Symposium
on FPGAs for Custom Computing Machines, pages 1221, Apr 1997.
[83] R. Hecht, S. Kubisch, H. Michelsen, E. Zeeb, and D. Timmermann.
A distributed object system approach for dynamic reconfiguration. In
Proceedings of the 20th International Symposium on Parallel and Dis-
tributed Processing Symposium, April 2006.
[84] J. L. Hennessy and D. A. Patterson. Computer Architecture - A Quan-
titative Approach (Fourth Edition). Morgan Kaufmann, 2007.
[85] E. L. Horta and J. W. Lockwood. PARBIT: a tool to transform bitfiles
to implement partial reconfiguration of field programmable gate arrays
(FPGAs). Washington University, Department of Computer Science,
Technical Report WUCS-01-13, July 2001.
[86] E. L. Horta and J. W. Lockwood. Automated method to generate bit-
stream intellectual property cores for Virtex FPGAs. In Proceedings of
the International Conference on Field Programmable Logic and Appli-
cation, pages 975979, 2004.
[87] E. L. Horta, J. W. Lockwood, D. E. Taylor, and D. Parlour. Dynamic
hardware plugins in an FPGA with partial run-time reconfiguration.
In Proceedings of the 39st Annual Conference on Design Automation
(DAC2002), pages 343348. ACM, 2002.
[88] P.-A. Hsiung, C.-H. Huang, and Y.-H. Chen. Hardware task scheduling
and placement in operating systems for dynamically reconfigurable SoC.
Journal of Embedded Computing, 2008. To appear.
[89] P.-A. Hsiung, C.-H. Huang, and C.-F. Liao. Perfecto: A SystemC-based
performance evaluation framework for dynamically partially reconfig-
urable systems. In Proceedings of the 16th IEEE International Confer-
ence on Field Programmable Logic and Applications (FPL 2006), pages
190198. IEEE CS Press, August 2006.
[90] P.-A. Hsiung, C.-S. Lin, and C.-F. Liao. Perfecto: A SystemC-based
design space exploration framework for dynamically partially reconfig-
urable systems. ACM Transactions on Reconfigurable Technology and
Systems, 1(3):130, 2008.
[91] P.-A. Hsiung and C.-W. Liu. Exploiting hardware and software low
power techniques for energy efficient co-scheduling in dynamically re-
configurable systems. In Proceedings of the 17th IEEE International
Conference on Field-Programmable Logic and Applications (FPL 2007),
pages 165170. IEEE CS Press, August 2007.
[92] P.-A. Hsiung, P.-H. Lu, and C.-W. Liu. Energy efficient hardware-
software co-scheduling in dynamically reconfigurable systems. In Pro-
ceedings of the International Conference on Hardware-Software Code-
sign and System Synthesis, pages 8792. ACM Press, September 2007.
[93] C.-H. Huang and P.-A. Hsiung. UML-based hardware/software co-
design for partially reconfigurable systems. In Proceedings of the 13th
IEEE Asia-Pacific Computer Systems Architecture Conference (AC-
SAC), pages 16, August 2008. doi: 10.1109/APCSAC.2008.4625436.
[94] R. D. Hudson, D. Lehn, J. Hess, J. Atwell, D. Moye, K. Shiring, and
P. Athanas. Spatio-temporal partitioning of computational structures
onto configurable computing machines. In Configurable Computing:
Technology and Applications, Proc. SPIE 3526, pages 6271. SPIE
The International Society for Optical Engineering, 1998.
[95] Xilinx Inc. Two Flows of Partial Reconfiguration: Module Based or
Difference Based. Technical Report XAPP290, Xilinx Inc., November
2003.
[96] Xilinx Inc. Virtex Series Configuration Architecture User Guide. Tech-
nical Report XAPP151, Xilinx Inc., March 2003.
[97] Xilinx Inc. Opb hwicap (v1.00.b) product specification. Technical re-
port, Xilinx Inc., March 2005.
[98] Xilinx Inc. Early Access Partial Reconfiguration Guide. Xilinx Inc.,
2006.
[99] Xilinx Inc. Embedded Development Kit EDK 8.2i. Xilinx Inc., 2006.
[100] Xilinx Inc. Virtex-4 configuration user guide. Technical Report ug71,
Xilinx Inc., January 2007.
[101] Xilinx Inc. Virtex-4 user guide. Technical Report ug70, Xilinx Inc.,
March 2007.
[102] Xilinx Inc. Virtex-5 user guide. Technical Report ug190, Xilinx Inc.,
February 2007.
[103] Xilinx Inc. Xilinx : Virtex series FPGAs.
http://www.xilinx.com/products/silicon solutions/fpgas/virtex/
index.htm, 2007.
[104] Xilinx Inc. PlanAhead User Guide. Xilinx Inc., July 27, 2007.
[169] K.-J. Shih, P.-A. Hsiung, and C.-C. Hung. Reconfigurable hardware
module sequencer a tradeoff between networked and data flow ar-
chitectures. In Proceedings of the International Conference on Field-
Programmable Technology (ICFPT), pages 237240. IEEE Computer
Society Press, December 2007.
[170] H. Singh, M.-H. Lee, G. Lu, F.J. Kurdahi, N. Bagherzadeh, and E. M. C.
Filho. MorphoSys: an integrated reconfigurable system for data-parallel
and computation-intensive applications. IEEE Transactions on Com-
puters, 49(5):465481, 2000.
[171] H. Singh, G. Lu, M.-H. Lee, E. Filho, R. Maestre, F. Kurdahi, and
N. Bagherzadeh. Morphosys: case study of a reconfigurable computing
system targeting multimedia applications. In Proceedings of the 37th
Annual Conference on Design Automation (DAC2000), pages 573578,
2000.
[172] S. Singh and C. J. Lillieroth. Formal verification of reconfigurable cores.
In Proceedings of the 7th IEEE Symposium on Field-Programmable Cus-
tom Computing Machines (FCCM), pages 2532. IEEE CS Press, April
1999.
[173] L. Singhal and E. Bozorgzadeh. Multi-layer floorplanning on a sequence
of reconfigurable design. In Proceedings of the 16th IEEE International
Conference on Field Programmable Logic and Applications (FPL 2006),
pages 18, 2006.
[174] I. Skilarova and A. Ferrari. Reconfigurable hardware SAT solvers: A
survey of systems. IEEE Transactions on Computers, 53(11):14491461,
November 2004.
[175] H. K.-H. So and R. W. Brodersen. Improving usability of FPGA-
based reconfigurable computers through operating system support. In
Proceedings of the 16th IEEE International Conference on Field Pro-
grammable Logic and Applications (FPL 2006), pages 349354, August
2006.
[176] H. K.-H. So and R. W. Brodersen. A unified hardware/software runtime
environment for FPGA-based reconfigurable computers using BORPH.
ACM Transactions on Embedded Computing Systems, 7(2):(article no.
14), February 2008.
[177] C. Steiger, H. Walder, and M. Platzner. Heuristics for online scheduling
real-time tasks to partially reconfigurable devices. In Proceedings of the
13th IEEE International Conference on Field Programmable Logic and
Applications (FPL 2003), pages 575584, September 2003.
[178] C. Steiger, H. Walder, and M. Platzner. Operating systems for reconfig-
urable embedded platforms: Online scheduling of real-time tasks. IEEE
Transactions on Computers, 53(11):13931407, November 2004.
[216] Xilinx Inc. Virtex-II Pro Data Sheet Virtex-II ProTM Platform FPGA
Data Sheet, 2003.
[217] Xilinx Inc. Spartan-3 FPGA Family: Complete Data Sheet. Xilinx Inc.,
2007.
[218] Xilinx Inc. UG012 Virtex-II Pro FPGA User Guide. Xilinx Inc., 4.2
edition, November 2007.
[219] Xilinx Inc. Virtex-4 Family Overview. Xilinx Inc., 2007.
[220] X. Yan and J. Han. Closegraph: mining closed frequent graph patterns.
In KDD 03: Proceedings of the Ninth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 286295.
ACM Press, 2003.
[221] P. Yang, C. Wong, P. Marchal, F. Catthoor, D. Desmet, D. Verkest,
and R. Lauwereins. Energy-aware runtime scheduling for embedded
multiprocessor SoCs. IEEE Journal on Design and Test of Computers,
18(5):4658, September 2001.
[222] M. J. Zaki. Efficiently mining frequent trees in a forest: algorithms and
applications. IEEE Transactions on Knowledge and Data Engineering,
17(8):10211035, 2005.